CN114143074A

CN114143074A - Webshell attack recognition device and method

Info

Publication number: CN114143074A
Application number: CN202111433509.1A
Authority: CN
Inventors: 赵玉芳
Original assignee: Hangzhou DPTech Technologies Co Ltd
Current assignee: Hangzhou DPTech Technologies Co Ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-03-04
Anticipated expiration: 2041-11-29
Also published as: CN114143074B

Abstract

The disclosure relates to a webshell attack identification method and device, electronic equipment and a computer readable medium. The method comprises the following steps: acquiring webpage data on a target webpage; scanning webpage data based on the static feature library and the code feature library; when the matching result exists in the scanning result, acquiring behavior data of the target webpage; scanning the behavior data based on a behavior feature library; and when the hit result exists in the scanning result, determining that the webshell attack exists on the target webpage. The webshell attack identification method, the webshell attack identification device, the electronic equipment and the computer readable medium can be used for quickly, effectively and accurately identifying the webshell attack, improve identification efficiency on the premise of ensuring identification accuracy, shorten identification time and improve user experience.

Description

Webshell attack recognition device and method

Technical Field

The disclosure relates to the field of computer information processing, in particular to a webshell attack identification method and device, electronic equipment and a computer readable medium.

Background

Currently, with the continuous development of the internet, the network security problem is also paid more and more attention, and the webshell as an important attack means of a network attacker also becomes a research hotspot in the field of network security. The webshell is actually a section of code, and the section of code is used by an attacker for realizing the purpose of malicious attack, so when the webshell is detected, the two aspects are usually taken into consideration, firstly, the webshell is taken as a section of code, the section of code is subjected to static analysis, common points in different webshell codes are found out, and then the common points are extracted to be used as characteristics for identifying the code to be detected. However, the method has the obvious defect that the step of extracting the features is very dependent on the experience of researchers, and most webshell writers adopt a mode of obfuscating encryption coding for bypassing the detection currently. This makes this detection possible with high probability of false positives. Another way is to detect the execution of the code in the network, and analyze the data stream when the code is executed to determine whether the code is webshell. However, such a method also has its disadvantages. At present, researchers introduce popular machine learning methods into the field of network security, and use machine learning models and algorithms to detect webshells, but these methods are often accompanied by low efficiency and high false alarm rate.

Therefore, a new webshell attack identification method, apparatus, electronic device, and computer-readable medium are needed.

The above information disclosed in this background section is only for enhancement of understanding of the background of the application and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

In view of this, the application provides a webshell attack identification method, a webshell attack identification device, an electronic device, and a computer readable medium, which can quickly, effectively, and accurately identify a webshell attack, improve identification efficiency, shorten identification time, and improve user experience on the premise of ensuring identification accuracy.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to one aspect of the application, a webshell attack identification method is provided, and the method comprises the following steps: acquiring webpage data on a target webpage; scanning webpage data based on the static feature library and the code feature library; when the matching result exists in the scanning result, acquiring behavior data of the target webpage; scanning the behavior data based on a behavior feature library; and when the hit result exists in the scanning result, determining that the webshell attack exists on the target webpage.

In an exemplary embodiment of the present application, further comprising: and intercepting the target webpage and generating warning information when the webshell attack exists on the target webpage.

In an exemplary embodiment of the present application, further comprising: establishing the static feature library based on analysis of webshell data in webpage data of a plurality of webpages; establishing the code feature library based on analysis of encrypted webshell data in webpage data of a plurality of webpages; and establishing the behavior feature library based on analysis of behavior data in the webpage data of a plurality of webpages.

In an exemplary embodiment of the present application, acquiring web page data on a target web page includes: acquiring webpage identifiers of a plurality of webpages; and comparing the webpage identification with webpage identifications in a black list and a white list to determine the target webpage.

In an exemplary embodiment of the present application, comparing the web page identifier with web page identifiers in a black list and a white list to determine the target web page includes: and when the webpage identification is not in the blacklist and is not in the white list, determining that the webpage is a target webpage.

In an exemplary embodiment of the present application, the establishing the static feature library based on the analysis of webshell data in web page data of a plurality of web pages includes: acquiring webshell data in webpage data of a plurality of webpages; performing feature extraction on the webshell data to generate static feature data; generating the static feature library based on static feature data.

In an exemplary embodiment of the present application, the establishing the code feature library based on the analysis of encrypted webshell data in web page data of a plurality of web pages includes: acquiring encrypted webshell data in webpage data of a plurality of webpages; converting the encrypted webshell data into an opcode file; extracting code characteristic data in the opcode file through a natural language processing technology; generating the code feature library based on the code feature data.

In an exemplary embodiment of the present application, extracting code feature data in the opcode file by a natural language processing technique includes: converting data in the opcode file into a plurality of word vectors; determining the weight of a plurality of word vectors according to the depth and the times of data in the opcode file; extracting the code feature data from the plurality of word vectors based on the weights.

In an exemplary embodiment of the present application, the creating the behavior feature library based on analysis of behavior data in web page data of a plurality of web pages includes: acquiring behavior data in webpage data of a plurality of webpages meeting preset conditions; performing feature extraction on the behavior data to generate behavior feature data; generating the behavioral characteristic library based on the behavioral characteristic data.

According to an aspect of the present application, a webshell attack recognition apparatus is provided, the apparatus including: the data module is used for acquiring webpage data on a target webpage; the characteristic module is used for scanning the webpage data based on the static characteristic library and the code characteristic library; the behavior module is used for acquiring the behavior data of the target webpage when the matching result exists in the scanning result; the scanning module is used for scanning the behavior data based on a behavior feature library; and the attack module is used for determining that the webshell attack exists on the target webpage when the hit result exists in the scanning result.

According to an aspect of the present application, an electronic device is provided, the electronic device including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as above.

According to an aspect of the application, a computer-readable medium is proposed, on which a computer program is stored, which program, when being executed by a processor, carries out the method as above.

According to the webshell attack identification method, device, electronic equipment and computer readable medium, webpage data on a target webpage are obtained; scanning webpage data based on the static feature library and the code feature library; when the matching result exists in the scanning result, acquiring behavior data of the target webpage; scanning the behavior data based on a behavior feature library; when a hit result exists in the scanning result, the mode that the webshell attack exists on the target webpage is determined, the webshell attack can be quickly, effectively and accurately identified, the identification efficiency is improved on the premise that the identification accuracy is guaranteed, the identification time is shortened, and the user experience is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. The drawings described below are only some embodiments of the present application, and other drawings may be derived from those drawings by those skilled in the art without inventive effort.

FIG. 1 is a flow chart illustrating a method for webshell attack identification in accordance with an exemplary embodiment.

FIG. 2 is a flowchart illustrating a method for webshell attack identification, according to an example embodiment.

FIG. 3 is a flowchart illustrating a method for webshell attack identification, according to another example embodiment.

FIG. 4 is a flowchart illustrating a method for webshell attack identification, according to another example embodiment.

FIG. 5 is a flowchart illustrating a method for webshell attack identification, according to another example embodiment.

Fig. 6 is a block diagram illustrating a webshell attack recognition apparatus according to an exemplary embodiment.

FIG. 7 is a block diagram illustrating an electronic device in accordance with an example embodiment.

FIG. 8 is a block diagram illustrating a computer-readable medium in accordance with an example embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first component discussed below may be termed a second component without departing from the teachings of the present concepts. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be appreciated by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or processes shown in the drawings are not necessarily required to practice the present application and are, therefore, not intended to limit the scope of the present application.

The inventor of the application finds that in the scheme of the prior art, all webshells can be directly analyzed without classifying a large number of webshells, then a traditional feature library is directly established, then data flow is directly captured in the monitoring process, and whether the webshells are the webshells or not is judged. However, many current webshells use obfuscated cryptographic encoding to bypass feature library matching detection.

In the scheme in the prior art, in order to prevent webshell adopting an obfuscated encryption coding mode from bypassing the detection of matching of the feature library, the php file is directly compiled to generate an opcode file, and then keywords in the opcode file are found out through a certain algorithm to establish the feature library. And then directly capturing data flow in the monitoring process, and judging whether the data flow is webshell or not. However, this approach is prone to extremely high false alarm and false negative rates.

The method for identifying the webshell attack fully considers the predicament in the prior art, the high efficiency of a static detection method and the detection capability of a dynamic detection method on the obfuscated encrypted code webshell, and simultaneously considers the high efficiency of a black-and-white list method on the webpage monitoring with determined safety.

The following is a detailed description with the aid of specific examples.

FIG. 1 is a flow chart illustrating a method for webshell attack identification in accordance with an exemplary embodiment. The webshell attack recognition method 10 includes at least steps S102 to S110.

As shown in fig. 1, in S102, web page data on the target web page is acquired. Acquiring webpage identifiers of a plurality of webpages; and comparing the webpage identification with webpage identifications in a black list and a white list to determine the target webpage.

More specifically, when the webpage identifier is not in the blacklist and not in the whitelist, the webpage is determined to be the target webpage. The web pages on the black list prohibit their web page activities, and the web pages on the white list automatically release their web page activities.

Furthermore, the white list is constructed after the business requirement and the safety are determined, and the accuracy of the white list must be ensured to avoid the missing report. The constructed blacklist is obtained according to log analysis. The black and white list is constructed in order to optimize the efficiency of the system. In the first step, a part of the webpage flow is filtered out in a black and white list mode. This will improve the efficiency of webshell detection in the system.

In S104, the web page data is scanned based on the static feature library and the code feature library. For the webpage with unknown security, a static feature library and a code feature library are called first to scan webpage data. And monitoring the behavior of the php file in the test webpage when the php file in the test webpage can hit any one of the static feature library and the code feature library.

In S106, when there is a matching result in the scanning result, behavior data of the target web page is acquired. If there is a matching result in the scanning result, it can be considered that the web page data has a web page risk, but since the risk of false alarm is avoided, the next scanning is continued.

In S108, the behavior data is scanned based on the behavior feature library.

At S110, when there is a hit in the scan result, it is determined that there is a webshell attack on the target webpage.

In one embodiment, when a webshell attack exists on the target webpage, the target webpage can be intercepted and warning information can be generated. For example, after being manually verified, a webpage intercepted and alarmed for many times is added into a blacklist, mainly because the detection system uses the blacklist as a first threshold for detection, the accuracy of the blacklist must be ensured, and false alarm is prevented.

According to the webshell attack identification method, webpage data on a target webpage are obtained; scanning webpage data based on the static feature library and the code feature library; when the matching result exists in the scanning result, acquiring behavior data of the target webpage; scanning the behavior data based on a behavior feature library; when a hit result exists in the scanning result, the mode that the webshell attack exists on the target webpage is determined, the webshell attack can be quickly, effectively and accurately identified, the identification efficiency is improved on the premise that the identification accuracy is guaranteed, the identification time is shortened, and the user experience is improved.

It should be clearly understood that this application describes how to make and use particular examples, but the principles of this application are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.

FIG. 2 is a flowchart illustrating a method for webshell attack identification, according to an example embodiment. The process 20 shown in fig. 2 is a detailed description of the process shown in fig. 2.

As shown in fig. 2, in S202, whether the blacklist is hit.

In S204, whether the white list is hit.

In S206, whether a static feature library is hit.

In S208, whether the code feature library is hit.

In S210, the web page behavior is monitored to generate behavior data. If the static feature library is hit or the code feature library is hit, the webpage is marked, and the behavior track of the webpage is tracked.

In S212, whether the behavioral characteristic library is hit.

In S214, interception and alarm. The network security personnel can check the addresses of the web pages intercepted and alarmed many times and add the web pages to the blacklist if the web pages containing the webshell are confirmed by the check.

In S216, no processing is performed.

FIG. 3 is a flowchart illustrating a method for webshell attack identification, according to another example embodiment. The process 30 shown in FIG. 3 is a detailed description of "building the static feature library based on analysis of webshell data in web page data of multiple web pages".

As shown in fig. 3, in S302, webshell data in the web page data of the plurality of web pages is acquired. And collecting a large amount of data on the GitHub website, and distinguishing the webshell data from normal non-webshell data. And classifying attack data samples of the webshell into encrypted webshell data and webshell data which confuse the encryption coding technology.

In S304, feature extraction is performed on the webshell data to generate static feature data.

In S306, the static feature library is generated based on the static feature data.

Static feature data extraction building a static feature library can be based on some features that are typical of the common static php webshell, such as the functions of the command execution class: and common sensitive functions such as a file operation function, a database operation function, a callback function and the like.

FIG. 4 is a flowchart illustrating a method for webshell attack identification, according to another example embodiment. The process 40 shown in FIG. 4 is a detailed description of "building the code feature library based on analysis of encrypted webshell data in web page data of multiple web pages".

As shown in fig. 4, in S402, encrypted webshell data is obtained from web page data of a plurality of web pages.

In S404, the encrypted webshell data is converted into an opcode file. And compiling the encrypted webshell data in the form of PHP to generate the opcode.

In S406, code feature data in the opcode file is extracted by a natural language processing technique. Converting data in the opcode file into a plurality of word vectors; determining the weight of a plurality of word vectors according to the depth and the times of data in the opcode file; extracting the code feature data from the plurality of word vectors based on the weights. And sequencing the word vectors according to the weight from high to low, and taking data corresponding to the word vectors with top N bits as code characteristic data.

Firstly, constructing a word vector by using a word bag model with a weight label, and converting an opcode file into the word vector; and adding labels which represent the depth in the document to the word vectors when the model is constructed.

And then, finding out the keywords in all the word vectors by using a WTF-IDF algorithm, wherein the WTF-IDF algorithm is improved on the basis of a TF-IDF (word frequency-inverse document frequency) algorithm.

In the TF-IDF algorithm, TF (word frequency) indicates the frequency of occurrence, i.e. the number of occurrences, of a single word vector in a document. But since the length of each document is often very different, take:

IDF (inverse document frequency) is used to represent the number of documents containing a word in the entire document library to be analyzed:

TF-IDF＝TF×IDF；

the inventor of the application considers that when extracting the keywords, the common TF-IDF algorithm only considers the occurrence frequency of the word vector, and considers that the word vector is more critical when the frequency is higher, but actually in webshell data, some words may only appear once or twice but are extremely critical. Therefore, the method improves the TF-IDF algorithm and provides the WTF-IDF algorithm.

When extracting keywords, the importance of word data at different depths in the data of one webshell is considered to be different, so when extracting keywords, the depth of words is also considered to be one of the criteria for judging the importance degree.

In webshell traffic, locations that are shallower in depth are more important. So for any one opcode file, it can be divided into several parts by depth, which can be, for example, 10 parts, whose importance is assigned as weight value W from 10 to 1.

Therefore, the following steps are available:

wherein N is_iRepresenting the number of times a word vector i appears in the opcode file, N representing the total number of word vectors in the opcode file, W_iRepresenting the weight value of the word vector i.

Wherein RN represents the total number of all opcode files to be parsed and RN_iIndicating the number of opcode files containing the word vector i.

In S408, the code feature library is generated based on the code feature data.

FIG. 5 is a flowchart illustrating a method for webshell attack identification, according to another example embodiment. The flow 50 shown in fig. 5 is a detailed description of "establishing the behavior feature library based on analysis of behavior data in web page data of a plurality of web pages".

In S502, behavior data in the web page data of the plurality of web pages that satisfy the preset condition is acquired.

In S504, feature extraction is performed on the behavior data to generate behavior feature data.

In S506, the behavior feature library is generated based on the behavior feature data.

In one embodiment, the purpose of using the webshell file by an attacker can be analyzed, and the behavior of the webshell for the purpose of attacking a target drone and obtaining authority to further obtain information often includes the following categories: carrying out operations such as repairing, deleting, checking, downloading and the like on files in the drone; operating the catalog in the drone; operating a registry in the drone; and performing addition, deletion, modification, searching operation and the like on the database in the drone. Therefore, the operation behaviors are sorted and analyzed, and finally, a behavior feature library is established.

Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. When executed by the CPU, performs the functions defined by the methods provided herein. The program may be stored in a computer readable storage medium storing [ software class ] [ format language ], it being noted that the above figures are only schematic illustrations of the processes involved in the method according to exemplary embodiments of the present application and are not meant to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Fig. 6 is a block diagram illustrating a webshell attack recognition apparatus according to an exemplary embodiment. As shown in fig. 6, the webshell attack recognition apparatus 60 includes: a data module 602, a feature module 604, an action module 606, a scan module 608, and an attack module 610.

The data module 602 is configured to obtain web page data on a target web page;

the feature module 604 is configured to scan the web page data based on the static feature library and the code feature library;

the behavior module 606 is configured to obtain behavior data of the target webpage when a matching result exists in the scanning result;

the scanning module 608 is configured to scan the behavior data based on a behavior feature library;

the attack module 610 is configured to determine that a webshell attack exists on the target webpage when a hit result exists in the scanning result.

According to the webshell attack recognition device, webpage data on a target webpage are obtained; scanning webpage data based on the static feature library and the code feature library; when the matching result exists in the scanning result, acquiring behavior data of the target webpage; scanning the behavior data based on a behavior feature library; when a hit result exists in the scanning result, the mode that the webshell attack exists on the target webpage is determined, the webshell attack can be quickly, effectively and accurately identified, the identification efficiency is improved on the premise that the identification accuracy is guaranteed, the identification time is shortened, and the user experience is improved.

An electronic device 700 according to this embodiment of the present application is described below with reference to fig. 7. The electronic device 700 shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 7, electronic device 700 is embodied in the form of a general purpose computing device. The components of the electronic device 700 may include, but are not limited to: at least one processing unit 710, at least one memory unit 720, a bus 730 that connects the various system components (including the memory unit 720 and the processing unit 710), a display unit 740, and the like.

Wherein the storage unit stores program code that can be executed by the processing unit 710 such that the processing unit 710 performs the steps according to various exemplary embodiments of the present application described in the present specification. For example, the processing unit 710 may perform the steps as shown in fig. 1 to 5.

The memory unit 720 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)7201 and/or a cache memory unit 7202, and may further include a read only memory unit (ROM) 7203.

The memory unit 720 may also include a program/utility 7204 having a set (at least one) of program modules 7205, such program modules 7205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 730 may be any representation of one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 700 may also communicate with one or more external devices 700' (e.g., keyboard, pointing device, bluetooth device, etc.), such that a user can communicate with devices with which the electronic device 700 interacts, and/or any devices (e.g., router, modem, etc.) with which the electronic device 700 can communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 750. Also, the electronic device 700 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 760. The network adapter 760 may communicate with other modules of the electronic device 700 via the bus 730. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 700, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, as shown in fig. 8, the technical solution according to the embodiment of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the above method according to the embodiment of the present application.

The software product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The computer readable medium carries one or more programs which, when executed by a device, cause the computer readable medium to perform the functions of: acquiring webpage data on a target webpage; scanning webpage data based on the static feature library and the code feature library; when the matching result exists in the scanning result, acquiring behavior data of the target webpage; scanning the behavior data based on a behavior feature library; and when the hit result exists in the scanning result, determining that the webshell attack exists on the target webpage. The computer readable medium may also implement the following functions: establishing the static feature library based on analysis of webshell data in webpage data of a plurality of webpages; establishing the code feature library based on analysis of encrypted webshell data in webpage data of a plurality of webpages; and establishing the behavior feature library based on analysis of behavior data in the webpage data of a plurality of webpages.

Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus according to the description of the embodiments, or may be modified accordingly in one or more apparatuses unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiment of the present application.

Exemplary embodiments of the present application are specifically illustrated and described above. It is to be understood that the application is not limited to the details of construction, arrangement, or method of implementation described herein; on the contrary, the intention is to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A webshell attack identification method is characterized by comprising the following steps:

acquiring webpage data on a target webpage;

scanning webpage data based on the static feature library and the code feature library;

when the matching result exists in the scanning result, acquiring behavior data of the target webpage;

scanning the behavior data based on a behavior feature library;

and when the hit result exists in the scanning result, determining that the webshell attack exists on the target webpage.

2. The method of claim 1, further comprising:

and intercepting the target webpage and generating warning information when the webshell attack exists on the target webpage.

3. The method of claim 1, further comprising:

establishing the static feature library based on analysis of webshell data in webpage data of a plurality of webpages;

establishing the code feature library based on analysis of encrypted webshell data in webpage data of a plurality of webpages;

and establishing the behavior feature library based on analysis of behavior data in the webpage data of a plurality of webpages.

4. The method of claim 1, wherein obtaining web page data on a target web page comprises:

acquiring webpage identifiers of a plurality of webpages;

and comparing the webpage identification with webpage identifications in a black list and a white list to determine the target webpage.

5. The method of claim 4, wherein comparing the web page identifier to web page identifiers in a black list and a white list to determine the target web page comprises:

and when the webpage identification is not in the blacklist and is not in the white list, determining that the webpage is a target webpage.

6. The method of claim 3, wherein building the static feature library based on analysis of webshell data in web page data of a plurality of web pages comprises:

acquiring webshell data in webpage data of a plurality of webpages;

performing feature extraction on the webshell data to generate static feature data;

generating the static feature library based on static feature data.

7. The method of claim 3, wherein building the code feature library based on analysis of encrypted webshell data in web page data of a plurality of web pages comprises:

acquiring encrypted webshell data in webpage data of a plurality of webpages;

converting the encrypted webshell data into an opcode file;

extracting code characteristic data in the opcode file through a natural language processing technology;

generating the code feature library based on the code feature data.

8. The method of claim 7, wherein extracting code feature data in the opcode file by natural language processing techniques comprises:

converting data in the opcode file into a plurality of word vectors;

determining the weight of a plurality of word vectors according to the depth and the times of data in the opcode file;

extracting the code feature data from the plurality of word vectors based on the weights.

9. The method of claim 3, wherein building the behavioral characteristic library based on analysis of behavioral data in web page data of a plurality of web pages comprises:

acquiring behavior data in webpage data of a plurality of webpages meeting preset conditions;

performing feature extraction on the behavior data to generate behavior feature data;

generating the behavioral characteristic library based on the behavioral characteristic data.

10. A webshell attack recognition device, comprising:

the data module is used for acquiring webpage data on a target webpage;

the characteristic module is used for scanning the webpage data based on the static characteristic library and the code characteristic library;

the behavior module is used for acquiring the behavior data of the target webpage when the matching result exists in the scanning result;

the scanning module is used for scanning the behavior data based on a behavior feature library;

and the attack module is used for determining that the webshell attack exists on the target webpage when the hit result exists in the scanning result.