CN114422148B - Framework depiction and detection method, device and equipment of Webshell - Google Patents

Framework depiction and detection method, device and equipment of Webshell Download PDF

Info

Publication number
CN114422148B
CN114422148B CN202210299498.0A CN202210299498A CN114422148B CN 114422148 B CN114422148 B CN 114422148B CN 202210299498 A CN202210299498 A CN 202210299498A CN 114422148 B CN114422148 B CN 114422148B
Authority
CN
China
Prior art keywords
webshell
skeleton
token
token sequence
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210299498.0A
Other languages
Chinese (zh)
Other versions
CN114422148A (en
Inventor
陈靖远
李昌志
蒋倩
张嘉欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Changting Future Technology Co ltd
Original Assignee
Beijing Changting Future Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Changting Future Technology Co ltd filed Critical Beijing Changting Future Technology Co ltd
Priority to CN202210299498.0A priority Critical patent/CN114422148B/en
Publication of CN114422148A publication Critical patent/CN114422148A/en
Application granted granted Critical
Publication of CN114422148B publication Critical patent/CN114422148B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3247Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials involving digital signatures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/321Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials involving a third party or a trusted authority
    • H04L9/3213Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials involving a third party or a trusted authority using tickets or tokens, e.g. Kerberos
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3236Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials using cryptographic hash functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a method, a device and equipment for describing and detecting the skeleton of a known Webshell, which are used for carrying out lexical analysis on the known Webshell and reasonably generalizing the lexical analysis, so as to describe the skeleton of the known Webshell and construct a Webshell skeleton database according to the skeleton; and describing the skeleton of the code to be detected in the same way, and inquiring whether a matched skeleton exists in the constructed Webshell skeleton database so as to judge whether the code to be detected is a Webshell. The embodiment of the invention ensures that the Webshell detection is efficient and accurate, has extremely strong adaptability and expansibility, can detect codes written in multiple languages, and can achieve ideal detection effect in a real detection environment.

Description

Framework depiction and detection method, device and equipment of Webshell
Technical Field
The embodiment of the invention relates to the technical field of network security, in particular to a method, a device and equipment for describing and detecting a framework of a Webshell.
Background
The existing regular expression-based Webshell detection mode is limited in generalization capability of a regular language, is easily bypassed by a confused Webshell, and needs to manually add rules, so that the rule designating process is tedious and error-prone. Other Webshell detection modes based on machine learning and neural networks have insufficient interpretability, are unfavorable for operation, require a long-time training process and have insufficient convenience.
Disclosure of Invention
Therefore, the embodiment of the invention provides a method, a device and equipment for describing and detecting the skeleton of a Webshell, which are used for solving the technical problems of how to improve the expansibility, convenience, generalization capability and accuracy of Webshell detection.
In order to achieve the above object, the embodiment of the present invention provides the following technical solutions:
according to a first aspect of an embodiment of the present invention, an embodiment of the present application provides a method for skeleton characterization and detection of Webshell, where the method includes:
analyzing the known Webshell file into a first Token sequence;
performing skeleton depiction on the first Token sequence to obtain a second Token sequence;
generalizing the second Token sequence to obtain a third Token sequence;
constructing a Webshell skeleton database according to the first abstract value of the third Token sequence signature;
and extracting the skeleton of the code to be detected and judging whether the code to be detected is a Webshell or not based on the Webshell skeleton database.
Further, parsing the known Webshell file into a corresponding first Token sequence includes:
defining a BaseToken structure, the BaseToken structure comprising: string Type Name and Text, int Type;
defining a Token array of a BaseToken type, wherein the Token array is used for storing a Token sequence with analysis completed;
creating an input character stream according to the known Webshell file, and assigning a corresponding lexical analyzer to the input character stream;
creating a lexical symbol stream and assigning the lexical symbol stream to a corresponding lexical analyzer;
decomposing the character stream into a plurality of lexical symbol objects by a lexical analyzer;
and obtaining all Token in the lexical symbol stream to a Token array defined previously to form a first Token sequence.
Further, performing skeleton characterization on the first Token sequence to obtain a second Token sequence, including:
filtering a first Token without substantial influence on Webshell meaning from the first Token sequence;
extracting key functions in the first Token sequence to serve as key nodes of a first Webshell skeleton;
extracting symbols with key meanings from the first Token sequence to serve as key nodes of a second Webshell skeleton;
and forming the second Token sequence by using the filtered first Token sequence and the first Webshell skeleton key node and the second Webshell skeleton key node.
Further, generalizing the second Token sequence to obtain a third Token sequence, including:
generalizing the second Token correspondence for non-skeleton nodes that are not extracted to IDENTIFIER, STRINGLITERAL, INTEGERLITERAL;
wherein the second Token includes: variables, class names, functions, character strings and numbers of non-skeleton nodes in the second Token sequence.
Further, constructing a Webshell skeleton database according to the first digest value of the third Token sequence signature, including:
respectively marking non-skeleton nodes generalized to IDENTIFIER, STRINGLITERAL, INTEGERLITERAL in the third Token sequence as a first generalization point, a second generalization point and a third generalization point;
marking the first generalization point, the second generalization point and the third generalization point in the forms of IDE_x, STR_y and INT_z respectively to be used as Webshell signatures corresponding to the Webshell files, wherein x, y and z are integer numbers and represent the serial numbers of the types of the generalization points;
calculating a first Hash value of the Webshell signature, and taking the first Hash value as a first digest value;
and establishing and storing the index of the Webshell by using the first abstract value for subsequent detection and query, and completing the construction of the Webshell skeleton database.
Further, extracting a skeleton of the code to be detected and judging whether the code to be detected is a Webshell based on the Webshell skeleton database, including:
checking whether the code to be detected is coded or confused;
if the code to be detected is coded or confused, decoding or restoring the code to be detected; extracting and generalizing the Webshell skeleton of the decoded or restored code to be detected;
if the code to be detected is not coded or confused, directly extracting a Webshell skeleton of the code to be detected and generalizing;
inquiring whether a Webshell skeleton matched with the code to be detected exists in a constructed Webshell skeleton database;
if the constructed Webshell skeleton database has a Webshell skeleton matched with the code to be detected, the code to be detected is a Webshell;
if the constructed Webshell skeleton database does not have the Webshell skeleton matched with the code to be detected, the code to be detected is not the Webshell.
Further, extracting and generalizing the decoded or restored Webshell skeleton of the code to be detected, or directly extracting and generalizing the Webshell skeleton of the code to be detected, including:
analyzing the decoded or restored code to be detected into a fourth Token sequence, or directly analyzing the code to be detected into the fourth Token sequence;
filtering a third Token without substantial influence on Webshell meaning from the fourth Token sequence;
extracting key functions in the fourth Token sequence to serve as third Webshell skeleton key nodes;
extracting symbols with key meanings from the fourth Token sequence to serve as key nodes of a fourth Webshell skeleton;
forming a fifth Token sequence by using the filtered fourth Token sequence and the third Webshell skeleton key node;
generalizing the fourth Token correspondence of non-extracted non-skeleton nodes to IDENTIFIER, STRINGLITERAL, INTEGERLITERAL to obtain a sixth Token sequence;
wherein the fourth Token comprises: variables, class names, functions, character strings and numbers of non-skeleton nodes in the second Token sequence.
Further, resolving the decoded or restored code to be detected into a fourth Token sequence, or directly resolving the code to be detected into the fourth Token sequence, including:
defining a BaseToken structure, the BaseToken structure comprising: string Type Name and Text, int Type;
defining a Token array of a BaseToken type, wherein the Token array is used for storing a Token sequence with analysis completed;
creating an input character stream according to the known Webshell file, and assigning a corresponding lexical analyzer to the input character stream;
creating a lexical symbol stream and assigning the lexical symbol stream to a corresponding lexical analyzer;
decomposing the character stream into a plurality of lexical symbol objects by a lexical analyzer;
and obtaining all Token in the lexical symbol stream into a Token array defined previously to form a fourth Token sequence.
Further, querying whether the constructed Webshell skeleton database has the Webshell skeleton matched with the code to be detected or not includes:
marking non-skeleton nodes generalized to IDENTIFIER, STRINGLITERAL, INTEGERLITERAL in the sixth Token sequence as a fourth generalization point, a fifth generalization point and a sixth generalization point respectively;
marking the fourth generalization point, the fifth generalization point and the sixth generalization point in the forms of IDE_x, STR_y and INT_z respectively as code signatures to be detected, wherein x, y and z are integer numbers and represent serial numbers of the types of the generalization points;
calculating a second Hash value of the code signature to be detected, and taking the second Hash value as a second digest value;
querying whether the first abstract value matched with the second abstract value exists in the constructed Webshell skeleton database;
if the first abstract value matched with the second abstract value exists, a Webshell skeleton matched with the code to be detected exists in the constructed Webshell skeleton database;
and if the first digest value matched with the second digest value does not exist, the constructed Webshell skeleton database does not exist the Webshell skeleton matched with the code to be detected.
According to a second aspect of the embodiments of the present invention, an embodiment of the present application provides a device for skeleton characterization and detection of Webshell, where the device includes:
the analysis module is used for analyzing the known Webshell file into a first Token sequence;
the depicting module is used for carrying out skeleton depicting on the first Token sequence to obtain a second Token sequence;
the generalization module is used for generalizing the second Token sequence to obtain a third Token sequence;
the storage module is used for constructing a Webshell skeleton database according to the first abstract value of the third Token sequence signature;
and the detection module is used for extracting the skeleton of the code to be detected and judging whether the code to be detected is a Webshell based on the Webshell skeleton database.
According to a third aspect of an embodiment of the present invention, there is provided a device for skeleton characterization and detection of Webshell, the device including: a processor and a memory;
the memory is used for storing one or more program instructions;
the processor is configured to execute one or more program instructions to perform the steps of a Webshell skeleton characterization and detection method as described in any one of the above.
Compared with the prior art, the embodiment of the invention provides a method, a device and equipment for describing and detecting the skeleton of a Webshell, which are used for carrying out lexical analysis on a known Webshell and reasonably generalizing the lexical analysis, so as to describe the skeleton of the known Webshell, thereby constructing a Webshell skeleton database; and describing the skeleton of the code to be detected in the same way, and inquiring whether a matched skeleton exists in the constructed Webshell skeleton database so as to judge whether the code to be detected is a Webshell. The embodiment of the invention ensures that the Webshell detection is efficient and accurate, has extremely strong adaptability and expansibility, can detect codes written in multiple languages, and can achieve ideal detection effect in a real detection environment.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those of ordinary skill in the art that the drawings in the following description are exemplary only and that other implementations can be obtained from the extensions of the drawings provided without inventive effort.
The structures, proportions, sizes, etc. shown in the present specification are shown only for the purposes of illustration and description, and are not intended to limit the scope of the invention, which is defined by the claims, so that any structural modifications, changes in proportions, or adjustments of sizes, which do not affect the efficacy or the achievement of the present invention, should fall within the ambit of the technical disclosure.
Fig. 1 is a schematic structural diagram of a device for describing and detecting a Webshell skeleton according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of a framework depiction and detection method of Webshell according to an embodiment of the present invention.
Detailed Description
Other advantages and advantages of the present invention will become apparent to those skilled in the art from the following detailed description, which, by way of illustration, is to be read in connection with certain specific embodiments, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that, in the embodiment of the present application, the detection program implemented in the Go language is used to detect the asp (Active Server Pages, dynamic server page) code, and the Go language is a static strong type and compiled language developed by Robert Griesemer, rob Pike and Ken Thompson of Google. The language in which the program is implemented and the code language to be detected in the present application may also be other programming languages, which are not limited in this application.
The abbreviations and key term definitions in the examples of the present invention are explained below:
webshell is a code execution environment in the form of a webpage file such as asp, php, jsp or cgi, and is mainly used for operations such as website management, server management, authority management and the like.
Token means a Token (temporary) in computer authentication and a Token means a Token in lexical analysis. Generally used as an invitation and login system.
BaseToken is a structure defined in the embodiments of the present application for describing the basic information of Token.
INT is a data type and in programming languages (C, C ++, c#, java, etc.), INT is an identifier for defining integer type variables.
string is a string in programming languages such as C++, java, VB, and the like, and the string is a special object and belongs to a reference type.
TEXT is a macro commonly encountered in Windows programming, which in the present embodiment refers to TEXT content extracted from Webshell.
asp (Active Server Pages, dynamic server page), a server-side scripting environment developed by Microsoft corporation, can be used to create dynamic interactive web pages and build powerful web applications.
The IDENTIFIER is one of the Token types identified in the lexical analysis, and refers to the location as a key function location (which may be a specific function name or a generalized label as ide_x).
STRINGLITERAL is the type of string Token (generalizable labeled str_y) identified in the lexical analysis.
INTEGERLITERAL is the integer Token type (generalizable labeled INT_z) identified in the lexical analysis.
In order to solve the above technical problems, as shown in fig. 1, an embodiment of the present application provides a device for describing and detecting a framework of a Webshell, where the device includes: the device comprises an analysis module 1, a characterization module 2, a generalization module 3, a storage module 4 and a detection module 5.
Specifically, the parsing module 1 is configured to parse a known Webshell file into a first Token sequence; the depicting module 2 is used for carrying out skeleton depicting on the first Token sequence to obtain a second Token sequence; the generalization module 3 is used for generalizing the second Token sequence to obtain a third Token sequence; the storage module 4 is used for constructing a Webshell skeleton database according to the first digest value of the third Token sequence signature; the detection module 5 is used for extracting the skeleton of the code to be detected and judging whether the code to be detected is a Webshell based on the Webshell skeleton database.
Compared with the prior art, the device for describing and detecting the frameworks of the Webshells, provided by the embodiment of the application, is used for carrying out lexical analysis on the known Webshells and reasonably generalizing the lexical analysis, so as to describe the frameworks of the known Webshells, and accordingly, a Webshell framework database is constructed; and describing the skeleton of the code to be detected in the same way, and inquiring whether a matched skeleton exists in the constructed Webshell skeleton database so as to judge whether the code to be detected is a Webshell. The embodiment of the invention ensures that the Webshell detection is efficient and accurate, has extremely strong adaptability and expansibility, can detect codes written in multiple languages, and can achieve ideal detection effect in a real detection environment.
Corresponding to the above disclosed framework depiction and detection device of the Webshell, the embodiment of the invention also discloses a framework depiction and detection method of the Webshell. The following describes a method for describing and detecting the skeleton of the Webshell in detail in combination with the device for describing and detecting the skeleton of the Webshell.
As shown in fig. 2, specific steps of a method for describing and detecting a skeleton of a Webshell provided in the embodiment of the present application are described in detail below.
Step S11: the known Webshell file is parsed into a first Token sequence by a parsing module 1.
Further, the step S11 specifically includes: defining a BaseToken structure, the BaseToken structure comprising: string Type Name and Text, int Type; defining a Token array of a BaseToken type, wherein the Token array is used for storing a Token sequence with analysis completed; creating an input character stream (inputStream) according to the known Webshell file, and assigning a corresponding lexical analyzer to the input character stream; creating a lexical symbol stream (lexer) and assigning the lexical symbol stream to a corresponding lexical analyzer; decomposing the character stream into a plurality of lexical symbol objects by a lexical analyzer; and obtaining all Token in the lexical symbol stream to a Token array defined previously to form a first Token sequence.
More specifically, the step S11 specifically includes:
definition of BaseToken Structure:
type BaseToken struct {
Name string
Text string
Type int
}
defining a token array of BaseToken types:
var tokens []BaseToken
creating an input character stream, which is generated according to the known Webshell file:
inputStream := antlr.NewInputStream(webshell)
creating an example of a lexical symbol stream lexer and assigning it to a corresponding asplexical analyzer, which will decompose the character stream into lexical symbol objects:
lexer := NewASPLexer(inputStream)
obtaining all Token in the lexical symbol stream into a Token array defined previously to form a first Token sequence:
tokens := lexer.GetAllTokens()
and the basic lexical analysis of the known Webshell file is completed, and the known Webshell file is analyzed into a corresponding first Token sequence.
It should be noted that the above code examples are only for explaining the steps, and are not a complete implementation code.
Step S12: and carrying out skeleton depiction on the first Token sequence through a depiction module 2 to obtain a second Token sequence.
Further, the step S12 specifically includes: filtering a first Token having no substantial effect on Webshell meaning from the first Token sequence, wherein the first Token comprises: space, blank line, repeated various forms of bracket pairs, in the present embodiment, the first Token is embodied as a Token of the type "WS", "new", "LPAREN", "RPAREN", "LBRACE", "RBRACE", etc.; extracting key functions in the first Token sequence as first Webshell skeleton key nodes, wherein in the embodiment of the invention, the first Webshell skeleton key nodes are selected from "eval", "execute global", "request", "CreateObject", "exec", "StdOut", "ReadAll", files "," SaveAs "," MapPath ", and the like; extracting a symbol with a key meaning from the first Token sequence as a second Webshell skeleton key node, wherein in the embodiment of the invention, the second Webshell skeleton key node is selected from ',' (','), '{', '=', '- |'; and forming the second Token sequence by using the filtered first Token sequence and the first Webshell skeleton key node and the second Webshell skeleton key node.
In the embodiment of the invention, the selection of the specific Webshell key skeleton node is not limited, and the dynamic adjustment can be performed according to specific detection scenes and languages, so that the method can adapt to various Webshell detection environments.
Step S13: and generalizing the second Token sequence through a generalization module 3 to obtain a third Token sequence.
Further, the step S13 specifically includes: generalizing the second Token correspondence for non-skeleton nodes that are not extracted to IDENTIFIER, STRINGLITERAL, INTEGERLITERAL; wherein the second Token includes: variables, class names, functions, character strings and numbers of non-skeleton nodes in the second Token sequence.
Step S14: and constructing a Webshell skeleton database according to the first digest value of the third Token sequence signature through the storage module 4.
Further, the step S14 specifically includes: respectively marking non-skeleton nodes generalized to IDENTIFIER, STRINGLITERAL, INTEGERLITERAL in the third Token sequence as a first generalization point, a second generalization point and a third generalization point; marking the first generalization point, the second generalization point and the third generalization point in the forms of IDE_x, STR_y and INT_z respectively as Webshell signatures (signs) corresponding to the Webshell files, wherein x, y and z are integer numbers and represent serial numbers of the types of the generalization points, namely, the serial numbers of different contents of the same type are distinguished; calculating a first Hash value of the Webshell signature, and taking the first Hash value as a first digest value (digest); and establishing and storing the index of the Webshell by using the first abstract value for subsequent detection and query, and completing the construction of the Webshell skeleton database.
In an embodiment of the present invention, the calculation of the first digest value (digest) is typically performed by a hash function, such as an MD5 function: digest =md 5 (Signature). The following is a reference example of an embodiment of the present invention for what is known as a Webshell file:
<script language=VBScript runat=server>execute request("#")</script>
the Webshell Signature (Signature) obtained after skeleton characterization and generalization is:
“ServerScript language=vbscript execute request ( STR_1 ) ”
the first digest value (digest) is calculated as: "ad75ec4ca33996ee4f0ef70db4c2cceb".
Step S15: and extracting the skeleton of the code to be detected through the detection module 5 and judging whether the code to be detected is a Webshell based on the Webshell skeleton database.
Further, the step S15 specifically includes: checking whether the code to be detected is coded or confused; if the code to be detected is coded or confused, decoding or restoring the code to be detected; extracting and generalizing the Webshell skeleton of the decoded or restored code to be detected; if the code to be detected is not coded or confused, directly extracting a Webshell skeleton of the code to be detected and generalizing; inquiring whether a Webshell skeleton matched with the code to be detected exists in a constructed Webshell skeleton database; if the constructed Webshell skeleton database has a Webshell skeleton matched with the code to be detected, the code to be detected is a Webshell; if the constructed Webshell skeleton database does not have the Webshell skeleton matched with the code to be detected, the code to be detected is not the Webshell.
Further, extracting and generalizing the decoded or restored Webshell skeleton of the code to be detected, or directly extracting and generalizing the Webshell skeleton of the code to be detected, including: analyzing the decoded or restored code to be detected into a fourth Token sequence, or directly analyzing the code to be detected into the fourth Token sequence; filtering a third Token without substantial influence on Webshell meaning from the fourth Token sequence; extracting key functions in the fourth Token sequence to serve as third Webshell skeleton key nodes; extracting symbols with key meanings from the fourth Token sequence to serve as key nodes of a fourth Webshell skeleton; forming a fifth Token sequence by using the filtered fourth Token sequence and the third Webshell skeleton key node; generalizing the fourth Token correspondence of non-extracted non-skeleton nodes to IDENTIFIER, STRINGLITERAL, INTEGERLITERAL to obtain a sixth Token sequence; wherein the fourth Token comprises: variables, class names, functions, character strings and numbers of non-skeleton nodes in the second Token sequence.
Further, the parsing the decoded or restored code to be detected into the fourth Token sequence, or directly parsing the code to be detected into the fourth Token sequence, includes: defining a BaseToken structure, the BaseToken structure comprising: string Type Name and Text, int Type; defining a Token array of a BaseToken type, wherein the Token array is used for storing a Token sequence with analysis completed; creating an input character stream according to the known Webshell file, and assigning a corresponding lexical analyzer to the input character stream; creating a lexical symbol stream and assigning the lexical symbol stream to a corresponding lexical analyzer; decomposing the character stream into a plurality of lexical symbol objects by a lexical analyzer; and obtaining all Token in the lexical symbol stream into a Token array defined previously to form a fourth Token sequence.
Further, querying whether the constructed Webshell skeleton database has the Webshell skeleton matched with the code to be detected or not includes: marking non-skeleton nodes generalized to IDENTIFIER, STRINGLITERAL, INTEGERLITERAL in the sixth Token sequence as a fourth generalization point, a fifth generalization point and a sixth generalization point respectively; marking the fourth generalization point, the fifth generalization point and the sixth generalization point in the forms of IDE_x, STR_y and INT_z respectively as code signatures to be detected, wherein x, y and z are integer numbers and represent serial numbers of the types of the generalization points; calculating a second Hash value of the code signature to be detected, and taking the second Hash value as a second digest value; querying whether the first abstract value matched with the second abstract value exists in the constructed Webshell skeleton database; if the first abstract value matched with the second abstract value exists, a Webshell skeleton matched with the code to be detected exists in the constructed Webshell skeleton database; and if the first digest value matched with the second digest value does not exist, the constructed Webshell skeleton database does not exist the Webshell skeleton matched with the code to be detected.
The following is one example of detection, the code to be detected is:
<script language=vbs runat=server>execute(request("c"))</script>
the code Signature to be detected (Signature) is obtained after skeleton extraction and generalization:
“ServerScript language=vbscript execute request ( STR_1 ) ”
calculating a second Hash value of the code Signature to be detected through a Hash function, and taking the second Hash value as a second digest value (digest), namely, digest: =md5 (Signature), thereby obtaining a second digest value (digest):
“ad75ec4ca33996ee4f0ef70db4c2cceb”。
compared with the prior art, the method for describing and detecting the frameworks of the Webshells, provided by the embodiment of the application, is characterized by carrying out lexical analysis on the known Webshells and reasonably generalizing the lexical analysis, so as to construct a Webshell framework database; and describing the skeleton of the code to be detected in the same way, and inquiring whether a matched skeleton exists in the constructed Webshell skeleton database so as to judge whether the code to be detected is a Webshell. The embodiment of the invention ensures that the Webshell detection is efficient and accurate, has extremely strong adaptability and expansibility, can detect codes written in multiple languages, and can achieve ideal detection effect in a real detection environment.
The embodiment of the invention also provides a device for describing and detecting the skeleton of the Webshell, which comprises: a processor and a memory; the memory is used for storing one or more program instructions; the processor is configured to execute one or more program instructions to perform the steps of a Webshell skeleton characterization and detection method as described in any one of the above.
In the embodiment of the invention, the processor may be an integrated circuit chip with signal processing capability. The processor may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP for short), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC for short), a field programmable gate array (FieldProgrammable Gate Array, FPGA for short), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The processor reads the information in the storage medium and, in combination with its hardware, performs the steps of the above method.
The storage medium may be memory, for example, may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory.
The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable ROM (Electrically EPROM, EEPROM), or a flash Memory.
The volatile memory may be a random access memory (Random Access Memory, RAM for short) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (Double Data RateSDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and direct memory bus RAM (directracram, DRRAM).
The storage media described in embodiments of the present invention are intended to comprise, without being limited to, these and any other suitable types of memory.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in a combination of hardware and software. When the software is applied, the corresponding functions may be stored in a computer-readable medium or transmitted as one or more instructions or code on the computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
While the invention has been described in detail in the foregoing general description and specific examples, it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims (8)

1. The method for describing and detecting the skeleton of the Webshell is characterized by comprising the following steps of:
analyzing the known Webshell file into a first Token sequence;
performing skeleton depiction on the first Token sequence to obtain a second Token sequence;
generalizing the second Token sequence to obtain a third Token sequence;
constructing a Webshell skeleton database according to the first abstract value of the third Token sequence signature;
extracting a skeleton of a code to be detected and judging whether the code to be detected is a Webshell or not based on the Webshell skeleton database;
parsing a known Webshell file into a corresponding first Token sequence includes:
defining a BaseToken structure, the BaseToken structure comprising: string Type Name and Text, int Type;
defining a Token array of a BaseToken type, wherein the Token array is used for storing a Token sequence with analysis completed;
creating an input character stream according to the known Webshell file, and assigning a corresponding lexical analyzer to the input character stream;
creating a lexical symbol stream and assigning the lexical symbol stream to a corresponding lexical analyzer;
decomposing the character stream into a plurality of lexical symbol objects by a lexical analyzer;
obtaining all Token in the lexical symbol stream to a Token array defined previously to form a first Token sequence;
performing skeleton depiction on the first Token sequence to obtain a second Token sequence, wherein the skeleton depiction comprises the following steps:
filtering a first Token without substantial influence on Webshell meaning from the first Token sequence;
extracting key functions in the first Token sequence to serve as key nodes of a first Webshell skeleton;
extracting symbols with key meanings from the first Token sequence to serve as key nodes of a second Webshell skeleton;
and forming the second Token sequence by using the filtered first Token sequence and the first Webshell skeleton key node and the second Webshell skeleton key node.
2. The method for skeleton characterization and detection of Webshell of claim 1, wherein generalizing the second Token sequence to obtain a third Token sequence includes:
generalizing the second Token correspondence for non-skeleton nodes that are not extracted to IDENTIFIER, STRINGLITERAL, INTEGERLITERAL;
wherein the second Token includes: variables, class names, functions, character strings and numbers of non-skeleton nodes in the second Token sequence.
3. The method for skeleton characterization and detection of Webshell of claim 2, wherein constructing a Webshell skeleton database according to the first digest value of the third Token sequence signature includes:
respectively marking non-skeleton nodes generalized to IDENTIFIER, STRINGLITERAL, INTEGERLITERAL in the third Token sequence as a first generalization point, a second generalization point and a third generalization point;
marking the first generalization point, the second generalization point and the third generalization point in the forms of IDE_x, STR_y and INT_z respectively to be used as Webshell signatures corresponding to the Webshell files, wherein x, y and z are integer numbers and represent the serial numbers of the types of the generalization points;
calculating a first Hash value of the Webshell signature, and taking the first Hash value as a first digest value;
and establishing and storing the index of the Webshell by using the first abstract value for subsequent detection and query, and completing the construction of the Webshell skeleton database.
4. The method for skeleton characterization and detection of Webshell as claimed in claim 3, wherein extracting the skeleton of the code to be detected and judging whether the code to be detected is a Webshell based on the Webshell skeleton database comprises:
checking whether the code to be detected is coded or confused;
if the code to be detected is coded or confused, decoding or restoring the code to be detected; extracting and generalizing the Webshell skeleton of the decoded or restored code to be detected;
if the code to be detected is not coded or confused, directly extracting a Webshell skeleton of the code to be detected and generalizing;
inquiring whether a Webshell skeleton matched with the code to be detected exists in a constructed Webshell skeleton database;
if the constructed Webshell skeleton database has a Webshell skeleton matched with the code to be detected, the code to be detected is a Webshell;
if the constructed Webshell skeleton database does not have the Webshell skeleton matched with the code to be detected, the code to be detected is not the Webshell.
5. The method for describing and detecting the skeleton of the Webshell as claimed in claim 3, wherein the steps of extracting the Webshell skeleton of the decoded or restored code to be detected and generalizing, or directly extracting the Webshell skeleton of the code to be detected and generalizing, include:
analyzing the decoded or restored code to be detected into a fourth Token sequence, or directly analyzing the code to be detected into the fourth Token sequence;
filtering a third Token without substantial influence on Webshell meaning from the fourth Token sequence;
extracting key functions in the fourth Token sequence to serve as third Webshell skeleton key nodes;
extracting symbols with key meanings from the fourth Token sequence to serve as key nodes of a fourth Webshell skeleton;
forming a fifth Token sequence by using the filtered fourth Token sequence and the third Webshell skeleton key node;
generalizing the fourth Token correspondence of non-extracted non-skeleton nodes to IDENTIFIER, STRINGLITERAL, INTEGERLITERAL to obtain a sixth Token sequence;
wherein the fourth Token comprises: variables, class names, functions, character strings and numbers of non-skeleton nodes in the second Token sequence.
6. The method for describing and detecting the skeleton of the Webshell as in claim 5, wherein querying whether the Webshell skeleton matched with the code to be detected exists in the constructed Webshell skeleton database comprises the following steps:
marking non-skeleton nodes generalized to IDENTIFIER, STRINGLITERAL, INTEGERLITERAL in the sixth Token sequence as a fourth generalization point, a fifth generalization point and a sixth generalization point respectively;
marking the fourth generalization point, the fifth generalization point and the sixth generalization point in the forms of IDE_x, STR_y and INT_z respectively as code signatures to be detected, wherein x, y and z are integer numbers and represent serial numbers of the types of the generalization points;
calculating a second Hash value of the code signature to be detected, and taking the second Hash value as a second digest value;
querying whether the first abstract value matched with the second abstract value exists in the constructed Webshell skeleton database;
if the first abstract value matched with the second abstract value exists, a Webshell skeleton matched with the code to be detected exists in the constructed Webshell skeleton database;
and if the first digest value matched with the second digest value does not exist, the constructed Webshell skeleton database does not exist the Webshell skeleton matched with the code to be detected.
7. A device for skeleton characterization and detection of Webshell, the device comprising:
the analysis module is used for analyzing the known Webshell file into a first Token sequence;
the depicting module is used for carrying out skeleton depicting on the first Token sequence to obtain a second Token sequence;
the generalization module is used for generalizing the second Token sequence to obtain a third Token sequence;
the storage module is used for constructing a Webshell skeleton database according to the first abstract value of the third Token sequence signature;
the detection module is used for extracting the skeleton of the code to be detected and judging whether the code to be detected is a Webshell or not based on the Webshell skeleton database;
parsing a known Webshell file into a corresponding first Token sequence includes:
defining a BaseToken structure, the BaseToken structure comprising: string Type Name and Text, int Type;
defining a Token array of a BaseToken type, wherein the Token array is used for storing a Token sequence with analysis completed;
creating an input character stream according to the known Webshell file, and assigning a corresponding lexical analyzer to the input character stream;
creating a lexical symbol stream and assigning the lexical symbol stream to a corresponding lexical analyzer;
decomposing the character stream into a plurality of lexical symbol objects by a lexical analyzer;
obtaining all Token in the lexical symbol stream to a Token array defined previously to form a first Token sequence;
performing skeleton depiction on the first Token sequence to obtain a second Token sequence, wherein the skeleton depiction comprises the following steps:
filtering a first Token without substantial influence on Webshell meaning from the first Token sequence;
extracting key functions in the first Token sequence to serve as key nodes of a first Webshell skeleton;
extracting symbols with key meanings from the first Token sequence to serve as key nodes of a second Webshell skeleton;
and forming the second Token sequence by using the filtered first Token sequence and the first Webshell skeleton key node and the second Webshell skeleton key node.
8. A Webshell's skeleton characterization and detection device, the device comprising: a processor and a memory;
the memory is used for storing one or more program instructions;
the processor is configured to execute one or more program instructions to perform the steps of a Webshell skeleton characterization and detection method according to any one of claims 1 to 7.
CN202210299498.0A 2022-03-25 2022-03-25 Framework depiction and detection method, device and equipment of Webshell Active CN114422148B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210299498.0A CN114422148B (en) 2022-03-25 2022-03-25 Framework depiction and detection method, device and equipment of Webshell

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210299498.0A CN114422148B (en) 2022-03-25 2022-03-25 Framework depiction and detection method, device and equipment of Webshell

Publications (2)

Publication Number Publication Date
CN114422148A CN114422148A (en) 2022-04-29
CN114422148B true CN114422148B (en) 2024-04-09

Family

ID=81264573

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210299498.0A Active CN114422148B (en) 2022-03-25 2022-03-25 Framework depiction and detection method, device and equipment of Webshell

Country Status (1)

Country Link
CN (1) CN114422148B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341399A (en) * 2016-04-29 2017-11-10 阿里巴巴集团控股有限公司 Assess the method and device of code file security
CN108985059A (en) * 2018-06-29 2018-12-11 北京奇虎科技有限公司 A kind of webpage back door detection method, device, equipment and storage medium
CN109933977A (en) * 2019-03-12 2019-06-25 北京神州绿盟信息安全科技股份有限公司 A kind of method and device detecting webshell data
CN112052451A (en) * 2020-08-17 2020-12-08 北京兰云科技有限公司 Webshell detection method and device
CN112800427A (en) * 2021-04-08 2021-05-14 北京邮电大学 Webshell detection method and device, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9063710B2 (en) * 2013-06-21 2015-06-23 Sap Se Parallel programming of in memory database utilizing extensible skeletons

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341399A (en) * 2016-04-29 2017-11-10 阿里巴巴集团控股有限公司 Assess the method and device of code file security
CN108985059A (en) * 2018-06-29 2018-12-11 北京奇虎科技有限公司 A kind of webpage back door detection method, device, equipment and storage medium
CN109933977A (en) * 2019-03-12 2019-06-25 北京神州绿盟信息安全科技股份有限公司 A kind of method and device detecting webshell data
CN112052451A (en) * 2020-08-17 2020-12-08 北京兰云科技有限公司 Webshell detection method and device
CN112800427A (en) * 2021-04-08 2021-05-14 北京邮电大学 Webshell detection method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Evil-hunter: a novel web shell detection system based on scoring scheme;Truong Dinh Tu等;《Journal of Southeast University》;278-284 *

Also Published As

Publication number Publication date
CN114422148A (en) 2022-04-29

Similar Documents

Publication Publication Date Title
CN103888490B (en) A kind of man-machine knowledge method for distinguishing of full automatic WEB client side
US10133650B1 (en) Automated API parameter resolution and validation
US9552237B2 (en) API validation system
CN109768992B (en) Webpage malicious scanning processing method and device, terminal device and readable storage medium
CN112749284B (en) Knowledge graph construction method, device, equipment and storage medium
CN109684607B (en) JSON data analysis method and device, computer equipment and storage medium
CN112989348B (en) Attack detection method, model training method, device, server and storage medium
US20210064453A1 (en) Automated application programming interface (api) specification construction
CN113486350B (en) Method, device, equipment and storage medium for identifying malicious software
CN111460241B (en) Data query method and device, electronic equipment and storage medium
CN114422148B (en) Framework depiction and detection method, device and equipment of Webshell
CN112395880A (en) Error correction method and device for structured triples, computer equipment and storage medium
CN113806647A (en) Method for identifying development framework and related equipment
CN115437930B (en) Webpage application fingerprint information identification method and related equipment
CN116361793A (en) Code detection method, device, electronic equipment and storage medium
CN115576603A (en) Method and device for acquiring variable values in code segments
CN112632946A (en) Method, apparatus, computer device and storage medium for automatic table building
CN113312540A (en) Information processing method, device, equipment, system and readable storage medium
CN112130860A (en) JSON object analysis method and device, electronic device and storage medium
CN116167048B (en) Webshell detection method and device for EL expression
CN109981818A (en) Domain name semantically anomalous analysis method, device, computer equipment and its storage medium
CN109657178A (en) Page table list processing method, device, computer equipment and storage medium
Choi et al. Chracer: Memory analysis of Chromium-based browsers
Yang et al. RTF editor XSS fuzz framework
US20210109842A1 (en) Generation of explanatory and executable repair examples

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant