CN110795731A

CN110795731A - Page detection method and device

Info

Publication number: CN110795731A
Application number: CN201910955399.1A
Authority: CN
Inventors: 马文强
Original assignee: New H3C Security Technologies Co Ltd
Current assignee: New H3C Security Technologies Co Ltd
Priority date: 2019-10-09
Filing date: 2019-10-09
Publication date: 2020-02-14
Anticipated expiration: 2039-10-09
Also published as: CN110795731B

Abstract

The application provides a page detection method and device. The scheme is as follows: acquiring a script language code of a page to be detected; dividing the script language code into a plurality of lexical segments, and determining a segment type corresponding to each lexical segment; combining fragment types corresponding to the plurality of lexical fragments according to the position of each lexical fragment in the script language code to obtain a target Token stream; carrying out expression simplification on lexical fragments corresponding to the fragment types in the target Token stream to obtain simplified codes corresponding to the script language codes; and inputting the simplified code into a preset security engine, and detecting whether the page to be detected is a malicious page. By applying the technical scheme provided by the embodiment of the application, the accuracy of the malicious code is improved, the probability of user information leakage is reduced, and the safety of the network is improved.

Description

Page detection method and device

Technical Field

The present application relates to the field of security technologies, and in particular, to a page detection method and apparatus.

Background

Javascript (JS for short) is an transliterated script language. JS is widely applied to hypertext Markup Language (HTML) pages, so that dynamic functions are added to HTML pages displayed by browsers.

At present, an illegal user can use a browser bug to hang a horse on a webpage. For example, an illegal user implants a malicious JS code (hereinafter referred to as a malicious code) in an HTML page aiming at a browser bug, and when the legal user accesses the HTML page carrying the malicious code through a browser which does not repair the bug, the malicious code is triggered, so that user information is leaked, and the network security is poor.

Disclosure of Invention

In view of this, an object of the present application is to provide a page detection method and apparatus, so as to improve the accuracy of malicious codes, reduce the probability of user information leakage, and improve the security of a network. The specific technical scheme is as follows:

in a first aspect, the present application provides a page detection method, including:

acquiring a script language code of a page to be detected;

dividing the script language code into a plurality of lexical segments, and determining a segment type corresponding to each lexical segment;

combining the fragment types corresponding to the plurality of lexical fragments according to the position of each lexical fragment in the script language code to obtain a target Token (English: Token) stream;

and simplifying the expression of the lexical fragments corresponding to the fragment types in the target Token stream to obtain simplified codes corresponding to the script language codes.

With reference to the first aspect, in a first possible implementation manner, the method further includes:

receiving a plurality of script language code segments of the page to be detected, wherein the script language code segments form a script language code of the page to be detected;

caching the multiple script language code fragments into a preset cache space;

the acquiring of the script language code of the page to be detected comprises the following steps:

and reading the plurality of script language code segments from the preset cache space.

With reference to the first aspect or the first possible implementation manner of the first aspect, in a second possible implementation manner, the dividing the script language code into a plurality of lexical segments includes:

dividing each operator in the script language code into a lexical segment;

and dividing at least one continuous character except for an operator in the script language code into a lexical segment.

With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner, the determining a fragment type corresponding to each lexical fragment includes:

for each lexical segment, if the lexical segment is an operator, taking the lexical segment as a segment type of the lexical segment;

if the lexical segment is at least one continuous character, detecting whether the lexical segment is matched with at least one continuous character in a preset syntactic operation function; if so, determining the fragment type of the lexical fragment as an identification type; if not, determining that the fragment type of the lexical fragment is the character type.

With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner, the combining, according to the position of each lexical segment in the scripting language code, segment types corresponding to a plurality of lexical segments to obtain a target Token stream includes:

replacing each lexical segment in the script language code with a segment type corresponding to the lexical segment to obtain an initial Token stream;

if the initial Token stream comprises a plurality of continuous target segment types, replacing the plurality of continuous target segment types with an identification type to obtain a target Token stream, wherein the target segment types comprise identification types and preset operators, the first segment type and the last segment type in the plurality of continuous target segment types are identification types, and the plurality of continuous target segment types do not comprise character types;

and if the initial Token stream does not comprise a plurality of continuous target fragment types, taking the initial Token stream as a target Token stream.

With reference to the first aspect, in a fifth possible implementation manner, the preset security engine includes: a code fragment of a malicious page;

the inputting the simplified code into a preset security engine, and detecting whether the page to be detected is a malicious page, includes:

inputting the simplified code into a preset security engine, and detecting whether the simplified code is matched with the code segment;

if so, determining that the page to be detected is a malicious page;

if not, determining that the page to be detected is a normal page.

In a second aspect, the present application further provides a page detecting apparatus, including:

the acquisition module is used for acquiring the script language code of the page to be detected;

the dividing module is used for dividing the script language code into a plurality of lexical fragments and determining a fragment type corresponding to each lexical fragment;

the combination module is used for combining the fragment types corresponding to the plurality of lexical fragments according to the position of each lexical fragment in the script language code to obtain a target Token stream;

the simplification module is used for simplifying expressions of the lexical fragments corresponding to the fragment types in the target Token stream to obtain simplified codes corresponding to the script language codes;

and the detection module is used for inputting the simplified codes into a preset security engine and detecting whether the page to be detected is a malicious page.

With reference to the second aspect, in a first possible implementation manner, the apparatus further includes:

the receiving module is used for receiving a plurality of script language code segments of the page to be detected, and the script language code of the page to be detected is formed by the plurality of script language code segments;

the cache module is used for caching the script language code fragments into a preset cache space;

the obtaining module is specifically configured to read the multiple script language code segments from the preset cache space.

With reference to the second aspect or the first possible implementation manner of the second aspect, in a second possible implementation manner, the dividing module is specifically configured to:

dividing each operator in the script language code into a lexical segment;

With reference to the second possible implementation manner of the second aspect, in a third possible implementation manner, the dividing module is specifically configured to:

With reference to the third possible implementation manner of the second aspect, in a fourth possible implementation manner, the dividing module is specifically configured to:

if the initial Token stream does not include a plurality of continuous target fragment types, taking the initial Token stream as a target Token stream analysis module, and specifically, constructing a syntax tree based on the Token stream and a preset syntax rule; and simplifying the syntax tree to obtain a simplified code.

With reference to the second aspect, in a fifth possible implementation manner, the preset security engine includes: a code fragment of a malicious page;

the detection module is specifically configured to:

if so, determining that the page to be detected is a malicious page;

if not, determining that the page to be detected is a normal page.

In a third aspect, the present application further provides an electronic device comprising a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor, the processor being caused by the machine-executable instructions to: the steps of the page detection method described in any one of the above are implemented.

In a fourth aspect, the present application also provides a machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to: the steps of the page detection method described in any one of the above are implemented.

According to the page detection method and device, after the obtained script language code is divided into a plurality of lexical fragments, the fragment types corresponding to the lexical fragments are combined to obtain the target Token stream. The target Token stream includes a fragment type corresponding to each lexical fragment. The fragment types of different lexical fragments determine expression relationships among the different lexical fragments. Therefore, based on the fragment types included in the target Token stream, the expression simplification can be performed on the lexical fragments in the script language code, the confusion or deformation in the script language code is eliminated, and the simplified code corresponding to the script language code is obtained. The simplified codes are used for detecting whether the page to be detected is the malicious page or not, so that the accuracy of the malicious codes can be improved, the probability of user information leakage is reduced, and the safety of a network is improved.

Of course, it is not necessary for any product or method of the present application to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a first flowchart of a page detection method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a second page detection method according to an embodiment of the present application;

fig. 3 is a third schematic flow chart of a page detection method according to an embodiment of the present application;

fig. 4 is a schematic diagram of Token stream provided in the embodiment of the present application;

FIG. 5 is a diagram of a syntax tree provided in an embodiment of the present application;

fig. 6 is a fourth flowchart illustrating a page detection method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a page detection apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, page detection is mainly performed by defining character rules. Specifically, matching the JS code of the page to be detected with the defined malicious code segment. If the JS code of the page to be detected is matched with the defined malicious code segment, namely the JS code of the page to be detected comprises the code segment matched with the defined malicious code segment, judging that the page to be detected is the page implanted with the malicious code, and judging that the page to be detected is the malicious page. If the JS code of the page to be detected is not matched with the defined malicious code segment, namely the JS code of the page to be detected does not include the code segment matched with the defined malicious code segment, judging that the page to be detected is not a malicious page, namely a normal page.

However, the JS code can be obfuscated or deformed to some extent according to the characteristics of the JS language, and the obfuscated or deformed malicious code is likely to be mismatched with the defined malicious code fragment, so that the page implanted with the malicious code is determined to be a normal page, so that user information is leaked, and the network security is poor.

In order to improve the accuracy of malicious codes, reduce the probability of user information leakage, and improve the security of a network, the page detection method provided by the embodiment of the application is provided. The method can be applied to any electronic equipment with a browser. In the page detection method provided by the embodiment of the application, the electronic equipment can acquire the script language code of the page to be detected, divide the script language code into a plurality of lexical segments, and determine the segment type corresponding to each lexical segment; combining fragment types corresponding to the plurality of lexical fragments according to the position of each lexical fragment in the script language code to obtain a target Token stream; carrying out expression simplification on lexical fragments corresponding to the fragment types in the target Token stream to obtain simplified codes corresponding to the script language codes; and inputting the simplified code into a preset security engine, and detecting whether the page to be detected is a malicious page.

By the method provided by the embodiment of the application, the acquired script language code is divided into a plurality of lexical fragments, and then the target Token stream is obtained by combining the fragment types corresponding to the lexical fragments. The target Token stream includes a fragment type corresponding to each lexical fragment. The fragment types of different lexical fragments determine expression relationships among the different lexical fragments. Therefore, based on the fragment types included in the target Token stream, the expression simplification can be performed on the lexical fragments in the script language code, the confusion or deformation in the script language code is eliminated, and the simplified code corresponding to the script language code is obtained. The simplified codes are used for detecting whether the page to be detected is the malicious page or not, so that the accuracy of the malicious codes can be improved, the probability of user information leakage is reduced, and the safety of a network is improved.

The following describes a page detection method provided in the embodiments of the present application with specific embodiments.

As shown in fig. 1, fig. 1 is a first flowchart of a page detection method provided in the embodiment of the present application. For convenience of description, the following description will be made with reference to an electronic device as an execution subject, and is not intended to be limiting. The method comprises the following steps.

And S101, acquiring a script language code of the page to be detected.

In this step, the script language code may be a JS code or an HTML code. When the page detection is carried out, the electronic equipment acquires the script language code of the page to be detected.

In one example, the page to be detected may be an HTML page that the user needs to access. For example, when a user searches a browser of an electronic device for a keyword, the browser displays links of a plurality of HTML pages corresponding to the keyword. And the user accesses the HTML page corresponding to the link by clicking the link displayed in the browser. The electronic equipment can determine the HTML page clicked and accessed by the user as the page to be detected, and acquire the script language code of the HTML page.

The page to be detected can be an HTML page stored in the electronic equipment in advance. The page to be detected is not specifically limited in the embodiment of the application.

In an optional embodiment, after acquiring the scripting language code of the page to be detected, the electronic device may perform preprocessing on the code of the page to be detected, for example, delete comments, useless line feed, and the like in the scripting language code. Therefore, the efficiency of page detection can be effectively improved.

Step S102, the script language code is divided into a plurality of lexical segments, and the segment type corresponding to each lexical segment is determined.

In this step, after acquiring the script language code of the page to be detected, the electronic device may perform lexical scanning analysis on the script language code of the page to be detected, divide the script language code into a plurality of lexical segments, and determine a segment type corresponding to each lexical segment.

In the embodiment of the application, the division rule of the lexical segment and the division rule of the segment type corresponding to the lexical segment can be set according to actual requirements. For example, the division rule of the lexical segment may be: dividing continuous letters into a lexical segment, dividing continuous numbers into a lexical segment, dividing operators into a lexical segment, and/or dividing each character into a lexical segment, and the like. The division rule of the fragment type corresponding to the lexical fragment may be: the method comprises the steps of dividing fragment types corresponding to lexical fragments of continuous letters into character string types, dividing fragment types corresponding to the lexical fragments of continuous numbers into number types, and dividing fragment types corresponding to the lexical fragments of operators into types of the operators.

In an optional embodiment, the electronic device may perform lexical scan analysis on the script language code of the page to be detected by using a lexical analyzer, and divide the script language code into a plurality of lexical segments. The LEXical analyzer includes, but is not limited to, a lex (english: LEXical analyzer), a flex LEXical analyzer, and the like.

Step S103, combining the fragment types corresponding to the plurality of lexical fragments according to the position of each lexical fragment in the script language code to obtain the target Token stream.

In this step, the electronic device uses the fragment type corresponding to each lexical fragment as a Token, and combines the fragment types corresponding to the plurality of lexical fragments according to the position of each lexical fragment in the script language code to obtain a target Token stream.

For example, the scripting language code 1 is ABCD, and A, B, C and D respectively represent a lexical segment. Wherein, the fragment type corresponding to the lexical fragment A is a ', the fragment type corresponding to the lexical fragment B is B', the fragment type corresponding to the lexical fragment C is C ', and the fragment type corresponding to the lexical fragment D is D'. The electronic device combines the fragment types corresponding to the lexical fragments A, B, C, D according to the position of each lexical fragment in the script language code 1, and obtains a target Token stream of { a 'b' c'd' }.

And step S104, performing expression simplification on the lexical fragments corresponding to the fragment types in the target Token stream to obtain simplified codes corresponding to the script language codes.

In this step, the electronic device performs expression simplification on the lexical segment corresponding to the segment type in the target Token stream based on the segment type in the target Token stream, so as to obtain a simplified code corresponding to the script language code.

The fragment types of different lexical fragments determine the expression relationship between different lexical fragments. Therefore, based on the fragment types included in the target Token stream, the expression simplification can be performed on the lexical fragments in the script language code, the confusion or deformation in the script language code is eliminated, and the simplified code corresponding to the script language code is obtained.

In an optional embodiment, the electronic device may perform syntax analysis on the lexical fragments corresponding to the fragment types in the target Token stream by using a syntax analyzer, and further perform expression simplification to obtain simplified codes corresponding to the script language codes.

The above parsers include, but are not limited to, yacc (English: Yet AnotherCompile-Compiler) parser, bison parser, and the like.

Step S105, inputting the simplified code into a preset security engine, and detecting whether the page to be detected is a malicious page.

In this step, the electronic device may input the simplified code into a preset security engine, and detect whether the page to be detected is a malicious page. The predetermined security engine includes, but is not limited to, a virus engine and an Intrusion Prevention System (IPS) engine.

In the embodiment of the application, the electronic equipment simplifies the script language code of the page to be detected, can eliminate confusion or deformation of the script language code, and obtains the simplified code. The simplified codes are used for detecting whether the page to be detected is the malicious page or not, so that the accuracy of the malicious codes is improved, the probability of user information leakage is reduced, and the safety of a network is improved.

In an alternative embodiment, the pre-set security engine may include a code fragment of the malicious page. The electronic equipment inputs the simplified codes into a preset security engine and detects whether the simplified codes are matched with code segments of the malicious page or not; if so, determining that the page to be detected is a malicious page; if not, determining that the page to be detected is a normal page.

In an optional embodiment, after determining that the page to be detected is a malicious page, the electronic device may alarm. Taking the page to be detected as the HTML page clicked and accessed by the user in the browser as an example, in one example, the electronic device may mark the link of the malicious page displayed in the browser, for example, mark the link of the malicious page as a risk or suggest no access. In another example, when the user clicks to access a malicious page, the electronic device may send confirmation information to the user, such as whether the page is at risk, whether to continue to be accessed, and the like.

In another optional embodiment, after determining that the page to be detected is a malicious page, the electronic device may perform security protection processing on the malicious code, for example, discard the malicious code in the script language code of the page to be detected.

In another optional embodiment, after determining that the page to be detected is a normal page, the electronic device may not process the normal page. For example, the electronic device allows the user to normally access the normal page.

In an alternative embodiment, referring to fig. 2, step S101, step S1023, and steps S103-S105 may refer to the description of fig. 1, and are not repeated here. The step S102 may specifically include the following steps.

Step S1021, each operator in the script language code is divided into a lexical segment.

Step S1022, divide at least one continuous character in the script language code except the operator into a lexical segment.

For example, the scripting language code 2 of the page to be detected is:

(("gUwIl"<"GjClm")？"MvOtt":"O")+String.fromCharCode(90,71,116,100)+unescape("wr")。

the electronic equipment divides each operator in the script language code 2 into a lexical segment, and divides at least one continuous character except the operator in the script language code 2 into the lexical segment. At this time, the obtaining, by the electronic device, the lexical segment of the scripting language code 2 includes: "(", "" gguwIl "", "" "", "" GjClm "", ")", "" "", "" MvOtt "", "" "", "" String "," "fromCharCode", "(", "90", "" 116 "," "100", "") "," + "," "unescape", "(", "" wr "", ")".

The execution sequence of steps S1021 and S1022 is defined in the embodiment of the present application. When obtaining the lexical segments, the electronic device may determine, according to a code reading order of the scripting language code, whether the currently obtained lexical segments are an operator or at least one continuous character except for the operator.

In step S1023, a fragment type corresponding to each lexical fragment is determined.

In an alternative embodiment, referring to fig. 3, step S101, step S1024, and steps S103 to S105 may refer to the description of the above-mentioned portion of fig. 1, and are not repeated here. The step S102 may specifically include the following steps.

Step S1024, the script language code is divided into a plurality of lexical segments.

In step S1025, for each lexical segment, if the lexical segment is an operator, the lexical segment is used as the segment type of the lexical segment.

Step S1026, if the lexical segment is at least one continuous character, detecting whether the lexical segment is matched with at least one continuous character in a preset grammar operation function; if so, determining the fragment type of the lexical fragment as an identification type; if not, determining that the fragment type of the lexical fragment is the character type.

Wherein, at least one continuous character in the preset grammar operation function does not comprise an operator.

In the embodiment of the application, in order to facilitate the expression simplification in the subsequent process, the character types can be further divided into a character string type and a number type. If the lexical segment corresponding to the character type is a number, the segment type of the lexical segment can be subdivided into a number type, otherwise, the segment type of the lexical segment can be subdivided into a character string type.

The above example of step S1022 is still used as an example for explanation. The electronic equipment obtains the lexical fragments of the script language code 2 and comprises the following steps: "(", "" gguwIl "", "" "", "" GjClm "", ")", "" "", "" MvOtt "", "" "", "" String "," "fromCharCode", "(", "90", "" 116 "," "100", "") "," + "," "unescape", "(", "" wr "", ")".

For an operator, e.g., "(", the electronic device may set the fragment type of the operator "(". for consecutive at least one character, e.g., "gauwil" and String. if the preset syntactic operating function is String. fromcharkode, the electronic device may determine that the lexical fragment "gaull" does not match consecutive at least one character in the preset syntactic operating function, determine that the fragment type of "gaull" is a String type, determine that the lexical fragment String matches consecutive at least one character in the preset syntactic operating function, determine that the fragment type of String is an identification type, and so on, the electronic device may obtain the fragment type corresponding to each lexical fragment in the scripting language code 2, as shown in table 1.

TABLE 1

In table 1, type indicates a fragment type corresponding to a lexical fragment. value indicates a fragment value corresponding to the lexical fragment. Wherein the segment value may be determined according to the segment class. For example, if the fragment category is operator, the fragment value is null, i.e., null; and if the fragment category is the identification type or the character string type, the fragment value is a lexical fragment, and the fragment value corresponding to the lexical fragment is convenient for constructing the Token stream. Identity represents the Identity type and string represents the string type. Number indicates a Number type.

In an optional embodiment, after determining the fragment type corresponding to each lexical fragment, the electronic device may replace each lexical fragment in the script language code with the fragment type corresponding to the lexical fragment, so as to obtain an initial Token stream. If the initial Token stream comprises a plurality of continuous target fragment types, replacing the continuous target fragment types with an identification type to obtain a target Token stream, wherein the target fragment types comprise identification types and preset operators, the first fragment type and the last fragment type in the continuous target fragment types are the identification types, and the continuous target fragment types do not comprise character types; and if the initial Token stream does not comprise a plurality of continuous target fragment types, taking the initial Token stream as the target Token stream.

The description is made by taking the script language code 2 and table 1 as an example. The electronic device replaces each lexical segment in the scripting language code 2 with a segment type corresponding to the lexical segment to obtain an initial Token stream, as shown in fig. 4, the upper part in fig. 4 is the scripting language code 2, and the lower part is the initial Token stream. The preset operator is ". wherein String is the type of identification and fromcharrcode is the type of identification. A preset operator ". between String and fromcharrcode is included, so that the Identity. At this time, the electronic device obtains a target Token stream as:

((string<string)？string:string)+Identity(number,number,number,number)+Identity(string)。

in an optional embodiment, in the step S104, performing expression simplification on the lexical segment corresponding to the segment type in the target Token stream to obtain a simplified code corresponding to the script language code, which may specifically include the following steps.

Step S1041, based on the target Token stream, building a syntax tree corresponding to the target Token stream according to a preset syntax rule.

In this step, the electronic device may construct, by using the syntax analyzer, a syntax tree corresponding to the target Token stream according to a preset syntax rule and the target Token stream. Each node in the syntax tree is a lexical segment corresponding to each segment type in the target Token stream.

For example, the target Token stream is x + y × s-m, and if the preset syntax rule is the mathematical operation rule, a syntax tree is constructed based on the target Token stream according to the preset syntax rule, as shown in fig. 5.

In an optional embodiment, the preset grammar rule for the grammar analysis by the grammar analyzer may be all grammar rules corresponding to the script language code of the page to be detected. For example, if the script language code of the page to be detected is a JS code, the preset grammar rule for the grammar analyzer to perform grammar analysis is all grammar rules corresponding to the JS code. The electronic equipment simplifies the expression based on all grammar rules, improves the accuracy of simplified codes, effectively reduces the probability of detecting the escape pages after the malicious codes are mixed up or deformed, improves the detection rate of the malicious codes, reduces the probability of user information leakage, and improves the safety of the network.

In another alternative embodiment, the preset grammar rule for the grammar analysis by the grammar analyzer may be a grammar rule of a code corresponding portion of the page to be detected. For example, a more frequently used grammar rule is used among the total grammar rules, that is, a grammar rule other than the less frequently used grammar rule is used among the total grammar rules. For another example, the user selects a part of grammar rules from the above all grammar rules according to actual requirements. The electronic equipment simplifies the expression based on part of grammar rules, the number of the preset grammar rules is relatively small, but common grammars can be covered, the time for constructing the grammar tree is shortened, more codes which are confused or deformed are recovered, and the detection rate of malicious codes and the efficiency of page detection are improved.

And step S1042, carrying out expression simplification on the syntax tree to obtain simplified codes corresponding to the script language codes.

For example, the preset grammar rules are as follows:

rule one, the primary expression grammar rule may be specifically expressed as: primary _ expr:strandnumber identity. Where primary _ expr is composed of a string (i.e., string), a number (i.e., number), or an identifier (i.e., identity), primary _ expr may correspond to a leaf node in the syntax tree.

Rule two, the syntax rule of the suffix expression can be specifically expressed as: post _ expr ═ post _ expr [ expr ] | post _ expr (expr.) | post _ expr. identity | post _ expr + + | post _ expr- - | primary _ expr. The syntax rule indicates that post _ expr can be expressed as a recursive expression post _ expr [ expr ], or post _ expr (expr.), or post _ expr + +, or post _ expr- -, or a structure post _ expr.

Rule three, the grammar rule of the unary expression can be specifically expressed as: unary _ expr: ^ unary _ expr | -unary _ expr | post _ expr. The syntax rule is expressed as unary _ expr, and may be expressed as a value obtained by inverting or negating unary _ expr, or the above-mentioned post _ expr. Wherein-sum ^ is inverting operation, -is negating operation.

A rule three, a syntax rule of a multiplication-division, modulus-taking or remainder expression, may be specifically expressed as: and (pro _ expr) ═ pro _ expr ═ unary _ expr | pro _ expr/unary _ expr | pro _ expr% unary _ expr. The syntax rule indicates that prod _ expr can be expressed as a product value of prod _ expr and unary _ expr, or a quotient of prod _ expr and unary _ expr, or a remainder value between prod _ expr and unary _ expr. Where,% is the operation of taking the remainder.

The rule four, the grammatical rule of the addition expression and the subtraction expression, may be specifically expressed as: addi _ expr ═ addi _ expr + prod _ expr | addi _ expr-prod _ expr | prod _ expr. The syntax rule indicates that addi _ expr may be expressed as the sum of addi _ expr and prod _ expr or the difference of addi _ expr and prod _ expr. Wherein, + is an add operation, and-is a subtract operation.

Rule five, the syntax rule of the shift expression can be specifically expressed as: shift _ expr ═ shift _ expr < < addi _ expr | shift _ expr > > addi _ expr. The syntax rule indicates that shift _ expr can be expressed as shift _ expr left-shifted addi _ expr bits or shift _ expr right-shifted addi _ expr bits. Wherein < < is a left shift operation and > > is a right shift operation.

Rule six, the syntax rule of the relational expression may be specifically expressed as: and rel _ expr ═ rel _ expr > shift _ expr | rel _ expr < shift _ expr | rel _ expr > -shift _ expr | rel _ expr ═ shift _ expr. The syntax rule indicates that rel _ expr can be represented as rel _ expr being greater than shift _ expr, or rel _ expr being less than shift _ expr, or rel _ expr being greater than or equal to shift _ expr, or rel _ expr being less than or equal to shift _ expr. Here, > means greater than, < means less than, > < means greater than or equal to, and < means less than or equal to.

Rule seven, which is equal to the syntax rule of the expression, may be specifically expressed as: equal _ expr ═ rel _ expr | equal _ expr! Rel _ expr. The syntax rule indicates that the equal _ expr may indicate that the equal _ expr is equal to the rel _ expr or that the equal _ expr is not equal to the rel _ expr. Wherein, represents equal to! Denotes not equal.

Rule eight, the syntax rule of the xor expression may be specifically expressed as: exclu _ expr ═ exclu _ expr ^ equal _ expr | equal _ expr. The syntax rule denoted exclu _ expr may be denoted exclu _ expr or equal _ expr, or equal _ expr. Wherein ^ is an exclusive-OR operation.

The rule nine, the syntax rule of the conditional expression, may be specifically expressed as: condi _ expr:? expr. condi _ expr. expl. expr. Exclu. expr. The syntax rule may be expressed as condi _ expr, which may be expressed as exclu _ expr being True (english: True) output expr, exclu _ expr being False (english: False), output condi _ expr or exclu _ expr.

Rule ten, the syntax rule of the assignment expression can be specifically expressed as: the default _ expr is equal to the default _ expr, and the default _ expr is equal to the default _ expr | condi _ expr. The syntax rule is assign _ expr, which can be obtained by assigning unary _ expr, assign _ expr or condi _ expr. Wherein, the assignment operation is defined as the value assignment operation.

Rule eleven, a syntax rule of the general expression may be specifically expressed as: expr ═ begin _ expr | expr, assign _ expr. The syntax rule denoted expr can be denoted as either ssgin _ expr or expr, assign _ expr.

In the embodiment of the present application, the preset syntax rule is only a commonly used JS syntax rule, and is not all syntax rules. In addition, the expressions in the preset syntax rules, such as post _ exp, unary _ expr, prod _ expr, addi _ expr, rel _ expr, shift _ expr, etc., have different practical meanings according to the specific syntax rules, and are not limited specifically herein.

The process of simplifying the electronic device expression can be obtained according to the grammar rule of the primary expression. That is, the primary expression may be: primary _ expr:string | num | identity.

Primary _ expr can be composed of segments of string, num or identity in Token stream. If the fragment type corresponding to "gauwil" is string, the character string gauwi will be converted into a leaf node of the syntax tree when the syntax tree is constructed.

Alternatively, the process of simplifying the expression of the electronic device can be obtained by using a terminator in other expressions. For example, assign _ expr: ═ unary _ expr ═ assign _ expr defines a structure equal to a symbol in the syntax analysis.

When the electronic device constructs the corresponding syntax tree according to the preset syntax rules, the constants in the syntax tree can be simplified.

For the convenience of understanding, the process of simplifying expressions of the syntax tree provided in the embodiment of the present application is described with reference to the above scripting language code 2. As shown in table 2.

TABLE 2

In the expression simplification process shown in table 2, when the electronic device constructs nodes of a corresponding syntax tree according to the target Token stream, constants in the syntax tree may be simplified. For example, in the JS code, a part of "gaull" < "GjClm" may be used to construct a sub-tree of a conditional expression according to rel _ expr: ═ rel _ expr < shift _ expr in the preset syntax rules when constructing the syntax tree. Since rel _ expr ("gUwIl") and shift _ expr ("GjClm") on the right are string constants in the acquired code, that is, there are specific numeric values. Therefore, the comparison result corresponding to the part of "gaull" < "GjClm" is determined, that is, the comparison result of "gaull" < "GjClm" can be true, that is, the character string constant corresponding to gaull is smaller than the character string constant corresponding to GjClm; or the comparison result of "gauwil" < "GjClm" may be false, that is, the string constant corresponding to gauwil is not less than the string constant corresponding to GjClm.

In the first simplification process shown in table 2, according to the string constant corresponding to gaull, ("gaull" < "GjClm") - >0, that is, the judgment result of "gaull" < "GjClm" is false, which is performed for the first simplification according to the string constant corresponding to gaull, ("GjClm") and the preset grammar rule.

For the first simplified character string, i.e., (0; the judgment result of "gUwIl" < "GjClm" is false, and "O" is output. The electronic device may determine that the simplified result of the second reduction is (0.

For the second reduced string, i.e., ("O") + string. fromcharkode (90,71,116,100) + unescape ("wr"), the electronic device can convert the ASCII value 90 to character Z, the ASCII value 71 to character G, the ASCII value 116 to character t, and the ASCII value 100 to character d according to the preset grammar rule, i.e., American Standard Code for Information exchange (ASCII) value. The simplification result of the third simplification performed by the electronic device is string.

For the third simplified character string, namely ("O") + "ZGtd" + unescape ("wr"), the result of the fourth simplification performed by the electronic device is unescape ("wr") - > "wr" according to the preset grammar rule.

For the fourth simplified character string, that is ("O") + "ZGtd" + "wr", the result of the fifth simplification performed by the electronic device is ("O") + "ZGtd" + "wr" - > "OZGtdwr" according to the preset grammar rule.

The electronic device reduces the character string (("gUwIl" < "GjClm"). That is, the character string "OZGtdwr" is a character string before the acquired script language code 2 is confused or deformed.

In the simplification process shown in table 2, the electronic device may adopt a recursive descent analysis method or a shift normalization analysis method to perform traversal simplification on the character string in the obtained JS code based on the preset grammar rule. The five simplification processes are the traversal processes of the constructed syntax tree by the electronic equipment.

In summary, with the method provided in the embodiment of the present application, after the obtained script language code is divided into a plurality of lexical segments, the target Token stream is obtained by combining the segment types corresponding to the plurality of lexical segments. The target Token stream includes a fragment type corresponding to each lexical fragment. The fragment types of different lexical fragments determine expression relationships among the different lexical fragments. Therefore, based on the fragment types included in the target Token stream, the expression simplification can be performed on the lexical fragments in the script language code, the confusion or deformation in the script language code is eliminated, and the simplified code corresponding to the script language code is obtained. The simplified codes are used for detecting whether the page to be detected is the malicious page or not, so that the accuracy of the malicious codes can be improved, the probability of user information leakage is reduced, and the safety of a network is improved.

In an optional embodiment, according to the page detection method shown in fig. 1, an embodiment of the present application further provides a page detection method. As shown in fig. 6, fig. 6 is a fourth flowchart illustrating a page detection method according to an embodiment of the present application. The method comprises the following steps.

Step S601, receiving a plurality of script language code segments of the page to be detected, wherein the plurality of script language code segments form the script language code of the page to be detected.

In this step, the scripting language code of the page to be detected is divided into a plurality of segments, which are respectively sent to the electronic device, and the electronic device receives the plurality of scripting language code segments of the page to be detected to obtain the scripting language code of the page to be detected.

It can be understood that, because the HyperText Transfer Protocol (HTTP) response packet is relatively large, in the Transmission process of the Transmission Control Protocol (TCP), the HTTP response packet is packetized, and each packet includes the script language code segment of the page to be detected. The electronic equipment can obtain all sub-packages corresponding to the page to be detected, obtain a plurality of script language code segments of the page to be detected, and further obtain script language codes of the page to be detected.

Step S602, caching a plurality of script language code fragments in a preset cache space.

In this step, each scripting language code segment that can be received by the electronic device is cached in the preset cache space. And the subsequent analysis of the script language code of the page to be detected is facilitated.

In an example, the scripting language code segments are carried in sub-packets of the HTTP response packet, and in order to facilitate subsequent analysis of the scripting language code of the page to be detected, the electronic device may sequentially cache the scripting language code segments in a preset cache space according to sequence numbers of the sub-packets.

Step S603, reading a plurality of script language code segments from the preset buffer space.

In this step, the electronic device may read a plurality of script language code segments from the preset cache space, where the plurality of script language code segments are the script language codes of the page to be detected.

Step S604, the script language code is divided into a plurality of lexical segments, and the segment type corresponding to each lexical segment is determined.

Step S605, combining the fragment types corresponding to the plurality of lexical fragments according to the position of each lexical fragment in the script language code to obtain a target Token stream.

Step S606, expression simplification is carried out on the lexical fragments corresponding to the fragment types in the target Token stream, and simplified codes corresponding to the script language codes are obtained.

Step S607, inputting the simplified code into a preset security engine, and detecting whether the page to be detected is a malicious page.

The above-described steps S604 to S607 are the same as the above-described steps S102 to S105.

In the page detection method shown in fig. 6, the electronic device caches the script language code segments of the page to be detected in the preset cache space, so that when the electronic device obtains the code of the page to be detected and performs page detection, the electronic device can directly read the multiple script language code segments from the preset cache space, so that the script language code can be more conveniently obtained, and the obtained complete script language code facilitates expression simplification of the obtained script language code in the later stage.

Based on the same inventive concept, according to the page detection method provided by the embodiment of the present application, the embodiment of the present application further provides a page detection device. As shown in fig. 7, fig. 7 is a schematic structural diagram of a page detection apparatus according to an embodiment of the present application. The apparatus includes the following modules.

An obtaining module 701, configured to obtain a scripting language code of a page to be detected;

a dividing module 702, configured to divide the scripting language code into a plurality of lexical segments, and determine a segment type corresponding to each lexical segment;

the combining module 703 is configured to combine the fragment types corresponding to the multiple lexical fragments according to the position of each lexical fragment in the script language code, so as to obtain a target Token stream;

a simplification module 704, configured to simplify an expression of the lexical segment corresponding to the segment type in the target Token stream, to obtain a simplified code corresponding to the script language code;

the detecting module 705 is configured to input the simplified code into a preset security engine, and detect whether the page to be detected is a malicious page.

In an optional embodiment, the page detecting apparatus may further include:

the receiving module is used for receiving a plurality of script language code segments of the page to be detected, and the plurality of script language code segments form script language codes of the page to be detected;

the cache module is used for caching a plurality of script language code fragments into a preset cache space;

the obtaining module 701 is specifically configured to read a plurality of script language code segments from a preset cache space.

In an optional embodiment, the dividing module is specifically configured to:

dividing each operator in the script language code into a lexical segment;

dividing at least one continuous character except operators in the script language code into a lexical segment.

In an alternative embodiment, the dividing module 702 may specifically be configured to:

if the initial Token stream comprises a plurality of continuous target fragment types, replacing the plurality of continuous target fragment types with an identification type to obtain a target Token stream, wherein the target fragment types comprise identification types and preset operators, the first fragment type and the last fragment type in the plurality of continuous target fragment types are the identification types, and the plurality of continuous target fragment types do not comprise character types;

and if the initial Token stream does not comprise a plurality of continuous target fragment types, taking the initial Token stream as the target Token stream.

In an alternative embodiment, the preset security engine may include a code fragment of a malicious page;

the detection module 705 is specifically configured to input the simplified code into a preset security engine, and detect whether the simplified code is matched with the code segment; if so, determining that the page to be detected is a malicious page; if not, determining that the page to be detected is a normal page.

By the device provided by the embodiment of the application, the acquired script language code is divided into a plurality of lexical fragments, and then the target Token stream is obtained by combining the fragment types corresponding to the lexical fragments. The target Token stream includes a fragment type corresponding to each lexical fragment. The fragment types of different lexical fragments determine expression relationships among the different lexical fragments. Therefore, based on the fragment types included in the target Token stream, the expression simplification can be performed on the lexical fragments in the script language code, the confusion or deformation in the script language code is eliminated, and the simplified code corresponding to the script language code is obtained. The simplified codes are used for detecting whether the page to be detected is the malicious page or not, so that the accuracy of the malicious codes can be improved, the probability of user information leakage is reduced, and the safety of a network is improved.

Based on the same inventive concept, according to the page detection method provided in the foregoing embodiment of the present application, an embodiment of the present application further provides an electronic device, as shown in fig. 8, including a processor 801 and a machine-readable storage medium 802, where the machine-readable storage medium 802 stores machine-executable instructions that can be executed by the processor 801.

In addition, as shown in fig. 8, the electronic device may further include: a communication interface 803 and a communication bus 804; the processor 801, the machine-readable storage medium 802, and the communication interface 803 complete communication with each other through the communication bus 804, and the communication interface 803 is used for communication between the electronic device and other devices.

The processor 801 is caused by machine executable instructions to implement the steps of:

acquiring a script language code of a page to be detected;

combining fragment types corresponding to the plurality of lexical fragments according to the position of each lexical fragment in the script language code to obtain a target Token stream;

carrying out expression simplification on lexical fragments corresponding to the fragment types in the target Token stream to obtain simplified codes corresponding to the script language codes;

and inputting the simplified code into a preset security engine, and detecting whether the page to be detected is a malicious page.

The communication bus 804 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The communication bus 804 may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.

The machine-readable storage medium 802 may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Additionally, the machine-readable storage medium 802 may also be at least one memory device located remotely from the aforementioned processor.

The Processor 801 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

By the electronic device provided by the embodiment of the application, after the obtained script language code is divided into the plurality of lexical fragments, the target Token stream is obtained by combining the fragment types corresponding to the plurality of lexical fragments. The target Token stream includes a fragment type corresponding to each lexical fragment. The fragment types of different lexical fragments determine expression relationships among the different lexical fragments. Therefore, based on the fragment types included in the target Token stream, the expression simplification can be performed on the lexical fragments in the script language code, the confusion or deformation in the script language code is eliminated, and the simplified code corresponding to the script language code is obtained. The simplified codes are used for detecting whether the page to be detected is the malicious page or not, so that the accuracy of the malicious codes can be improved, the probability of user information leakage is reduced, and the safety of a network is improved.

Based on the same inventive concept, according to the page detection method provided in the embodiment of the present application, an embodiment of the present application further provides a machine-readable storage medium storing machine-executable instructions, which, when invoked and executed by a processor, cause the processor to: the steps of the page detection method described in any one of the above are implemented.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for embodiments such as the apparatus, the electronic device, and the machine-readable storage medium, since they are substantially similar to the method embodiments, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. A page detection method, characterized in that the method comprises:

acquiring a script language code of a page to be detected;

carrying out expression simplification on the lexical fragments corresponding to the fragment types in the target Token stream to obtain simplified codes corresponding to script language codes;

2. The method of claim 1, further comprising:

caching the multiple script language code fragments into a preset cache space;

3. The method of claim 1 or 2, wherein said dividing said scripting language code into a plurality of lexical segments comprises:

dividing each operator in the script language code into a lexical segment;

4. The method of claim 3, wherein determining the fragment type corresponding to each lexical fragment comprises:

5. The method according to claim 4, wherein the combining the fragment types corresponding to the plurality of lexical fragments according to the position of each lexical fragment in the scripting language code to obtain the target Token stream comprises:

6. The method of claim 1, wherein the pre-provisioned security engine comprises: a code fragment of a malicious page;

if so, determining that the page to be detected is a malicious page;

if not, determining that the page to be detected is a normal page.

7. A page detection apparatus, characterized in that the apparatus comprises:

8. The apparatus of claim 7, further comprising:

9. The apparatus according to claim 7 or 8, wherein the partitioning module is specifically configured to:

dividing each operator in the script language code into a lexical segment;

10. The apparatus according to claim 9, wherein the partitioning module is specifically configured to:

11. The apparatus according to claim 10, wherein the partitioning module is specifically configured to:

12. The apparatus of claim 7, wherein the pre-defined security engine comprises: a code fragment of a malicious page;

the detection module is specifically configured to:

if so, determining that the page to be detected is a malicious page;

if not, determining that the page to be detected is a normal page.