CN110929257B

CN110929257B - Method and device for detecting malicious codes carried in webpage

Info

Publication number: CN110929257B
Application number: CN201911040978.XA
Authority: CN
Inventors: 侯贺明; 王赟; 黄华桥; 程波; 曾伟; 谭国权; 李明栋
Original assignee: Wuhan Greenet Information Service Co Ltd
Current assignee: Wuhan Greenet Information Service Co Ltd
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2022-02-01
Anticipated expiration: 2039-10-30
Also published as: CN110929257A

Abstract

The invention relates to the technical field of internet, and provides a method and a device for detecting malicious codes carried in a webpage. The method comprises the steps of using a User-agent of a search engine crawler, assigning a User _ agent field in a first website request message, and then sending the first website request message to an address of a webpage to be detected; assigning a User _ agent field in the second website request message by using a User-agent of a common terminal User, and then sending the second website request message to the address of the webpage to be detected; and matching the contents carried in the first response message and the second response message, and if the difference of the matching results is greater than a preset condition, marking the webpage to be detected as the website potentially carrying the malicious codes. The invention simulates the search engine crawler to crawl the website homepage content, simulates the normal user to crawl the website homepage content, compares the difference of titles and has good detection effect on black-hat SEO horse hanging.

Description

Method and device for detecting malicious codes carried in webpage

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of internet, in particular to a method and a device for detecting malicious codes carried in a webpage.

[ background of the invention ]

The web page Trojan is a popular calling method and refers to malicious codes in the form of web pages. The web page malicious codes are divided into two types, one is that the original web page with normal functions is changed into the web page with malicious functions by modifying the original web page, and the modification includes but is not limited to: add or modify the title of the web page, meta field, Javascirpt code, Iframe tag, etc. The other category is a web page backdoor category, also called webshell, and the malicious web page of this category is not modified on the basis of the original web page, but all contents are malicious codes provided by hackers and are a single file.

For the first type, there are two common utilization methods, namely black-hat SEO (all called Search Engine Optimization), and browser exploits. The black-cap SEO essentially deceives the search engine by means of hacker techniques, and the website address that should not appear originally appears on the detection result page of the search engine, so that the user can search the related illegal website address on the search engine. For black-hat SEOs, the technology implementation is diverse, and some common methods such as dark chains, keyword tiling, spider spoofing, parasite technology, and so on. However, means for effectively implementing identification and detection are very scarce.

In view of the above, overcoming the drawbacks of the prior art is an urgent problem in the art.

[ summary of the invention ]

The invention aims to solve the technical problem that the prior art lacks a means for effectively detecting the spidering technology and keyword stacking aiming at the black cap SEO.

The invention further aims to provide a method and a device for detecting malicious codes carried in a webpage.

The invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for detecting malicious codes carried in a web page, including:

assigning a User _ agent field in the first website request message by using a User-agent of a search engine crawler, and then sending the first website request message to the address of the webpage to be detected;

receiving a first response message returned by the webpage to be detected, and storing one or more of a webpage title, text content and an HTML (hypertext markup language) tag carried in the response message;

assigning a User _ agent field in the second website request message by using a User-agent of a common terminal User, and then sending the second website request message to the address of the webpage to be detected;

receiving a second response message returned by the webpage to be detected, and storing one or more of a webpage title, text content and an HTML (hypertext markup language) tag carried in the response message;

and matching one or more of the webpage title, the text content and the HTML label carried in the first response message and the second response message, and if the difference of the matching result is greater than a preset condition, calibrating the webpage to be detected as a website potentially carrying malicious codes.

Preferably, the matching of one or more of the web title, the text content, and the HTML tag carried in the first response message and the second response message, and if the difference of the matching result is greater than a preset condition, calibrating the web page to be detected as a website potentially carrying malicious codes specifically include:

matching the webpage titles carried in the first response message and the second response message, and if the matching results are not completely the same, calibrating the webpage to be detected as a website potentially carrying malicious codes;

and if the matching results are not completely the same, determining that the difference of the matching results is greater than a preset condition.

Preferably, after the web page to be detected is marked as a website potentially carrying malicious codes, the method further includes:

and matching the keyword libraries of the illegal websites according to the difference bytes in the webpage titles carried in the matched first response message and second response message, and marking the websites potentially carrying malicious codes as websites believed to carry the malicious codes if the matching is successful.

matching the keyword libraries of the illegal websites according to the difference bytes in the webpage titles carried in the matched first response message and second response message, and marking the websites potentially carrying malicious codes as normal websites if the matching is unsuccessful;

when the website is required to be detected again to determine whether the website carries malicious codes or not, directly matching the differential bytes matched in the current round with the differential bytes matched in the previous round, and if the differential bytes are matched, still marking the website as a normal website; and the previous round is the round marked as the normal website.

if the continuous appointed turns are normal websites according to the webpage title matching result, further performing the matching process of the text content and the illegal website keyword library to complete secondary verification;

if the secondary verification is determined to be a normal website, the subsequent processes are still normal websites according to the continuous appointed number of rounds and the webpage title matching result, and a round of periodic process of the secondary verification is carried out.

matching HTML tags carried in the first response message and the second response message, and if the matching result is that the HTML tag pairs are not completely consistent, calibrating the webpage to be detected as a website potentially carrying malicious codes;

and if the matching result is that the HTML label pair is not completely consistent, judging that the difference of the matching result is greater than a preset condition.

Preferably, the method further comprises: and matching the text contents carried in the first response message and the second response message by using an illegal website keyword library, and marking as a website which is believed to carry malicious codes if the matching result is that the number of successfully matched text contents and the illegal website keyword library exceeds a first preset threshold value.

Preferably, the User-agent type of the search engine crawler includes one or more of Baidu, Saogue, Google, and 360.

In a second aspect, the present invention further provides a method and an apparatus for detecting malicious codes carried in a web page, which are used to implement the method for detecting malicious codes carried in a web page in the first aspect, and the apparatus includes:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are programmed to perform the method for detecting malicious code carried in a web page according to the first aspect.

In a third aspect, the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions are executed by one or more processors, so as to complete the method for detecting malicious codes carried in a webpage according to the first aspect.

The invention detects whether the website is hung by the horse by actively simulating the method of crawling the website home page by the search engine crawler, simulates the search engine crawler to crawl the website home page content, simulates the normal user to crawl the website home page content, compares the difference of the titles under the two conditions, and has good detection effect on the black-hat SEO horse hanging.

The matching condition of HTML tag pairs in an HTML page is checked, and whether malicious codes are injected or not is judged through the abnormity of the HTML page structure; and matching the text content through the keyword list to judge whether the webpage is promoted by a black-hat SEO or is attacked by malicious codes.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a schematic flowchart of a method for detecting malicious codes carried in a web page according to an embodiment of the present invention;

fig. 2 is a schematic flowchart illustrating an improved method for detecting malicious codes carried in a web page according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a method for detecting malicious codes carried in a web page based on text content according to an embodiment of the present invention;

fig. 4 is a flowchart illustrating a typical method for detecting malicious codes carried in a web page according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a detection apparatus for detecting malicious codes carried in a web page according to an embodiment of the present invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In the description of the present invention, the terms "inner", "outer", "longitudinal", "lateral", "upper", "lower", "top", "bottom", and the like indicate orientations or positional relationships based on those shown in the drawings, and are for convenience only to describe the present invention without requiring the present invention to be necessarily constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.

When a hacker has invaded a website and wants to emerge by means of black-cap SEO, a commonly used method is that spider spoofing is combined with keyword tiling. The specific method for spidering includes modifying a page source code of a website home page, adding a Javascript code which is used for judging the identity of a visitor, and comparing the User-agent of the visitor, wherein if the visitor is found to be a crawler of a search engine, such as hundredths, necessity, Google and the like, a page title and page Meta information set by a hacker per se are returned; if the visitor is found to be a normal user, the original page of the website is returned without modifying the title and content of the page. The keyword stacking method is that in order to increase the weight of a webpage page to a certain keyword, a hacker stacks the keyword or related keywords in a large number of positions such as the title, Meta tag, content and the like of the page.

The method for detecting the Trojan hanging web page comprises the following steps:

setting a User-agent of a normal User to initiate access to a home page of a target website, setting a User-agent of a search engine to initiate access to the home page of the target website, comparing contents of TITLE tags in source codes of HTML pages obtained by two accesses, and if the TITLEs of the webpages are different, indicating that the webpages are possible to be hung horse and adopting a spidering technology in a black-hat SEO.

The method comprises the steps of preparing a keyword list (also described as an illegal website keyword library in the embodiment of the invention) in advance, searching keywords in source codes of a home page of a target website crawled by a search engine User-agent according to the commonly used target keywords in the keyword list, and judging that the webpage is hung on horse if the keywords can be matched with the source codes of the home page. Wherein, the User-agent of the normal User can be made into a list which covers the typical browser User-agent; the User-agent of the search engine can also be made into a list, wherein the list comprises several mainstream search engines, such as Baidu, Google, must and the like; one User-agent can be randomly chosen from the list at a time. For the keyword list, in the matching process, in order to prevent false alarm, a threshold value can be set as the minimum matching times, and only when the threshold value is reached, the matching is performed as a matching hit.

In addition to the above-mentioned method of detecting a change of a title, it is found that a hacker often destroys an original HTML format when modifying source code of an HTML web page (because the hacker is to integrate its own content into a normal web page, it is difficult to ensure the consistency of the original format), and there are two types of methods, namely, a plurality of title tags are introduced into the HTML web page; the second is that new data is introduced before the < html > tag or that new data is introduced after the </html > tag. Because an HTML file starts and ends with HTML and HTML, if other data is introduced beyond them, it is likely that the data is malicious code injected by a hacker.

The detection method involved in the method is applicable to the first category of webpage horse hanging, the black hat SEO, the spider cheating technology and the keyword stacking. In other words, we have found some common behaviors and features of this type of malicious web code by analyzing this type of malicious web code in a large quantity, and based on these common behaviors and features, the detection method in this document is proposed in a targeted manner.

In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Example 1:

embodiment 1 of the present invention provides a method for detecting a malicious code carried in a web page, as shown in fig. 1, including:

in step 201, a User-agent of the search engine crawler is used to assign a User _ agent field in the first website request message, and then the first website request message is sent to the address of the web page to be detected.

Wherein the User-agent type of the search engine crawler comprises one or more of Baidu, Saogue, Google, and 360.

In step 202, a first response message returned by the web page to be detected is received, and one or more items of a web page title, text content and an HTML tag carried in the response message are stored.

In step 203, the User-agent of the common terminal User is used to assign a User _ agent field in the second website request message, and then the second website request message is sent to the address of the webpage to be detected.

In step 204, a second response message returned by the web page to be detected is received, and one or more items of the web page title, the text content and the HTML tag carried in the response message are stored.

In step 205, one or more of the web title, the text content, and the HTML tag carried in the first response message and the second response message are matched, and if the difference of the matching result is greater than a preset condition, the web to be detected is marked as a website potentially carrying malicious codes.

The embodiment of the invention detects whether the website is hung by actively simulating the search engine crawler to crawl the website homepage, simulates the search engine crawler to crawl the website homepage content, simulates the normal user to crawl the website homepage content, compares the title difference under the two conditions, and has good detection effect on the black-hat SEO horse hanging.

In the embodiment of the present invention, for one or more of the web title, the text content, and the HTML tag carried in the matching first response message and the matching second response message, if the difference of the matching result is greater than a preset condition, the web page to be detected is marked as a website potentially carrying malicious codes, at least the following preferred modes are provided:

the first method is as follows:

In a specific implementation, after the web page to be detected is usually marked as a website potentially carrying malicious code, the method further includes:

Because, in practical situations, it is necessary to perform periodic security verification on the web page addresses in the list maintained by the server, that is, to execute the method process of

step

201 and 205 in the embodiment of the present invention, so as to be able to verify whether the web page addresses carry malicious codes, in the first mode proposed in the embodiment of the present invention, when one round of verification is completed as a normal web site, the next round or the following rounds are detection modes that can improve the detection efficiency by using the following preferred schemes, and are particularly suitable for a case where there is a difference between the matching of the web page titles obtained in step 202 and step 204, and the web page to be detected is marked as a web site potentially carrying malicious codes, as shown in fig. 2, the method further includes:

in step 301, matching is performed on the keyword libraries of the illegitimate web sites according to the difference bytes in the web page titles carried in the matched first response message and second response message, and if the matching is unsuccessful, the websites potentially carrying malicious codes are marked as normal websites.

In step 302, when the website needs to be detected again to detect whether the malicious code is carried, the differential bytes matched in the current round are directly matched with the differential bytes matched in the previous round, and if the differential bytes are matched, the website is still marked as a normal website; and the previous round is the round marked as the normal website.

The first existing meaning of the embodiment of the invention is that the basic purpose of the behavior of maliciously falsifiing codes in a normal webpage is to promote the website content of the user, so that no matter the normal website is directly modified or the link is jumped, the final expression result can generate substantial difference on the webpage title, and the corresponding difference bytes are recorded in an illegal website keyword library with a large probability. Therefore, the method of the embodiment of the invention directly scales the means on the webpage title, has less resources occupied by analysis and higher efficiency, and is suitable for occasions containing a large number of websites to be marked in a large list.

However, in the testing process, some websites are found, even though the websites are normal websites, in order to maintain the privileges of members or directly registered users of the websites, the User-agent type carrying the search engine crawler is returned with the webpage titles that are different from the User-agent type of the common terminal User in direct registration, that is, the webpage titles carried in the first response message and the second response message are inconsistent, so that the websites which are potentially carrying malicious codes and are marked with the webpage to be detected first exist, and the corresponding difference bytes are matched with the keyword library of the illegal website for identification.

And if the condition that the webpage title matching is marked as a normal website for many times is confirmed, performing a comprehensive matching process of text content carried in the webpage which consumes more resources once, thereby achieving the effect of comprehensive verification. By the implementation mode, the reliability of the final verification result can be ensured, and a hierarchical verification process is realized; therefore, an embodiment of the present invention further provides a preferable scheme for this, after the web page to be detected is marked as a website potentially carrying malicious codes, as shown in fig. 3, the method further includes:

in step 401, if the matching result of the continuous assigned turns according to the web title is a normal web site, the matching process of the text content and the keyword library of the illegal web site is further performed, and secondary verification is completed.

In step 402, if the secondary verification determines that the web site is a normal web site, the subsequent web sites are all normal web sites according to the matching result of the web page titles and the number of consecutive assigned rounds, and a round of periodic process of the secondary verification is performed.

In the preferred embodiment, in addition to the number of designated rounds described in

steps

401 and 402, the occupation of the computing resources of the server may be taken into consideration, that is, the website to be subjected to the secondary verification is preferably called out to perform the secondary verification when the occupation of the computing resources of the server is low.

The second method comprises the following steps:

The method is the same as the method in the first mode, namely, the platform side to which the webpage to be detected belongs exists, differential response is made according to the access of common users and the crawler mode of a search engine, at the moment, the conclusion cannot be directly obtained through the second mode, and therefore the website which is marked as the website potentially carrying malicious codes exists.

In this case, in order to complete the rigorous verification process, the following is also included: and matching the text contents carried in the first response message and the second response message by using an illegal website keyword library, and marking as a website which is believed to carry malicious codes if the matching result is that the number of successfully matched text contents and the illegal website keyword library exceeds a first preset threshold value.

It should be noted that, in a specific implementation manner, the first manner and the second manner may also be directly combined for use, and the accuracy of the combination manner of the first manner and the second manner may be improved through statistics; that is, in the embodiment of the present invention, the process of matching text content is already analyzed at the bottom, so that if the matching consistency of the web page title and the matching consistency of the HTML tag pair are satisfied with the difference, that is, the website is detected to be carrying malicious codes is obtained, and if the accuracy of the website can reach 99%, the website can be used as an optional third implementation manner of the present invention, and the matching of text content and an illegal website keyword library can be directly skipped. Of course, its feasibility requires further experimental demonstration, but is a potentially viable implementation. And should also fall within the scope of the embodiments of the present invention.

Example 2:

the embodiment of the present invention will explain the process of acquiring the title of the web Page, the home Page of the website, and the content of the stored Page as Page _1 in steps 201 to 204 described in embodiment 1 of the present invention based on the User-agent of the Baidu search engine crawler. The search engine selects a mainstream search engine such as a Baidu search engine, and the search engine crawler User-agent is set as the User-agent of the Baidu crawler, namely:

mozilla/5.0 (compatible; Baiduspider/2.0; + http:// www.baidu.com/search/spider. html). As shown in fig. 4, specifically:

in step 501, a User-agent of an Baidu search engine crawler is used for crawling a home Page of a website, and the content of the Page is stored as Page _ 1; when crawling a website (i.e., sending the first website request message as described in embodiment 1), it is only necessary to crawl the first page content of the website, and it is not necessary to crawl all pages, taking www.example.com as an example, here, if a domain name is provided instead of a URL with a protocol field, we first determine whether the protocol of the target domain name is http or https, the determination method is to detect the 80 and 443 ports of the IP address corresponding to the domain name, and if the target domain name provides both http service and https service, we should select https service. If a complete URL address is provided, such as http:// www.example.com/, we can directly access the website to get the contents of the home page file. After we obtain the website homepage, we need to decode the website homepage content properly, where decoding refers to decoding byte data into character string data to prepare for subsequent retrieval.

In step 502, the User-agent of the mainstream browser is used to crawl the home Page of the website, and the Page content is stored as Page _ 2. Specifically, the browser can select a mainstream browser such as Chrome, IE, Firefox, Safari, and the like. When the website is crawled, only the home page content of the website needs to be crawled, and all pages do not need to be crawled. After we obtain the website homepage, we need to decode the website homepage content properly, where decoding refers to decoding byte data into character string data to prepare for subsequent retrieval.

In step 503, a title is extracted from the page content. Titles, denoted as Title _1 and Title _2, are extracted from Page _1 and Page _2 provided in

steps

501 and 502, respectively. Specifically, the title herein refers to data between two tags < title > and </title > in HTML. In a few cases, a plurality of title tags appear in an HTML page, which itself indicates that the structure of the HTML page is damaged, and if we encounter this situation, the first title tag is taken as the title of the page.

Example 3:

fig. 5 is a schematic structural diagram of a detection apparatus for detecting malicious codes carried in a web page according to an embodiment of the present invention. The detection apparatus for detecting malicious codes carried in a web page of the present embodiment includes one or more processors 21 and a memory 22. In fig. 5, one processor 21 is taken as an example.

The processor 21 and the memory 22 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.

The memory 22, which is a non-volatile computer-readable storage medium, can be used to store non-volatile software programs and non-volatile computer-executable programs, such as the method for detecting malicious code carried in a web page in embodiment 1. The processor 21 executes the detection method for malicious code carried in the web page by executing the non-volatile software program and instructions stored in the memory 22.

The memory 22 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, and these remote memories may be connected to the processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The program instructions/modules are stored in the memory 22, and when executed by the one or more processors 21, perform the method for detecting malicious code carried in a web page in the foregoing embodiment 1, for example, perform the above-described steps shown in fig. 1 to 4.

Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be implemented by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for detecting malicious codes carried in a webpage is characterized by comprising the following steps:

matching one or more of the webpage title, the text content and the HTML label carried in the first response message and the second response message, and if the difference of the matching result is greater than a preset condition, calibrating the webpage to be detected as a website potentially carrying malicious codes;

after the webpage to be detected is marked as a website potentially carrying malicious codes, the method further comprises the following steps:

2. The method according to claim 1, wherein the matching of the first response message and the second response message includes one or more of a web title, a text content, and an HTML tag, and if a difference of a matching result is greater than a preset condition, the method marks the web page to be detected as a website potentially carrying malicious codes, and specifically includes:

3. The method for detecting malicious codes carried in web pages according to claim 2, wherein after the web page to be detected is marked as a website potentially carrying the malicious codes, the method further comprises:

4. The method for detecting malicious codes carried in web pages according to claim 1, wherein after the web page to be detected is marked as a website potentially carrying the malicious codes, the method further comprises:

5. The method according to claim 1, wherein the matching of the first response message and the second response message includes one or more of a web title, a text content, and an HTML tag, and if a difference of a matching result is greater than a preset condition, the method marks the web page to be detected as a website potentially carrying malicious codes, and specifically includes:

6. The method for detecting malicious codes carried in a webpage according to claim 5, wherein the method further comprises: and matching the text contents carried in the first response message and the second response message by using an illegal website keyword library, and marking as a website which is believed to carry malicious codes if the matching result is that the number of successfully matched text contents and the illegal website keyword library exceeds a first preset threshold value.

7. The method for detecting malicious codes carried in a webpage according to any one of claims 1-6, wherein the User-agent type of the search engine crawler includes one or more of Baidu, Saogue, Google, and 360.

8. An apparatus for detecting malicious code carried in a web page, the apparatus comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being programmed to perform a method for detecting malicious code carried in a web page as claimed in any one of claims 1 to 7.