CN110929257B - Method and device for detecting malicious codes carried in webpage - Google Patents

Method and device for detecting malicious codes carried in webpage Download PDF

Info

Publication number
CN110929257B
CN110929257B CN201911040978.XA CN201911040978A CN110929257B CN 110929257 B CN110929257 B CN 110929257B CN 201911040978 A CN201911040978 A CN 201911040978A CN 110929257 B CN110929257 B CN 110929257B
Authority
CN
China
Prior art keywords
website
webpage
matching
response message
malicious codes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911040978.XA
Other languages
Chinese (zh)
Other versions
CN110929257A (en
Inventor
侯贺明
王赟
黄华桥
程波
曾伟
谭国权
李明栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Greenet Information Service Co Ltd
Original Assignee
Wuhan Greenet Information Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Greenet Information Service Co Ltd filed Critical Wuhan Greenet Information Service Co Ltd
Priority to CN201911040978.XA priority Critical patent/CN110929257B/en
Publication of CN110929257A publication Critical patent/CN110929257A/en
Application granted granted Critical
Publication of CN110929257B publication Critical patent/CN110929257B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to the technical field of internet, and provides a method and a device for detecting malicious codes carried in a webpage. The method comprises the steps of using a User-agent of a search engine crawler, assigning a User _ agent field in a first website request message, and then sending the first website request message to an address of a webpage to be detected; assigning a User _ agent field in the second website request message by using a User-agent of a common terminal User, and then sending the second website request message to the address of the webpage to be detected; and matching the contents carried in the first response message and the second response message, and if the difference of the matching results is greater than a preset condition, marking the webpage to be detected as the website potentially carrying the malicious codes. The invention simulates the search engine crawler to crawl the website homepage content, simulates the normal user to crawl the website homepage content, compares the difference of titles and has good detection effect on black-hat SEO horse hanging.

Description

Method and device for detecting malicious codes carried in webpage
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of internet, in particular to a method and a device for detecting malicious codes carried in a webpage.
[ background of the invention ]
The web page Trojan is a popular calling method and refers to malicious codes in the form of web pages. The web page malicious codes are divided into two types, one is that the original web page with normal functions is changed into the web page with malicious functions by modifying the original web page, and the modification includes but is not limited to: add or modify the title of the web page, meta field, Javascirpt code, Iframe tag, etc. The other category is a web page backdoor category, also called webshell, and the malicious web page of this category is not modified on the basis of the original web page, but all contents are malicious codes provided by hackers and are a single file.
For the first type, there are two common utilization methods, namely black-hat SEO (all called Search Engine Optimization), and browser exploits. The black-cap SEO essentially deceives the search engine by means of hacker techniques, and the website address that should not appear originally appears on the detection result page of the search engine, so that the user can search the related illegal website address on the search engine. For black-hat SEOs, the technology implementation is diverse, and some common methods such as dark chains, keyword tiling, spider spoofing, parasite technology, and so on. However, means for effectively implementing identification and detection are very scarce.
In view of the above, overcoming the drawbacks of the prior art is an urgent problem in the art.
[ summary of the invention ]
The invention aims to solve the technical problem that the prior art lacks a means for effectively detecting the spidering technology and keyword stacking aiming at the black cap SEO.
The invention further aims to provide a method and a device for detecting malicious codes carried in a webpage.
The invention adopts the following technical scheme:
in a first aspect, the present invention provides a method for detecting malicious codes carried in a web page, including:
assigning a User _ agent field in the first website request message by using a User-agent of a search engine crawler, and then sending the first website request message to the address of the webpage to be detected;
receiving a first response message returned by the webpage to be detected, and storing one or more of a webpage title, text content and an HTML (hypertext markup language) tag carried in the response message;
assigning a User _ agent field in the second website request message by using a User-agent of a common terminal User, and then sending the second website request message to the address of the webpage to be detected;
receiving a second response message returned by the webpage to be detected, and storing one or more of a webpage title, text content and an HTML (hypertext markup language) tag carried in the response message;
and matching one or more of the webpage title, the text content and the HTML label carried in the first response message and the second response message, and if the difference of the matching result is greater than a preset condition, calibrating the webpage to be detected as a website potentially carrying malicious codes.
Preferably, the matching of one or more of the web title, the text content, and the HTML tag carried in the first response message and the second response message, and if the difference of the matching result is greater than a preset condition, calibrating the web page to be detected as a website potentially carrying malicious codes specifically include:
matching the webpage titles carried in the first response message and the second response message, and if the matching results are not completely the same, calibrating the webpage to be detected as a website potentially carrying malicious codes;
and if the matching results are not completely the same, determining that the difference of the matching results is greater than a preset condition.
Preferably, after the web page to be detected is marked as a website potentially carrying malicious codes, the method further includes:
and matching the keyword libraries of the illegal websites according to the difference bytes in the webpage titles carried in the matched first response message and second response message, and marking the websites potentially carrying malicious codes as websites believed to carry the malicious codes if the matching is successful.
Preferably, after the web page to be detected is marked as a website potentially carrying malicious codes, the method further includes:
matching the keyword libraries of the illegal websites according to the difference bytes in the webpage titles carried in the matched first response message and second response message, and marking the websites potentially carrying malicious codes as normal websites if the matching is unsuccessful;
when the website is required to be detected again to determine whether the website carries malicious codes or not, directly matching the differential bytes matched in the current round with the differential bytes matched in the previous round, and if the differential bytes are matched, still marking the website as a normal website; and the previous round is the round marked as the normal website.
Preferably, after the web page to be detected is marked as a website potentially carrying malicious codes, the method further includes:
if the continuous appointed turns are normal websites according to the webpage title matching result, further performing the matching process of the text content and the illegal website keyword library to complete secondary verification;
if the secondary verification is determined to be a normal website, the subsequent processes are still normal websites according to the continuous appointed number of rounds and the webpage title matching result, and a round of periodic process of the secondary verification is carried out.
Preferably, the matching of one or more of the web title, the text content, and the HTML tag carried in the first response message and the second response message, and if the difference of the matching result is greater than a preset condition, calibrating the web page to be detected as a website potentially carrying malicious codes specifically include:
matching HTML tags carried in the first response message and the second response message, and if the matching result is that the HTML tag pairs are not completely consistent, calibrating the webpage to be detected as a website potentially carrying malicious codes;
and if the matching result is that the HTML label pair is not completely consistent, judging that the difference of the matching result is greater than a preset condition.
Preferably, the method further comprises: and matching the text contents carried in the first response message and the second response message by using an illegal website keyword library, and marking as a website which is believed to carry malicious codes if the matching result is that the number of successfully matched text contents and the illegal website keyword library exceeds a first preset threshold value.
Preferably, the User-agent type of the search engine crawler includes one or more of Baidu, Saogue, Google, and 360.
In a second aspect, the present invention further provides a method and an apparatus for detecting malicious codes carried in a web page, which are used to implement the method for detecting malicious codes carried in a web page in the first aspect, and the apparatus includes:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are programmed to perform the method for detecting malicious code carried in a web page according to the first aspect.
In a third aspect, the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions are executed by one or more processors, so as to complete the method for detecting malicious codes carried in a webpage according to the first aspect.
The invention detects whether the website is hung by the horse by actively simulating the method of crawling the website home page by the search engine crawler, simulates the search engine crawler to crawl the website home page content, simulates the normal user to crawl the website home page content, compares the difference of the titles under the two conditions, and has good detection effect on the black-hat SEO horse hanging.
The matching condition of HTML tag pairs in an HTML page is checked, and whether malicious codes are injected or not is judged through the abnormity of the HTML page structure; and matching the text content through the keyword list to judge whether the webpage is promoted by a black-hat SEO or is attacked by malicious codes.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a schematic flowchart of a method for detecting malicious codes carried in a web page according to an embodiment of the present invention;
fig. 2 is a schematic flowchart illustrating an improved method for detecting malicious codes carried in a web page according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of a method for detecting malicious codes carried in a web page based on text content according to an embodiment of the present invention;
fig. 4 is a flowchart illustrating a typical method for detecting malicious codes carried in a web page according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a detection apparatus for detecting malicious codes carried in a web page according to an embodiment of the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the description of the present invention, the terms "inner", "outer", "longitudinal", "lateral", "upper", "lower", "top", "bottom", and the like indicate orientations or positional relationships based on those shown in the drawings, and are for convenience only to describe the present invention without requiring the present invention to be necessarily constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.
When a hacker has invaded a website and wants to emerge by means of black-cap SEO, a commonly used method is that spider spoofing is combined with keyword tiling. The specific method for spidering includes modifying a page source code of a website home page, adding a Javascript code which is used for judging the identity of a visitor, and comparing the User-agent of the visitor, wherein if the visitor is found to be a crawler of a search engine, such as hundredths, necessity, Google and the like, a page title and page Meta information set by a hacker per se are returned; if the visitor is found to be a normal user, the original page of the website is returned without modifying the title and content of the page. The keyword stacking method is that in order to increase the weight of a webpage page to a certain keyword, a hacker stacks the keyword or related keywords in a large number of positions such as the title, Meta tag, content and the like of the page.
The method for detecting the Trojan hanging web page comprises the following steps:
setting a User-agent of a normal User to initiate access to a home page of a target website, setting a User-agent of a search engine to initiate access to the home page of the target website, comparing contents of TITLE tags in source codes of HTML pages obtained by two accesses, and if the TITLEs of the webpages are different, indicating that the webpages are possible to be hung horse and adopting a spidering technology in a black-hat SEO.
The method comprises the steps of preparing a keyword list (also described as an illegal website keyword library in the embodiment of the invention) in advance, searching keywords in source codes of a home page of a target website crawled by a search engine User-agent according to the commonly used target keywords in the keyword list, and judging that the webpage is hung on horse if the keywords can be matched with the source codes of the home page. Wherein, the User-agent of the normal User can be made into a list which covers the typical browser User-agent; the User-agent of the search engine can also be made into a list, wherein the list comprises several mainstream search engines, such as Baidu, Google, must and the like; one User-agent can be randomly chosen from the list at a time. For the keyword list, in the matching process, in order to prevent false alarm, a threshold value can be set as the minimum matching times, and only when the threshold value is reached, the matching is performed as a matching hit.
In addition to the above-mentioned method of detecting a change of a title, it is found that a hacker often destroys an original HTML format when modifying source code of an HTML web page (because the hacker is to integrate its own content into a normal web page, it is difficult to ensure the consistency of the original format), and there are two types of methods, namely, a plurality of title tags are introduced into the HTML web page; the second is that new data is introduced before the < html > tag or that new data is introduced after the </html > tag. Because an HTML file starts and ends with HTML and HTML, if other data is introduced beyond them, it is likely that the data is malicious code injected by a hacker.
The detection method involved in the method is applicable to the first category of webpage horse hanging, the black hat SEO, the spider cheating technology and the keyword stacking. In other words, we have found some common behaviors and features of this type of malicious web code by analyzing this type of malicious web code in a large quantity, and based on these common behaviors and features, the detection method in this document is proposed in a targeted manner.
In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Example 1:
embodiment 1 of the present invention provides a method for detecting a malicious code carried in a web page, as shown in fig. 1, including:
in step 201, a User-agent of the search engine crawler is used to assign a User _ agent field in the first website request message, and then the first website request message is sent to the address of the web page to be detected.
Wherein the User-agent type of the search engine crawler comprises one or more of Baidu, Saogue, Google, and 360.
In step 202, a first response message returned by the web page to be detected is received, and one or more items of a web page title, text content and an HTML tag carried in the response message are stored.
In step 203, the User-agent of the common terminal User is used to assign a User _ agent field in the second website request message, and then the second website request message is sent to the address of the webpage to be detected.
In step 204, a second response message returned by the web page to be detected is received, and one or more items of the web page title, the text content and the HTML tag carried in the response message are stored.
In step 205, one or more of the web title, the text content, and the HTML tag carried in the first response message and the second response message are matched, and if the difference of the matching result is greater than a preset condition, the web to be detected is marked as a website potentially carrying malicious codes.
The embodiment of the invention detects whether the website is hung by actively simulating the search engine crawler to crawl the website homepage, simulates the search engine crawler to crawl the website homepage content, simulates the normal user to crawl the website homepage content, compares the title difference under the two conditions, and has good detection effect on the black-hat SEO horse hanging.
In the embodiment of the present invention, for one or more of the web title, the text content, and the HTML tag carried in the matching first response message and the matching second response message, if the difference of the matching result is greater than a preset condition, the web page to be detected is marked as a website potentially carrying malicious codes, at least the following preferred modes are provided:
the first method is as follows:
matching the webpage titles carried in the first response message and the second response message, and if the matching results are not completely the same, calibrating the webpage to be detected as a website potentially carrying malicious codes;
and if the matching results are not completely the same, determining that the difference of the matching results is greater than a preset condition.
In a specific implementation, after the web page to be detected is usually marked as a website potentially carrying malicious code, the method further includes:
and matching the keyword libraries of the illegal websites according to the difference bytes in the webpage titles carried in the matched first response message and second response message, and marking the websites potentially carrying malicious codes as websites believed to carry the malicious codes if the matching is successful.
Because, in practical situations, it is necessary to perform periodic security verification on the web page addresses in the list maintained by the server, that is, to execute the method process of step 201 and 205 in the embodiment of the present invention, so as to be able to verify whether the web page addresses carry malicious codes, in the first mode proposed in the embodiment of the present invention, when one round of verification is completed as a normal web site, the next round or the following rounds are detection modes that can improve the detection efficiency by using the following preferred schemes, and are particularly suitable for a case where there is a difference between the matching of the web page titles obtained in step 202 and step 204, and the web page to be detected is marked as a web site potentially carrying malicious codes, as shown in fig. 2, the method further includes:
in step 301, matching is performed on the keyword libraries of the illegitimate web sites according to the difference bytes in the web page titles carried in the matched first response message and second response message, and if the matching is unsuccessful, the websites potentially carrying malicious codes are marked as normal websites.
In step 302, when the website needs to be detected again to detect whether the malicious code is carried, the differential bytes matched in the current round are directly matched with the differential bytes matched in the previous round, and if the differential bytes are matched, the website is still marked as a normal website; and the previous round is the round marked as the normal website.
The first existing meaning of the embodiment of the invention is that the basic purpose of the behavior of maliciously falsifiing codes in a normal webpage is to promote the website content of the user, so that no matter the normal website is directly modified or the link is jumped, the final expression result can generate substantial difference on the webpage title, and the corresponding difference bytes are recorded in an illegal website keyword library with a large probability. Therefore, the method of the embodiment of the invention directly scales the means on the webpage title, has less resources occupied by analysis and higher efficiency, and is suitable for occasions containing a large number of websites to be marked in a large list.
However, in the testing process, some websites are found, even though the websites are normal websites, in order to maintain the privileges of members or directly registered users of the websites, the User-agent type carrying the search engine crawler is returned with the webpage titles that are different from the User-agent type of the common terminal User in direct registration, that is, the webpage titles carried in the first response message and the second response message are inconsistent, so that the websites which are potentially carrying malicious codes and are marked with the webpage to be detected first exist, and the corresponding difference bytes are matched with the keyword library of the illegal website for identification.
And if the condition that the webpage title matching is marked as a normal website for many times is confirmed, performing a comprehensive matching process of text content carried in the webpage which consumes more resources once, thereby achieving the effect of comprehensive verification. By the implementation mode, the reliability of the final verification result can be ensured, and a hierarchical verification process is realized; therefore, an embodiment of the present invention further provides a preferable scheme for this, after the web page to be detected is marked as a website potentially carrying malicious codes, as shown in fig. 3, the method further includes:
in step 401, if the matching result of the continuous assigned turns according to the web title is a normal web site, the matching process of the text content and the keyword library of the illegal web site is further performed, and secondary verification is completed.
In step 402, if the secondary verification determines that the web site is a normal web site, the subsequent web sites are all normal web sites according to the matching result of the web page titles and the number of consecutive assigned rounds, and a round of periodic process of the secondary verification is performed.
In the preferred embodiment, in addition to the number of designated rounds described in steps 401 and 402, the occupation of the computing resources of the server may be taken into consideration, that is, the website to be subjected to the secondary verification is preferably called out to perform the secondary verification when the occupation of the computing resources of the server is low.
The second method comprises the following steps:
matching HTML tags carried in the first response message and the second response message, and if the matching result is that the HTML tag pairs are not completely consistent, calibrating the webpage to be detected as a website potentially carrying malicious codes;
and if the matching result is that the HTML label pair is not completely consistent, judging that the difference of the matching result is greater than a preset condition.
The matching condition of HTML tag pairs in an HTML page is checked, and whether malicious codes are injected or not is judged through the abnormity of the HTML page structure; and matching the text content through the keyword list to judge whether the webpage is promoted by a black-hat SEO or is attacked by malicious codes.
The method is the same as the method in the first mode, namely, the platform side to which the webpage to be detected belongs exists, differential response is made according to the access of common users and the crawler mode of a search engine, at the moment, the conclusion cannot be directly obtained through the second mode, and therefore the website which is marked as the website potentially carrying malicious codes exists.
In this case, in order to complete the rigorous verification process, the following is also included: and matching the text contents carried in the first response message and the second response message by using an illegal website keyword library, and marking as a website which is believed to carry malicious codes if the matching result is that the number of successfully matched text contents and the illegal website keyword library exceeds a first preset threshold value.
It should be noted that, in a specific implementation manner, the first manner and the second manner may also be directly combined for use, and the accuracy of the combination manner of the first manner and the second manner may be improved through statistics; that is, in the embodiment of the present invention, the process of matching text content is already analyzed at the bottom, so that if the matching consistency of the web page title and the matching consistency of the HTML tag pair are satisfied with the difference, that is, the website is detected to be carrying malicious codes is obtained, and if the accuracy of the website can reach 99%, the website can be used as an optional third implementation manner of the present invention, and the matching of text content and an illegal website keyword library can be directly skipped. Of course, its feasibility requires further experimental demonstration, but is a potentially viable implementation. And should also fall within the scope of the embodiments of the present invention.
Example 2:
the embodiment of the present invention will explain the process of acquiring the title of the web Page, the home Page of the website, and the content of the stored Page as Page _1 in steps 201 to 204 described in embodiment 1 of the present invention based on the User-agent of the Baidu search engine crawler. The search engine selects a mainstream search engine such as a Baidu search engine, and the search engine crawler User-agent is set as the User-agent of the Baidu crawler, namely:
mozilla/5.0 (compatible; Baiduspider/2.0; + http:// www.baidu.com/search/spider. html). As shown in fig. 4, specifically:
in step 501, a User-agent of an Baidu search engine crawler is used for crawling a home Page of a website, and the content of the Page is stored as Page _ 1; when crawling a website (i.e., sending the first website request message as described in embodiment 1), it is only necessary to crawl the first page content of the website, and it is not necessary to crawl all pages, taking www.example.com as an example, here, if a domain name is provided instead of a URL with a protocol field, we first determine whether the protocol of the target domain name is http or https, the determination method is to detect the 80 and 443 ports of the IP address corresponding to the domain name, and if the target domain name provides both http service and https service, we should select https service. If a complete URL address is provided, such as http:// www.example.com/, we can directly access the website to get the contents of the home page file. After we obtain the website homepage, we need to decode the website homepage content properly, where decoding refers to decoding byte data into character string data to prepare for subsequent retrieval.
In step 502, the User-agent of the mainstream browser is used to crawl the home Page of the website, and the Page content is stored as Page _ 2. Specifically, the browser can select a mainstream browser such as Chrome, IE, Firefox, Safari, and the like. When the website is crawled, only the home page content of the website needs to be crawled, and all pages do not need to be crawled. After we obtain the website homepage, we need to decode the website homepage content properly, where decoding refers to decoding byte data into character string data to prepare for subsequent retrieval.
In step 503, a title is extracted from the page content. Titles, denoted as Title _1 and Title _2, are extracted from Page _1 and Page _2 provided in steps 501 and 502, respectively. Specifically, the title herein refers to data between two tags < title > and </title > in HTML. In a few cases, a plurality of title tags appear in an HTML page, which itself indicates that the structure of the HTML page is damaged, and if we encounter this situation, the first title tag is taken as the title of the page.
Example 3:
fig. 5 is a schematic structural diagram of a detection apparatus for detecting malicious codes carried in a web page according to an embodiment of the present invention. The detection apparatus for detecting malicious codes carried in a web page of the present embodiment includes one or more processors 21 and a memory 22. In fig. 5, one processor 21 is taken as an example.
The processor 21 and the memory 22 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.
The memory 22, which is a non-volatile computer-readable storage medium, can be used to store non-volatile software programs and non-volatile computer-executable programs, such as the method for detecting malicious code carried in a web page in embodiment 1. The processor 21 executes the detection method for malicious code carried in the web page by executing the non-volatile software program and instructions stored in the memory 22.
The memory 22 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, and these remote memories may be connected to the processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The program instructions/modules are stored in the memory 22, and when executed by the one or more processors 21, perform the method for detecting malicious code carried in a web page in the foregoing embodiment 1, for example, perform the above-described steps shown in fig. 1 to 4.
Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be implemented by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (8)

1. A method for detecting malicious codes carried in a webpage is characterized by comprising the following steps:
assigning a User _ agent field in the first website request message by using a User-agent of a search engine crawler, and then sending the first website request message to the address of the webpage to be detected;
receiving a first response message returned by the webpage to be detected, and storing one or more of a webpage title, text content and an HTML (hypertext markup language) tag carried in the response message;
assigning a User _ agent field in the second website request message by using a User-agent of a common terminal User, and then sending the second website request message to the address of the webpage to be detected;
receiving a second response message returned by the webpage to be detected, and storing one or more of a webpage title, text content and an HTML (hypertext markup language) tag carried in the response message;
matching one or more of the webpage title, the text content and the HTML label carried in the first response message and the second response message, and if the difference of the matching result is greater than a preset condition, calibrating the webpage to be detected as a website potentially carrying malicious codes;
after the webpage to be detected is marked as a website potentially carrying malicious codes, the method further comprises the following steps:
matching the keyword libraries of the illegal websites according to the difference bytes in the webpage titles carried in the matched first response message and second response message, and marking the websites potentially carrying malicious codes as normal websites if the matching is unsuccessful;
when the website is required to be detected again to determine whether the website carries malicious codes or not, directly matching the differential bytes matched in the current round with the differential bytes matched in the previous round, and if the differential bytes are matched, still marking the website as a normal website; and the previous round is the round marked as the normal website.
2. The method according to claim 1, wherein the matching of the first response message and the second response message includes one or more of a web title, a text content, and an HTML tag, and if a difference of a matching result is greater than a preset condition, the method marks the web page to be detected as a website potentially carrying malicious codes, and specifically includes:
matching the webpage titles carried in the first response message and the second response message, and if the matching results are not completely the same, calibrating the webpage to be detected as a website potentially carrying malicious codes;
and if the matching results are not completely the same, determining that the difference of the matching results is greater than a preset condition.
3. The method for detecting malicious codes carried in web pages according to claim 2, wherein after the web page to be detected is marked as a website potentially carrying the malicious codes, the method further comprises:
and matching the keyword libraries of the illegal websites according to the difference bytes in the webpage titles carried in the matched first response message and second response message, and marking the websites potentially carrying malicious codes as websites believed to carry the malicious codes if the matching is successful.
4. The method for detecting malicious codes carried in web pages according to claim 1, wherein after the web page to be detected is marked as a website potentially carrying the malicious codes, the method further comprises:
if the continuous appointed turns are normal websites according to the webpage title matching result, further performing the matching process of the text content and the illegal website keyword library to complete secondary verification;
if the secondary verification is determined to be a normal website, the subsequent processes are still normal websites according to the continuous appointed number of rounds and the webpage title matching result, and a round of periodic process of the secondary verification is carried out.
5. The method according to claim 1, wherein the matching of the first response message and the second response message includes one or more of a web title, a text content, and an HTML tag, and if a difference of a matching result is greater than a preset condition, the method marks the web page to be detected as a website potentially carrying malicious codes, and specifically includes:
matching HTML tags carried in the first response message and the second response message, and if the matching result is that the HTML tag pairs are not completely consistent, calibrating the webpage to be detected as a website potentially carrying malicious codes;
and if the matching result is that the HTML label pair is not completely consistent, judging that the difference of the matching result is greater than a preset condition.
6. The method for detecting malicious codes carried in a webpage according to claim 5, wherein the method further comprises: and matching the text contents carried in the first response message and the second response message by using an illegal website keyword library, and marking as a website which is believed to carry malicious codes if the matching result is that the number of successfully matched text contents and the illegal website keyword library exceeds a first preset threshold value.
7. The method for detecting malicious codes carried in a webpage according to any one of claims 1-6, wherein the User-agent type of the search engine crawler includes one or more of Baidu, Saogue, Google, and 360.
8. An apparatus for detecting malicious code carried in a web page, the apparatus comprising:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being programmed to perform a method for detecting malicious code carried in a web page as claimed in any one of claims 1 to 7.
CN201911040978.XA 2019-10-30 2019-10-30 Method and device for detecting malicious codes carried in webpage Active CN110929257B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911040978.XA CN110929257B (en) 2019-10-30 2019-10-30 Method and device for detecting malicious codes carried in webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911040978.XA CN110929257B (en) 2019-10-30 2019-10-30 Method and device for detecting malicious codes carried in webpage

Publications (2)

Publication Number Publication Date
CN110929257A CN110929257A (en) 2020-03-27
CN110929257B true CN110929257B (en) 2022-02-01

Family

ID=69850004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911040978.XA Active CN110929257B (en) 2019-10-30 2019-10-30 Method and device for detecting malicious codes carried in webpage

Country Status (1)

Country Link
CN (1) CN110929257B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111585975B (en) * 2020-04-17 2023-03-14 上海中通吉网络技术有限公司 Security vulnerability detection method, device and system and switch
CN112115266A (en) * 2020-09-25 2020-12-22 奇安信科技集团股份有限公司 Malicious website classification method and device, computer equipment and readable storage medium
CN113810400A (en) * 2021-09-13 2021-12-17 北京百度网讯科技有限公司 Website parasite detection method, device, equipment and medium

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
CN102016840A (en) * 2008-04-24 2011-04-13 摩维迪欧控股有限公司 System and method for tracking usage
CN102436564A (en) * 2011-12-30 2012-05-02 奇智软件(北京)有限公司 Method and device for identifying falsified webpage
CN103065089A (en) * 2012-12-11 2013-04-24 深信服网络科技(深圳)有限公司 Method and device for detecting webpage Trojan horses
CN103164446A (en) * 2011-12-14 2013-06-19 阿里巴巴集团控股有限公司 Webpage request information response method and webpage request information response device
CN105184159A (en) * 2015-08-27 2015-12-23 深圳市深信服电子科技有限公司 Web page falsification identification method and apparatus
CN105912693A (en) * 2016-04-22 2016-08-31 北京搜狗科技发展有限公司 Network request processing method and apparatus, network data acquisition method, and server
CN106844522A (en) * 2016-12-29 2017-06-13 北京市天元网络技术股份有限公司 A kind of network data crawling method and device
CN107273416A (en) * 2017-05-05 2017-10-20 深信服科技股份有限公司 The dark chain detection method of webpage, device and computer-readable recording medium
CN107784107A (en) * 2017-10-31 2018-03-09 杭州安恒信息技术有限公司 Dark chain detection method and device based on flight behavior analysis
CN108664559A (en) * 2018-03-30 2018-10-16 中山大学 A kind of automatic crawling method of website and webpage source code
CN108984801A (en) * 2018-08-22 2018-12-11 百卓网络科技有限公司 A kind of search engine optimization method identifying asynchronous loading content based on html tag
CN109067716A (en) * 2018-07-18 2018-12-21 杭州安恒信息技术股份有限公司 A kind of method and system identifying dark chain
CN109635203A (en) * 2018-12-19 2019-04-16 北京达佳互联信息技术有限公司 Webpage capture request processing method, device, server and storage medium
CN110263283A (en) * 2019-06-19 2019-09-20 郑州悉知信息科技股份有限公司 Website detection method and device

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
CN102016840A (en) * 2008-04-24 2011-04-13 摩维迪欧控股有限公司 System and method for tracking usage
CN103164446A (en) * 2011-12-14 2013-06-19 阿里巴巴集团控股有限公司 Webpage request information response method and webpage request information response device
CN102436564A (en) * 2011-12-30 2012-05-02 奇智软件(北京)有限公司 Method and device for identifying falsified webpage
CN103065089A (en) * 2012-12-11 2013-04-24 深信服网络科技(深圳)有限公司 Method and device for detecting webpage Trojan horses
CN105184159A (en) * 2015-08-27 2015-12-23 深圳市深信服电子科技有限公司 Web page falsification identification method and apparatus
CN105912693A (en) * 2016-04-22 2016-08-31 北京搜狗科技发展有限公司 Network request processing method and apparatus, network data acquisition method, and server
CN106844522A (en) * 2016-12-29 2017-06-13 北京市天元网络技术股份有限公司 A kind of network data crawling method and device
CN107273416A (en) * 2017-05-05 2017-10-20 深信服科技股份有限公司 The dark chain detection method of webpage, device and computer-readable recording medium
CN107784107A (en) * 2017-10-31 2018-03-09 杭州安恒信息技术有限公司 Dark chain detection method and device based on flight behavior analysis
CN108664559A (en) * 2018-03-30 2018-10-16 中山大学 A kind of automatic crawling method of website and webpage source code
CN109067716A (en) * 2018-07-18 2018-12-21 杭州安恒信息技术股份有限公司 A kind of method and system identifying dark chain
CN108984801A (en) * 2018-08-22 2018-12-11 百卓网络科技有限公司 A kind of search engine optimization method identifying asynchronous loading content based on html tag
CN109635203A (en) * 2018-12-19 2019-04-16 北京达佳互联信息技术有限公司 Webpage capture request processing method, device, server and storage medium
CN110263283A (en) * 2019-06-19 2019-09-20 郑州悉知信息科技股份有限公司 Website detection method and device

Also Published As

Publication number Publication date
CN110929257A (en) 2020-03-27

Similar Documents

Publication Publication Date Title
CN110929257B (en) Method and device for detecting malicious codes carried in webpage
CN104125209B (en) Malice website prompt method and router
CN101964025B (en) XSS detection method and equipment
CN102333122B (en) Downloaded resource provision method, device and system
CN105184159A (en) Web page falsification identification method and apparatus
CN101895516B (en) Method and device for positioning cross-site scripting attack source
WO2019109528A1 (en) Vulnerability detection method and apparatus, computer device and storage medium
CN103279710B (en) Method and system for detecting malicious codes of Internet information system
CN105760379B (en) Method and device for detecting webshell page based on intra-domain page association relation
CN107786537B (en) Isolated page implantation attack detection method based on Internet cross search
CN107992738B (en) Account login abnormity detection method and device and electronic equipment
CN108900554B (en) HTTP asset detection method, system, device and computer medium
CN101350822A (en) Method for discovering and tracing Internet malevolence code
WO2011116696A1 (en) Method and system for providing network resources
CN105376217B (en) A kind of malice jumps and the automatic judging method of malice nested class objectionable website
CN104881603A (en) Method and apparatus for detecting webpage redirection vulnerabilities
CN104168293A (en) Method and system for recognizing suspicious phishing web page in combination with local content rule base
US20200336498A1 (en) Method and apparatus for detecting hidden link in website
CN104679747B (en) Detection device and method for website redirection
CN113518077A (en) Malicious web crawler detection method, device, equipment and storage medium
CN107784107B (en) Dark chain detection method and device based on escape behavior analysis
CN110309667B (en) Website hidden link detection method and device
KR20180074774A (en) How to identify malicious websites, devices and computer storage media
CN104158828A (en) Method and system for identifying doubtful phishing webpage on basis of cloud content rule base
WO2020082763A1 (en) Decision trees-based method and apparatus for detecting phishing website, and computer device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A method and device for detecting malicious code in web pages

Effective date of registration: 20220608

Granted publication date: 20220201

Pledgee: Hengfeng Bank Co.,Ltd. Wuhan Branch

Pledgor: WUHAN GREENET INFORMATION SERVICE Co.,Ltd.

Registration number: Y2022420000150

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20221121

Granted publication date: 20220201

Pledgee: Hengfeng Bank Co.,Ltd. Wuhan Branch

Pledgor: WUHAN GREENET INFORMATION SERVICE Co.,Ltd.

Registration number: Y2022420000150

PC01 Cancellation of the registration of the contract for pledge of patent right