CN107786537B

CN107786537B - Isolated page implantation attack detection method based on Internet cross search

Info

Publication number: CN107786537B
Application number: CN201710845948.0A
Authority: CN
Inventors: 王方军; 范渊; 黄进
Original assignee: DBAPPSecurity Co Ltd
Current assignee: DBAPPSecurity Co Ltd
Priority date: 2017-09-19
Filing date: 2017-09-19
Publication date: 2020-04-07
Anticipated expiration: 2037-09-19
Also published as: CN107786537A

Abstract

The invention relates to an information security technology and aims to provide an isolated page implantation attack detection method based on internet cross search. The method for detecting the solitary page implantation attack based on the internet cross search comprises the following steps: arranging websites on the Internet into a website library, and performing hidden link and keyword retrieval on a home page of each website; analyzing the risk links in the risk link module of the website with higher suspicious degree one by one; the source page and the pointing page which are linked illegally are combined and analyzed, and the possibility of illegal tampering or implantation is further confirmed; and finding and confirming the isolated page from the WEB system where the page pointed by the illegal link is located. The invention confirms that the probability of illegal modules, illegal links and illegal contents is changeable and learnable; the invention is more accurate and credible than single content from a plurality of integral angles.

Description

Isolated page implantation attack detection method based on Internet cross search

Technical Field

The invention relates to the technical field of information security, in particular to an isolated page implantation attack detection method based on internet cross search.

Background

Since the first WEB site in the world was born in the early 90 s of the last century, WEB sites continue to develop to the present day along with the innovation of internet technology. The most important change is that the website information provider in the era of WEB1.0 provides contents unidirectionally (static websites) to be widely applied to dynamic websites. With the booming of BBS forum and the coming of WEB2.0 era, webpage interactivity also increases with the importance of users, and various webpage technologies, database technologies and WEB container technologies also develop rapidly. However, in the technology, the double-edged sword improves the user experience, and simultaneously, the input of the user is an uncontrollable factor, and various kinds of injection and attack directly cause the security of the WEB site to be reduced. Even some hackers directly acquire the system permission through the web front end to change and destroy the background, so that the purpose of illegal access is achieved. These behaviors are manifested in forms visible to the average user, namely tampering, horse hanging, implanting dark chains and orphan pages, etc.

Due to the particularity of the attack of planting the orphan page, the link pointing from the station is not available, and the attack is difficult to find out directly from the station. Currently possible methods are: one method is that a local file directory is directly scanned by entering an operating system installed in a WEB website, and a single page is found by means of sensitive text analysis, page addition history, system log recording operation and the like; another method is to add specific keywords to manually search by inputting instructions through an internet content search engine.

However, both of the above methods have their limitations:

the disadvantages of the first method are the following: 1. logging in the system one by one, and being incapable of checking in large batch; 2. and the user name password cannot enter the system, otherwise, the local client can be installed inside in advance. On one hand, the loading amount of the client is increased, and on the other hand, the application is locally installed, so that no influence on the system is possible. 3. The page add history, system operation log, etc. may be erased, thus making the search completely unhappy.

The second method also has several drawbacks: 1. although the implanted isolated chain is completely presented by the search engine, the information is completely buried in a large amount of non-network safety related information, and the information is not extracted in an automatic mode; 2. the search engine caches too much content, has no specificity, and the timeliness is reduced due to the fact that the updating period is prolonged. In addition, the two methods have a common defect that when a sensitive information text is used for searching, the requirement on the hit rate of the keywords is high, and the effectiveness of the keywords cannot be guaranteed. Both of the above two detection methods can be used as the evidential means of the present invention.

Disclosure of Invention

The invention mainly aims to overcome the defects in the prior art and provide a method capable of discovering the isolated page implantation attack. In order to solve the technical problem, the solution of the invention is as follows:

the method for detecting the solitary page implantation attack based on the internet cross search specifically comprises the following steps:

(1) arranging websites on the Internet into a website library, and performing hidden link and keyword retrieval on a home page of each website; the method specifically comprises the following substeps:

step A: carrying out collection and classification on a domain name website and an IP (Internet protocol) port-added website on the Internet, taking and integrating a website library (which is continuously increased and improved), and storing a URL (uniform resource locator) used by a website entrance in the website library;

and B: the website of the website library is preliminarily analyzed, URLs which are different and are actually the same website (including that jump exists, different formats exist; URL1 with jump exists, the jumped URL2 is obtained and URL1 is associated under URL2, the final URL is in the standard after the jump is performed for a plurality of times, URL3 is obtained after two times of jump is performed; all URLs, namely URL1, URL2 and URL3 are marked to be analyzed; only marks that the labels with different formats are domain names and the shortest length are analyzed, if URL4 is http:// www.aaa.com, URL5 is http:// www.aaa.com/index.html, URL6 is tp://204.205.206.207:7788, the same page of the same website is actually pointed to, URL4 is marked to be analyzed according to the principle, other URLs are associated with URL4 and are not analyzed), and the website is inaccessible (if the inaccessible URL is marked to be temporarily accessible within a period of time, the URL7 is recovered as intermittent state, the flag is that analysis should be performed; if URL8 exceeds a certain time limit, for example, if the URL still cannot be accessed in one day, then delete) to obtain the URL of the website capable of accessing the home page;

and C: a hidden link structure characteristic analysis method (a method for analyzing and searching a hidden link in a specific format by static text of a website page code, a method for generating a text hidden link after searching and executing a js script by dynamically rendering the js script, a method for reading picture characters and judging whether the picture characters are accessible links or not, and a method for reading two-dimensional codes and converting the two-dimensional codes into character links) is used for analyzing the content of a home page; if links which cannot be found by naked eyes exist in the website (the purposes that a website user cannot see or be obvious by naked eyes and a website manager cannot find by naked eyes when browsing are achieved by a code hiding mode, a mode of exceeding the height or width of a screen, having the same character color and background color, being embedded into a picture or a flash file and converting the characters into two-dimensional codes), but the links in the mode can be recorded by a search engine, the website has the risk of being attacked by a dark chain;

the method comprises the steps of using a keyword lexicon (adopting a keyword lexicon with high hit rate, specifically a user-defined keyword set, having a fixed upper limit and defaulting to 500 words, searching for a link with sensitive information according to the keywords, feeding back the link hit by the keywords through verification to effectively generate risks, periodically updating the keyword lexicon, eliminating words with reduced hit rate, supplementing words with high hit rate, entering), and analyzing the content of a home page; if the link text contains at least one keyword, the website has the risk of being implanted with sensitive information;

extracting risk link modules from websites with any type of risk (namely websites with risk of being attacked by a dark chain and websites with risk of being implanted with sensitive information); the risk link module is a tag module (usually the text of < div >, < table >, < td >, < li > tags) containing one or more risk links, and the risk links are not written into the page of the website B and become risk-free links because the text is written into the page of the website A and judged by a program to be in risk links;

(2) for the websites with higher suspicious degree, namely the websites with at least one risk link module selected in the step C, analyzing the risk links in the risk link modules one by one to obtain the content of the pages pointed by the links, and analyzing the content of the pages pointed by the links; the method specifically comprises the following substeps:

step D: extracting links in a risk link module of the website with higher suspicious degree one by one to obtain the content of a page pointed by the links;

step E: performing text analysis on the content pointing to the page, and judging whether sensitive text (namely, keywords with high hit rate) and illegal domain names exist (namely, the content pointing to the domain names is judged to be malicious and illegal and is published on the Internet and stored in a blacklist of a security company);

if the sensitive text or the illegal domain name exists, judging that the link pointing to the page is an illegal link, extracting the illegal link, and jumping to the step G for execution;

if the sensitive text and the illegal domain name do not exist, continuing the processing of the step F;

step F: when the image exists in the pointing page, carrying out similarity matching on the image existing in the pointing page (the similarity matching with the existing similarity matching refers to analyzing the similarity of two images by an image similarity algorithm, and the method has a large number of calculated values of sensitive images), OCR character recognition (character patterns on the images are translated into computer characters by a character recognition method); if the calculated value of the image pointing to the page is 1 after the similarity matching algorithm analysis, judging that the link pointing to the page is an illegal link, and extracting and storing the illegal link; if the character information extracted by OCR character recognition contains the key words, judging that the link pointing to the page is an illegal link, extracting and storing the illegal link; judging whether the link pointing to the page is an illegal link under other conditions;

when no picture exists in the pointing page, directly entering the step G for execution;

step G: circularly executing the step D, the step E and the step F until all illegal links in the risk link module are extracted;

(3) the source page and the pointing page which are linked illegally are combined and analyzed, and the possibility of illegal tampering or implantation is further confirmed; correspondingly storing the source page and the link; the method specifically comprises the following substeps:

step H: carrying out combined weighted judgment on illegal information in the source page and illegal information in the pointing information, and when the weighted probability reaches a preset threshold value, determining that the illegal link pair of the source page and the pointing page is established;

step I, repeatedly executing all illegal link pairs obtained in the step H, recalculating (through an algorithm) the probability value α of the illegal tampering or implantation of the source page according to the number of the final illegal link pairs, recalculating (through an algorithm) the probability β of the illegal information existing in the illegal link pairs according to the specific 'source page-pointing page' illegal link pairs, and determining the connection of the illegal information generated by the URL link among a plurality of pages when the probabilities α and β exceed respective preset thresholds;

so far, the method finishes screening the relation of '1 to many' from the link relations of countless pages of the Internet: "1" means that a source page containing illegal links is found and determined, and the source page is the first page of a certain website; "many" is the finding and determination of multiple pages pointed to by illegal link pairs, which may be the first page of the website or the "lone page" that the method ultimately needs to find; the isolated page is that all pages on a WEB system (website) for publishing the page do not contain a link pointing to the page, and the page can be accessed only by inputting a complete URL of the website (the entry of the website, namely the website home page), or can be accessed by clicking a link implanted into the page of other websites by a hacker; the method selects the isolated pages with illegal information from the mass source web pages through the latter approach;

(4) finding out and confirming that the webpage is a solitary page from a WEB system (website) where the page pointed by the illegal link is located; specifically comprising the following substeps;

step J: finding out a suspected orphan page, namely a page which is not a website and is pointed to by a link containing a path (a) (http:// www.aaa.com/test.html;

step K: analyzing all web pages of a website where suspected isolated pages are located, extracting links of all websites, converting relative URLs (namely addresses relative to a certain absolute URL address) into absolute URLs, and performing duplicate elimination;

step L: comparing the URL after the duplication elimination with the URL of the suspected page one by one;

if the suspected page is not matched with the suspected page, the suspected page is confirmed to be a real page, and the website is attacked by the implantation of the page;

if the two pages are matched, the suspected page is a common illegal link (the illegal link is not a page, and is also a useful accessory product of the method, and is most likely a tampered webpage, so that the probability of being invaded per se is very high);

(5) and after confirming that the illegal link is the isolated page, storing the isolated page link (a hacker can implant the isolated page into a plurality of websites which are invaded and tampered; if any website has the isolated page link, the probability value α that the source page is illegally tampered or implanted in the step I is 100% if any webpage has the isolated page link), and jumping to the step C to repeatedly execute to search other websites attacked by the isolated page implant.

The working principle of the invention is as follows: the method comprises the steps of detecting illegal link label modules in the home pages of all websites of a website library in real time, consolidating the judgment that the original pages have illegal links by analyzing the illegal property of links pointing to webpage contents, searching the isolated pages of the website based on the cross relation of the Internet, and verifying that other systems have the links due to invasion and tampering by taking the links judged to be the isolated pages as input conditions.

Compared with the prior art, the invention has the beneficial effects that:

the invention does not need to enter a local system, does not depend on retrieving local files, can carry out large-batch analysis and research, has no influence on the system and does not depend on system accounts.

The invention is based on the relationship, finding B from A, not directly finding A from A, finding B from B; the invention relies on a mode similar to an internet search engine but is independent and more targeted and lightweight.

The invention can further increase or reduce the probability of judging the illegal link pair as illegal through the connection of the illegal link module and cross search, thereby achieving the probability of effectively identifying the attack of the isolated page implantation and being beneficial to quickly and accurately discovering the attacked WEB website.

The invention confirms that the probability of illegal modules, illegal links and illegal contents is changeable and learnable; the invention is more accurate and credible than single content from a plurality of integral angles.

The isolated page link is a system which can reversely deduce the invasion and tampering, and has high accuracy.

Drawings

Fig. 1 is a schematic diagram of the operation of the present invention.

Fig. 2 is a flow chart of the operation of the present invention.

Detailed Description

The invention relates to a sensitive information retrieval and identification technology, and is an application of a computer technology in the technical field of information security. In the implementation process of the invention, the application of a plurality of software functional modules is involved. The applicant believes that it is fully possible for one skilled in the art to utilize the software programming skills in his or her own practice to implement the invention, as well as to properly understand the principles and objectives of the invention, in conjunction with the prior art, after a perusal of this application. The aforementioned software functional modules include but are not limited to: the method comprises the steps of website availability judgment, website skip identification, dark chain module acquisition, automatic keyword library ranking and deletion, website library summary classification, definition and storage of 'illegal link pairs', and the like.

The invention is described in further detail below with reference to the following detailed description and accompanying drawings:

as shown in fig. 1, the method for detecting the orphan page implantation attack based on the internet cross search specifically includes the following steps:

(1) the method comprises the following steps of detecting illegal link label modules in the home pages of all websites of a website library in real time: judging the accessible state of the website; analyzing the content of a home page of a website; and searching for an illegal link label module according to the characteristics.

(2) The judgment that the original page has the illegal link is consolidated by analyzing the illegal property that the link points to the webpage content: acquiring a link in the illegal link label module; analyzing whether the link points to illegal content; if the pointing content is confirmed to be illegal, the probability of the illegal link module is further consolidated, and the probability that other links are possible illegal links is provided.

(3) And searching a page of which the local station is an illegal isolated page based on the Internet cross relationship, namely the page link does not exist in the page which can be crawled by the web system.

(4) And according to the fact that the link which is judged to be the isolated page serves as an input condition, other systems are proved that the link is caused by intrusion tampering: searching whether the link text pointing to the isolated page exists in other website pages; the probability of being tampered by intrusion is greatly improved for the website with the isolated page link.

The following examples are presented to enable those skilled in the art to more fully understand the present invention and are not intended to limit the invention in any way.

As shown in fig. 2, the detection of the lone page implantation attack includes the following steps:

step A: firstly, arranging a certain scale of website libraries, wherein the larger the base number of the websites is, the higher the probability of finding problems is; and website entries of the website library can be gradually increased and improved.

And B: and discovering the illegal link module by a dark chain detection means and a keyword sensitive text information detection means.

And C: and analyzing one by one, if illegal information also exists in the pointed page, saving the relation pair, such as http:// www.aaa.com- > http:// www.bbb.com/c/d/index.

Step D: http:// www.bbb.com/c/d/index. html is verified to be a orphan page on the WEB system http:// www.bbb.com, i.e. no link of any page points to the page.

Step E: html is used as an input condition, searching is carried out in a website home page in a website library, and all the connections are possible to have the risk of intrusion.

Finally, it should be noted that the above-mentioned list is only a specific embodiment of the present invention. It is obvious that the present invention is not limited to the above embodiments, but many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims

1. An isolated page implantation attack detection method based on internet cross search is characterized by comprising the following steps:

step A: carrying out collection and classification on a domain name website and an IP (Internet protocol) port website on the Internet, taking and integrating a website library, and storing a URL (uniform resource locator) used by a website entrance in the website library;

and B: preliminarily analyzing the websites of the website library, preprocessing the conditions that URLs are different but actually the same website and the URLs are inaccessible, and acquiring the URL of the website capable of accessing the home page;

and C: analyzing the content of the home page by using a dark chain structure characteristic analysis method; if links which cannot be found by naked eyes exist in the website, but the links in the form can be recorded by a search engine, the website has the risk of being attacked by a dark chain;

analyzing the home page content by using a keyword lexicon; if the link text contains at least one keyword, the website has the risk of being implanted with sensitive information;

extracting risk link modules of websites with any type of risks; the risk link module refers to a label module containing one or more risk links;

step E: performing text analysis on the content pointing to the page, and judging whether sensitive text and illegal domain names exist or not;

step F: when the image exists in the pointing page, performing similarity matching and OCR character recognition on the image existing in the pointing page; if the calculated value of the image pointing to the page is 1 after the similarity matching algorithm analysis, judging that the link pointing to the page is an illegal link, and extracting and storing the illegal link; if the character information extracted by OCR character recognition contains the key words, judging that the link pointing to the page is an illegal link, extracting and storing the illegal link; judging whether the link pointing to the page is an illegal link under other conditions;

step H: combining and weighting the illegal information in the source page and the illegal information in the pointing page, and when the weighting probability reaches a preset threshold value, determining that the source page and the illegal link pair of the pointing page thereof are established, namely the illegal link pair of the source page-the pointing page is established;

step I, repeatedly executing all illegal link pairs obtained in the step H, recalculating α the probability value of the illegal tampering or implantation of the source page according to the number of the final illegal link pairs, recalculating β the probability of the illegal information existing in the illegal link pairs according to the specific illegal link pairs of the source page and the pointing page, and determining the connection of the illegal information generated by the URL link among a plurality of pages when the probabilities α and β both exceed the respective preset threshold values;

(4) finding and confirming the isolated page from a WEB system where the page pointed by the illegal link is located; specifically comprising the following substeps;

step J: finding out suspected isolated pages, namely pages pointed by links not contained in the site, wherein the isolated page links can only be absolute URLs;

step K: analyzing all web pages of the website where the suspected isolated pages are located, extracting links of all websites, converting relative URLs into absolute URLs, and performing duplicate elimination;

if the two are matched, the suspected page is an ordinary illegal link;

(5) after the illegal link is confirmed to be the isolated page, storing the isolated page link; and C, jumping to the step C to repeatedly execute so as to find other websites attacked by the orphan page implantation.