CN108494728B

CN108494728B - Method, device, equipment and medium for creating blacklist library for preventing traffic hijacking

Info

Publication number: CN108494728B
Application number: CN201810122846.0A
Authority: CN
Inventors: 林泽全
Original assignee: Ping An Puhui Enterprise Management Co Ltd
Current assignee: Ping An Puhui Enterprise Management Co Ltd
Priority date: 2018-02-07
Filing date: 2018-02-07
Publication date: 2021-01-26
Anticipated expiration: 2038-02-07
Also published as: CN108494728A

Abstract

The invention discloses a blacklist library creation method, a device, equipment and a medium for preventing traffic hijacking. The method comprises the following steps: acquiring an HTTP access request sent by a client, wherein the HTTP access request comprises a URL to be accessed; acquiring a corresponding original webpage based on the URL to be accessed, wherein the original webpage corresponds to an original DOM tree; scanning an original DOM tree by adopting an anti-hijack software development kit, and judging whether suspected advertisement URLs exist in the original DOM tree or not; if the suspected advertisement URL exists in the original DOM tree, storing the suspected advertisement URL in a cache library; and determining a blacklist domain name based on the suspected advertisement URL in the cache library, and storing the blacklist domain name in a blacklist library. The method improves the accuracy of obtaining the blacklist domain name and the speed of confirming the network advertisement resource information by the original webpage corresponding to the URL to be accessed, and optimizes the comprehensiveness of identifying the network advertisement resource information.

Description

Method, device, equipment and medium for creating blacklist library for preventing traffic hijacking

Technical Field

The invention relates to the field of network security, in particular to a blacklist library creation method, a device, equipment and a medium for preventing traffic hijacking.

Background

When a user requests a webpage, the advertisement operator inserts network advertisement resource information into the webpage resource information related to the webpage, and a client (usually a browser) is enabled to display the resource information unrelated to the webpage, so that the purpose of hijacking the flow of the advertisement operator is achieved. The network advertisement resource information is usually some pop-up window, promotional advertisement or directly display the content of other web pages. At present, most of processing methods aiming at the traffic hijacking of advertisement operators are realized by a method of creating a blacklist. However, the current method for creating the blacklist is usually implemented by developers in a manual writing manner, and accurate identification and judgment on whether the network advertisement resource information exists in the webpage cannot be achieved, so that the network advertisement resource information is not comprehensively identified.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a medium for creating a blacklist library for preventing traffic hijacking, which aim to solve the problem that the current blacklist for preventing traffic hijacking cannot comprehensively identify network advertisement resource information.

In a first aspect, an embodiment of the present invention provides a method for creating a blacklist library to prevent traffic hijacking, including:

acquiring an HTTP access request sent by a client, wherein the HTTP access request comprises a URL to be accessed;

acquiring a corresponding original webpage based on the URL to be accessed, wherein the original webpage corresponds to an original DOM tree;

scanning the original DOM tree by adopting an anti-hijack software development kit, and judging whether suspected advertisement URLs exist in the original DOM tree or not;

if the suspected advertisement URL exists in the original DOM tree, storing the suspected advertisement URL in a cache library;

and determining a blacklist domain name based on the suspected advertisement URL in the cache library, and storing the blacklist domain name in a blacklist library.

In a second aspect, an embodiment of the present invention provides a device for creating a blacklist library, where the device prevents traffic hijacking, and includes:

the access request acquisition module is used for acquiring an HTTP access request sent by a client, wherein the HTTP access request comprises a URL to be accessed;

the original webpage obtaining module is used for obtaining a corresponding original webpage based on the URL to be accessed, and the original webpage corresponds to an original DOM tree;

a suspected advertisement URL judging module, configured to scan the original DOM tree by using an anti-hijack software development kit, and judge whether a suspected advertisement URL exists in the original DOM tree;

a cache library storage module, configured to store the suspected advertisement URL in a cache library when the suspected advertisement URL exists in the original DOM tree;

and the blacklist domain name acquisition module is used for determining a blacklist domain name based on the suspected advertisement URL in the cache library and storing the blacklist domain name in a blacklist library.

In a third aspect, an embodiment of the present invention provides a terminal device, which includes a memory, a processor, and a computer program that is stored in the memory and is executable on the processor, where the processor implements the steps of the method for creating a blacklist library for preventing traffic hijacking when executing the computer program.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the blacklist library creation method for preventing traffic hijacking are implemented.

According to the method, the device, the equipment and the medium for creating the blacklist library for preventing the traffic hijacking, the URL to be accessed is obtained through the HTTP access request sent by the client side, and the original DOM tree corresponding to the corresponding original webpage is obtained based on the obtained URL to be accessed. And then scanning the original DOM tree by adopting an anti-hijack software development kit, acquiring suspected advertisement URLs existing in the original DOM tree and storing the suspected advertisement URLs in a cache library, thereby being beneficial to improving the efficiency of extracting the domain names of the subsequent blacklist. And domain name extraction is carried out on suspected advertisement URLs in the cache library to obtain the blacklist domain name, which is beneficial to improving the accuracy of blacklist domain name confirmation. The blacklist domain name is stored in the blacklist library, so that the accuracy of performing blacklist domain name identification on the original webpage corresponding to the URL to be accessed subsequently is improved, the speed of confirming the network advertisement resource information by the original webpage corresponding to the URL to be accessed is improved, and the comprehensiveness of network advertisement resource information identification is optimized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

Fig. 1 is a flowchart of a method for creating a blacklist library for preventing traffic hijacking according to embodiment 1 of the present invention.

Fig. 2 is a specific diagram of step S30 in fig. 1.

Fig. 3 is a specific diagram of step S50 in fig. 1.

Fig. 4 is another flowchart of a method for creating a blacklist library for preventing traffic hijacking according to embodiment 1 of the present invention.

Fig. 5 is a schematic block diagram of a device for creating a blacklist library for preventing traffic hijacking according to embodiment 2 of the present invention.

Fig. 6 is a schematic diagram of a terminal device in embodiment 4 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Fig. 1 shows a flowchart of a blacklist library creation method for preventing traffic hijacking in this embodiment. The blacklist base creation method for preventing the traffic hijacking is applied to a server, the server and a client carry out information interaction through a network, and can prevent an advertisement operator from inserting network advertisement resource information into normal webpage resource information, so that the purpose of preventing the traffic advertisement hijacking of the advertisement operator is achieved. As shown in fig. 1, the method for creating a blacklist library for preventing traffic hijacking includes the following steps:

s10: and acquiring an HTTP access request sent by the client, wherein the HTTP access request comprises a URL to be accessed.

The URL to be accessed refers to a webpage address which a user needs to access. Specifically, a server communicatively connected to a client receives an HTTP access request sent by the client, where the HTTP access request generally carries a web page address URL, and the URL is a web page address that the client sends to the server and needs to access.

S20: and acquiring a corresponding original webpage based on the URL to be accessed, wherein the original webpage corresponds to an original DOM tree.

Specifically, the original webpage refers to a webpage corresponding to the URL to be visited. The server acquires original webpages corresponding to the URLs to be visited according to the URLs to be visited in the HTTP access request, each original webpage corresponds to a DOM tree, and the DOM trees are original DOM trees corresponding to the original webpages. The original DOM tree refers to DOM trees corresponding to all webpage resource information loaded by the original webpage corresponding to the URL to be accessed.

The DOM tree (Document Object Model) is a Document Object Model specifically adapted to HTML (hypertext markup language), which is a markup language designed for web page creation and other information visible in a web browser. The essence of a web page is composed of an HTML (hypertext markup language), and the DOM tree is the document object model corresponding to the web page. In the DOM tree, each element in the web page is treated as an individual object, so that the elements in the web page can also be captured or edited by the computer language. At least one element exists in one webpage, and one element corresponds to one DOM tag in one DOM tree, namely at least one DOM tag exists in one DOM tree.

S30: and scanning the original DOM tree by adopting an anti-hijack software development kit, and judging whether suspected advertisement URLs exist in the original DOM tree or not.

The anti-hijack software development kit is composed of a set of JavaScript codes and used for detecting whether suspected advertisement URLs exist, and the set of JavaScript codes are introduced into the software development kit in a script tag mode in a browser. For example, the expression form of the JavaScript code in the software development package is < script src ═ a.js >, where src is followed by the address of the software development package. A Software Development Kit (SDK) refers to a tool Kit provided for Software Development, and is generally a collection of Development tools for establishing application Software for a specific Software package, Software framework, hardware platform, operating system, and the like.

The suspected URL is the URL corresponding to the DOM tag which accords with the preset characteristics. The preset characteristics refer to characteristics of a DOM tag corresponding to an advertisement code implanted by an advertisement operator. Characteristics of the DOM tag corresponding to the advertisement code include, but are not limited to, advertisement code integrity characteristics, URL jump characteristics, and absolute positioning characteristics that need to be presented at a specific location on the web page. The advertisement code integral characteristic refers to the complete advertisement information which needs to be displayed by an advertisement operator, the advertisement code corresponding to the advertisement information is a segment of complete code, namely the advertisement code is represented in a DOM tree as a whole, and the representation form can be a code beginning with < div > and ending with </div >. The URL jump feature is that an advertisement picture is inserted, and a URL link of < a > is added, wherein a is a string of characters representing the storage position of the picture. The absolute positioning feature refers to that many iframes and divs embedded with advertisement codes are added at the tail of a DOM tree corresponding to an original webpage corresponding to a URL to be visited, for example, the last element of the original webpage corresponding to the URL to be visited is < div id ═ last-div' >, and the illegally inserted code is </div > < script src ═ a.js >.

Specifically, the original webpage corresponding to the URL to be accessed loads webpage resource information, and the webpage resource information may have various display modes on the webpage, including but not limited to pictures, texts, websites and videos. These web page resource information are the elements in the web page. The elements in these web pages are all present as DOM tags to the software development kit.

Further, after the HTTP access request sent by the client is obtained, the server obtains the anti-hijack software development kit based on the HTTP access request. After the original webpage corresponding to the URL to be accessed is loaded with all webpage resource information of the webpage, an onload state event occurs on the original webpage corresponding to the URL to be accessed, the state event refers to a request event for processing the webpage resource information loaded on the original webpage corresponding to the URL to be accessed by accessing an anti-hijack software development kit, and the state event has an interface and can be accessed to the anti-hijack software development kit to scan a DOM tree.

And the anti-hijacking software development kit scans the DOM tree corresponding to the original webpage corresponding to the URL to be accessed in a breadth-first scanning mode based on the state event, starts scanning from the html tag at the outermost layer of the DOM tree, scans the corresponding DOM tags layer by layer, and searches whether the DOM tags conforming to the preset characteristics have suspected advertisement URLs. All DOM tags in the DOM tree are traversed in a scanning mode with a preferred breadth, all DOM tags contained in each level in the DOM tree can be scanned, all DOM tags of one level are scanned, and all DOM tags of the next level are scanned, so that all adjacent DOM tags of the same level are accessed according to the dequeuing sequence, and the scanning method is suitable for comprehensively scanning the DOM tree. If any one of three preset characteristics, namely an advertisement code integrity characteristic, a URL skipping characteristic and an absolute positioning characteristic, exists in the original DOM tree, the fact that a suspected advertisement URL exists in the original DOM tree can be determined, the suspected advertisement URL is a URL which is preliminarily determined to be possibly an advertisement, the suspected advertisement URL is determined to be beneficial to determining a blacklist domain name, and therefore it is guaranteed that in the step S50, the blacklist domain name is stored in a blacklist library based on the domain name extracted by the suspected advertisement.

S40: and if the suspected advertisement URL exists in the original DOM tree, storing the suspected advertisement URL in a cache library.

Specifically, whether a DOM tag meeting preset characteristics exists in the original DOM tree or not is judged, if yes, a suspected advertisement URL exists in the DOM tree, and the DOM tag, namely the suspected advertisement URL, is stored in a cache library. It is understood that storing the suspected advertisement URL in the cache library can enable fast processing (including but not limited to query processing) of data such as the suspected advertisement URL stored in the cache library, without requiring a request for the server and obtaining a processing instruction sent by the server for data processing.

The cache library in this embodiment may be a mysql relational database, which is an open source code relational database management system, provides programming interfaces (APIs) facing multiple programming languages, supports multiple field types, and provides complete operators to support SELECT and WHERE operations in queries. The mysql relational database has the characteristics of high speed, good reliability, strong adaptability and the like, and the mysql relational database is used for storing suspected advertisement URLs, so that the functions of master-slave configuration and read-write separation can be realized, and efficient service can be provided for data storage.

S50: and determining a blacklist domain name based on the suspected advertisement URL in the cache library, and storing the blacklist domain name in a blacklist library.

The blacklist domain name refers to a domain name obtained after domain name extraction is performed on the suspected advertisement URL. The blacklist library refers to a database storing blacklist domain names. Specifically, domain name extraction is performed on suspected advertisement URLs stored in a cache library, and if the domain name extracted from the suspected advertisement URL meets a preset blacklist judgment method, the domain name extracted from the suspected advertisement URL is determined to be a blacklist domain name. And then, storing the blacklist domain name in a pre-established blacklist library so as to be used as a reference basis when carrying out the subsequent blacklist domain name identification.

And S10-S50, obtaining the URL to be accessed through obtaining the HTTP access request sent by the client, and obtaining the original DOM tree corresponding to the corresponding original webpage based on the obtained URL to be accessed. And scanning the original DOM tree by adopting an anti-hijack software development kit, acquiring suspected advertisement URLs existing in the original DOM tree, extracting domain names of the suspected advertisement URLs to acquire blacklist domain names, and storing the acquired blacklist domain names in a blacklist library, so that the confirmed blacklist library is beneficial to improving the accuracy of carrying out blacklist domain name identification on the original webpage corresponding to the URL to be accessed subsequently, and the comprehensiveness of network advertisement resource information identification is improved.

In one embodiment, as shown in fig. 2, in step S30, scanning the original DOM tree with the anti-hijacking software development kit, and determining whether there is a suspected advertisement URL in the original DOM tree, specifically includes the following steps:

s31: and scanning the original DOM tree by adopting the anti-hijack software development kit to obtain the original URL contained in the original DOM tree.

Specifically, one DOM tree contains at least one DOM tag. Scanning an original DOM tree corresponding to the original webpage corresponding to the URL to be accessed by adopting an anti-hijack software development kit in a breadth-first scanning mode, starting scanning from an html tag at the outermost layer of the original DOM tree, scanning DOM tags of all layers of poles layer by layer, determining at least one DOM tag existing in a URL form in the DOM tags, and searching for the URL contained in the DOM tag, wherein the URL is the original URL to be acquired.

S32: and if the domain name of the original URL is not matched with the domain name of the URL to be accessed, determining that the suspected advertisement URL exists in the original DOM tree.

The domain name of the original URL refers to an address on the Internet obtained by domain name extraction of the original URL, and the domain name of the URL to be visited refers to an address on the Internet obtained by domain name extraction of the URL to be visited. The process of obtaining the domain name of the original URL and the domain name of the URL to be accessed is described in detail in step S51, which is not described in detail herein to avoid redundancy.

Specifically, the domain name of the original URL and the domain name of the URL to be visited included in the acquired original DOM tree are determined, and whether the domain name of the original URL and the domain name of the URL to be visited are matched is determined. If the two are matched, the original URL is the original webpage resource information of the webpage which the user needs to access; and if the matching is inconsistent, the fact that the suspected URL exists in the original DOM tree is shown, and the original URL is not the original webpage resource information of the webpage which the user needs to access. By matching the domain name of the original URL with the domain name of the URL to be accessed, whether the suspected advertisement URL exists in the original webpage can be quickly and effectively determined.

In one embodiment, as shown in fig. 3, in step S50, the determining the blacklisted domain name based on the suspected advertisement URL in the cache library specifically includes the following steps:

s51: and performing domain name extraction on each suspected advertisement URL in the cache library to obtain a corresponding suspected domain name.

When the suspected advertisement URL is determined, the suspected advertisement URL is stored in a cache library, and at least one suspected URL is stored in the cache library. And performing domain name extraction on each suspected URL in the cache library, wherein the extracted domain name is a suspected domain name.

Further, calling a regular expression in the anti-hijack software development kit to extract the domain name of each suspected advertisement URL in the cache library to obtain the corresponding suspected domain name.

The Regular Expression is also called a Regular Expression (often abbreviated as regex, regexp, or RE in code). The regular expression is a logical formula operating on a string, and in the present embodiment, the regular expression is a filtering logic used to express the string. The character string includes normal characters (e.g., letters between a and z) and special characters (also known as "meta characters", e.g., "$, &, #, +,.

Specifically, the anti-hijacking software development kit contains a packaged regular expression. Step S51 specifically includes: splitting each suspected advertisement URL in the cache library by adopting the packaged regular expression so as to split the suspected advertisement URL into three parts, namely a protocol name, a domain name and a parameter; then, the parameter part behind the protocol name and the domain name is removed, and only the domain name is reserved, so that the corresponding suspected domain name is obtained. If the suspected advertisement URL is: http:// pos.baidu.com/she is 250& wid is 250& di is u3031286& ltu is lV-RgLBX E5wJyFr & r is 35d363d1cad5eabfcd131082d275f954#, wherein "http" corresponds to a protocol name and "pos.baidu.com" corresponds to a domain name, and all contents after the domain name can be collectively referred to as parameters. When the regular expression is adopted to extract the domain name of the suspected advertisement URL, only the domain name part' pos.

S52: and determining the suspected domain names with the quantity reaching a preset value in the cache library as blacklist domain names.

The blacklist domain name is determined to be the blacklist domain name when the storage frequency of the same suspected domain name in the cache bank reaches (i.e. is greater than or equal to) a preset value. The preset value refers to the number of preset suspected domain names stored in the cache library. The preset value is used for judging whether the suspected domain name is a blacklist domain name or not.

If the suspected domain name appears once in the cache library and does not reach a preset value, the suspected domain name cannot be determined to be a blacklist domain name, only a domain name which is not matched with the domain name of the URL to be accessed is possible, and when the number of the suspected domain names stored in the cache library reaches the preset value, the suspected domain name can be determined to be the blacklist domain name. It can be understood that the suspected domain names are determined as the blacklist domain names when the number of the suspected domain names reaches a preset value, so that misjudgment of the blacklist domain names can be reduced, and the accuracy of determining the blacklist domain names is improved.

In one embodiment, as described above, if the number of suspected advertisement URLs in the cache library reaches the preset value, the suspected advertisement URLs are determined to be the blacklist domain name, and there may be misjudgment, which may cause the subsequently misjudged suspected advertisement URLs to enter the blacklist library, and thus access or other operations may not be performed. As shown in fig. 4, after the step of storing the blacklist domain name in the blacklist repository, the method for creating the blacklist repository for preventing traffic hijacking further includes:

s61: and acquiring a misjudgment recovery request, wherein the misjudgment recovery request comprises a target URL.

The misjudgment recovery request is a recovery request that a server receives a user needs to recover and view hidden contents, wherein the hidden contents refer to contents of webpage resource information displayed by a URL corresponding to a blacklist domain name added into a blacklist. The target URL refers to a URL corresponding to the hidden content needing to be restored for viewing. Specifically, in the process of confirming the black name domain name, there may be a case of erroneous judgment. When a user accesses a certain webpage, the server judges the domain name corresponding to the suspected advertisement URL inconsistent with the domain name of the webpage to be accessed as the domain name of the blacklist and adds the domain name into the blacklist library. Therefore, the webpage only displays part of content which is not added into the blacklist library, and part of content which is added into the blacklist library is hidden and is not displayed. When the browser displays the webpage content corresponding to the webpage resource information, the webpage can have notification information whether to check the hidden content. If the user clicks to recover the hidden content, the server acquires a recovery request, and the recovery request is a misjudgment recovery request. Meanwhile, the misjudgment recovery request comprises a URL corresponding to the hidden content to be recovered, and the URL is a target URL. The domain name added to the blacklist library by mistake can be reduced by obtaining the misjudgment recovery request, and the user can be helped to browse the webpage content corresponding to the complete webpage resource information.

S62: and calling a regular expression in the anti-hijack software development kit to extract the domain name of the target URL to acquire the target domain name.

When the server receives the misjudgment recovery request sent by the user, a regular expression in the anti-hijack software development kit is called to extract the domain name of the target URL, and the target domain name corresponding to the target URL is obtained, where the domain name extraction process is described in step S51, and is not repeated in order to avoid repetition.

S63: and deleting the blacklist domain name which is stored in the blacklist library and is consistent with the target domain name, and updating the blacklist library.

Based on the acquired target domain name, the server compares and confirms the target domain name and the blacklist domain name stored in the blacklist library, deletes the blacklist domain name stored in the blacklist library consistent with the target domain name, and updates the blacklist library. Step S63, it is ensured that the blacklist domain names stored in the blacklist library can be continuously adjusted according to actual conditions, the misjudgment rate of the blacklist domain names is reduced, and the accuracy of the blacklist stored in the blacklist library is ensured.

In this embodiment, after step S63, that is, after the step of deleting the blacklist domain name that is stored in the blacklist repository and is consistent with the target domain name, the method for creating a blacklist repository for preventing traffic hijacking may further include:

s64: and taking the blacklist domain name which is stored in the blacklist library and is consistent with the target domain name as a white list domain name, and storing the white list domain name in the white list library.

And creating a white list library while creating the black list library, wherein the white list library is a database for storing a target domain name corresponding to the URL of a webpage which a user is allowed to access. And comparing and judging the blacklist domain names stored in the blacklist library based on the target domain name, taking the blacklist domain name consistent with the target domain name as a white list domain name, and storing the white list domain name in the white list library.

In this embodiment, the white list library further includes a pre-stored white list domain name. The pre-stored white name single domain name is: some web pages to be accessed are allowed to be inserted with web advertisement resource information which does not belong to the web page resource information which is normal, at this time, the domain name extraction can be carried out on the URL corresponding to the web advertisement resource information by adopting a regular expression, and the extracted domain name is stored in a white list library.

When the anti-hijacking software development kit scans all DOM labels in an original DOM tree of a webpage accessed by a user, a suspected advertisement URL is determined and stored in a cache library, domain name extraction needs to be carried out on the suspected advertisement URL to determine a domain name corresponding to the suspected advertisement URL (namely the suspected domain name in the step S51), and when the suspected domain name is judged to be consistent with a white list domain name in a white list library, webpage resource information corresponding to the suspected advertisement URL is displayed. For example, the Baidu promotion advertisements allowed to be inserted in the Baidu webpage are determined as suspected advertisement URLs through scanning of an anti-hijacking software development kit, and if the domain name is determined to be in a white list library after domain name extraction, webpage resource information of the URL corresponding to the Baidu promotion advertisement can be displayed. Therefore, the method can avoid the phenomenon that the webpage content corresponding to the webpage resource information which a certain webpage allows a user to access is mistakenly added into the blacklist, so that the loss of the webpage content corresponding to the unnecessary webpage resource information is caused, and the webpage content corresponding to the webpage resource information can be more comprehensively reflected.

In one embodiment, after step S40, that is, after the step of storing the suspected advertisement URL in the cache library, the method for creating the blacklist library for preventing traffic hijacking further includes: and if the domain name corresponding to the suspected advertisement URL is stored in the white list library, deleting the suspected advertisement URL from the cache library.

It can be understood that after the suspected advertisement URL is stored in the cache library, a domain name extraction needs to be performed on the suspected advertisement URL to determine a domain name corresponding to the suspected advertisement URL (i.e., the suspected domain name in step S51), and when it is determined that the domain name corresponding to the suspected advertisement URL is stored in the white list library, it indicates that the domain name corresponding to the suspected advertisement URL belongs to the white list library, and the content of the URL corresponding to the domain name is the web page content corresponding to the web page resource information that needs to be displayed. In order to avoid deleting only the domain name corresponding to the suspected advertisement URL stored in the black list library, and not deleting the suspected advertisement URL stored in the cache library, the web page content corresponding to the web page resource information corresponding to the suspected advertisement URL may still not be displayed normally. Therefore, after confirming that the domain name corresponding to the suspected advertisement URL is stored in the white list library, the suspected advertisement URL needs to be deleted from the cache library.

According to the method for creating the blacklist base for preventing traffic hijacking, provided by the embodiment of the invention, the domain name of the suspected advertisement URL is extracted by adopting the regular expression in the anti-hijacking software development kit and is stored in the cache base, so that the efficiency of subsequent blacklist domain name extraction is improved. When the number of the domain names corresponding to the same URL in the cache library reaches a preset value, the domain names are stored in the blacklist library, so that the accuracy of performing blacklist domain name identification on the original webpage corresponding to the URL to be accessed subsequently is improved by the confirmed blacklist library, and the network advertisement resource information identification is more comprehensive.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

Example 2

Fig. 5 is a schematic block diagram of a blacklist repository creation apparatus for preventing traffic hijacking, which corresponds one-to-one to the blacklist repository creation method for preventing traffic hijacking in embodiment 1. As shown in fig. 5, the apparatus for creating a blacklist base to prevent traffic hijacking includes an access request acquisition module 10, an original web page acquisition module 20, a suspected advertisement URL determination module 30, a cache base storage module 40, and a blacklist domain name acquisition module 50. The implementation functions of the access request obtaining module 10, the original web page obtaining module 20, the suspected advertisement URL determining module 30, the cache library storage module 40, and the blacklist domain name obtaining module 50 correspond to the steps corresponding to the blacklist library creation method for preventing traffic hijacking in the embodiment one to one, and for avoiding repeated description, detailed description is not needed in this embodiment.

An access request obtaining module 10, configured to obtain an HTTP access request sent by a client, where the HTTP access request includes a URL to be accessed.

The original webpage obtaining module 20 is configured to obtain a corresponding original webpage based on the URL to be visited, where the original webpage corresponds to an original DOM tree.

And the suspected advertisement URL judging module 30 is configured to scan the original DOM tree by using the anti-hijack software development kit, and judge whether the suspected advertisement URL exists in the original DOM tree.

And the cache library storage module 40 is used for storing the suspected advertisement URL in the cache library when the suspected advertisement URL exists in the original DOM tree.

And the blacklist domain name acquisition module 50 is configured to determine a blacklist domain name based on the suspected advertisement URL in the cache repository, and store the blacklist domain name in the blacklist repository.

Preferably, the suspected advertisement URL determination module 30 includes an original URL obtaining unit 31 and a suspected advertisement URL confirmation unit 32.

And an original URL obtaining unit 31, configured to scan the original DOM tree by using the anti-hijacking software development kit, and obtain an original URL included in the original DOM tree.

And the suspected advertisement URL confirming unit 32 is configured to determine that a suspected advertisement URL exists in the original DOM tree when the domain name of the original URL does not match the domain name of the URL to be visited.

Preferably, the blacklist domain name acquisition module 50 includes a suspected domain name acquisition unit 51 and a blacklist domain name acquisition unit 52.

The suspected domain name obtaining unit 51 is configured to perform domain name extraction on each suspected advertisement URL in the cache library, and obtain a corresponding suspected domain name.

And a blacklist domain name obtaining unit 52, configured to determine that the suspected domain names whose number reaches a preset value in the cache repository are blacklist domain names.

Preferably, the suspected domain name obtaining unit 51 is configured to call a regular expression in the anti-hijacking software development kit to perform domain name extraction on each suspected advertisement URL in the cache library, so as to obtain a corresponding suspected domain name.

Preferably, the apparatus for creating a blacklist base to prevent traffic hijacking further includes a misjudgment recovery request acquisition unit 61, a target domain name acquisition unit 62, a blacklist base updating unit 63, and a whitelist domain name acquisition unit 64.

The misjudgment recovery request obtaining unit 61 is configured to obtain a misjudgment recovery request, where the misjudgment recovery request includes a target URL.

And the target domain name obtaining unit 62 is configured to call a regular expression in the anti-hijacking software development kit to perform domain name extraction on the target URL, so as to obtain the target domain name.

And a blacklist library updating unit 63, configured to delete the blacklist domain name that is stored in the blacklist library and is consistent with the target domain name, and update the blacklist library.

And a white list domain name obtaining unit 64, configured to use the black list domain name that is stored in the black list library and is consistent with the target domain name as a white list domain name, and store the white list domain name in the white list library.

Preferably, the blacklist library creation apparatus for preventing traffic hijacking further includes: and a suspected advertisement URL deleting module 70, configured to delete the suspected advertisement URL from the cache library when the domain name corresponding to the suspected advertisement URL is stored in the white list library.

Example 3

This embodiment provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method for creating a blacklist library for preventing traffic hijacking in embodiment 1 is implemented, and details are not described here again to avoid repetition. Alternatively, the computer program, when executed by the processor, implements the functions of each module/unit in the apparatus for creating a blacklist library for preventing traffic hijacking in embodiment 2, and is not described herein again to avoid repetition.

Example 4

Fig. 6 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 6, the terminal device 80 of this embodiment includes: a processor 81, a memory 82, and a computer program 83 stored in the memory 82 and operable on the processor 81, such as a blacklist library creation program to prevent traffic hijacking. The processor 81 executes the computer program 83 to implement the steps in the embodiments of the above-described blacklist repository creation method for preventing traffic hijacking, such as the steps S10 to S50 shown in fig. 1. Alternatively, the processor 81 executes the computer program 83 to implement the functions of the modules/units in the above-mentioned embodiments of the apparatus, such as the functions of the access request obtaining module 10, the original web page obtaining module 20, the suspected advertisement URL determining module 30, the cache repository storage module 40 and the blacklist domain name obtaining module 50 shown in fig. 5.

Illustratively, the computer program 83 may be divided into one or more modules/units, which are stored in the memory 82 and executed by the processor 81 to carry out the invention. One or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 83 in the terminal device 80. For example, the access request obtaining module 10, the original web page obtaining module 20, the suspected advertisement URL determining module 30, the cache repository storing module 40 and the blacklist domain name obtaining module 50.

The terminal device 80 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 81, a memory 82. Those skilled in the art will appreciate that fig. 6 is merely an example of a terminal device 80 and does not constitute a limitation of terminal device 80 and may include more or fewer components than shown, or some components may be combined, or different components, e.g., the terminal device may also include input-output devices, network access devices, buses, etc.

The Processor 81 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 82 may be an internal storage unit of the terminal device 80, such as a hard disk or a memory of the terminal device 80. The memory 82 may also be an external storage device of the terminal device 80, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal device 80. Further, the memory 82 may also include both an internal storage unit of the terminal device 80 and an external storage device. The memory 82 is used for storing computer programs and other programs and data required by the terminal device. The memory 82 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A blacklist base creation method for preventing traffic hijacking is characterized by comprising the following steps:

acquiring a corresponding original webpage based on the URL to be accessed, wherein the original webpage corresponds to an original DOM tree, and the original DOM tree comprises at least one DOM tag;

scanning the original DOM tree by adopting an anti-hijack software development kit, and judging whether an original URL corresponding to at least one DOM tag in the original DOM tree has a suspected advertisement URL or not, wherein the anti-hijack software development kit is a software development kit which is composed of JavaScript codes and is used for detecting whether the suspected advertisement URL exists or not; the suspected advertisement URL is a URL corresponding to a DOM tag containing at least one of an advertisement code integral characteristic, a URL skipping characteristic and an absolute positioning characteristic required to be displayed at a specific position of a webpage;

and determining a blacklist domain name according with a blacklist judgment method based on the suspected advertisement URL in the cache library, and storing the blacklist domain name in a blacklist library.

2. The method for creating the blacklist base for preventing traffic hijacking according to claim 1, wherein the scanning the original DOM tree by the anti-hijacking software development kit to determine whether the suspected advertisement URL exists in the original DOM tree comprises:

scanning the original DOM tree by adopting an anti-hijack software development kit to acquire an original URL contained in the original DOM tree;

and if the domain name of the original URL is not matched with the domain name of the URL to be visited, determining that the suspected advertisement URL exists in the original DOM tree.

3. The method of creating a blacklist bank to prevent traffic hijacking according to claim 1, wherein said determining a blacklist domain name based on said suspected advertisement URL in said cache bank comprises:

performing domain name extraction on each suspected advertisement URL in the cache library to obtain a corresponding suspected domain name;

and determining the suspected domain names with the number reaching a preset value in the cache library as blacklist domain names.

4. The method for creating a blacklist base of preventing traffic hijacking according to claim 3, wherein said performing domain name extraction on each suspected advertisement URL in the cache base to obtain a corresponding suspected domain name comprises:

calling a regular expression in the anti-hijack software development kit to extract the domain name of each suspected advertisement URL in the cache library to obtain the corresponding suspected domain name.

5. The method of creating a blacklist bank to prevent traffic hijacking according to claim 1, wherein after said step of storing said blacklist domain name in a blacklist bank, said method of creating a blacklist bank to prevent traffic hijacking further comprises:

acquiring a misjudgment recovery request, wherein the misjudgment recovery request comprises a target URL;

calling a regular expression in the anti-hijack software development package to extract the domain name of the target URL to obtain a target domain name;

deleting the blacklist domain name which is stored in the blacklist library and is consistent with the target domain name, and updating the blacklist library.

6. The method of creating a blacklist bank for preventing traffic hijacking according to claim 5, wherein after the step of deleting the blacklist domain name consistent with the target domain name stored in the blacklist bank, the method of creating a blacklist bank for preventing traffic hijacking further comprises:

taking the blacklist domain name which is stored in the blacklist library and is consistent with the target domain name as a white list domain name, and storing the white list domain name in a white list library;

after the step of storing the suspected advertisement URL in a cache repository, the method for creating a blacklist repository for preventing traffic hijacking further comprises: and if the domain name corresponding to the suspected advertisement URL is stored in the white list library, deleting the suspected advertisement URL from the cache library.

7. A blacklist repository creation apparatus for preventing traffic hijacking, comprising:

the original webpage obtaining module is used for obtaining a corresponding original webpage based on the URL to be accessed, wherein the original webpage corresponds to an original DOM tree, and the original DOM tree comprises at least one DOM tag;

a suspected advertisement URL judging module, configured to scan the original DOM tree by using an anti-hijack software development kit, and judge whether an original URL corresponding to at least one DOM tag in the original DOM tree has a suspected advertisement URL, where the anti-hijack software development kit is a software development kit composed of JavaScript codes and used for detecting whether the suspected advertisement URL exists; the suspected advertisement URL is a URL corresponding to a DOM tag containing at least one of an advertisement code integral characteristic, a URL skipping characteristic and an absolute positioning characteristic required to be displayed at a specific position of a webpage;

and the blacklist domain name acquisition module is used for determining a blacklist domain name conforming to a blacklist judgment method based on the suspected advertisement URL in the cache library and storing the blacklist domain name in a blacklist library.

8. The apparatus of claim 7, wherein the blacklist domain name acquisition module comprises:

a suspected domain name obtaining unit, configured to perform domain name extraction on each suspected advertisement URL in the cache library, and obtain a corresponding suspected domain name;

and the blacklist domain name acquisition unit is used for determining the suspected domain names of which the number reaches a preset value in the cache library as the blacklist domain names.

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method for creating a blacklist library for preventing traffic hijacking according to any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method for creating a blacklist bank for preventing traffic hijacking as claimed in any one of claims 1 to 6.