CN112231578A

CN112231578A - Advertisement blocking system and method based on graph and machine learning

Info

Publication number: CN112231578A
Application number: CN202011233201.8A
Authority: CN
Inventors: 潘晓光; 王小华; 王宇琦; 潘晓辉; 董虎弟
Original assignee: Shanxi Sanyouhe Smart Information Technology Co Ltd
Current assignee: Shanxi Sanyouhe Smart Information Technology Co Ltd
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2021-01-15

Abstract

The invention relates to an advertisement blocking system and method based on graph and machine learning, the system includes: the tracing graph constructing module, the feature extracting module and the classifier module; the tracing graph building module collects page resource loading information in a browser page rendering pipeline, builds a tracing graph, and corresponds resources in a page to a unique source of the resources; the learning module performs learning training through the training data obtained by the marking module to obtain a classifier for identifying the advertisement resources; and the classifier module classifies and identifies the network resource nodes in the tracing graph obtained in the feature extraction module, finds out the advertisement resources in the network resource nodes, and extracts the url corresponding to the advertisement resources. According to the invention, the tracing graph is constructed, and the machine learning method is used for identifying and intercepting the advertisement resources in the page, so that the functions of accelerating the page loading speed and having high interception accuracy are realized, and meanwhile, the function of automatically expanding the blacklist can be completed according to the advertisement resources found out from the tracing graph; the invention relates to the technical field of network security.

Description

Advertisement blocking system and method based on graph and machine learning

Technical Field

The invention relates to the technical field of network security, in particular to an advertisement blocking system and method based on graph and machine learning.

Background

The currently known effective interception method is to use a blacklist and browser expansion to intercept advertisements appearing on a webpage, and the method is proved to be effective, but because the advertisements on the webpage can be continuously updated, and the updating of the blacklist needs to be completed manually, the labor cost is greatly increased, meanwhile, the blacklist also has the conditions of interception errors and slowing down the loading time of the webpage, so that the normal content of the webpage cannot be displayed, and the internet experience of a user is further influenced.

Therefore, there is a need for improvements in the prior art.

Disclosure of Invention

In order to overcome the defects in the prior art, the advertisement interception method and the advertisement interception system based on the graph and the machine learning are high in interception accuracy rate and high in page loading speed, and can automatically expand the blacklist.

In order to solve the technical problems, the invention adopts the technical scheme that:

an advertisement blocking system based on graph and machine learning comprises a tracing graph construction module, a feature extraction module and a classifier module which are connected in sequence;

the tracing graph building module is used for collecting page resource loading information in a pipeline of a browser rendering page, building a tracing graph and corresponding resources in the page to a unique source of the tracing graph;

the feature extraction module is used for receiving the tracing graph generated by the tracing graph construction module, extracting content features and structural features of nodes in each graph, namely page resources, and generating a multi-dimensional feature vector of each node;

the classifier module is used for classifying and identifying the multidimensional characteristic vectors of the nodes extracted from the characteristic extraction module, and finding out advertisement resources in the multidimensional characteristic vectors for interception.

The system further comprises a blacklist module, a marking module and a learning module, wherein the marking module is respectively connected with the feature extraction module and the blacklist module, and the marking module marks and stores the multidimensional feature vector generated by the feature extraction module according to the existing data in the blacklist module; the learning module is respectively connected with the marking module and the classifier module, and the learning module performs learning training and updates the classifier module according to the data of the marking module.

The system further comprises a feedback module, wherein the feedback module is respectively connected with the classifier module and the blacklist module, and the feedback module is used for further processing the url of the advertisement resource obtained in the classifier module, generating a filtering rule which is not included in the blacklist module, and expanding the blacklist module.

Furthermore, the tracing graph building module, the feature extraction module and the classifier module are arranged inside a rendering engine of the browser;

the marking module, the learning module, the feedback module and the blacklist module are deployed off line.

An advertisement blocking method based on graph and machine learning comprises the following steps:

s1, after the browser receives the webpage html document, the rendering engine can analyze the html document into a dom tree, the source tracing module captures dom tree information, monitors javascript execution and corresponds each page resource to the source of the page resource;

s2, the feature extraction module receives the tracing graph generated by the tracing graph module, and extracts content features and structural features of nodes in each graph, namely page resources;

and S3, the feature extraction module generates a multi-dimensional feature vector of each node in the tracing graph and inputs the multi-dimensional feature vector to the classifier module, the classifier module comprises a trained classifier model, and the classifier module identifies the feature vectors of the nodes extracted by the feature extraction module through the classifier model and finds out advertisement resources in the nodes for interception.

Further, the method also comprises the following steps:

s4, the feature vector extracted by the feature extraction module is used as the input of a marking module, and the marking module marks the feature vector through the existing blacklist and stores the feature vector;

and the learning module trains the classifier model according to the newly added data of the marking module every 12h or 24h for updating the classifier module.

Further, the method also comprises the following steps: and S5, receiving the advertisement resources identified by the classifier module through the feedback module, automatically generating a filtering rule according with the grammar of the blacklist, and expanding the number of the blacklist.

Further, the content characteristics comprise the length of the resource url, whether the resource url contains advertisement vocabularies and whether the resource url comes from a third party;

the structural characteristics comprise the out-in degree of the nodes in the tracing graph, the out-in degree of the father node, the number of child nodes and the number of brother nodes.

Compared with the prior art, the invention has the following beneficial effects:

1. by constructing a tracing graph and identifying and intercepting advertisement resources in the page by a machine learning method, the functions of accelerating the page loading speed and having high interception accuracy are realized,

2. the function of automatically expanding the blacklist can be completed according to the advertisement resources found out from the tracing graph, and the labor cost is greatly reduced.

3. The learning module can automatically update and perfect the classifier model according to the data of the marking module, and the interception capability is continuously improved.

Drawings

The following will explain embodiments of the present invention in further detail through the accompanying drawings.

FIG. 1 is a schematic block diagram of an advertisement blocking system of the present invention;

FIG. 2 is a flowchart of an advertisement blocking method according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1:

as shown in fig. 1 to 2, an advertisement blocking system and method based on graph and machine learning includes a tracing graph construction module, a feature extraction module, and a classifier module, which are connected in sequence;

The method comprises the following steps:

for example, a certain < img > tag in the dom tree is generated by the parser first and then modified by the a.js code, and the traceback graph can trace the entire change process of the < img > tag. The traceback graph building module generates an entire traceback graph for a page, where each node represents a page resource,

the content characteristics include the length of the resource url, whether an ad class vocabulary (e.g., ad, banner, etc.) is included, whether from a third party, etc. The structural characteristics comprise the out-in degree of the nodes in the tracing graph, the out-in degree of the father nodes, the number of the child nodes, the number of the brother nodes and the like S3, the characteristic extraction module generates a multi-dimensional characteristic vector of each node in the tracing graph and inputs the multi-dimensional characteristic vector into the classifier module,

and S3, the classifier module comprises a trained classifier model, and the classifier module identifies the feature vectors of the plurality of nodes extracted by the feature extraction module through the classifier model, finds out advertisement resources in the advertisement vectors and intercepts the advertisement resources.

Example 2:

on the basis of the embodiment 1, the system further comprises a blacklist module, a marking module, a learning module and a feedback module, wherein the marking module is respectively connected with the feature extraction module and the blacklist module, and marks and stores the multidimensional feature vector generated by the feature extraction module according to the existing data in the blacklist module; the learning module is respectively connected with the marking module and the classifier module,

the feedback module is respectively connected with the classifier module and the blacklist module, and is used for further processing the url of the advertisement resource obtained in the classifier module, generating a filtering rule which is not available in the blacklist module, and expanding the blacklist module.

And according to the set time interval, the learning module trains the classifier model according to the newly added data in the marking module for updating the classifier module.

The time interval is set in the range of 12 hours to 24 hours.

The division of a module, unit or component herein is merely a division of logical functions, and other divisions may be made in practice, for example, a plurality of modules and/or units may be combined or integrated in another system. Modules, units, or components described as separate parts may or may not be physically separate. The components displayed as cells may or may not be physical cells, and may be located in a specific place or distributed in grid cells. Therefore, some or all of the units can be selected according to actual needs to implement the scheme of the embodiment.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the medium, including but not limited to disk storage, CD-ROM, optical storage, and the like.

Although only the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art, and all changes are encompassed in the scope of the present invention.

Claims

1. An advertisement blocking system based on graph and machine learning is characterized in that: the tracing graph constructing module, the feature extracting module and the classifier module are sequentially connected;

2. The advertisement blocking system based on graph and machine learning of claim 1, wherein: the system also comprises a blacklist module, a marking module and a learning module, wherein the marking module is respectively connected with the feature extraction module and the blacklist module, and the marking module marks and stores the multidimensional feature vector generated by the feature extraction module according to the existing data in the blacklist module; the learning module is respectively connected with the marking module and the classifier module, and the learning module performs learning training and updates the classifier module according to the data of the marking module.

3. The advertisement blocking system based on graph and machine learning according to claim 2, wherein: the advertisement resource management system further comprises a feedback module, wherein the feedback module is respectively connected with the classifier module and the blacklist module, and is used for further processing the url of the advertisement resource obtained in the classifier module, generating a filtering rule which is not contained in the blacklist module, and expanding the blacklist module.

4. The advertisement blocking system based on graph and machine learning of claim 1, wherein: the tracing graph building module, the feature extraction module and the classifier module are arranged in a rendering engine of the browser;

5. An advertisement blocking method based on graph and machine learning is characterized in that:

6. The advertisement blocking method based on graph and machine learning of claim 5, further comprising the following steps:

7. The advertisement blocking method based on graph and machine learning according to claim 6, characterized in that:

and S5, receiving the advertisement resources identified by the classifier module through the feedback module, automatically generating a filtering rule according with the grammar of the blacklist, and expanding the number of the blacklist.

8. The advertisement blocking method based on graph and machine learning according to claim 5, characterized in that: the content features comprise the length of the resource url, whether the resource url contains an advertisement vocabulary or not and whether the resource url comes from a third party or not;