CN112231578A - Advertisement blocking system and method based on graph and machine learning - Google Patents
Advertisement blocking system and method based on graph and machine learning Download PDFInfo
- Publication number
- CN112231578A CN112231578A CN202011233201.8A CN202011233201A CN112231578A CN 112231578 A CN112231578 A CN 112231578A CN 202011233201 A CN202011233201 A CN 202011233201A CN 112231578 A CN112231578 A CN 112231578A
- Authority
- CN
- China
- Prior art keywords
- module
- graph
- classifier
- advertisement
- tracing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000010801 machine learning Methods 0.000 title claims abstract description 18
- 230000000903 blocking effect Effects 0.000 title claims abstract description 17
- 238000000605 extraction Methods 0.000 claims abstract description 28
- 238000009877 rendering Methods 0.000 claims abstract description 9
- 239000000284 extract Substances 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 claims description 28
- 238000010276 construction Methods 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 5
- 230000006870 function Effects 0.000 abstract description 5
- 238000004590 computer program Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Abstract
The invention relates to an advertisement blocking system and method based on graph and machine learning, the system includes: the tracing graph constructing module, the feature extracting module and the classifier module; the tracing graph building module collects page resource loading information in a browser page rendering pipeline, builds a tracing graph, and corresponds resources in a page to a unique source of the resources; the learning module performs learning training through the training data obtained by the marking module to obtain a classifier for identifying the advertisement resources; and the classifier module classifies and identifies the network resource nodes in the tracing graph obtained in the feature extraction module, finds out the advertisement resources in the network resource nodes, and extracts the url corresponding to the advertisement resources. According to the invention, the tracing graph is constructed, and the machine learning method is used for identifying and intercepting the advertisement resources in the page, so that the functions of accelerating the page loading speed and having high interception accuracy are realized, and meanwhile, the function of automatically expanding the blacklist can be completed according to the advertisement resources found out from the tracing graph; the invention relates to the technical field of network security.
Description
Technical Field
The invention relates to the technical field of network security, in particular to an advertisement blocking system and method based on graph and machine learning.
Background
The currently known effective interception method is to use a blacklist and browser expansion to intercept advertisements appearing on a webpage, and the method is proved to be effective, but because the advertisements on the webpage can be continuously updated, and the updating of the blacklist needs to be completed manually, the labor cost is greatly increased, meanwhile, the blacklist also has the conditions of interception errors and slowing down the loading time of the webpage, so that the normal content of the webpage cannot be displayed, and the internet experience of a user is further influenced.
Therefore, there is a need for improvements in the prior art.
Disclosure of Invention
In order to overcome the defects in the prior art, the advertisement interception method and the advertisement interception system based on the graph and the machine learning are high in interception accuracy rate and high in page loading speed, and can automatically expand the blacklist.
In order to solve the technical problems, the invention adopts the technical scheme that:
an advertisement blocking system based on graph and machine learning comprises a tracing graph construction module, a feature extraction module and a classifier module which are connected in sequence;
the tracing graph building module is used for collecting page resource loading information in a pipeline of a browser rendering page, building a tracing graph and corresponding resources in the page to a unique source of the tracing graph;
the feature extraction module is used for receiving the tracing graph generated by the tracing graph construction module, extracting content features and structural features of nodes in each graph, namely page resources, and generating a multi-dimensional feature vector of each node;
the classifier module is used for classifying and identifying the multidimensional characteristic vectors of the nodes extracted from the characteristic extraction module, and finding out advertisement resources in the multidimensional characteristic vectors for interception.
The system further comprises a blacklist module, a marking module and a learning module, wherein the marking module is respectively connected with the feature extraction module and the blacklist module, and the marking module marks and stores the multidimensional feature vector generated by the feature extraction module according to the existing data in the blacklist module; the learning module is respectively connected with the marking module and the classifier module, and the learning module performs learning training and updates the classifier module according to the data of the marking module.
The system further comprises a feedback module, wherein the feedback module is respectively connected with the classifier module and the blacklist module, and the feedback module is used for further processing the url of the advertisement resource obtained in the classifier module, generating a filtering rule which is not included in the blacklist module, and expanding the blacklist module.
Furthermore, the tracing graph building module, the feature extraction module and the classifier module are arranged inside a rendering engine of the browser;
the marking module, the learning module, the feedback module and the blacklist module are deployed off line.
An advertisement blocking method based on graph and machine learning comprises the following steps:
s1, after the browser receives the webpage html document, the rendering engine can analyze the html document into a dom tree, the source tracing module captures dom tree information, monitors javascript execution and corresponds each page resource to the source of the page resource;
s2, the feature extraction module receives the tracing graph generated by the tracing graph module, and extracts content features and structural features of nodes in each graph, namely page resources;
and S3, the feature extraction module generates a multi-dimensional feature vector of each node in the tracing graph and inputs the multi-dimensional feature vector to the classifier module, the classifier module comprises a trained classifier model, and the classifier module identifies the feature vectors of the nodes extracted by the feature extraction module through the classifier model and finds out advertisement resources in the nodes for interception.
Further, the method also comprises the following steps:
s4, the feature vector extracted by the feature extraction module is used as the input of a marking module, and the marking module marks the feature vector through the existing blacklist and stores the feature vector;
and the learning module trains the classifier model according to the newly added data of the marking module every 12h or 24h for updating the classifier module.
Further, the method also comprises the following steps: and S5, receiving the advertisement resources identified by the classifier module through the feedback module, automatically generating a filtering rule according with the grammar of the blacklist, and expanding the number of the blacklist.
Further, the content characteristics comprise the length of the resource url, whether the resource url contains advertisement vocabularies and whether the resource url comes from a third party;
the structural characteristics comprise the out-in degree of the nodes in the tracing graph, the out-in degree of the father node, the number of child nodes and the number of brother nodes.
Compared with the prior art, the invention has the following beneficial effects:
1. by constructing a tracing graph and identifying and intercepting advertisement resources in the page by a machine learning method, the functions of accelerating the page loading speed and having high interception accuracy are realized,
2. the function of automatically expanding the blacklist can be completed according to the advertisement resources found out from the tracing graph, and the labor cost is greatly reduced.
3. The learning module can automatically update and perfect the classifier model according to the data of the marking module, and the interception capability is continuously improved.
Drawings
The following will explain embodiments of the present invention in further detail through the accompanying drawings.
FIG. 1 is a schematic block diagram of an advertisement blocking system of the present invention;
FIG. 2 is a flowchart of an advertisement blocking method according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1:
as shown in fig. 1 to 2, an advertisement blocking system and method based on graph and machine learning includes a tracing graph construction module, a feature extraction module, and a classifier module, which are connected in sequence;
the tracing graph building module is used for collecting page resource loading information in a pipeline of a browser rendering page, building a tracing graph and corresponding resources in the page to a unique source of the tracing graph;
the feature extraction module is used for receiving the tracing graph generated by the tracing graph construction module, extracting content features and structural features of nodes in each graph, namely page resources, and generating a multi-dimensional feature vector of each node;
the classifier module is used for classifying and identifying the multidimensional characteristic vectors of the nodes extracted from the characteristic extraction module, and finding out advertisement resources in the multidimensional characteristic vectors for interception.
The method comprises the following steps:
s1, after the browser receives the webpage html document, the rendering engine can analyze the html document into a dom tree, the source tracing module captures dom tree information, monitors javascript execution and corresponds each page resource to the source of the page resource;
for example, a certain < img > tag in the dom tree is generated by the parser first and then modified by the a.js code, and the traceback graph can trace the entire change process of the < img > tag. The traceback graph building module generates an entire traceback graph for a page, where each node represents a page resource,
s2, the feature extraction module receives the tracing graph generated by the tracing graph module, and extracts content features and structural features of nodes in each graph, namely page resources;
the content characteristics include the length of the resource url, whether an ad class vocabulary (e.g., ad, banner, etc.) is included, whether from a third party, etc. The structural characteristics comprise the out-in degree of the nodes in the tracing graph, the out-in degree of the father nodes, the number of the child nodes, the number of the brother nodes and the like S3, the characteristic extraction module generates a multi-dimensional characteristic vector of each node in the tracing graph and inputs the multi-dimensional characteristic vector into the classifier module,
and S3, the classifier module comprises a trained classifier model, and the classifier module identifies the feature vectors of the plurality of nodes extracted by the feature extraction module through the classifier model, finds out advertisement resources in the advertisement vectors and intercepts the advertisement resources.
Example 2:
on the basis of the embodiment 1, the system further comprises a blacklist module, a marking module, a learning module and a feedback module, wherein the marking module is respectively connected with the feature extraction module and the blacklist module, and marks and stores the multidimensional feature vector generated by the feature extraction module according to the existing data in the blacklist module; the learning module is respectively connected with the marking module and the classifier module,
the feedback module is respectively connected with the classifier module and the blacklist module, and is used for further processing the url of the advertisement resource obtained in the classifier module, generating a filtering rule which is not available in the blacklist module, and expanding the blacklist module.
And according to the set time interval, the learning module trains the classifier model according to the newly added data in the marking module for updating the classifier module.
The time interval is set in the range of 12 hours to 24 hours.
The division of a module, unit or component herein is merely a division of logical functions, and other divisions may be made in practice, for example, a plurality of modules and/or units may be combined or integrated in another system. Modules, units, or components described as separate parts may or may not be physically separate. The components displayed as cells may or may not be physical cells, and may be located in a specific place or distributed in grid cells. Therefore, some or all of the units can be selected according to actual needs to implement the scheme of the embodiment.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the medium, including but not limited to disk storage, CD-ROM, optical storage, and the like.
Although only the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art, and all changes are encompassed in the scope of the present invention.
Claims (8)
1. An advertisement blocking system based on graph and machine learning is characterized in that: the tracing graph constructing module, the feature extracting module and the classifier module are sequentially connected;
the tracing graph building module is used for collecting page resource loading information in a pipeline of a browser rendering page, building a tracing graph and corresponding resources in the page to a unique source of the tracing graph;
the feature extraction module is used for receiving the tracing graph generated by the tracing graph construction module, extracting content features and structural features of nodes in each graph, namely page resources, and generating a multi-dimensional feature vector of each node;
the classifier module is used for classifying and identifying the multidimensional characteristic vectors of the nodes extracted from the characteristic extraction module, and finding out advertisement resources in the multidimensional characteristic vectors for interception.
2. The advertisement blocking system based on graph and machine learning of claim 1, wherein: the system also comprises a blacklist module, a marking module and a learning module, wherein the marking module is respectively connected with the feature extraction module and the blacklist module, and the marking module marks and stores the multidimensional feature vector generated by the feature extraction module according to the existing data in the blacklist module; the learning module is respectively connected with the marking module and the classifier module, and the learning module performs learning training and updates the classifier module according to the data of the marking module.
3. The advertisement blocking system based on graph and machine learning according to claim 2, wherein: the advertisement resource management system further comprises a feedback module, wherein the feedback module is respectively connected with the classifier module and the blacklist module, and is used for further processing the url of the advertisement resource obtained in the classifier module, generating a filtering rule which is not contained in the blacklist module, and expanding the blacklist module.
4. The advertisement blocking system based on graph and machine learning of claim 1, wherein: the tracing graph building module, the feature extraction module and the classifier module are arranged in a rendering engine of the browser;
the marking module, the learning module, the feedback module and the blacklist module are deployed off line.
5. An advertisement blocking method based on graph and machine learning is characterized in that:
s1, after the browser receives the webpage html document, the rendering engine can analyze the html document into a dom tree, the source tracing module captures dom tree information, monitors javascript execution and corresponds each page resource to the source of the page resource;
s2, the feature extraction module receives the tracing graph generated by the tracing graph module, and extracts content features and structural features of nodes in each graph, namely page resources;
and S3, the feature extraction module generates a multi-dimensional feature vector of each node in the tracing graph and inputs the multi-dimensional feature vector to the classifier module, the classifier module comprises a trained classifier model, and the classifier module identifies the feature vectors of the nodes extracted by the feature extraction module through the classifier model and finds out advertisement resources in the nodes for interception.
6. The advertisement blocking method based on graph and machine learning of claim 5, further comprising the following steps:
s4, the feature vector extracted by the feature extraction module is used as the input of a marking module, and the marking module marks the feature vector through the existing blacklist and stores the feature vector;
and the learning module trains the classifier model according to the newly added data of the marking module every 12h or 24h for updating the classifier module.
7. The advertisement blocking method based on graph and machine learning according to claim 6, characterized in that:
and S5, receiving the advertisement resources identified by the classifier module through the feedback module, automatically generating a filtering rule according with the grammar of the blacklist, and expanding the number of the blacklist.
8. The advertisement blocking method based on graph and machine learning according to claim 5, characterized in that: the content features comprise the length of the resource url, whether the resource url contains an advertisement vocabulary or not and whether the resource url comes from a third party or not;
the structural characteristics comprise the out-in degree of the nodes in the tracing graph, the out-in degree of the father node, the number of child nodes and the number of brother nodes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011233201.8A CN112231578A (en) | 2020-11-06 | 2020-11-06 | Advertisement blocking system and method based on graph and machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011233201.8A CN112231578A (en) | 2020-11-06 | 2020-11-06 | Advertisement blocking system and method based on graph and machine learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112231578A true CN112231578A (en) | 2021-01-15 |
Family
ID=74122436
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011233201.8A Pending CN112231578A (en) | 2020-11-06 | 2020-11-06 | Advertisement blocking system and method based on graph and machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112231578A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113343233A (en) * | 2021-05-08 | 2021-09-03 | 山西三友和智慧信息技术股份有限公司 | Interval security monitoring system and monitoring method based on big data |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105279215A (en) * | 2014-06-10 | 2016-01-27 | 中兴通讯股份有限公司 | Resource downloading method and apparatus |
CN105956026A (en) * | 2016-04-22 | 2016-09-21 | 北京小米移动软件有限公司 | Webpage rendering method and apparatus |
CN108733764A (en) * | 2018-04-16 | 2018-11-02 | 优视科技有限公司 | Advertisement filter rule generating method based on machine learning and advertisement filtering system |
CN109948080A (en) * | 2019-03-18 | 2019-06-28 | 西安电子科技大学 | A kind of counteradvertising based on machine learning intercepts the application method of detection system |
CN110866162A (en) * | 2019-10-10 | 2020-03-06 | 西安交通大学 | Causal relationship mining method based on conjugate behaviors in MOOC data |
-
2020
- 2020-11-06 CN CN202011233201.8A patent/CN112231578A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105279215A (en) * | 2014-06-10 | 2016-01-27 | 中兴通讯股份有限公司 | Resource downloading method and apparatus |
CN105956026A (en) * | 2016-04-22 | 2016-09-21 | 北京小米移动软件有限公司 | Webpage rendering method and apparatus |
CN108733764A (en) * | 2018-04-16 | 2018-11-02 | 优视科技有限公司 | Advertisement filter rule generating method based on machine learning and advertisement filtering system |
CN109948080A (en) * | 2019-03-18 | 2019-06-28 | 西安电子科技大学 | A kind of counteradvertising based on machine learning intercepts the application method of detection system |
CN110866162A (en) * | 2019-10-10 | 2020-03-06 | 西安交通大学 | Causal relationship mining method based on conjugate behaviors in MOOC data |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113343233A (en) * | 2021-05-08 | 2021-09-03 | 山西三友和智慧信息技术股份有限公司 | Interval security monitoring system and monitoring method based on big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9223815B2 (en) | Method, apparatus, and program for supporting creation and management of metadata for correcting problem in dynamic web application | |
CN101251855B (en) | Equipment, system and method for cleaning internet web page | |
CN107943838B (en) | Method and system for automatically acquiring xpath generated crawler script | |
CN107145556B (en) | Universal distributed acquisition system | |
WO2015074503A1 (en) | Statistical method and apparatus for webpage access data | |
BR112014028739B1 (en) | SYSTEM AND METHOD TO CREATE STRUCTURED EVENT OBJECTS | |
CN104504150A (en) | News public opinion monitoring system | |
CN107729475A (en) | Web page element acquisition method, device, terminal and computer-readable recording medium | |
CN108664635B (en) | Method, device, equipment and storage medium for acquiring database statistical information | |
CN101976241B (en) | Method and system for generating identification code | |
CN106294535A (en) | The recognition methods of website and device | |
CN104615765A (en) | Data processing method and data processing device for browsing internet records of mobile subscribers | |
CN103164423A (en) | Method and device for confirming browser inner core type rendering web pages | |
CN105528416A (en) | Method and system for monitoring update contents of website | |
CN112818200A (en) | Data crawling and event analyzing method and system based on static website | |
CN106227770A (en) | A kind of intelligentized news web page information extraction method | |
CN112231578A (en) | Advertisement blocking system and method based on graph and machine learning | |
Yu et al. | Brain: Log parsing with bidirectional parallel tree | |
CN111190873A (en) | Log mode extraction method and system for log training of cloud native system | |
CN103870495A (en) | Method and device for extracting information from website | |
CN114398315A (en) | Data storage method, system, storage medium and electronic equipment | |
CN104462095A (en) | Extraction method and device of common pars of query statements | |
CN109213833A (en) | Two disaggregated model training methods, data classification method and corresponding intrument | |
CN105119910A (en) | Template-based online social network rubbish information real-time detecting method | |
CN104636324B (en) | Topic source tracing method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210115 |