CN112231578A - Advertisement blocking system and method based on graph and machine learning - Google Patents

Advertisement blocking system and method based on graph and machine learning Download PDF

Info

Publication number
CN112231578A
CN112231578A CN202011233201.8A CN202011233201A CN112231578A CN 112231578 A CN112231578 A CN 112231578A CN 202011233201 A CN202011233201 A CN 202011233201A CN 112231578 A CN112231578 A CN 112231578A
Authority
CN
China
Prior art keywords
module
graph
classifier
advertisement
tracing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011233201.8A
Other languages
Chinese (zh)
Inventor
潘晓光
王小华
王宇琦
潘晓辉
董虎弟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi Sanyouhe Smart Information Technology Co Ltd
Original Assignee
Shanxi Sanyouhe Smart Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi Sanyouhe Smart Information Technology Co Ltd filed Critical Shanxi Sanyouhe Smart Information Technology Co Ltd
Priority to CN202011233201.8A priority Critical patent/CN112231578A/en
Publication of CN112231578A publication Critical patent/CN112231578A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention relates to an advertisement blocking system and method based on graph and machine learning, the system includes: the tracing graph constructing module, the feature extracting module and the classifier module; the tracing graph building module collects page resource loading information in a browser page rendering pipeline, builds a tracing graph, and corresponds resources in a page to a unique source of the resources; the learning module performs learning training through the training data obtained by the marking module to obtain a classifier for identifying the advertisement resources; and the classifier module classifies and identifies the network resource nodes in the tracing graph obtained in the feature extraction module, finds out the advertisement resources in the network resource nodes, and extracts the url corresponding to the advertisement resources. According to the invention, the tracing graph is constructed, and the machine learning method is used for identifying and intercepting the advertisement resources in the page, so that the functions of accelerating the page loading speed and having high interception accuracy are realized, and meanwhile, the function of automatically expanding the blacklist can be completed according to the advertisement resources found out from the tracing graph; the invention relates to the technical field of network security.

Description

Advertisement blocking system and method based on graph and machine learning
Technical Field
The invention relates to the technical field of network security, in particular to an advertisement blocking system and method based on graph and machine learning.
Background
The currently known effective interception method is to use a blacklist and browser expansion to intercept advertisements appearing on a webpage, and the method is proved to be effective, but because the advertisements on the webpage can be continuously updated, and the updating of the blacklist needs to be completed manually, the labor cost is greatly increased, meanwhile, the blacklist also has the conditions of interception errors and slowing down the loading time of the webpage, so that the normal content of the webpage cannot be displayed, and the internet experience of a user is further influenced.
Therefore, there is a need for improvements in the prior art.
Disclosure of Invention
In order to overcome the defects in the prior art, the advertisement interception method and the advertisement interception system based on the graph and the machine learning are high in interception accuracy rate and high in page loading speed, and can automatically expand the blacklist.
In order to solve the technical problems, the invention adopts the technical scheme that:
an advertisement blocking system based on graph and machine learning comprises a tracing graph construction module, a feature extraction module and a classifier module which are connected in sequence;
the tracing graph building module is used for collecting page resource loading information in a pipeline of a browser rendering page, building a tracing graph and corresponding resources in the page to a unique source of the tracing graph;
the feature extraction module is used for receiving the tracing graph generated by the tracing graph construction module, extracting content features and structural features of nodes in each graph, namely page resources, and generating a multi-dimensional feature vector of each node;
the classifier module is used for classifying and identifying the multidimensional characteristic vectors of the nodes extracted from the characteristic extraction module, and finding out advertisement resources in the multidimensional characteristic vectors for interception.
The system further comprises a blacklist module, a marking module and a learning module, wherein the marking module is respectively connected with the feature extraction module and the blacklist module, and the marking module marks and stores the multidimensional feature vector generated by the feature extraction module according to the existing data in the blacklist module; the learning module is respectively connected with the marking module and the classifier module, and the learning module performs learning training and updates the classifier module according to the data of the marking module.
The system further comprises a feedback module, wherein the feedback module is respectively connected with the classifier module and the blacklist module, and the feedback module is used for further processing the url of the advertisement resource obtained in the classifier module, generating a filtering rule which is not included in the blacklist module, and expanding the blacklist module.
Furthermore, the tracing graph building module, the feature extraction module and the classifier module are arranged inside a rendering engine of the browser;
the marking module, the learning module, the feedback module and the blacklist module are deployed off line.
An advertisement blocking method based on graph and machine learning comprises the following steps:
s1, after the browser receives the webpage html document, the rendering engine can analyze the html document into a dom tree, the source tracing module captures dom tree information, monitors javascript execution and corresponds each page resource to the source of the page resource;
s2, the feature extraction module receives the tracing graph generated by the tracing graph module, and extracts content features and structural features of nodes in each graph, namely page resources;
and S3, the feature extraction module generates a multi-dimensional feature vector of each node in the tracing graph and inputs the multi-dimensional feature vector to the classifier module, the classifier module comprises a trained classifier model, and the classifier module identifies the feature vectors of the nodes extracted by the feature extraction module through the classifier model and finds out advertisement resources in the nodes for interception.
Further, the method also comprises the following steps:
s4, the feature vector extracted by the feature extraction module is used as the input of a marking module, and the marking module marks the feature vector through the existing blacklist and stores the feature vector;
and the learning module trains the classifier model according to the newly added data of the marking module every 12h or 24h for updating the classifier module.
Further, the method also comprises the following steps: and S5, receiving the advertisement resources identified by the classifier module through the feedback module, automatically generating a filtering rule according with the grammar of the blacklist, and expanding the number of the blacklist.
Further, the content characteristics comprise the length of the resource url, whether the resource url contains advertisement vocabularies and whether the resource url comes from a third party;
the structural characteristics comprise the out-in degree of the nodes in the tracing graph, the out-in degree of the father node, the number of child nodes and the number of brother nodes.
Compared with the prior art, the invention has the following beneficial effects:
1. by constructing a tracing graph and identifying and intercepting advertisement resources in the page by a machine learning method, the functions of accelerating the page loading speed and having high interception accuracy are realized,
2. the function of automatically expanding the blacklist can be completed according to the advertisement resources found out from the tracing graph, and the labor cost is greatly reduced.
3. The learning module can automatically update and perfect the classifier model according to the data of the marking module, and the interception capability is continuously improved.
Drawings
The following will explain embodiments of the present invention in further detail through the accompanying drawings.
FIG. 1 is a schematic block diagram of an advertisement blocking system of the present invention;
FIG. 2 is a flowchart of an advertisement blocking method according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1:
as shown in fig. 1 to 2, an advertisement blocking system and method based on graph and machine learning includes a tracing graph construction module, a feature extraction module, and a classifier module, which are connected in sequence;
the tracing graph building module is used for collecting page resource loading information in a pipeline of a browser rendering page, building a tracing graph and corresponding resources in the page to a unique source of the tracing graph;
the feature extraction module is used for receiving the tracing graph generated by the tracing graph construction module, extracting content features and structural features of nodes in each graph, namely page resources, and generating a multi-dimensional feature vector of each node;
the classifier module is used for classifying and identifying the multidimensional characteristic vectors of the nodes extracted from the characteristic extraction module, and finding out advertisement resources in the multidimensional characteristic vectors for interception.
The method comprises the following steps:
s1, after the browser receives the webpage html document, the rendering engine can analyze the html document into a dom tree, the source tracing module captures dom tree information, monitors javascript execution and corresponds each page resource to the source of the page resource;
for example, a certain < img > tag in the dom tree is generated by the parser first and then modified by the a.js code, and the traceback graph can trace the entire change process of the < img > tag. The traceback graph building module generates an entire traceback graph for a page, where each node represents a page resource,
s2, the feature extraction module receives the tracing graph generated by the tracing graph module, and extracts content features and structural features of nodes in each graph, namely page resources;
the content characteristics include the length of the resource url, whether an ad class vocabulary (e.g., ad, banner, etc.) is included, whether from a third party, etc. The structural characteristics comprise the out-in degree of the nodes in the tracing graph, the out-in degree of the father nodes, the number of the child nodes, the number of the brother nodes and the like S3, the characteristic extraction module generates a multi-dimensional characteristic vector of each node in the tracing graph and inputs the multi-dimensional characteristic vector into the classifier module,
and S3, the classifier module comprises a trained classifier model, and the classifier module identifies the feature vectors of the plurality of nodes extracted by the feature extraction module through the classifier model, finds out advertisement resources in the advertisement vectors and intercepts the advertisement resources.
Example 2:
on the basis of the embodiment 1, the system further comprises a blacklist module, a marking module, a learning module and a feedback module, wherein the marking module is respectively connected with the feature extraction module and the blacklist module, and marks and stores the multidimensional feature vector generated by the feature extraction module according to the existing data in the blacklist module; the learning module is respectively connected with the marking module and the classifier module,
the feedback module is respectively connected with the classifier module and the blacklist module, and is used for further processing the url of the advertisement resource obtained in the classifier module, generating a filtering rule which is not available in the blacklist module, and expanding the blacklist module.
And according to the set time interval, the learning module trains the classifier model according to the newly added data in the marking module for updating the classifier module.
The time interval is set in the range of 12 hours to 24 hours.
The division of a module, unit or component herein is merely a division of logical functions, and other divisions may be made in practice, for example, a plurality of modules and/or units may be combined or integrated in another system. Modules, units, or components described as separate parts may or may not be physically separate. The components displayed as cells may or may not be physical cells, and may be located in a specific place or distributed in grid cells. Therefore, some or all of the units can be selected according to actual needs to implement the scheme of the embodiment.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the medium, including but not limited to disk storage, CD-ROM, optical storage, and the like.
Although only the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art, and all changes are encompassed in the scope of the present invention.

Claims (8)

1. An advertisement blocking system based on graph and machine learning is characterized in that: the tracing graph constructing module, the feature extracting module and the classifier module are sequentially connected;
the tracing graph building module is used for collecting page resource loading information in a pipeline of a browser rendering page, building a tracing graph and corresponding resources in the page to a unique source of the tracing graph;
the feature extraction module is used for receiving the tracing graph generated by the tracing graph construction module, extracting content features and structural features of nodes in each graph, namely page resources, and generating a multi-dimensional feature vector of each node;
the classifier module is used for classifying and identifying the multidimensional characteristic vectors of the nodes extracted from the characteristic extraction module, and finding out advertisement resources in the multidimensional characteristic vectors for interception.
2. The advertisement blocking system based on graph and machine learning of claim 1, wherein: the system also comprises a blacklist module, a marking module and a learning module, wherein the marking module is respectively connected with the feature extraction module and the blacklist module, and the marking module marks and stores the multidimensional feature vector generated by the feature extraction module according to the existing data in the blacklist module; the learning module is respectively connected with the marking module and the classifier module, and the learning module performs learning training and updates the classifier module according to the data of the marking module.
3. The advertisement blocking system based on graph and machine learning according to claim 2, wherein: the advertisement resource management system further comprises a feedback module, wherein the feedback module is respectively connected with the classifier module and the blacklist module, and is used for further processing the url of the advertisement resource obtained in the classifier module, generating a filtering rule which is not contained in the blacklist module, and expanding the blacklist module.
4. The advertisement blocking system based on graph and machine learning of claim 1, wherein: the tracing graph building module, the feature extraction module and the classifier module are arranged in a rendering engine of the browser;
the marking module, the learning module, the feedback module and the blacklist module are deployed off line.
5. An advertisement blocking method based on graph and machine learning is characterized in that:
s1, after the browser receives the webpage html document, the rendering engine can analyze the html document into a dom tree, the source tracing module captures dom tree information, monitors javascript execution and corresponds each page resource to the source of the page resource;
s2, the feature extraction module receives the tracing graph generated by the tracing graph module, and extracts content features and structural features of nodes in each graph, namely page resources;
and S3, the feature extraction module generates a multi-dimensional feature vector of each node in the tracing graph and inputs the multi-dimensional feature vector to the classifier module, the classifier module comprises a trained classifier model, and the classifier module identifies the feature vectors of the nodes extracted by the feature extraction module through the classifier model and finds out advertisement resources in the nodes for interception.
6. The advertisement blocking method based on graph and machine learning of claim 5, further comprising the following steps:
s4, the feature vector extracted by the feature extraction module is used as the input of a marking module, and the marking module marks the feature vector through the existing blacklist and stores the feature vector;
and the learning module trains the classifier model according to the newly added data of the marking module every 12h or 24h for updating the classifier module.
7. The advertisement blocking method based on graph and machine learning according to claim 6, characterized in that:
and S5, receiving the advertisement resources identified by the classifier module through the feedback module, automatically generating a filtering rule according with the grammar of the blacklist, and expanding the number of the blacklist.
8. The advertisement blocking method based on graph and machine learning according to claim 5, characterized in that: the content features comprise the length of the resource url, whether the resource url contains an advertisement vocabulary or not and whether the resource url comes from a third party or not;
the structural characteristics comprise the out-in degree of the nodes in the tracing graph, the out-in degree of the father node, the number of child nodes and the number of brother nodes.
CN202011233201.8A 2020-11-06 2020-11-06 Advertisement blocking system and method based on graph and machine learning Pending CN112231578A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011233201.8A CN112231578A (en) 2020-11-06 2020-11-06 Advertisement blocking system and method based on graph and machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011233201.8A CN112231578A (en) 2020-11-06 2020-11-06 Advertisement blocking system and method based on graph and machine learning

Publications (1)

Publication Number Publication Date
CN112231578A true CN112231578A (en) 2021-01-15

Family

ID=74122436

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011233201.8A Pending CN112231578A (en) 2020-11-06 2020-11-06 Advertisement blocking system and method based on graph and machine learning

Country Status (1)

Country Link
CN (1) CN112231578A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343233A (en) * 2021-05-08 2021-09-03 山西三友和智慧信息技术股份有限公司 Interval security monitoring system and monitoring method based on big data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279215A (en) * 2014-06-10 2016-01-27 中兴通讯股份有限公司 Resource downloading method and apparatus
CN105956026A (en) * 2016-04-22 2016-09-21 北京小米移动软件有限公司 Webpage rendering method and apparatus
CN108733764A (en) * 2018-04-16 2018-11-02 优视科技有限公司 Advertisement filter rule generating method based on machine learning and advertisement filtering system
CN109948080A (en) * 2019-03-18 2019-06-28 西安电子科技大学 A kind of counteradvertising based on machine learning intercepts the application method of detection system
CN110866162A (en) * 2019-10-10 2020-03-06 西安交通大学 Causal relationship mining method based on conjugate behaviors in MOOC data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279215A (en) * 2014-06-10 2016-01-27 中兴通讯股份有限公司 Resource downloading method and apparatus
CN105956026A (en) * 2016-04-22 2016-09-21 北京小米移动软件有限公司 Webpage rendering method and apparatus
CN108733764A (en) * 2018-04-16 2018-11-02 优视科技有限公司 Advertisement filter rule generating method based on machine learning and advertisement filtering system
CN109948080A (en) * 2019-03-18 2019-06-28 西安电子科技大学 A kind of counteradvertising based on machine learning intercepts the application method of detection system
CN110866162A (en) * 2019-10-10 2020-03-06 西安交通大学 Causal relationship mining method based on conjugate behaviors in MOOC data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343233A (en) * 2021-05-08 2021-09-03 山西三友和智慧信息技术股份有限公司 Interval security monitoring system and monitoring method based on big data

Similar Documents

Publication Publication Date Title
US9223815B2 (en) Method, apparatus, and program for supporting creation and management of metadata for correcting problem in dynamic web application
CN101251855B (en) Equipment, system and method for cleaning internet web page
CN107943838B (en) Method and system for automatically acquiring xpath generated crawler script
CN107145556B (en) Universal distributed acquisition system
WO2015074503A1 (en) Statistical method and apparatus for webpage access data
BR112014028739B1 (en) SYSTEM AND METHOD TO CREATE STRUCTURED EVENT OBJECTS
CN104504150A (en) News public opinion monitoring system
CN107729475A (en) Web page element acquisition method, device, terminal and computer-readable recording medium
CN108664635B (en) Method, device, equipment and storage medium for acquiring database statistical information
CN101976241B (en) Method and system for generating identification code
CN106294535A (en) The recognition methods of website and device
CN104615765A (en) Data processing method and data processing device for browsing internet records of mobile subscribers
CN103164423A (en) Method and device for confirming browser inner core type rendering web pages
CN105528416A (en) Method and system for monitoring update contents of website
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN106227770A (en) A kind of intelligentized news web page information extraction method
CN112231578A (en) Advertisement blocking system and method based on graph and machine learning
Yu et al. Brain: Log parsing with bidirectional parallel tree
CN111190873A (en) Log mode extraction method and system for log training of cloud native system
CN103870495A (en) Method and device for extracting information from website
CN114398315A (en) Data storage method, system, storage medium and electronic equipment
CN104462095A (en) Extraction method and device of common pars of query statements
CN109213833A (en) Two disaggregated model training methods, data classification method and corresponding intrument
CN105119910A (en) Template-based online social network rubbish information real-time detecting method
CN104636324B (en) Topic source tracing method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210115