CN111177614A - Source tracking method and device for injecting content to third party of webpage - Google Patents

Source tracking method and device for injecting content to third party of webpage Download PDF

Info

Publication number
CN111177614A
CN111177614A CN201911155077.5A CN201911155077A CN111177614A CN 111177614 A CN111177614 A CN 111177614A CN 201911155077 A CN201911155077 A CN 201911155077A CN 111177614 A CN111177614 A CN 111177614A
Authority
CN
China
Prior art keywords
party
publisher
tag
dom
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911155077.5A
Other languages
Chinese (zh)
Inventor
林军
麦松涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Electronic Product Reliability and Environmental Testing Research Institute
Original Assignee
China Electronic Product Reliability and Environmental Testing Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Electronic Product Reliability and Environmental Testing Research Institute filed Critical China Electronic Product Reliability and Environmental Testing Research Institute
Priority to CN201911155077.5A priority Critical patent/CN111177614A/en
Publication of CN111177614A publication Critical patent/CN111177614A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/972Access to data in other repository systems, e.g. legacy data or dynamic Web page generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9574Browsing optimisation, e.g. caching or content distillation of access to content, e.g. by caching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/16Program or content traceability, e.g. by watermarking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Technology Law (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application relates to a source tracking method and device for injecting content to a webpage third party. The method comprises the following steps: analyzing DOM elements loaded by the webpage, and acquiring a tag set of the DOM elements including all publisher addresses; judging whether a third party publisher address exists according to the tag set; if the address of the publisher of the third party exists, highlighting the content or the root tag published by the publisher of the third party on the webpage; the content published by the publisher of the third party is a DOM element added or modified in the webpage, and the root tag is a tag of the root element of the added or modified DOM element. By adopting the method, the content injected by the third party can be identified and highlighted.

Description

Source tracking method and device for injecting content to third party of webpage
Technical Field
The application relates to the technical field of computer web development, in particular to a source tracking method and device for injecting content to a third party facing a webpage.
Background
Currently, the browser extension enhances the additional practical functions of the browser, the code of the browser extension is usually written by a third party, and the tasks from simply modifying the appearance of a web page to complex tasks (such as fine-grained filtering of content) can be realized through the browser extension, so that the browser extension gives the third party more rights through the code, and the browser provider does not necessarily maintain or support the extended functions of the third party. Third parties may implement these extensibility functions, including, for example, accessing cross-domain content and executing network requests that are not constrained by the same initial rules. Because the third party is endowed with an expansion function, the third party is endowed with opportunities for attacking the user and data thereof, a bottom system and even the whole Internet.
As web advertising becomes more popular, those with the ability to modify web content, such as ISPs (Internet Service providers) and browser extenders, find that revenue can be gained by injecting or replacing advertisements in web pages. For example, some ISPs (Internet Service providers) start tampering with a mid-roll HTTP transmission, inject DOM elements into the HTML document, and add the ISP's advertisements to the pages accessed by the client. In a similar manner, the browser extension modifies the page to inject DOM elements so that advertisements are displayed to the user without having to obtain prior consent from the user. Ad injection has become a common form of content injection for third parties on today's networks without permission.
These actions have an impact on both the web page publisher and the user browsing the web page. On the one hand, ad injection transfers the publisher's revenue to a third party responsible for ad injection, which may have a significant impact on their net revenue if the ad is the publisher's primary source of revenue; on the one hand, if the injected ad contains or references subject matter that relates to an undesirable topic such as adult or politics, the ad injection may also compromise the reputation of the web page publisher from the perspective of the user browsing the web page; on the one hand, if content injection is malicious in nature, the reputation of the web page publisher may be further compromised, in addition to exposing users browsing web pages to security risks due to malware, phishing, and other threats. Previous research has also shown that users who often use ads injected into web pages are more vulnerable to malicious ads and traditional malware.
In order to prevent a third party from abusing the extended functions, an extended security framework of a certain new browser isolates a fine-grained authority system and integrates a minimum authority separation function so as to limit the use of the extended functions by the third party. However, the security framework is not universal, and in fact, because the browser is over-extended and the third party rights are given to the browser and the threat brought by the user's rights to the third party is lack of understanding, the third party's attack to the user's network cannot be effectively avoided by the security framework. Even if there is an extended security framework, recent research has shown that browser extension-based advertisement Injection (such as advertisement Injection) has become a popular and profitable technology, which enables a fraudster to gain revenue through user advertisement browsing on a webpage, when a user visits a website, a third party can transfer advertising revenue in the webpage to the third party by using the browser extension to inject or replace only the advertisement of the original advertiser in the webpage, and whether the user has injected or replaced the advertisement when browsing the webpage is common, even if it is noticed, it is difficult to determine who has injected or replaced the advertisement.
Currently, some existing browser providers have begun to delete ad injection extensions because of the problem of third party injection of malicious content. Although the extended security framework has been able to successfully identify ad injection extensions by isolating fine-grained privilege systems and integrating a minimum privilege separation function, fraudulent extensions can hide their ad injection behavior during short-time analysis and cannot be identified by such semi-automated centralized detection methods; in addition, this semi-automated centralized detection method may also miss some ad injection extensions. In addition, there are also browser extensions that are not provided through an online application store, and which the user can obtain locally, and which are likely not detectable by this semi-automated centralized detection method.
Although, ad injection itself is not necessarily classified as a completely malicious activity, third party ad injection may have a significant impact on the security and privacy of users and web site publishers. For example, recent research has shown that not only can advertisements injected by third parties through browser extensions provide advertisements from the web, but that the advertisements injected by browser extensions may insert malicious components to fool users into installing malware. At the same time, revenue from website publishers is transferred as a result of the third party injecting advertisements into the web pages, while users may be exposed to malware and are generally unaware of the potential hazards, which are generally considered as modifications to the pages by the website publishers. In addition, whether the advertisements injected by the third party through the browser expansion are malicious or not is generally judged by the user, and the inconsistency of the user requirements is difficult to judge whether the advertisements injected by the third party through the browser expansion are malicious or not, so that the browser expansion function cannot be cancelled.
Currently, researchers have proposed several methods for automatically detecting malicious behaviors similar to advertisement injection in browser extension, for example, examining the extension behaviors being used by a centralized distribution point through Chrome Web Store and Mozilla add-on components to detect the behaviors of improper use of browser extension functions. Due to the fact that certain hysteresis exists in the technology, malicious behaviors cannot be analyzed before the third party injects the expanded advertisements through the browser. Finally, researchers have proposed a method of client-side detection that can report any potential ad injection that deviates from the legitimate DOM (document object model) structure, but this method requires the user to have a priori knowledge of the legitimate DOM structure and cooperation of the user with the web site publisher.
By integrating the analysis, the existing browser extension function can meet the needs of the user to a certain extent, but the browser extension function is easily abused by a third party, so that the user can touch malicious inserted content when browsing a webpage, and meanwhile, a website publisher can lose the user or lose income due to the content inserted by the third party.
Disclosure of Invention
Accordingly, it is desirable to provide a method and an apparatus for tracking a source of webpage third-party injected content, which can highlight third-party injected content.
A source tracking method for injecting content to a third party of a webpage, the method comprising:
analyzing DOM elements loaded by the webpage, and acquiring a tag set of the DOM elements including all publisher addresses;
judging whether a third party publisher address exists according to the tag set;
if the address of the publisher of the third party exists, highlighting the content or the root tag published by the publisher of the third party on the webpage; the content published by the publisher of the third party is a DOM element added or modified in the webpage, and the root tag is a tag of the root element of the added or modified DOM element.
In one embodiment, the parsing the DOM element loaded by the web page and obtaining the tab set including all publisher addresses in the DOM element includes: analyzing the annotation of the DOM element loaded by the webpage; and according to the annotation of the DOM element, acquiring a tag set of the DOM element, which comprises all publisher addresses, wherein the tag set comprises the original webpage publisher address.
In one embodiment, the parsing annotations of DOM elements loaded by a web page includes: obtaining annotations of DOM elements loaded through a webpage: performing annotation of the DOM elements in an incremental manner.
In one embodiment, the step of determining whether the publisher address of the third party exists according to the tag set includes: and judging whether a third-party publisher address except the original webpage publisher address exists or not according to the label set.
In one embodiment, the set of tags further includes extended tags used by the extended source set of tags; the step of judging whether the third party publisher address exists according to the tag set comprises the following steps: and judging whether the publisher address of the third party exists according to whether the tag set comprises the expansion tag.
In one embodiment, the step of highlighting the content or the root tag published by the publisher of the third party on the webpage if the address of the publisher of the third party exists comprises: if the publisher address of the third party exists, acquiring a DOM element corresponding to the publisher address of the third party; setting a visual tag for the DOM element, wherein the visual tag is used for highlighting the display characteristic of the DOM element on a page; or setting the root tag of the DOM element to be visually displayed when the cursor stays at the position of the webpage where the DOM element is located.
In one embodiment, the setting of a visual class tag to the DOM element, the visual class tag being used to highlight the DOM element after the step of displaying characteristics of a page, comprises: setting a delayed display tag for the DOM element; and the delayed display tag is used for highlighting the display characteristic of the DOM element in the page when the DOM element is completely in the visible area of the webpage.
A source tracking apparatus for injecting content to a third party on a web page, the apparatus comprising:
the system comprises a tag set acquisition module, a tag set acquisition module and a web page loading module, wherein the tag set acquisition module is used for analyzing DOM elements loaded by a web page and acquiring a tag set of the DOM elements including all publisher addresses;
the judging module is used for judging whether the third party publisher address exists according to the label set;
the highlighting module is used for highlighting the content or the root tag published by the publisher of the third party on the webpage if the address of the publisher of the third party exists; the content published by the publisher of the third party is a DOM element added or modified in the webpage, and the root tag is a tag of the root element of the added or modified DOM element.
In one embodiment, the tag set obtaining module 210 includes: the analysis unit is used for analyzing the annotation of the DOM element loaded by the webpage; and the tag set acquisition unit is used for acquiring a tag set comprising all publisher addresses in the DOM element according to the annotation of the DOM element, wherein the tag set comprises the original webpage publisher address.
In one embodiment, the parsing unit includes: an annotation acquisition subunit that acquires an annotation of a DOM element loaded through a web page: an execution subunit to execute the annotation of the DOM element in an incremental manner.
In one embodiment, the determining module 220 is further configured to determine whether a third-party publisher address other than the original webpage publisher address exists according to the tag set.
In one embodiment, the set of tags further includes extended tags used by the extended source set of tags; the determining module 220 is further configured to determine whether a publisher address of a third party exists according to whether the tab set includes the extension tab.
In one embodiment, the highlighting module 230 includes: the DOM element obtaining unit is used for obtaining a DOM element corresponding to the publisher address of the third party if the publisher address of the third party exists; the visual tag setting unit is used for setting a visual tag for the DOM element, and the visual tag is used for highlighting the display characteristic of the DOM element on a page; or the root tag setting unit is used for setting the root tag of the DOM element to be visually displayed when the cursor stays at the position of the webpage where the DOM element is located.
In one embodiment, the highlighting module 230 further comprises: a delayed display tag setting unit configured to set a delayed display tag to the DOM element; and the delayed display tag is used for highlighting the display characteristic of the DOM element in the page when the DOM element is completely in the visible area of the webpage.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
analyzing DOM elements loaded by the webpage, and acquiring a tag set of the DOM elements including all publisher addresses;
judging whether a third party publisher address exists according to the tag set;
if the address of the publisher of the third party exists, highlighting the content or the root tag published by the publisher of the third party on the webpage; the content published by the publisher of the third party is a DOM element added or modified in the webpage, and the root tag is a tag of the root element of the added or modified DOM element.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
analyzing DOM elements loaded by the webpage, and acquiring a tag set of the DOM elements including all publisher addresses;
judging whether a third party publisher address exists according to the tag set;
if the address of the publisher of the third party exists, highlighting the content or the root tag published by the publisher of the third party on the webpage; the content published by the publisher of the third party is a DOM element added or modified in the webpage, and the root tag is a tag of the root element of the added or modified DOM element.
according to the source tracking method, the source tracking device, the computer equipment and the storage medium for injecting the content to the third party of the webpage, the address of the publisher of the third party is identified through the label set of all the addresses of the publishers, the content published by the publisher of the third party and the root label are acquired according to the address of the publisher of the third party and are highlighted on the webpage in a visual mode, and visual effects can be achieved, namely the fact that the content is modified in the expansion process is informed to a user, the source of the modification process is informed to the user, and with the highlighted information, the user can make correct decisions to determine whether the user really wants the content from specific expansion to be modified or whether the user wants to unload the expansion which violates the expectation.
Drawings
FIG. 1 is a schematic representation of the structure of the DOM in one embodiment;
FIG. 2 is a flowchart illustrating a method for source tracking of injected content to a third party on a web page in one embodiment;
FIG. 3 is a block diagram of a source tracking device for injecting content to a third party on a web page according to an embodiment;
FIG. 4 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
A browser extension is a program that extends the functionality of a Web browser. For example, browser extensions are typically implemented through a combination of three programming languages, HTML, CSS, and JavaScript, written by browser-specific extension APIs, which provide three functions: (1) modifying the browser user interface in a controlled manner; (2) a control HTTP header; (3) the web page content is modified through a DOM (document object model) API. According to the three expanding functions, the content injected by the third party is identified and prompted.
The DOM is a standard programming interface recommended by the world wide web consortium to handle extensible markup language, a platform and language independent Application Program Interface (API) that can dynamically access programs and scripts and update the contents, structure and style of www documents (currently, www documents of HTML and XML are defined by declarative parts), which can be further processed and the results of the processing can be added to the current page. As shown in FIG. 1, a specific structure of an embodiment DOM (document object model) is a tree-based API document, which requires that the entire document be represented in memory during processing, and provides an access model for the entire document, taking the document as a tree structure, where each node of the tree represents an HTML tag or text item within the tag, as in FIG. 1, the DOM tree structure accurately describes the interrelationship between tags in the HTML document. Among them, a process of converting an HTML or XML document into a DOM tree is called parsing (parse). The HTML document is converted into a DOM tree after being analyzed, so that the HTML document can be processed through the operation of the DOM tree. The DOM model not only describes the structure of the document, but also defines the behavior of the node object, and the nodes and the content of the DOM tree can be conveniently accessed, modified, added and deleted by using the method and the attribute of the object. The DOM tree comprises a plurality of elements, each element is provided with a corresponding tag, and the document type (such as HTML), title and summary information, webpage layout, display characteristics, hyperlinks, types (images, texts, flash) and the like of the corresponding element can be set according to the tags.
Wherein the label categories include: topic class (TITLE), which refers to the proprietary class of topic tags in HTML documents; a text Class (CONTENT), which refers to a category of tags containing text CONTENT of a web page, such as < td > tags containing text; visual class (VISION), which refers to the category of labels describing the display characteristics of a page, such as < b >, < strong >, etc.; BLOCK class (BLOCK), which refers to the category of tags for BLOCKs of web page content, such as < table >, < tr >, etc.; hyperlink (LINK), which refers to the category of tags containing hyperlinks, such as < a >; OTHER classes (OTHER), refer to tag types that do not fall into the above 5 categories. Labels describing title and page summary information, such as title, meta, etc. As shown in fig. 1, a tag < HTML > of a root element indicates that a document is an HTML document, a tag < HEAD > of an element is a built-in tag included in a HEAD portion of a web page and is generally used for declaring a script language to be used, a manner to be used when the web page is transmitted, and the like, the tag < body > of the element defines a body of the document, the tag < title > of the element indicates a title of the document, the tag < a > of the element indicates a hyperlink, an attribute (href, Hypertext Reference) of the tag < a > of the element is used for specifying a URL (uniform resource locator) of a hyperlink target, of course, the tag < a > of the element further includes a size attribute, the tag < a > of the element can be used for defining a size of content corresponding to the tag < a >) of the element, the tag < h1 > of the element is used for emphasizing content, and the corresponding content of the tag < title > of the element is text: "document management", the content corresponding to the label (a) of the element is text: "my link", the content corresponding to the label < h1 > of the element is text: "my title".
In one embodiment, as shown in fig. 2, a source tracking method for injecting content to a third party on a web page is provided, which includes the following steps:
step S110, analyzing the DOM elements loaded by the webpage, and acquiring the tag sets of all publisher addresses in the DOM elements.
The web page may refer to various plug-ins (such as style sheets, scripts, images, Flash, and the like), even resources of other web pages, and after being parsed, various plug-ins in the web page are a DOM (document object model), which is a good structure for representing the web page and is also suitable as a basic element for realizing content source tracking, and can be controlled through a standard API (Application programming interface). The DOM is parsed to obtain DOM elements.
The source (publisher address) of the DOM element is recorded as a set of tags, L ∈ P (L), where L is the set of all tags, the structure of which is as follows:
L=<S,I,P,X>
S={scheme}∪{"extension"}
I={host}∪{"extension-identifier"}
P={port}∪{null}
X={0,1,2,…}
where L represents a 4-tuple label and the 4-tuple consists of S, I, P, X. S consists of a mechanism (scheme) or extension (extension); i is defined by the host (host) or a unique extension-identifier (extension-identifier); p consists of a port or special null value (null); x is an index description for global total ordering.
And step S120, judging whether the third party publisher address exists according to the label set.
In the case of static publishing of the content of the web page, the content source tracking starts with the loading of the web page, and when the browser parses the DOM, each element is tagged with the tag set { l0} containing the publisher address, so that it is simpler to track the statically published content, i.e., the source of the content can be tracked by reading the tag set { l0} containing the publisher address from the browser.
To track the origin of content dynamically published by a web page, the origin tracking system must monitor the execution of JavaScript (JavaScript is an interpreted scripting language). When a script mark is encountered in an analysis page, a Blink engine (Blink is a browser typesetting engine developed by Google and Opera Software) in a browser creates a new element and inserts the new element into the DOM, wherein the script mark is a script positioning mark, the visible representation form of the script on a Web page opened in a Microsoft Office program does not display the script positioning mark under the default condition, and different script positioning marks represent scripts written in different scripting languages; the Blink engine then fetches the JavaScript code (from the network in the case of remote script reference), submits the script code to the V8 JavaScript engine for execution, and pauses the parsing process until script execution is completed. During script execution, a third party publisher may dynamically create and insert some new elements into the DOM; these new elements inherit the script's labelset according to the new element source meaning. There are two methods for creating new elements in JavaScript: (1) creating a new element using the DOM API and inserting it into the DOM of the webpage; (2) directly into the page.
In method (1), a new element object is to be created, comprising:
a. providing a mark name for a createlement function in a browser;
b. setting other attributes of the new element; for example: after creating a new element for a tag, its href attribute must be provided with an address.
c. The new element is inserted into the DOM as a child element using either the apendchild function or the insertBeform function in the browser.
In method (2), the new element (HTML) is inserted directly into the web page using functions in the browser such as write and writeln, or by modifying the attribute of the element innerHTML, which inherits the tab set of the script currently being executed in the case of modifying the attribute of the existing element innerHTML (e.g., changing the src attribute of the image).
Regardless of method (1) or method (2), to record the complete DOM modifications to the Web page, the Node class in the Blink engine in the browser is used to assign a source tab set for the DOM element that is newly created or is currently executing the script causing the modifications.
In this embodiment, whether the publisher address of the third party exists can be determined through the source tag set.
In one embodiment, the event handler and timer registration may need to be modified when the source tag set is analyzed. Specifically, this entails registering a callback handler for the event through the addEventListener API and registering a callback handler for the timer through setTimeout and setInterval. To this end, we modify the EventTarget and DOMTimer classes in the Blink engine in the browser, respectively, to implement the modified event handler and timer registration. Of course, the specific processing related to the program is not described herein.
Step S130, if the address of the third party publisher exists, highlighting the content or the root tag published by the third party publisher on the web page. The content published by the publisher of the third party is a DOM element added or modified in the webpage, and the root tag is a tag of the root element of the added or modified DOM element.
The contents published by the publisher of the third party may be some malicious advertisements, which are displayed in the page of the browser through DOM elements, and the contents published by the publisher of the third party can be basically identified as long as the DOM elements inserted or modified by the publisher of the third party are identified. According to DOM elements inserted or modified by a third-party publisher, the contents published by the third-party publisher can be highlighted on the webpage by setting the highlighting attribute for the DOM elements, and meanwhile, the display attribute of the root tag corresponding to the DOM elements is set, so that the root tag of the contents published by the third-party publisher can be highlighted on the webpage. For example, the labels with display characteristics such as < b >, < I >, < Strong >, < h1 > -h 6, etc. can be used to emphasize important contents and attract attention.
in the source tracking method for injecting the content to the third party of the webpage, the address of the publisher of the third party is identified through the label set of all the addresses of the publishers, the content published by the publisher of the third party and the root label are acquired according to the address of the publisher of the third party and are highlighted in the webpage in a visual mode, and visual effects of ① informing a user of the behavior of modifying the content in expansion, ① informing the user of the source of modification and making a correct decision by the user with the highlighted information to determine whether the user really wants the content modification from specific expansion or not or whether the user wants to uninstall the expansion which violates the expectation of the user can be achieved.
In one embodiment, the parsing the DOM element loaded by the web page and obtaining the tab set including all publisher addresses in the DOM element includes: analyzing the annotation of the DOM element loaded by the webpage; and according to the annotation of the DOM element, acquiring a tag set of the DOM element, which comprises all publisher addresses, wherein the tag set comprises the original webpage publisher address.
Wherein, in the case of content static publishing in a web page, content source tracking starts with web page loading, and when a browser parses the DOM, a tab set { l0} containing publisher addresses can be obtained according to the annotations of the DOM elements, wherein the tab set { l0} contains all content publisher addresses, and of course, the original web page publisher address.
In one embodiment, the parsing annotations of DOM elements loaded by a web page includes: obtaining annotations of DOM elements loaded through a webpage: performing annotation of the DOM elements in an incremental manner.
Among them, the annotation of DOM elements may include multiple layers, and similar to the postings in forums, the annotation may be continuously added behind the annotation, and thus the multiple layers of annotations need to be sequentially executed in an incremental manner.
In one embodiment, the step of determining whether the publisher address of the third party exists according to the tag set includes: and judging whether a third-party publisher address except the original webpage publisher address exists or not according to the label set.
In one embodiment, the set of tags further includes extended tags used by the extended source set of tags. The step of judging whether the third party publisher address exists according to the tag set comprises the following steps: and judging whether the publisher address of the third party exists according to whether the tag set comprises the expansion tag.
In the case of dynamic distribution of contents in a web page, when the dynamic script code is executed, the source of the contents in the web page becomes more complicated. For example, as a typical method of dynamic content publishing, JavaScript can add, modify, and delete DOM elements in any way using a DOM API that is exposed by a browser. Thus, DOM elements can be added, modified, and deleted in any way according to the tagset tracking for the source, as follows:
first, when publishing the original content, the original web Publisher (Publisher) will create an initial tab set { l0}, wherein during the dynamic publishing of the content, a new source tab set will be created based on the initial tab set { l0 }: then, when an external script (a script added by a third party) is loaded into the DOM of the webpage, a new label li is generated according to the source point of the script, i belongs to {1,2, … }, and at the moment, the label set is marked as { l0, li }; subsequently, when the DOM of the web page loads the external script, the generation process of the previous tab set is repeated based on the tab set corresponding to the previously loaded external script. For example, if three external scripts are loaded, the set of tags is { l0, li, lj, lk }, i, j, k ∈ {1,2, … }.
The elements contained in the tab set are unique, the original webpage publisher publishes the initial content based on the tab set { l0}, the content published by the third-party publisher is based on the tab sets { l0, li }, { l0, lj } or { l0, lk }, and thus the continuous external script loading of the access source point is not reflected in the initial tab set and can be continuously used for DOM modification of the subsequent original webpage publisher. For example: if the original web page publisher loads the script from the original source, any generated DOM modification will have a source tag set of { l0} instead of { l0, lj }.
When a third party adds content to a webpage, the DOM can be operated in three ways: element insertion, element modification, and element deletion. Regarding element insertion, the current DOM is updated by inserting a new element that marks the new element publisher address, and if the inserted new element is a script, generating a set of tags regarding the script source from the script. With respect to element modification, the current DOM is updated by modifying the current DOM element, wherein the tab set of the modified element is merged with the current tab set. Regarding element deletion, the corresponding element is deleted from the current DOM.
In the case that the web content is released in an extended form, similar to the case of the dynamic release of the content in the web page, unlike the case of the dynamic release of the content in the web page, the source tab set needs to be initialized, and the content released by the third-party publisher comes from the dynamically released source tab set and also includes the initial tab l0 of the original web page publisher, but the content released by the third-party publisher comes from the extended source tab set, which uses the extended tab.
In one embodiment, the step of highlighting the content or the root tag published by the publisher of the third party on the webpage if the address of the publisher of the third party exists comprises: if the publisher address of the third party exists, acquiring a DOM element corresponding to the publisher address of the third party; setting a visual tag for the DOM element, wherein the visual tag is used for highlighting the display characteristic of the DOM element on a page; or setting the root tag of the DOM element to be visually displayed when the cursor stays at the position of the webpage where the DOM element is located.
Wherein, the visual type labels include < b >, < I >, < Strong >, < h1 > - < h6 >. The root tags are tags such as < html >, </html >. Because there may be content injected by a plurality of third-party publishers, if root tags of content sources of a plurality of third-party publishers are all displayed, too much information may be displayed on the web page, which reduces the effect of the present application in tracking the neatness of the web page of the content sources of the third-party publishers, and may highlight the content injected by the third-party publishers or selectively display several root tags of the content sources of the third-party publishers.
For example, the visual type tag is configured with a border color to prompt the user about the content injected by the third-party publisher, and the border color should be selected to be significantly different from the existing color combination of the web page. And when the user hovers the cursor of the mouse over the DOM element corresponding to the publisher address of the third party, displaying the root tag of the content injected by the publisher of the third party.
In one embodiment, a source tracking method for injecting content to a third party facing a webpage further includes: and setting the specific DOM element as non-modifiable, deleting the content injected by the known third-party publisher or sending the information of the content injected by the third-party publisher to the webpage management center.
In one embodiment, the setting of a visual class tag to the DOM element, the visual class tag being used to highlight the DOM element after the step of displaying characteristics of a page, comprises: setting a delayed display tag for the DOM element; and the delayed display tag is used for highlighting the display characteristic of the DOM element in the page when the DOM element is completely in the visible area of the webpage.
Wherein the deferred display tag can be modified by the conterminalnode class.
In one embodiment, the browser chrome extension detects its rights to statements in the manifest file, injected content scripts, background scripts, and content scripts, and dispatches to the V8 JavaScript script engine for execution. The browser chrome can be expanded by injecting a content script into a webpage or modifying the content of the webpage by using a webRequest function, wherein the content script is a JavaScript program, is communicated with an expansion server through an XMLHttpRequest function, and can control a Web page by using a shared DOM; and calling a group of chrome APIs to interact with the corresponding extension background pages. HTTP requests and responses can also be modified and blocked by using the webRequest function extension to modify the DOM of a web page. In this embodiment, we only track content modifications by content scripts and prepare to identify ad injection by webRequest in the future. How the injection and execution of the trace content script is implemented is described below:
a1, in order to track elements created or modified during the execution of the content script, the extension engine of the browser chrome is modified to correspond to the hook event of the content script injection and execution. The content script injected into the web page can adopt various methods: one method is that if content scripts are to be injected into each matched web page, the content scripts must be registered in the extended manifest file using the content scripts field, and the injection of the content scripts can be controlled at any time and at any place by the option of this field being different; another approach is to program injection, such as to inject a content script in response to a particular event (e.g., a user clicking on an extended browser operation), while at program injection, the content script may be injected using the tabs. In either of these approaches, the content script has an initial source tag set with extension tags attached to it at the time of content script injection.
The injection and execution tracking of the content script can be realized by analyzing a callback function, the extension engine of the browser chrome provides a message API as a communication channel between a background page and the content script, the background page and the content script can receive messages mutually by providing the callback function for an onMessage event or an onRequest event, and can send messages by calling a sendMessage event or a sendRequest event. In order to track the registration and execution of the callback function, some codes are added to map the registered callbacks to the corresponding content scripts so as to find out the extensions responsible for DOM modification, and the specific code setting method is not repeated again.
In the above embodiments, the user is aided in detecting content injection by distinguishing the DOM elements injected or modified by the third-party publisher from the DOM elements injected or modified by the original web page publisher. The source tracking method for injecting the content to the webpage third party can be easily applied to any browser to inform the user of the content modification based on the extension. The source tracking method for injecting the content to the webpage third party can identify all types of content injection, so that the method can highlight the injected content to the third party regardless of the injection types (such as flash advertisements, banner advertisements and text advertisements). The source tracking method for injecting content to the webpage third party can be used as a supplement to the prior art, such as a centralized detection method used by a browser provider, and can also be used as a complementary system similar to advertisement blocking systems such as Adblock Plus and Ghoster.
It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 3, there is provided a source tracking apparatus for injecting content to a third party on a web page, including: a tag set obtaining module 210, a judging module 220 and a highlighting module 230, wherein:
the tab set obtaining module 210 is configured to parse the DOM elements loaded in the web page, and obtain a tab set including all publisher addresses in the DOM elements.
The judging module 220 is configured to judge whether a publisher address of a third party exists according to the tag set.
A highlighting module 230, configured to highlight, on the web page, content or a root tag published by the publisher of the third party if the publisher address of the third party exists. The content published by the publisher of the third party is a DOM element added or modified in the webpage, and the root tag is a tag of the root element of the added or modified DOM element.
In one embodiment, the tag set obtaining module 210 includes: the analysis unit is used for analyzing the annotation of the DOM element loaded by the webpage; and the tag set acquisition unit is used for acquiring a tag set comprising all publisher addresses in the DOM element according to the annotation of the DOM element, wherein the tag set comprises the original webpage publisher address.
In one embodiment, the parsing unit includes: an annotation acquisition subunit that acquires an annotation of a DOM element loaded through a web page: an execution subunit to execute the annotation of the DOM element in an incremental manner.
In one embodiment, the determining module 220 is further configured to determine whether a third-party publisher address other than the original webpage publisher address exists according to the tag set.
In one embodiment, the set of tags further includes extended tags used by the extended source set of tags; the determining module 220 is further configured to determine whether a publisher address of a third party exists according to whether the tab set includes the extension tab.
In one embodiment, the highlighting module 230 includes: the DOM element obtaining unit is used for obtaining a DOM element corresponding to the publisher address of the third party if the publisher address of the third party exists; the visual tag setting unit is used for setting a visual tag for the DOM element, and the visual tag is used for highlighting the display characteristic of the DOM element on a page; or the root tag setting unit is used for setting the root tag of the DOM element to be visually displayed when the cursor stays at the position of the webpage where the DOM element is located.
In one embodiment, the highlighting module 230 further comprises: a delayed display tag setting unit configured to set a delayed display tag to the DOM element; and the delayed display tag is used for highlighting the display characteristic of the DOM element in the page when the DOM element is completely in the visible area of the webpage.
For specific limitations of the source tracking apparatus for injecting content to the third party, reference may be made to the above limitations of the source tracking method for injecting content to the third party, which are not described herein again. The modules in the source tracking device for injecting content to the webpage third party can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a source tracking method for injecting content to a third party on a web page. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:
analyzing DOM elements loaded by the webpage, and acquiring a tag set of the DOM elements including all publisher addresses;
judging whether a third party publisher address exists according to the tag set;
if the address of the publisher of the third party exists, highlighting the content or the root tag published by the publisher of the third party on the webpage; the content published by the publisher of the third party is a DOM element added or modified in the webpage, and the root tag is a tag of the root element of the added or modified DOM element.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
analyzing DOM elements loaded by the webpage, and acquiring a tag set of the DOM elements including all publisher addresses;
judging whether a third party publisher address exists according to the tag set;
if the address of the publisher of the third party exists, highlighting the content or the root tag published by the publisher of the third party on the webpage; the content published by the publisher of the third party is a DOM element added or modified in the webpage, and the root tag is a tag of the root element of the added or modified DOM element.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A source tracking method for injecting content to a webpage third party is characterized by comprising the following steps:
analyzing DOM elements loaded by the webpage, and acquiring a tag set of the DOM elements including all publisher addresses;
judging whether a third party publisher address exists according to the tag set;
if the address of the publisher of the third party exists, highlighting the content or the root tag published by the publisher of the third party on the webpage; the content published by the publisher of the third party is a DOM element added or modified in the webpage, and the root tag is a tag of the root element of the added or modified DOM element.
2. The method of claim 1, wherein the parsing the DOM elements loaded by the web page and obtaining the set of tags for DOM elements that include all publisher addresses comprises:
analyzing the annotation of the DOM element loaded by the webpage;
and according to the annotation of the DOM element, acquiring a tag set of the DOM element, which comprises all publisher addresses, wherein the tag set comprises the original webpage publisher address.
3. The method of claim 2, wherein parsing annotations of a webpage loaded DOM element comprises:
obtaining annotations of DOM elements loaded through a webpage:
performing annotation of the DOM elements in an incremental manner.
4. The method of claim 2, wherein the step of determining whether a publisher address of a third party exists according to the labelset comprises:
and judging whether a third-party publisher address except the original webpage publisher address exists or not according to the label set.
5. The method of claim 1, wherein the labelset further comprises extended labels used by an extended source labelset;
the step of judging whether the third party publisher address exists according to the tag set comprises the following steps:
and judging whether the publisher address of the third party exists according to whether the tag set comprises the expansion tag.
6. The method of claim 1, wherein the step of highlighting the content or root tag published by the publisher of the third party on the web page if the publisher address of the third party exists comprises:
if the publisher address of the third party exists, acquiring a DOM element corresponding to the publisher address of the third party;
setting a visual tag for the DOM element, wherein the visual tag is used for highlighting the display characteristic of the DOM element on a page; alternatively, the first and second electrodes may be,
and setting the root tag of the DOM element to be displayed in a visual mode when the cursor stays at the position of the webpage where the DOM element is located.
7. The method of claim 6, wherein the setting of a visual class tag to the DOM element, the visual class tag to highlight the DOM element after the step of displaying characteristics of a page, comprises:
setting a delayed display tag for the DOM element; and the delayed display tag is used for highlighting the display characteristic of the DOM element in the page when the DOM element is completely in the visible area of the webpage.
8. A source tracking apparatus for injecting content to a third party on a web page, the apparatus comprising:
the system comprises a tag set acquisition module, a tag set acquisition module and a web page loading module, wherein the tag set acquisition module is used for analyzing DOM elements loaded by a web page and acquiring a tag set of the DOM elements including all publisher addresses;
the judging module is used for judging whether the third party publisher address exists according to the label set;
the highlighting module is used for highlighting the content or the root tag published by the publisher of the third party on the webpage if the address of the publisher of the third party exists; the content published by the publisher of the third party is a DOM element added or modified in the webpage, and the root tag is a tag of the root element of the added or modified DOM element.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN201911155077.5A 2019-11-22 2019-11-22 Source tracking method and device for injecting content to third party of webpage Pending CN111177614A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911155077.5A CN111177614A (en) 2019-11-22 2019-11-22 Source tracking method and device for injecting content to third party of webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911155077.5A CN111177614A (en) 2019-11-22 2019-11-22 Source tracking method and device for injecting content to third party of webpage

Publications (1)

Publication Number Publication Date
CN111177614A true CN111177614A (en) 2020-05-19

Family

ID=70647270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911155077.5A Pending CN111177614A (en) 2019-11-22 2019-11-22 Source tracking method and device for injecting content to third party of webpage

Country Status (1)

Country Link
CN (1) CN111177614A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112769792A (en) * 2020-12-30 2021-05-07 绿盟科技集团股份有限公司 ISP attack detection method and device, electronic equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894139A (en) * 2010-06-25 2010-11-24 优视科技有限公司 Mobile Internet webpage information data interaction processing method
CN103761326A (en) * 2014-01-29 2014-04-30 百度在线网络技术(北京)有限公司 Image search method and search engine
CN103793498A (en) * 2014-01-22 2014-05-14 百度在线网络技术(北京)有限公司 Picture searching method and device and searching engine
CN104468546A (en) * 2014-11-27 2015-03-25 微梦创科网络科技(中国)有限公司 Network information processing method and firewall device and system
CN106022132A (en) * 2016-05-30 2016-10-12 南京邮电大学 Real-time webpage Trojan detection method based on dynamic content analysis
CN107729385A (en) * 2017-09-19 2018-02-23 杭州安恒信息技术有限公司 A kind of method for gathering dynamic web page partial data content
CN108614849A (en) * 2017-01-13 2018-10-02 南京邮电大学盐城大数据研究院有限公司 A kind of web advertisement detection method based on dynamic pitching pile and static more script page feature extractions
CN108664303A (en) * 2018-04-28 2018-10-16 北京小米移动软件有限公司 The display methods and device of web page contents
CN109684570A (en) * 2018-12-27 2019-04-26 北京字节跳动网络技术有限公司 Web information processing method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894139A (en) * 2010-06-25 2010-11-24 优视科技有限公司 Mobile Internet webpage information data interaction processing method
CN103793498A (en) * 2014-01-22 2014-05-14 百度在线网络技术(北京)有限公司 Picture searching method and device and searching engine
CN103761326A (en) * 2014-01-29 2014-04-30 百度在线网络技术(北京)有限公司 Image search method and search engine
CN104468546A (en) * 2014-11-27 2015-03-25 微梦创科网络科技(中国)有限公司 Network information processing method and firewall device and system
CN106022132A (en) * 2016-05-30 2016-10-12 南京邮电大学 Real-time webpage Trojan detection method based on dynamic content analysis
CN108614849A (en) * 2017-01-13 2018-10-02 南京邮电大学盐城大数据研究院有限公司 A kind of web advertisement detection method based on dynamic pitching pile and static more script page feature extractions
CN107729385A (en) * 2017-09-19 2018-02-23 杭州安恒信息技术有限公司 A kind of method for gathering dynamic web page partial data content
CN108664303A (en) * 2018-04-28 2018-10-16 北京小米移动软件有限公司 The display methods and device of web page contents
CN109684570A (en) * 2018-12-27 2019-04-26 北京字节跳动网络技术有限公司 Web information processing method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SAJJAD ARSHAD等: "Identifying Extension-Based Ad Injection via Fine-Grained Web Content Provenance", 《SPRINGER LINK》, 7 September 2016 (2016-09-07), pages 415 - 436, XP047355688, DOI: 10.1007/978-3-319-45719-2_19 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112769792A (en) * 2020-12-30 2021-05-07 绿盟科技集团股份有限公司 ISP attack detection method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
JP7330891B2 (en) System and method for direct in-browser markup of elements in Internet content
US9916293B2 (en) Module specification for a module to be incorporated into a container document
US7441195B2 (en) Associating website clicks with links on a web page
KR101477763B1 (en) Message catalogs for remote modules
US9614862B2 (en) System and method for webpage analysis
US7730082B2 (en) Remote module incorporation into a container document
US8640037B2 (en) Graphical overlay related to data mining and analytics
EP2433258B1 (en) Protected serving of electronic content
US7725530B2 (en) Proxy server collection of data for module incorporation into a container document
US10084878B2 (en) Systems and methods for hosted application marketplaces
US20070136201A1 (en) Customized container document modules using preferences
TW201003438A (en) Method and system to selectively secure the display of advertisements on web browsers
US9830304B1 (en) Systems and methods for integrating dynamic content into electronic media
CN108595697B (en) Webpage integration method, device and system
US20160239880A1 (en) Web advertising protection system
CN111737692A (en) Application program risk detection method and device, equipment and storage medium
CN111770086A (en) Fishing user simulation collection method, device, system and computer readable storage medium
CN112637361A (en) Page proxy method, device, electronic equipment and storage medium
US10410257B1 (en) Native online ad creation
CN111177614A (en) Source tracking method and device for injecting content to third party of webpage
CN107526678B (en) Web application program testing method and device
Fouquet et al. Breaking Bad: Quantifying the Addiction of Web Elements to JavaScript
US10218673B2 (en) Web content display system and method
Sonowal et al. Characteristics of Phishing Websites
Carneiro Platform to manage cookies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 511300 No.78, west of Zhucun Avenue, Zhucun street, Zengcheng District, Guangzhou City, Guangdong Province

Applicant after: CHINA ELECTRONIC PRODUCT RELIABILITY AND ENVIRONMENTAL TESTING RESEARCH INSTITUTE ((THE FIFTH ELECTRONIC RESEARCH INSTITUTE OF MIIT)(CEPREI LABORATORY))

Address before: 510610 No. 110 Zhuang Road, Tianhe District, Guangdong, Guangzhou, Dongguan

Applicant before: CHINA ELECTRONIC PRODUCT RELIABILITY AND ENVIRONMENTAL TESTING RESEARCH INSTITUTE ((THE FIFTH ELECTRONIC RESEARCH INSTITUTE OF MIIT)(CEPREI LABORATORY))

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination