WO2019020812A1

WO2019020812A1 - Cloud-based method, system and computer product for testing web domains for behavioral targeting in online advertising

Info

Publication number: WO2019020812A1
Application number: PCT/EP2018/070480
Authority: WO
Inventors: Nikolaos Laoutaris; Spyridon SAKELLARIOU; Juan Miguel CARRASCOSA
Original assignee: Lstech Ltd
Priority date: 2017-07-28
Filing date: 2018-07-27
Publication date: 2019-01-31

Abstract

The present invention relates to a cloud-based method for testing web domains, comprising: -a setup step including selecting a first end-user demographic type and a web domain target that comprises predefined web-sites and portals, and defining visiting pattern parameters based on the selected first end-user demographic type and web domain target, -a data collecting step that inputs the visiting pattern parameters into the web domain targets and obtains test results from the output information rendered by the web domain target, -the step of automatically identifying advertising-related results from the obtained test results, and -the step of detecting online behavioral targeting by analyzing the advertising- related results based on predetermined threshold combinations or metrics based on domain match, topic match and frequency counts. The present invention also relates to a system and a computer program product adapted to implement the steps of the method of the invention.

Description

Cloud-based method, system and computer product for testinfi web domains for behavioral tarfietinfi in online advertisin

Field of the invention

The present invention refers to the field of testing web-domains, web-pages and/or web-sites for any online behavioral targeting of end-users. The testing or auditing of web-pages hosted in uncontrolled computer systems and networks is a fundamental aspect of data protection and online privacy that reverts in an increased security for the end-user computer equipment and its networks.

Background of the invention

In defence of an individual's privacy, legislators internationally are starting to impose legal frameworks to limit web domains' legitimate abilities to track and target the visiting patterns of end-users, users or customers alike. A couple of examples can be found internationally:

Test for Child Online Privacy Protection Act (FTC COPPA) compliance, or

Test for Audiovisual Media Services Directive (EC AVMSD) compliance, or test for EU

General Data Protection Regulation (EU GDPR) compliance.

In the prior art, web domains and pages are tested by analyzing scripts in marked-up languages. Information is parsed and extracted and conclusions on end-user tracking are arrived at, mainly by source code or code analysis.

An object of the present invention is how to improve security of an end-user equipment of network system. Beyond the invasion of privacy, web domains, web- pages and web-sites that track the behavior of the visiting end-users can pose a security threat, as they collect information that can exploit security faults of computer systems.

Other disclosures in the prior art only propose partial solutions by PC-related limited testing of web-pages or web-sites. The scope of that teaching is narrow at the level of processing power of personal computers and cannot be implemented in bigger networks. In this respect, another object of the present invention is to propose methods and systems that allow a wider test of any web-page and web domain, by means of cloud-based computers and network architectures.

Yet another object of the present invention is how to identify advertising- related information, delivered by web-domains and web-servers, which does not incur in clicking buttons that lead to the rendering of advertising contents.

Further prior art references are listed below.

Prior art references:

[1] J. M. Carrascosa, J. Mikians, R. Cuevas, V. Erramilli, N. Laoutaris, "I Always Feel Like Somebody's Watching Me. Measuring Online Behavioural Advertising," ACM CoNEXT'15. Presents a methodology and a straw man implementation for desktop/laptop computers. The article is focused primarily on measurement methodology, findings, and analysis. Implementation issues and details are kept to a bare minimum and refer to an implementation for research purposes on a personal computer (desktop/laptop). [2] US20130212638A1. Systems and methods for testing online systems and content. This disclosure depends on (passive) web-page code analysis to detect the presence of tracking code (cookies etc.) as well as implementation of different opt-out functions required by regulatory and self-regulatory programs. The method has no means of detecting whether specific demographics are targeted or not. This requires conducting active, as opposed to passive, experiments.

[3] US8613051B2. System and method for COPPA compliance for online education: Access control for children to allow parents to grant them access to educational content [4] US20100330543A1. Method and system for a child review process within a networked community. It describes a monitoring system for the interactions of children with a networked system and community. Includes notifications to parents when certain conditions are met. It assumes that the conditions and functions that are prohibited for the kid are clearly defined. [5] US20150096052A1. Children's Online Personal Info Privacy Protection Service. It describes a Firewall/Personal Data Bank service for protecting the data of children from external services that wish to access them.

[6] US20150161672A1. Preventing Display of Age Inappropriate Advertising, is an automatic detection of inappropriate content for children in online services.

Summary of the invention

The present invention provides a method, a system and a computer product for improving security and privacy of an end-user's equipment by means of cloud-based testing of web-pages.

The testing of web pages can be used as an auditing tool for automating compliance tests for regulation that bans tracking and targeting of sensitive demographic groups. These demographic groups are categorized in a set of end-user profiles. These profiles are stored in a computer memory, preferably cloud-based.

The invention discloses a cloud-based method for testing web domains, comprising:

-a setup step including selecting a first end-user demographic type from a cloud storage repository that comprises multiple end-user demographic types relating to predetermined online behavioral parameters; and

-the setup step further including selecting a web domain target (or more than one) that comprises predefined web-sites and portals; and

-the setup step further including defining visiting pattern parameters based on said selected first end-user demographic types and said selected web domain target(s), and - a data-collecting step that inputs said visiting pattern parameters into said web domain target(s) and obtains test results from the output information rendered by said web domain target(s); and

-the step of automatically identifying advertising-related results from said obtained test results at the previous data collecting step; and

-the step of detecting online behavioral targeting by analyzing said advertising- related results based on predetermined threshold combinations or metrics based on domain match, topic match and frequency counts.

The invention also relates to a cloud-based method for testing web domains, which differs from the method described above in that the data collecting step comprises automatically visiting the selected web domain target(s), following the visiting pattern parameters, and render part or preferably the entire visited page by scrolling all the way to the bottom in an automated way (mimicking the scrolling of a real human). The rendered data is the above called as test results, from which the advertising-related results are automatically identified as described above, and the online behavioral targeting is detected based on the above mentioned threshold or metrics. Optionally, a "snapshot" of the partially or preferably fully rendered page is acquired to be used as proof/evidence of findings in terms of the metrics stated above.

The method optionally includes, at the setup step, sending a request to the tested web-domain target for disabling end-user online behavioural targeting. Similarly, optionally, the method may include sending a request to the tested web-domain target for disabling end-user tracking.

To identify advertising-related information without committing click-fraud, certain algorithms can be used. Preferably, the step of automatically identifying advertising-related results includes analyzing images, selecting the images related to advertising and analyzing tags around the image to identify landing pages. If needed, it may also include further identifying and parsing code injected by the web domain target or even further identifying an end-user clickable button or event attached to the advertising-related information.

Non-limiting embodiments of the multiple end-user demographic types relating to predetermined online behavioral parameters may include predetermined demographic groups, based on age, gender, race, or ethnic parameters, in particular, children-related profiles, women-related profiles.

The method finds an optimized cloud-based implementation if it includes discovering overlap of topics from the output information rendered by said web domain targets, wherein a specific topic L may be maintained and kept in memory, when it is encountered more than K times in the L subtree, by transforming a set of h-long branches of a category tree into a flat list of terms and computing threshold comparisons. The testing may also be carried out during predetermined time patterns so as to monitor said web domain targets.

For an embodiment of the method of the present invention, the above mentioned multiple end-user demographic types follow different standardized taxonomies.

For another embodiment, alternative or complementary to the on described in the paragraph just above, the method comprises defining, an operator or automatically, one or more of the multiple end-user demographic types by providing a list of URLs that each of the one or more of said multiple end-user demographic types visits.

For an implementation of said embodiment, the method comprises providing said list of URLs automatically by importing them from a real-world user web browser (from recent history and cookies), and selecting as said web domain target a web domain target that comprises said list of imported URLs.

The present invention also proposes a system for cloud-based testing of web- domains. The system comprises a server having a computer processing unit for selecting a first end-user demographic type a set of end-user profiles from a cloud storage repository, and having an engine to generate adequate visiting pattern parameters for each targeted web domain or web page and for the general implementation of the method described above. The engine is preferably part of a cloud-based container.

Specifically, the engine is for:

selecting a first end-user demographic type from a cloud storage repository that comprises multiple end-user demographic types relating to predetermined online behavioral parameters, and -selecting a web domain target that comprises predefined web-sites and portals, and

-defining visiting pattern parameters based on said selected first end-user demographic type and said selected web domain target, and

-input said visiting pattern parameters into said web domain targets and obtains test results from the output information rendered by said web domain target, and

-automatically identifying advertising-related results from said obtained test result, and

-detecting online behavioral targeting by analyzing said advertising-related results based on predetermined threshold combinations or metrics based on domain match, topic match and frequency counts.

For an embodiment, the above mentioned server comprises a web container for controlling test input times and visiting pattern parameters so that the test results from the tested web-domain targets achieve the same rendering of information as end-user manual input would achieve.

For another embodiment, the engine comprises a behavior-targeting mode filter, said tracking-mode filter comprising output means for sending a request to the tested web-page for disabling end-user online behavioral targeting and/or disabling tracking.

A further aspect of the present invention relates to a computer product for carrying out the computer-implemented steps of the method of the present invention, for example via an algorithm.

End-user demographic types can be referred as "personas". The methodology advantageously includes creating different "child personas" that it can attest in unprecedented scale and fidelity whether particular online Publishers, namely websites, portals, or content sites target minors. In addition, unlike manual methods it can continuously keep a watchful eye on all the above by running automated tests in regular intervals in the background. In case of specific complaints against specific Publishers, the method and system can be used to launch huge barrage of tests that will unequivocally verify whether a complaint is valid or not. Direct applicability in the regulation programs mentioned above is therefore possible. The several use cases and figures below constitute embodiments that can be grouped in various ways. For example:

A. Monitoring and certification of compliance with regulation for protecting sensitive demographic groups, e.g., children

The method and system can be used as a tool for automating compliance tests for regulation that bans tracking and targeting of sensitive demographic groups. For example by creating different "child personas", the system can attest in unprecedented scale and fidelity whether particular online Publishers (web-sites, portals, content sites) target minors. In addition, unlike manual methods it can continuously keep a watchful eye on all the above by running automated tests in regular intervals in the background. In case of specific complaints against specific Publishers, the present system and method can be used to launch huge barrage of tests that will unequivocally verify whether a complaint is valid or not.

B. Monitoring and certification of compliance with regulation for protecting sensitive personal data, e.g., related to health, religion, political beliefs, etc.

In addition to examples of regulation aimed at protecting specific demographic groups, other regulation protects sensitive data, independently of specific demographics. For example the European General Data Protection Regulation (GDPR) defines several categories of sensitive personal data such as:

(a) the racial or ethnic origin of the data subject;

(b) his political opinions;

(c) his religious beliefs or other beliefs of a similar nature;

(d) whether he is a member of a trade union;

(e) his physical or mental health or condition;

(f) his sexual life;

(g) the commission or alleged commission by him of any offence; or

(h) any proceedings for any offence committed or alleged to have been committed by him, the disposal of such proceedings or the sentence of any court in such proceedings. The present system can be used to test for compliance with all the above. An end-user profile, e.g. a Persona, can be constructed to be behaving like and leaking information of the above types. This can be achieved through an automated script that controls a browser. The said script visits with the browser specific sites that pertain to the above mentioned sensitive topics. Trackers present on these sites observe the visits of the browser and thus start profiling it as interested in these categories and topics. Then the present system can scrutinize Publishers to detect whether they display advertisements pertinent to the above categories, thereby indicating collection and processing of such protected data.

C. Monitoring and certification of compliance with anti-discrimination and anti- algorithmic bias laws and best practices

Algorithms can discriminate in ways that are illegal or ethically provoking for consumers. For example, an online store that geolocates the IP address of a visitor and, based on that, displays a price which is higher or lower than the one for other customers because of the store's perception about the willingness of customers from a particular country to pay, violates the Service Directive (Art 20.2) of the European Union's Digital Single Market which dictates that there should not be price discrimination or geoblocking based on country of residence or origin within the EU.

Similarly if, during the check-out process, the e-commerce site notices a delivery address in a part of the city where fraud has been committed in the past (e.g. purchases with stolen or cloned credit cards), then the site may decide to deny selling the article. This practice may be justifiable from the stores' point of view, but it is clearly unfair and discriminatory against law abiding citizens that live in the same area.

Such discrimination can take place explicitly, i.e. be ingrained explicitly in the algorithm, or can be the result of implicit algorithmic discrimination taking place when an artificial intelligence/machine learning is left unsupervised to make arbitrary inferences based on observed past data. Examples of such incidents abound in the news and go way beyond e-commerce, touching virtually most online services such as news recommendation, content recommendation, employment search sites, sharing economy communities for housing, sharing cars and other goods.

Note:

The system and method of the present invention in use cases A-C can be used by both regulators examining specific complaints (or proactively monitoring the market) and online services themselves to audit their own algorithms for accidental implicit discriminatory behavior that is hard to detect otherwise.

For example, an employment site can define a "Woman" end-user profile, e.g. Persona, and then test whether it displays to this persona similar jobs with similar salaries as with Men Personas of similar qualifications. A housing short-term leasing site can define Personas of different ethnic or racial backgrounds and then monitor if it targets such personas in a discriminatory manner, such as showing ads to one type and avoiding others or stirring specific types towards more or less expensive placements.

D. Use by the AdTech sector for increasing the credibility and effectiveness of own self-regulation programs

The online advertising sector, through its multiple trade bodies (IAB, DAA, EDAA, IAPP, FEDMA), has been very proactive in describing and publicizing self-regulation and best-practice programs around protection of data and protection of sensitive demographic groups. Companies undersigning vouch to respect the codes of such programs that often times are more restrictive even than the actual government- imposed regulation programs mentioned previously. One key challenge related to such self-regulatory programs is convincing the regulators, the consumers and their representatives that the undersigning companies indeed implement the strict restrictions prescribed by these programs. With the present system and method, an AdTech representative bodies that launches such a program can automate compliance tests on data and demographic-related targeting and thus ensure that all undersigning companies indeed implement its provisions. In such a way the said body will be in a better position of defending its programs effectiveness in front of regulators and consumers. For example, the present system can be used with or without opt-out from AdChoices, Do-Not-Track to verify whether opting out indeed leads to stopping receiving targeted offers. Examples of such programs:

- AdChoices of DAA: http://youradchoices.com/

- Privacy protections outlined within the Code of Ethics of the Digital Analytics Association (DAA)

- Privacy protections by standards organizations e.g. "Do-Not-Track" of the World Wide Web Consortium (W3C)

- Opt-out function defined by self-regulatory bodies, e.g. DAA, of the Network Advertising Initiative (NAI) and the Internet Advertising Bureau (IAB).

E. Use by Brands and Publishers to independently audit that advertising partners and ad delivery channels do not violate data protection regulation or put the reputation of the Brand/Publisher at risk with consumers.

Brands hire advertising partners (Ad desks, Demand Side Platforms, etc.) to implement advertising campaigns that will deliver their message to the intended audience. Brands, on the one hand want to reach the "right" audience but without risking breaking regulation or provoking their customers with questionable practices. For example, a pharmaceutical company or an insurance company wants to reach people that need their products and services but do so in a way that does not violate data protection regulation and self-regulation around sensitive personal data. At the same time ad delivery publishers have the incentive to present such ads to consumers so that they most likely act upon them: either click on the ad or purchase an item or service.

Despite the fact that Brands usually include clauses in their contracts with ad delivery partners to explicitly ban unlawful targeting, the incentive to engage in this practice or just the accidental occurrence of such targeting remains a danger. Therefore a Brand could use the method and system of the present invention to create an end- user profile, i.e. a Persona, with characteristics matching its advertised product and then verify whether the persona is indeed targeted using protected data. For example, a pharmaceutical selling HIV-related drugs can build a persona that is "trained" by visiting discussion forums and information pages about HIV treatment. Then it could test different outlets, where its ads appear, to verify that its HIV persona does not see its ads more frequently than any other visitor, including a clean persona without history executed in parallel, as a placebo test, by the present method and system.

Brands have a "top-down" use case for the method and system of the present invention. Publishers get paid by ad delivery channels to "lease" space on the Publisher pages to display ads paid by Brands and their campaign. A Publisher can be held legally accountable for ads that break data protection regulation and get to appear on its site. Even if no legal repercussions are incurred, a Publisher can suffer from severe brand damage if found to be hosting illegal or offending targeted ads. Therefore, as with the use case with Brands, a Publisher can use the present method and system to proactively test and keep monitoring their advertising partners for compliance with regulatory and consumer expectations. For example, a health-related portal can create an HIV related Persona and keep monitoring to make sure that it does not get shown more HIV content than other visitors of the site.

In all the previous use cases, the present invention not only identifies advertisements that are suspected of breaching regulatory or self-regulatory restrictions but it also identifies and reveals the chain of AdTech companies involved in the delivery of said ads. This allows, regulators, self-regulation organizations, or contracting businesses (whether Brands or Publishers) to know which ones of their AdTech partners are responsible for each incident.

Block diagram:

The present invention can be broken down into three main functional blocks shown below: 1. Setup:

An operator gives inputs to the system of the present invention selects the demographic types ("Personas") for which he wishes to test a number of Audited domains (news portals, kids related web-sites, etc.) to verify whether the domains target said personas or not. The operator selects from predefined Personas that follow different standardized taxonomies of the AdTech sector (e.g., IAB taxonomy). Such taxonomies are used in the actual definition of advertising campaigns by brands and their ad delivery partners. In addition, the invention allows the operator to define his own Personas by providing a list of URLs that this Persona visits. This allows for extra flexibility in testing for online behavioral advertising (OBA) upon Personas that are not included in the standardised taxonomies or ones that the user wants to provide an alternate definition to. A third option that the system offers for defining a persona is to import it from a real world user browser. In this case the system imports the recent history and the cookies found in a real world browser. Then visits the web-sites found in the recent history during the subsequent Collection (also referred to as Training) phase. This allows the system to check automatically whether a real-world user would be targeted or not.

The operator specifies additional parameters that govern the operation of the system during the next functional block. Such parameters include "how many times to visit each page in the definition of a persona", "how many times to visit each Audit Page" in order to collect advertisements, etc.

Within the additional parameters, two optional cases must be highlighted: Do Not Track (DNT) and AdChoices Opt-Out. On the one hand, DNT is an HTTP header field to request all the visited web sites to disable tracking for this browser. On the other hand, AdChoices Opt-Out allows to disable the online behavioral advertising or any company included in the AdChoices list (http://youradchoices.com/). AdChoices is implemented through an appropriate cookie which, when detected, indicates the explicit wish of the user to be excluded from data collection and targeting. In actual browsers the cookie is set by the user by clicking on the AdChoices icon that accompanies an advertisement. In the present invention, the cookie is preferably set programmatically when the user selects AdChoices opt-out during the setup phase. The present invention makes use of the features mentioned above to offer the operator the possibility to perform more complex experiments and, thus, to be able to compare results from the same "Personas" but using different countermeasures facing the OBA. For example, once the algorithm detects OBA toward a certain persona, it can launch, either in parallel or in tandem, a replica of the experiment with DNT and AdChoices Opt-Out set, collect the results to be compared against the original experiment and thus reveal whether the involved companies truly implement these opt- out initiatives or not (see the Self-Regulation use case).

Once all these parameters have been defined and the user launches the experiment, the present method automatically creates on the cloud a software configuration that faithfully imitates a real human user sitting behind his personal computer and surfing the web, with interests similar to the chosen Persona. The details of this "Cloud Implementation" are described in the paragraphs on the extraction of advertising landing pages without clicking on links. This method is an integral contribution of this patent. We will refer to cloud implementation as the "Container" of the Persona.

2. Collection or Training:

During the collection or training phase, the Container starts visiting the web- pages in the definition of the Persona. For example, for a Persona corresponding to an underage kid, the Container will be visiting web-sites of popular children's TV shows, computer games, video distribution sites, etc. During each of these visits the Container renders fully each page and executes the entire code in it, including tracking and advertising code. In this way, advertisers and trackers start "seeing" the Container visiting children-related content and therefore start building a corresponding profile using cookies and other tracking mechanisms that are opaque to our method.

After a number of visits, governed by the input parameters, the Container starts visiting also the Audited domains (in tandem or interchangeably). Special care is taken to make sure that the Container looks like a human user instead of an automated bot/script. For example, inter-visit times are matched to human inter-visit time scales, pages are fully rendered, and "user-agent" is set to values indicating popular real world configurations of browsers and operating systems. During each visit to an Audited Page, the Container identifies all advertisements included in the page as well as the URL of the advertised product or service. The details of this complex operation are defined in "Extraction of advertising landing pages without clicking on links".

Once an advertisement has been detected and its URL extracted, Topics are assigned. These Topics are the means by which we compare different similarity metrics between the collected advertisements and the web-sites visited initially by the Container. The amount of the said overlap is a prime indicator of OBA as described next. Topic extraction includes several innovative features that are described in "Advancement over prior art".

3. Detection:

Detection of OBA is achieved by means of evaluating various metrics such as Domain Match, Topic Match, and Frequency counts.

Domain Match: For each persona this metric indicates the number of times that the domain of a web-page visited during a training phase is re-encountered in the URL of an advertisement collected at an Audited page. Domain Match captures "Retargeted" advertisements as well as other types of behavioral targeting.

Topic Match: This metric captures behavioral targeting that goes beyond exact domain matching of a page in a training phase and on the collected ads. For example, if a Container pretends to be a child and visits children-related sites then under various types of behavioral targeting the Container may collect children-related advertisements from domains that do not belong to any of the domains visited during training (the latter will be captured by Domain Match metrics). Such ads do not increase Domain Match still they are targeted. We capture such ads through Topic Match. Topic Match is calculated by listing all the Topics obtained during the visits to different pages during training phase and then looking for recurrence of the same Topics in the landing URLs of collected advertisements. Coming back to the previous example, it may be that there will be no Domain Match for children-related ads collected but there will be Topic Match since the content is children-related in both training and collected ads and thus there will be a lot of overlap in terms of assigned Topics.

Frequency Counts: This metric sums the number of appearances of a certain Topic across all the ads seen by a specific Persona in the different Audited domains. Then it presents graphically to the user, in decreasing order with respect to count, the most popular Topics. The highest counts indicated the most common Topics of advertisements collected for a Persona. When the discovered, frequent topics seem relevant to the Persona then this is an initial indication of targeting. When these counts are much higher than the counts of a reference "Clean" Persona that we execute alongside the Persona Container then the probability of OBA is even higher. It visits the same Audited Pages and collects ads. Contrary to the main Container, though, the Clean Container has an empty history, i.e. it does not visit any training pages and thus corresponds to a Clean Persona (a new browser in a new personal computer). When a user-defined threshold of difference in frequency count on a Topic between a Persona and a Clean Persona is exceeded, then the system informs the operator so he can investigate more closely the Topic through Topic Match and Domain Match metrics as well as by observing the stored rendered pages. Automating the detection process: If more than X Domain Matches and more than Y Topic Matches and Frequency difference between Persona and Clean more than Z% then flag an Audited domain as suspicious for targeting on the Topic upon which the 3 metrics have been evaluated.

Detection of the involved AdTech companies: For each collected and analysed ad, the system can also reveal the chain of AdTech companies involved in its delivery. This is a very important function since it permits to know exactly which one of its AdTech partners/contracts is responsible for each incident. This allows the customer to act upon the findings of the present system and attribute them to specific perpetrators.

In contrast to [2], the method of the present invention conducts active experiments to detect whether specific demographics are targeted or not. This is carried out, as explained above, by training "personas". In addition, the method of the present invention is generic, in the sense that it considers the web-site and its advertising partners as a black box and tests only for correlations between input and output. Therefore, it can detect targeting independently of the specific mechanisms implemented to drive it, which is what [2] is trying to locate through code analysis.

As stated above, [4] describes a monitoring system that assumes that the conditions and functions that are prohibited for the kid are clearly defined. The present invention is rather about detecting the presence of such conditions for the specific case of targeted advertising. Brief description of the figures

In the following some preferred embodiments of the invention will be described with reference to the enclosed figure. They are provided only for illustration purposes without however limiting the scope of the invention.

Figure 1. A non-limiting example of the three phases according to another preferred embodiment of the invention.

Figure 2. A block diagram of embodiments allowing identification of advertising- related information without committing click fraud.

Figure 3. Example of set of h-level topic branches for www.amazon.com.

Figure 4. A h-level term hierarchy used by our method to produce a flat list of terms that characterise a page.

Figure 5. Example of set of m-level (m=2) topic branches for www.amazon.com.

Description of embodiments

Figure 1 describes a non-limiting example of the three phases of the method of the present invention, according to a preferred embodiment, i.e. of a setup, a collection and a detection phase, which are briefly described below.

SETUP. Initialization and configuration of the experiment.

- configuration setup: technical/algorithmic configuration parameters. They affect the accuracy and the duration of the experiment. - Advanced parameters: application parameters to check the visiting domains regarding the advertisements they serve and the way the personas will be built.

COLLECTION. Build the data corpus needed for the experiment. Extract landing pages, Collect and characterize the visited web pages based on the topic.

- Extraction of landing pages: find the web pages that an advertisement is pointing to.

- DDBB Topics: The database of topics found on various taxonomy services describing a hierarchical tree of interests of variable depth classified into categories and subcategories.

- Topic assignment: Based on the taxonomies the algorithm assign topics to both advertisement landing pages and advertisement training pages.

DETECTION. Analyze collected data and detect online behavioral targeting and the agents that are related to them, based on the predefined parameters.

- OBA detection:

The algorithm detects online behavioral targeting by analyzing the advertising-related results based on predetermined threshold combinations or metrics based on domain match, topic match and frequency counts.

- Detection of companies involved. The algorithm builds the chain of intermediaries involved to serve an advertisement by analyzing the HTML elements surrounding the advertisement on the web page.

Cloud-based embodiments: Advancement over prior art by a cloud-based Implementation

Unlike the prior art, which is mainly based on a desktop application, the present invention is preferably built for operations in a cloud-based virtualized infrastructure.

The present method and system are architected with cloud operations in mind, meaning that instead of a monolithic deployment into one or more virtual or physical servers, each component is built as a Docker (https://www.docker.com) container, that is an isolated self-contained unit guaranteed to work independently of the host server's configuration. More importantly docker containers can be dynamically instantiated and destroyed in just a few seconds allowing the system to consume minimum computing resources when demand is low, while almost instantly expand to multiple replicas to support sudden load increases. Thus, the present invention is preferably designed to maintain the highest quality of service level while minimizing operational costs. Furthermore, the present invention is preferably designed to work with a so called "headless" browser (i.e. one lacking a graphical user interface) on cloud environment using the PhantomJS framework. Its host servers advantageously have no screen or graphics cards, and yet the system fully simulates the behaviour of a browser as if operated by a human user running on her laptop or desktop computer. The system, using the headless browser, preferably "imitates" real operating systems/browsers as well as the behaviour of human users. This is achieved by visiting different web-sites at carefully selected time intervals that are chosen to match the statistical behaviour of inter-visit time of real human users. To achieve this the system is also instrumenting "Page down" commands to scroll further down on a page and thus render all of its content and also increase the perception that it is a human user and not a bot.

No-ad-clicking embodiments: Extraction of advertising landing pages without clicking on links

For an advertisement auditing system defined according to the system of the present invention to work efficiently, it must be able to visit tested or audited pages and extract advertisement-related information. Advertisements are easy to spot for a human user, but their automated extraction is far from trivial. Indeed, advertising companies have multiple ways to embed an advertisement banner in a web-page. They also have incentives to make the process complex and dynamic in order to evade ad blocking software. In [1], advertisement landing pages were identified by just clicking on images. This, however, can be perceived as click fraud if done at large scale. In the present system advertisement landing pages are detected without having to click on any ads. This is achieved by the algorithm shown in Figure 2 that implements the method of the present invention. In Figure 2, the box names refer to HTML elements which meaning can be understood by the skilled person with basic knowledge on web development. Particularly, Figure 2 depicts an algorithm to detect web advertisements and their corresponding landing page (the destination URL when a user clicks in the advertisement). The algorithm is split into two main sections depending on the placement of the advertisement. The first section, marked as shaded area (A), describes the steps involved when the advertisement is rendered alongside the content of the visited URL. The second part of the algorithm, marked as shaded area (B), describes the steps involved when the advertisement is rendered in an isolated environment called iFrame.

The algorithm starts as soon as the visited website is fully rendered to the browser and the "window. onload" event is triggered. First the algorithm tries to identify in which context it is actually executed. In the case where the code is executed within the context of the visited website, then, the (A) block of the algorithm is executed. In short, the (A) part of the algorithm is collecting all image objects of the website and is filtering out those that are not advertisements, based on a set of predefined attributes. The images that are classified as advertisements are then analyzed further to extract the landing page which is usually available on an HTML <a> tag that is surrounding the image.

In the case when the code is executed inside an iFrame, the (B) part of the code is executed. This part of the code involves multiple tests to be able to identify advertisements. This is necessary since advertisements rendered in iFrames are usually loaded into the website dynamically using different techniques, depending on the advertising network. To detect the different rendering techniques the algorithm is trying to detect if the iFrame contains any images, canvas elements or visible HTML <div> elements. If any of them exists and belongs to an advertisement (based on filtering element features) then the algorithm tries to extract the landing page of the advertisement. Since iFrame ads are more complex, the algorithm is utilizing different techniques to detect the landing page.

Overall, the algorithm is using four different approaches that are executed in a specific order as presented in the diagram above. First, the algorithm tries to detect if the landing page exists in a surrounding HTML <a> Tag. If no landing page is detected, then the algorithm moves to the second step. The second steps involves the detection of any "onClick" event that it may be attached on the advertisement. Again, if no landing page is detected the algorithm moves to the third and fourth step. The third step requires parsing any injected JavaScript code to detect if the landing page is available. The fourth step involves the data storage in cases when an advertisement is detect otherwise the algorithm is just exiting directly.

Embodiments that allow for scalable topic assignment

Topic assignment is a preferred implementation step in the operation of the present invention. It refers to the ability of the system to assign topics to both advertisement landing pages and advertisement training pages. The amount of topic overlap between the two is a direct measure online behavioral targeting. In [1], topics were assigned by querying in real-time each one of the multiple taxonomy services mentioned in the article (Google AdWords, McAfee, Cyren) and downloading from each one a set of h-level topic branches (e.g. Computers & Electronics > Consumer Electronics > TV & Video Equipment > Televisions > LCD TVs ) assigned to a particular web-page (see table below). Most taxonomy services include a large number of topics to cover the broad spectrum of users' interests. This high granularity of interests is classified into categories and subcategories into a hierarchical tree of variable depth, from 5 to 8 depth levels. Below we give an example of the Topic branches returned for domain www.amazon.com.

Figure 3 shows an example of set of h-level topic branches for www.amazon.com.

The method of the present invention was adequate for a research prototype like the one in [1] but is not scalable for a commercial service used by hundreds, if not thousands, of concurrent users. First, because the method had to query again and again for the same pages the online taxonomy services and secondly because the comparison of topics involved complex level-by-level processing across all h-levels of the branch.

Figure 4 shows an h-level term hierarchy used by the method of the present invention to produce a flat list of terms that characterise a page. Specifically, Figure 4 depicts the tree branch that is up to the h level of a topic hierarchy, starting from level 1 - LI topic. Each level describes topics at a higher level of abstraction than the topics below it. The algorithm decides if a topic at a specific level will be maintained and kept in memory for the rest of the process.

Additionally, having a too fine-grained list of terms, i.e. a high value for the parameter "h" indicating the depth of the taxonomy, makes unlikely to discover a Topic overlap, even if one exists. The reason is that Level-h leaf topics are too specific whereas LI or L2 topics might be too generic. Therefore pages with similar content may have completely different h-level leaf topics. Inversely, completely irrelevant pages may share LI and L2 Topics along their h-level branch.

The present embodiment solves both the scalability and the accuracy problems in Topic overlap detection through an optimised algorithm.

The algorithm effectively transforms a set of long h-branches of a category tree into a flat list of terms upon which our metrics are computed. It also "caches" the association of a particular URL with its produced flat list of topics. This permits the system to retrieve the terms again and again without having to communicate with the online taxonomy service.

The algorithm that implements the method of the present invention keeps in the Topics list of a URL a subset of its m-level topics. The selection is performed as follows. For each m-level topic the algorithm scans its m-level subtree (i.e., the tree rooted at this topic) and counts the number of times that the m-level Topic (the actual string) re-appears as a sub-string in other lower layer Topics. This happens naturally since the strings of lower layers tend to be longer than strings of higher layer topics. The reason is that lower layer topics are more specific and thus involve longer strings. If the m-level topic is encountered more than "k" times then the topic is added to the list of topics used by the present embodiment to compute metrics for a particular URL. The philosophy of the algorithm is that if an m-level topic re-appears frequently in its subtree then it is indeed representative enough of the subtree and can thus be used instead of very low or very high level topics that are either too specific or too generic. In the present implementation and embodiment of the system we use m=2 or m=3 depending on the particular taxonomy following a set of experiments run by the operator to fine-tune the selection of "m". The topics that are selected by the algorithm that implements the method of the present invention are cached for future use as described next.

Below we use an example of the weights of L2 topics computed for www.amazon.com. In this example if we set the threshold "k" to value 3 then the list of Topics that would be cached would be "Consumer Electronics" and "Consumer Resources".

Figure 5 shows an example of set of m-level (m=2) topic branches for www.amazon.com. Embodiment allowing scalability: Caching instead of online querying of taxonomy services

Scalability is one of the main goals or objects of the present invention. Each experiment consists of dozens of different web-sites and performing online queries to these taxonomy services is inefficient in terms of both time and resources. Therefore, the system includes an incremental database which contains the flat lists of topics mentioned above to reduce the time and increase the scalability of the system. The cached database is re-computed on the background to guarantee that it always reflects an up-to-date compressed version of the online taxonomies used as its source. When a new URL is involved in the processing steps of the present method, the page is first looked up in the cached version of the compressed taxonomy and only if there is no record for the URL, the online taxonomy is accessed. Following the access, a record is added in the cached taxonomy as described by our algorithm before.

Preferred embodiment: Extraction of advertisement companies involved in the delivery of an advertising-related information

To build the chain of intermediaries involved to serve an advertisement, the algorithm starts from the HTML element that is identified as an advertisement at the end of the advertisement detection algorithm explained above. The first ad domain in the intermediary chain is the domain that is actually serving the actual advertisement element (i.e. when the advertisement is an image then the domain can be extracted from the image source attribute <img src=...>). The second domain in the chain is the owner domain of the iFrame that the advertisement is enclosed in and can be extracted from the "window. url" attribute of the iFrame. When we have multiple nested iFrames we aggregate all window. urls in the chain recursively. Finally, the last domain in the chain is the destination URL (landing page) that the user will end up visiting if he clicks on the advertisement. The landing page can be detected during the execution of the ad detection algorithm explained above.

A more concrete embodiment is illustrated by an example where the list of intermediaries for an advertisement is provided and obtained from www.economist.com. The list of intermediaries includes the following domains:

1. http://www.economist.com

2. https://s0.2mdn.net

3. https://tpc.googlesyndication.com

4. https://www.ca.com

The first domain, www.economist.com is the publisher, which is an electronic newspaper. The second domain, s0.2mdn.net is the name of a domain owned by google which is used for loading ad content for http://Doubleclik.com from the google cdn (content delivery network). The third domain, tpc.googlesyndication.com is also owned by google and is again used to load advertisement content similar to s0.2mdn.net domain above. Finally, the fourth domain www.ca.com is the destination URL that the user will end up visiting if he clicks on the advertisement. In more complex examples, more than one ad networks may appear in the chain of URLs that we collect recursively as they re-auction iframes among them.

The ability to reveal the companies involved in the delivery of a particular ad, especially when it is offending or appearing to break regulation/self-regulation, is another very import advancement over [1] which limited itself to identifying potentially targeted ads but not the AdTech companies involved in their delivery.

Alternative embodiments include a method for cloud-based testing of web-sites by:

-selecting a set of end-user profiles from a cloud storage repository; and -generating a set of test variables or parameters corresponding to compliance rules based on said selected end-user profiles; and

-generating a set of test inputs for predetermined web-pages based on said test variables or parameters; and

-input said generated test inputs into predetermined web-domain targets; and

-obtaining a set of web-page test results from the outputs of said generated test inputs; -automatically extracting end-user online behavioural-targeting variables by identifying advertising-related information delivered from said web-page as test results; and -detecting end-user online behavioural targeting by analyzing said end-user behavioural-targeting variables from said web-domain page test results.

Preferably, the method also sends request to the tested web-page for disabling end-user online behavioural targeting. It may also send a request to the tested web- domain or page for disabling end-user tracking.

The method preferably detects end-user online behavioural targeting by analyzing metrics based on domain match, topic match or frequency counts.

Preferably, identifying advertising-related information delivered from said web- page as test results includes analyzing images, selecting the images related to advertising and analyzing tags around the image to identify landing pages. Also, it may also include identifying an end-user clickable button or event attached to the advertising-related information, or even identifying and parsing code injected by the tested web-page.

The set of end-user profiles from a computer memory or cloud-based repository are preferably selected from predetermined demographic groups, based on age, race, or ethnic parameters, in particular, children-related profiles, women-related profiles. The method advantageously includes discovering topic overlap of topics from tested web-pages, wherein a specific topic L is maintained and kept in memory, when it is encountered more than K times in the L subtree, by transforming a set of h-long branches of a category tree into a flat term list of terms and computing threshold comparisons. The system for cloud-based testing of web-pages preferably comprises a computer processing unit for selecting a set of end-user profiles from a cloud storage repository; and an engine. Cloud-based containers represent a useful embodiment.

Claims

1. - A cloud-based method for testing web domains, comprising:

-a setup step including selecting a first end-user demographic type from a cloud storage repository that comprises multiple end-user demographic types relating to predetermined online behavioral parameters, and

-said setup step further including selecting a web domain target that comprises predefined web-sites and portals,

-said setup step further including defining visiting pattern parameters based on said selected first end-user demographic type and said selected web domain target, and

-a data collecting step that inputs said visiting pattern parameters into said web domain target and obtains test results from the output information rendered by said web domain target, and

- the step of automatically identifying advertising-related results from said obtained test results at the previous data collecting step, and

- the step of detecting online behavioral targeting by analyzing said advertising- related results based on predetermined threshold combinations or metrics based on domain match, topic match and frequency counts.

2. - The method according to claim 1, wherein the setup stage includes sending a request to the tested web-domain target for disabling end-user online behavioural targeting.

3. - The method according to claim 1, wherein the setup stage includes sending a request to the tested web-domain target for disabling end-user tracking.

4. - The method according to claim 1, wherein the step of automatically identifying advertising-related results includes analyzing images, selecting the images related to advertising and analyzing tags around the image to identify landing pages.

5. - The method according to claim 4, further identifying and parsing code injected by the web domain target.

6. - The method according to claims 4 or 5, further identifying an end-user clickable button or event attached to the advertising-related information.

7.- The method according to claim 1, wherein said multiple end-user demographic types relating to predetermined online behavioral parameters include predetermined demographic groups, based on age, race, or ethnic parameters, in particular, children-related profiles, women-related profiles.

8.- The method according to claim 1, including discovering topic overlap of topics from the output information rendered by said web domain targets, wherein a specific topic L is maintained and kept in memory, when it is encountered more than K times in the L subtree, by transforming a set of h-long branches of a category tree into a flat term list of terms and computing threshold comparisons.

9.- The method according to any of the preceding claims, wherein the testing is carried out during predetermined time patterns so as to monitor said web domain targets.

10.- The method according to any of the previous claims, wherein said multiple end-user demographic types follow different standardized taxonomies.

11.- The method according to any of the previous claims, comprising defining, an operator or automatically, one or more of said multiple end-user demographic types by providing a list of URLs that each of the one or more of said multiple end-user demographic types visits.

12. - The method according to claim 11, comprising providing said list of URLs automatically by importing them from a real world user web browser, and selecting as said web domain target a web domain target that comprises said list of imported URLs.

13. - A system for cloud-based testing of web-domains ; said system comprising a server having a computer processing unit for selecting a first end-user demographic type a set of end-user profiles from a cloud storage repository ; and characterized by an engine for:

- selecting a first end-user demographic type from a cloud storage repository that comprises multiple end-user demographic types relating to predetermined online behavioral parameters, and

-selecting a web domain target that comprises predefined web-sites and portals, and -defining visiting pattern parameters based on said selected first end-user demographic type and said selected web domain target, and

14. - The system according to claim 13, wherein said server comprises a web container for controlling test input times and visiting pattern parameters so that the test results from the tested web-domain targets achieve the same rendering of information as end-user manual input would achieve.

15. - The system according to claim 13, wherein the engine comprises a behavior- targeting mode filter, said tracking-mode filter comprising output means for sending a request to the tested web-page for disabling end-user online behavioral targeting and/or disabling tracking.

16. - A computer product for carrying out the computer-implemented steps of any of claims 1 to 12.