CN108171074B

CN108171074B - Web tracking automatic detection method based on content association

Info

Publication number: CN108171074B
Application number: CN201711282970.5A
Authority: CN
Inventors: 杨明; 周佳欢; 罗军舟; 吴文甲; 凌振
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2017-12-07
Filing date: 2017-12-07
Publication date: 2021-03-26
Anticipated expiration: 2037-12-07
Also published as: CN108171074A

Abstract

The invention discloses a Web tracking automatic detection method based on content association, relates to the field of Web user privacy protection, and mainly solves the problem that sensitive information of users is collected and leaked by partial Web sites under the condition that the users do not know. The invention collects the operation behavior of the user on the Web page and the page element information in the form of browser extension, and analyzes and compares the relevance between the page content accessed before and after and the user operation through technologies such as text analysis, image recognition and the like, thereby judging whether the Web site collects the user information. The increasingly developed Web tracking technology can avoid the traditional detection method, so that the invention starts with the Web tracking effect, not only can effectively detect the privacy leakage problem of the user, but also can help researchers to find a novel tracking means.

Description

Web tracking automatic detection method based on content association

Technical Field

The invention relates to a method for protecting privacy of Web users, in particular to a Web tracking automatic detection method based on page content relevance.

Background

With the rapid popularization of Web technologies and services, more and more users can not leave the Web. Meanwhile, the Web site and the advertising service provider hope to perform effective content recommendation and more accurate advertisement delivery through equipment identification, but some advertisers mutually 'cooperate' to sell user privacy information, so that cross-domain user association is realized, behavior habits and preferences of users are analyzed, and privacy protection willingness of the users is violated to a great extent. At present, the Web-based device identification means mainly comprises Cookie and browser fingerprints. The Cookie is text information stored on a user browser by a Web server and can contain related information of a user and equipment, and when the user accesses a Web site, the server can access the Cookie information so as to obtain browsing records and behaviors of the user; the browser fingerprint is composed of various browser, operating system and device hardware related attributes such as a user agent, a font, a plug-in and the like, and does not depend on a specific certain characteristic, so that the browser fingerprint has better robustness.

Aiming at the privacy disclosure threat brought by Web tracking, relevant detection and defense methods are proposed by scholars. For Cookie, a user can directly disable or delete the Cookie regularly to avoid the Cookie; however, the browser fingerprint identification technology collects user information completely without the user knowing, and currently, detection can only be completed by monitoring the calling condition of the sensitive JavaScript API, but the scheme is based on the premise that attack means are comprehensively known, and can be avoided if a Web site uses an undiscovered new attribute.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects in the prior art, the invention provides an automatic Web tracking detection method based on content association by fully utilizing the correlation between intelligent recommendation of Web sites and user operation, and can detect whether a user is tracked or not from the beginning of effects.

The technical scheme is as follows: the invention relates to a Web tracking automatic detection method based on content association, which sequentially comprises the following steps:

1) collecting page elements and user operation information: when a user accesses a Web site, page element information (including text type description information and picture link URLs corresponding to all links) and user operation related information (including input search content, text type description information corresponding to clicked links and link URLs corresponding to clicked pictures) are obtained through browser extension and written into files and databases.

2) Analyzing the relevance of the page content: the page content association comprises a text association and a picture association, wherein the text association is as follows: respectively extracting keywords in text description information in the page element information and the user operation information, and analyzing the association degree of the page element information and the keywords by using a text matching technology; picture association: the page element information and the picture in the user operation are respectively downloaded, and the association degree of the page element information and the picture in the user operation is analyzed by utilizing an image recognition technology.

3) And (3) realizing an automatic process: and starting and configuring the browser by using the browser automatic testing tool, simulating user operation, realizing an automatic flow by using a script, and realizing Web tracking automatic detection.

Has the advantages that: compared with the prior art, the invention has the following advantages:

1. starting from the Web tracking effect, the method judges whether the Web site collects the user information by using the tracking technology or not by analyzing the relevance between the content of the Web site accessed by the user twice and the user operation. Even if Web tracking technology is continuously updated, it can be detected by the present invention as long as the Web site uses it to recommend advertisements that are relevant to the user. The problem that the prior knowledge of the Web tracking technology needs to be updated continuously in the prior art is solved, and the combination of the artificial code analysis is also helpful for discovering the novel Web tracking technology.

2. The invention utilizes the automatic testing tool of the browser and the automatic script to automate the whole process (including starting and configuring the browser, accessing the Web site, simulating the user operation, collecting the page and the user operation information), realizes the automatic detection of Web tracking without manual participation, and is beneficial to carrying out large-scale Web tracking detection experiments and analyzing the application condition of the Web tracking technology in real life.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

As shown in FIG. 1, the Web tracking automatic detection process based on content association is mainly divided into 3 steps, which are respectively the collection of page element information and user operation information, the analysis of page content association and the realization of an automatic process, according to the investigation and discovery, when a user accesses a Web site, the browser extension can record the page element and the user operation information. The concrete implementation is as follows:

step 1, collecting page element information and user operation information

11) Acquisition of page element information

The page element information comprises text description information and picture link URLs corresponding to all links, and the acquisition of the page information elements refers to the acquisition of page HTML source codes and can be acquired through a JavaScript API: document. getElementsByTagName ('html') [0]. innerHTML. Because part of the Web sites use dynamic loading technology, the user cannot obtain the complete HTML source code when just opening the Web page. The invention utilizes JavaScript (window. scrollTo) to simulate the operation of the roller, thereby completely loading the page.

12) Acquisition of user operation information

The user operation information comprises search content input by a user, text type description information of clicking a link, text type description information of clicking a picture and a link URL.

The method comprises the following steps of firstly, dynamically checking real-time change of an input tag by adding a monitor to obtain search content input by a user, and specifically:

the click link, the text description information of the picture and the URL of the picture are obtained by monitoring the click behavior of the user and obtaining the link of the click object and the context text description information. Since the clicked object usually corresponds to the < img > and < a > tags, the invention only obtains the useful attributes (src, alt, title) under the < img > tag and the text information (obtained through the InnerText) under the < a > tag, and the specific method is as follows:

step 2, analyzing the relevance of the page content

21) And calculating the relevance of the page content based on the page element information and the user operation information.

The page content relevance analysis comprises two parts: the textual association is associated with the picture. The text relevance is expressed by a text matching value, and the calculation method comprises the following steps: and (3) performing keyword extraction and word segmentation on the user operation information obtained in the step (1) by using a text analysis tool, matching the occurrence times of each keyword in the page element information, and calculating the sum of the occurrence times of each keyword in the page element information to obtain a text matching value. When extracting the keywords, the invention only concerns words with substantial meanings such as nouns, verbs, adjectives and the like, ignores unimportant information such as prepositions, numerologies, quantifications and the like, and the word segmentation is used for segmenting the extracted Chinese long words again, for example, the 'jeans' is segmented into 'jeans' and 'trousers' again, so that the matching accuracy is improved. The method comprises the following steps:

the picture relevance is expressed by a picture matching value, and the calculation method comprises the following steps: recognizing the pictures clicked by the user and all the pictures on the page by using the technologies such as an image recognition algorithm, a machine learning algorithm and the like to obtain a set S of two picture categories₁And S₂Then match S₁Wherein each element is in S₂And the number of times of occurrence in the image is summed up to obtain the image matching value. The final content relevance is the text matching value MatchText_USMatch value MatchImage with image_USAnd (3) the sum:

Match_US＝MatchText_US+MatchImage_US

22) and judging whether the Web site tracks the user or not based on the relevance difference between the two times of visiting the Web site by the user.

The method specifically comprises the steps of recording page element information S of a website A when a user accesses the website A₁And user operation information U, wherein the user accesses the Web site A for the second time and records the page element information S of the Web site A again₂Respectively calculating the page content relevance in the two visits, and when the relevance between the page information of the current and the last Web site visited and the user operation behavior is greater than a certain threshold value, considering that the Web site can recommend a specific advertisement for the user, so that the Web site tracks the user, namely:

wherein

For the association of the page information accessed for the second time with the user operation,

for the relevance of the page information accessed for the first time and the user operation, the threshold is a specified threshold, and the threshold is taken as 5 in the invention.

Step 3, realizing the automatic process

The invention utilizes the browser automatic test tool to start the browser, install the browser extension, simulate the user operation and cooperate with the multi-process automatic script to realize the automatic process. As shown in fig. 1, the task manager is responsible for controlling the number of concurrent processes of the whole process and distributing tasks (specifying visited Web site URLs, configuring browsers, etc.) for each process, i.e., browser automation test tool; each process is responsible for configuring and starting a browser and simulating user operation. For the URL set S of the Web site to be detected, the steps comprise: (1) and selecting a URL from the S to access the Web site, simulating a mouse click behavior (in order to cancel the suspended login window) in a blank page, simulating a roller wheel operation to roll to the bottom of the page (to completely load the page), and recording a page source code. (2) Extracting a search box (namely an < input > tag) in a homepage, simulating that a user inputs a search article (various article categories are expandable), simulating a carriage return operation, randomly selecting 3 image links in a skipped page to click, and recording the content of the search article. (3) And extracting the text link and the picture link in the homepage, randomly clicking 3 times respectively, and recording the content related to the clicked link. (4) And (5) closing all windows, repeating the operation (1) if the URL of the Web site to be detected exists in the S, and otherwise, performing the step (5). (5) For the Web site data recorded in the above steps, that is, the obtained simulated user operation and page information, the Web site set of the tracking user can be obtained by using the relevance analysis method in step 2.

According to the embodiment, the Web tracking automatic detection method based on content association is realized, and the problem of privacy disclosure of users can be effectively prevented. The invention starts from the Web tracking effect and avoids the problem that the prior knowledge of the tracking technology needs to be updated continuously in the existing method. In addition, the invention realizes an automatic detection process, is beneficial to analyzing the application condition of Web tracking in real life in a large scale and is also beneficial to discovering a novel Web tracking technology.

Claims

1. A Web tracking automatic detection method based on content association is characterized by comprising the following steps:

(1) collecting Web page elements and user operation information in a browser extension mode, wherein the page elements comprise all text description information and picture links in a page; the user operation information comprises search content input by a user, text type description information of clicking a link, text type description information of clicking a picture and a link URL;

(2) analyzing page content relevance including text relevance and picture relevance based on Web page elements and user operation information, and judging whether the Web site tracks the user or not, wherein,

the text relevance is expressed by a text matching value, and the calculation method comprises the following steps: extracting keywords and segmenting the user operation information obtained in the step (1) by using a text analysis tool, then matching the occurrence times of each keyword in the page element information and solving the sum of the occurrence times of each keyword in the page element information to obtain a text matching value MatchText_US；

The picture relevance is expressed by a picture matching value, and the calculation method comprises the following steps: identifying the pictures clicked by the user and all the pictures on the page by using an image identification algorithm and a machine learning algorithm to obtain a set S of two picture categories₁And S₂Then match S₁Wherein each element is in S₂The number of times of the image is found and the sum of the times of the image is the MatchImage of the matched value of the image_US；

The page content association degree is as follows: match_US＝MatchText_US+MatchImage_US；

(3) And realizing Web tracking automatic detection by using a browser automatic test tool.

2. The automatic detection method for Web tracking based on content association as claimed in claim 1, wherein said step (2) comprises:

recording page element information user operation information of a user accessing a Web site twice, and respectively calculating page content association degrees in the twice accesses

And

and when the difference value of the relevance degrees of the pages visited at the previous time and the later time is greater than a specified threshold, the Web site is considered to track the user.

3. The automatic detection method for Web tracking based on content association as claimed in claim 1, wherein said step (3) comprises: and (3) starting and configuring the browser to access the Web site and compiling an automatic script to simulate the operation behaviors of clicking and inputting a text by a user through an automatic browser test tool to obtain the operation and page information of the simulated user, and obtaining the Web site set of the tracking user by utilizing the analysis in the step (2).