CN108171074B - Web tracking automatic detection method based on content association - Google Patents

Web tracking automatic detection method based on content association Download PDF

Info

Publication number
CN108171074B
CN108171074B CN201711282970.5A CN201711282970A CN108171074B CN 108171074 B CN108171074 B CN 108171074B CN 201711282970 A CN201711282970 A CN 201711282970A CN 108171074 B CN108171074 B CN 108171074B
Authority
CN
China
Prior art keywords
user
web
page
information
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711282970.5A
Other languages
Chinese (zh)
Other versions
CN108171074A (en
Inventor
杨明
周佳欢
罗军舟
吴文甲
凌振
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201711282970.5A priority Critical patent/CN108171074B/en
Publication of CN108171074A publication Critical patent/CN108171074A/en
Application granted granted Critical
Publication of CN108171074B publication Critical patent/CN108171074B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6263Protecting personal data, e.g. for financial or medical purposes during internet communication, e.g. revealing personal data from cookies

Abstract

The invention discloses a Web tracking automatic detection method based on content association, relates to the field of Web user privacy protection, and mainly solves the problem that sensitive information of users is collected and leaked by partial Web sites under the condition that the users do not know. The invention collects the operation behavior of the user on the Web page and the page element information in the form of browser extension, and analyzes and compares the relevance between the page content accessed before and after and the user operation through technologies such as text analysis, image recognition and the like, thereby judging whether the Web site collects the user information. The increasingly developed Web tracking technology can avoid the traditional detection method, so that the invention starts with the Web tracking effect, not only can effectively detect the privacy leakage problem of the user, but also can help researchers to find a novel tracking means.

Description

Web tracking automatic detection method based on content association
Technical Field
The invention relates to a method for protecting privacy of Web users, in particular to a Web tracking automatic detection method based on page content relevance.
Background
With the rapid popularization of Web technologies and services, more and more users can not leave the Web. Meanwhile, the Web site and the advertising service provider hope to perform effective content recommendation and more accurate advertisement delivery through equipment identification, but some advertisers mutually 'cooperate' to sell user privacy information, so that cross-domain user association is realized, behavior habits and preferences of users are analyzed, and privacy protection willingness of the users is violated to a great extent. At present, the Web-based device identification means mainly comprises Cookie and browser fingerprints. The Cookie is text information stored on a user browser by a Web server and can contain related information of a user and equipment, and when the user accesses a Web site, the server can access the Cookie information so as to obtain browsing records and behaviors of the user; the browser fingerprint is composed of various browser, operating system and device hardware related attributes such as a user agent, a font, a plug-in and the like, and does not depend on a specific certain characteristic, so that the browser fingerprint has better robustness.
Aiming at the privacy disclosure threat brought by Web tracking, relevant detection and defense methods are proposed by scholars. For Cookie, a user can directly disable or delete the Cookie regularly to avoid the Cookie; however, the browser fingerprint identification technology collects user information completely without the user knowing, and currently, detection can only be completed by monitoring the calling condition of the sensitive JavaScript API, but the scheme is based on the premise that attack means are comprehensively known, and can be avoided if a Web site uses an undiscovered new attribute.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the defects in the prior art, the invention provides an automatic Web tracking detection method based on content association by fully utilizing the correlation between intelligent recommendation of Web sites and user operation, and can detect whether a user is tracked or not from the beginning of effects.
The technical scheme is as follows: the invention relates to a Web tracking automatic detection method based on content association, which sequentially comprises the following steps:
1) collecting page elements and user operation information: when a user accesses a Web site, page element information (including text type description information and picture link URLs corresponding to all links) and user operation related information (including input search content, text type description information corresponding to clicked links and link URLs corresponding to clicked pictures) are obtained through browser extension and written into files and databases.
2) Analyzing the relevance of the page content: the page content association comprises a text association and a picture association, wherein the text association is as follows: respectively extracting keywords in text description information in the page element information and the user operation information, and analyzing the association degree of the page element information and the keywords by using a text matching technology; picture association: the page element information and the picture in the user operation are respectively downloaded, and the association degree of the page element information and the picture in the user operation is analyzed by utilizing an image recognition technology.
3) And (3) realizing an automatic process: and starting and configuring the browser by using the browser automatic testing tool, simulating user operation, realizing an automatic flow by using a script, and realizing Web tracking automatic detection.
Has the advantages that: compared with the prior art, the invention has the following advantages:
1. starting from the Web tracking effect, the method judges whether the Web site collects the user information by using the tracking technology or not by analyzing the relevance between the content of the Web site accessed by the user twice and the user operation. Even if Web tracking technology is continuously updated, it can be detected by the present invention as long as the Web site uses it to recommend advertisements that are relevant to the user. The problem that the prior knowledge of the Web tracking technology needs to be updated continuously in the prior art is solved, and the combination of the artificial code analysis is also helpful for discovering the novel Web tracking technology.
2. The invention utilizes the automatic testing tool of the browser and the automatic script to automate the whole process (including starting and configuring the browser, accessing the Web site, simulating the user operation, collecting the page and the user operation information), realizes the automatic detection of Web tracking without manual participation, and is beneficial to carrying out large-scale Web tracking detection experiments and analyzing the application condition of the Web tracking technology in real life.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The technical scheme of the invention is further explained by combining the attached drawings.
As shown in FIG. 1, the Web tracking automatic detection process based on content association is mainly divided into 3 steps, which are respectively the collection of page element information and user operation information, the analysis of page content association and the realization of an automatic process, according to the investigation and discovery, when a user accesses a Web site, the browser extension can record the page element and the user operation information. The concrete implementation is as follows:
step 1, collecting page element information and user operation information
11) Acquisition of page element information
The page element information comprises text description information and picture link URLs corresponding to all links, and the acquisition of the page information elements refers to the acquisition of page HTML source codes and can be acquired through a JavaScript API: document. getElementsByTagName ('html') [0]. innerHTML. Because part of the Web sites use dynamic loading technology, the user cannot obtain the complete HTML source code when just opening the Web page. The invention utilizes JavaScript (window. scrollTo) to simulate the operation of the roller, thereby completely loading the page.
12) Acquisition of user operation information
The user operation information comprises search content input by a user, text type description information of clicking a link, text type description information of clicking a picture and a link URL.
The method comprises the following steps of firstly, dynamically checking real-time change of an input tag by adding a monitor to obtain search content input by a user, and specifically:
Figure BDA0001497843860000031
the click link, the text description information of the picture and the URL of the picture are obtained by monitoring the click behavior of the user and obtaining the link of the click object and the context text description information. Since the clicked object usually corresponds to the < img > and < a > tags, the invention only obtains the useful attributes (src, alt, title) under the < img > tag and the text information (obtained through the InnerText) under the < a > tag, and the specific method is as follows:
Figure BDA0001497843860000032
Figure BDA0001497843860000041
step 2, analyzing the relevance of the page content
21) And calculating the relevance of the page content based on the page element information and the user operation information.
The page content relevance analysis comprises two parts: the textual association is associated with the picture. The text relevance is expressed by a text matching value, and the calculation method comprises the following steps: and (3) performing keyword extraction and word segmentation on the user operation information obtained in the step (1) by using a text analysis tool, matching the occurrence times of each keyword in the page element information, and calculating the sum of the occurrence times of each keyword in the page element information to obtain a text matching value. When extracting the keywords, the invention only concerns words with substantial meanings such as nouns, verbs, adjectives and the like, ignores unimportant information such as prepositions, numerologies, quantifications and the like, and the word segmentation is used for segmenting the extracted Chinese long words again, for example, the 'jeans' is segmented into 'jeans' and 'trousers' again, so that the matching accuracy is improved. The method comprises the following steps:
Figure BDA0001497843860000042
the picture relevance is expressed by a picture matching value, and the calculation method comprises the following steps: recognizing the pictures clicked by the user and all the pictures on the page by using the technologies such as an image recognition algorithm, a machine learning algorithm and the like to obtain a set S of two picture categories1And S2Then match S1Wherein each element is in S2And the number of times of occurrence in the image is summed up to obtain the image matching value. The final content relevance is the text matching value MatchTextUSMatch value MatchImage with imageUSAnd (3) the sum:
MatchUS=MatchTextUS+MatchImageUS
22) and judging whether the Web site tracks the user or not based on the relevance difference between the two times of visiting the Web site by the user.
The method specifically comprises the steps of recording page element information S of a website A when a user accesses the website A1And user operation information U, wherein the user accesses the Web site A for the second time and records the page element information S of the Web site A again2Respectively calculating the page content relevance in the two visits, and when the relevance between the page information of the current and the last Web site visited and the user operation behavior is greater than a certain threshold value, considering that the Web site can recommend a specific advertisement for the user, so that the Web site tracks the user, namely:
Figure BDA0001497843860000051
wherein
Figure BDA0001497843860000052
For the association of the page information accessed for the second time with the user operation,
Figure BDA0001497843860000053
for the relevance of the page information accessed for the first time and the user operation, the threshold is a specified threshold, and the threshold is taken as 5 in the invention.
Step 3, realizing the automatic process
The invention utilizes the browser automatic test tool to start the browser, install the browser extension, simulate the user operation and cooperate with the multi-process automatic script to realize the automatic process. As shown in fig. 1, the task manager is responsible for controlling the number of concurrent processes of the whole process and distributing tasks (specifying visited Web site URLs, configuring browsers, etc.) for each process, i.e., browser automation test tool; each process is responsible for configuring and starting a browser and simulating user operation. For the URL set S of the Web site to be detected, the steps comprise: (1) and selecting a URL from the S to access the Web site, simulating a mouse click behavior (in order to cancel the suspended login window) in a blank page, simulating a roller wheel operation to roll to the bottom of the page (to completely load the page), and recording a page source code. (2) Extracting a search box (namely an < input > tag) in a homepage, simulating that a user inputs a search article (various article categories are expandable), simulating a carriage return operation, randomly selecting 3 image links in a skipped page to click, and recording the content of the search article. (3) And extracting the text link and the picture link in the homepage, randomly clicking 3 times respectively, and recording the content related to the clicked link. (4) And (5) closing all windows, repeating the operation (1) if the URL of the Web site to be detected exists in the S, and otherwise, performing the step (5). (5) For the Web site data recorded in the above steps, that is, the obtained simulated user operation and page information, the Web site set of the tracking user can be obtained by using the relevance analysis method in step 2.
According to the embodiment, the Web tracking automatic detection method based on content association is realized, and the problem of privacy disclosure of users can be effectively prevented. The invention starts from the Web tracking effect and avoids the problem that the prior knowledge of the tracking technology needs to be updated continuously in the existing method. In addition, the invention realizes an automatic detection process, is beneficial to analyzing the application condition of Web tracking in real life in a large scale and is also beneficial to discovering a novel Web tracking technology.

Claims (3)

1. A Web tracking automatic detection method based on content association is characterized by comprising the following steps:
(1) collecting Web page elements and user operation information in a browser extension mode, wherein the page elements comprise all text description information and picture links in a page; the user operation information comprises search content input by a user, text type description information of clicking a link, text type description information of clicking a picture and a link URL;
(2) analyzing page content relevance including text relevance and picture relevance based on Web page elements and user operation information, and judging whether the Web site tracks the user or not, wherein,
the text relevance is expressed by a text matching value, and the calculation method comprises the following steps: extracting keywords and segmenting the user operation information obtained in the step (1) by using a text analysis tool, then matching the occurrence times of each keyword in the page element information and solving the sum of the occurrence times of each keyword in the page element information to obtain a text matching value MatchTextUS
The picture relevance is expressed by a picture matching value, and the calculation method comprises the following steps: identifying the pictures clicked by the user and all the pictures on the page by using an image identification algorithm and a machine learning algorithm to obtain a set S of two picture categories1And S2Then match S1Wherein each element is in S2The number of times of the image is found and the sum of the times of the image is the MatchImage of the matched value of the imageUS
The page content association degree is as follows: matchUS=MatchTextUS+MatchImageUS
(3) And realizing Web tracking automatic detection by using a browser automatic test tool.
2. The automatic detection method for Web tracking based on content association as claimed in claim 1, wherein said step (2) comprises:
recording page element information user operation information of a user accessing a Web site twice, and respectively calculating page content association degrees in the twice accesses
Figure FDA0002926345120000011
And
Figure FDA0002926345120000012
and when the difference value of the relevance degrees of the pages visited at the previous time and the later time is greater than a specified threshold, the Web site is considered to track the user.
3. The automatic detection method for Web tracking based on content association as claimed in claim 1, wherein said step (3) comprises: and (3) starting and configuring the browser to access the Web site and compiling an automatic script to simulate the operation behaviors of clicking and inputting a text by a user through an automatic browser test tool to obtain the operation and page information of the simulated user, and obtaining the Web site set of the tracking user by utilizing the analysis in the step (2).
CN201711282970.5A 2017-12-07 2017-12-07 Web tracking automatic detection method based on content association Active CN108171074B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711282970.5A CN108171074B (en) 2017-12-07 2017-12-07 Web tracking automatic detection method based on content association

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711282970.5A CN108171074B (en) 2017-12-07 2017-12-07 Web tracking automatic detection method based on content association

Publications (2)

Publication Number Publication Date
CN108171074A CN108171074A (en) 2018-06-15
CN108171074B true CN108171074B (en) 2021-03-26

Family

ID=62524462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711282970.5A Active CN108171074B (en) 2017-12-07 2017-12-07 Web tracking automatic detection method based on content association

Country Status (1)

Country Link
CN (1) CN108171074B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109309664B (en) * 2018-08-14 2021-03-23 中国科学院数据与通信保护研究教育中心 Browser fingerprint detection behavior monitoring method
US11093644B2 (en) * 2019-05-14 2021-08-17 Google Llc Automatically detecting unauthorized re-identification

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650382A (en) * 2016-12-30 2017-05-10 北京工业大学 Browser-based high-performance user tracking method
CN107239491A (en) * 2017-04-25 2017-10-10 广州阿里巴巴文学信息技术有限公司 For realizing method, equipment, browser and electronic equipment that user behavior is followed the trail of

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8225401B2 (en) * 2008-12-18 2012-07-17 Symantec Corporation Methods and systems for detecting man-in-the-browser attacks

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650382A (en) * 2016-12-30 2017-05-10 北京工业大学 Browser-based high-performance user tracking method
CN107239491A (en) * 2017-04-25 2017-10-10 广州阿里巴巴文学信息技术有限公司 For realizing method, equipment, browser and electronic equipment that user behavior is followed the trail of

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
浏览器指纹探测识别技术研究;江军等;《保密科学技术》;20170131;全文 *

Also Published As

Publication number Publication date
CN108171074A (en) 2018-06-15

Similar Documents

Publication Publication Date Title
CN104766014B (en) For detecting the method and system of malice network address
KR100723867B1 (en) Apparatus and method for blocking access to phishing web page
US8413042B2 (en) Referrer-based website personalization
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
CN102436563B (en) Method and device for detecting page tampering
CN106685936B (en) Webpage tampering detection method and device
CN112541476B (en) Malicious webpage identification method based on semantic feature extraction
CN102446255B (en) Method and device for detecting page tamper
CN107437026B (en) Malicious webpage advertisement detection method based on advertisement network topology
CN103279710A (en) Method and system for detecting malicious codes of Internet information system
CN102222098A (en) Method and system for pre-fetching webpage
KR100792700B1 (en) Method for targeting web advertisement clickers based on click pattern by using a collaborative filtering system with neural networks and system thereof
CN108171074B (en) Web tracking automatic detection method based on content association
CN111371757B (en) Malicious communication detection method and device, computer equipment and storage medium
CN104881428A (en) Information graph extracting and retrieving method and device for information graph webpages
Singh et al. A survey on different phases of web usage mining for anomaly user behavior investigation
CN104036189A (en) Page distortion detecting method and black link database generating method
CN104036190A (en) Method and device for detecting page tampering
CN108694325B (en) Method and device for identifying specified type of website
Qu Research on password detection technology of iot equipment based on wide area network
Zhang et al. Detecting bad information in mobile wireless networks based on the wireless application protocol
CN104077353A (en) Method and device for detecting hacking links
CN111382383A (en) Method, device, medium and computer equipment for determining sensitive type of webpage content
Kandpal et al. A survey on web usage mining: process, application and tools
CN115114676A (en) Remote webpage tampering monitoring method, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant