CN108171074B - Web tracking automatic detection method based on content association - Google Patents
Web tracking automatic detection method based on content association Download PDFInfo
- Publication number
- CN108171074B CN108171074B CN201711282970.5A CN201711282970A CN108171074B CN 108171074 B CN108171074 B CN 108171074B CN 201711282970 A CN201711282970 A CN 201711282970A CN 108171074 B CN108171074 B CN 108171074B
- Authority
- CN
- China
- Prior art keywords
- user
- web
- page
- information
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
- G06F21/6263—Protecting personal data, e.g. for financial or medical purposes during internet communication, e.g. revealing personal data from cookies
Abstract
The invention discloses a Web tracking automatic detection method based on content association, relates to the field of Web user privacy protection, and mainly solves the problem that sensitive information of users is collected and leaked by partial Web sites under the condition that the users do not know. The invention collects the operation behavior of the user on the Web page and the page element information in the form of browser extension, and analyzes and compares the relevance between the page content accessed before and after and the user operation through technologies such as text analysis, image recognition and the like, thereby judging whether the Web site collects the user information. The increasingly developed Web tracking technology can avoid the traditional detection method, so that the invention starts with the Web tracking effect, not only can effectively detect the privacy leakage problem of the user, but also can help researchers to find a novel tracking means.
Description
Technical Field
The invention relates to a method for protecting privacy of Web users, in particular to a Web tracking automatic detection method based on page content relevance.
Background
With the rapid popularization of Web technologies and services, more and more users can not leave the Web. Meanwhile, the Web site and the advertising service provider hope to perform effective content recommendation and more accurate advertisement delivery through equipment identification, but some advertisers mutually 'cooperate' to sell user privacy information, so that cross-domain user association is realized, behavior habits and preferences of users are analyzed, and privacy protection willingness of the users is violated to a great extent. At present, the Web-based device identification means mainly comprises Cookie and browser fingerprints. The Cookie is text information stored on a user browser by a Web server and can contain related information of a user and equipment, and when the user accesses a Web site, the server can access the Cookie information so as to obtain browsing records and behaviors of the user; the browser fingerprint is composed of various browser, operating system and device hardware related attributes such as a user agent, a font, a plug-in and the like, and does not depend on a specific certain characteristic, so that the browser fingerprint has better robustness.
Aiming at the privacy disclosure threat brought by Web tracking, relevant detection and defense methods are proposed by scholars. For Cookie, a user can directly disable or delete the Cookie regularly to avoid the Cookie; however, the browser fingerprint identification technology collects user information completely without the user knowing, and currently, detection can only be completed by monitoring the calling condition of the sensitive JavaScript API, but the scheme is based on the premise that attack means are comprehensively known, and can be avoided if a Web site uses an undiscovered new attribute.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the defects in the prior art, the invention provides an automatic Web tracking detection method based on content association by fully utilizing the correlation between intelligent recommendation of Web sites and user operation, and can detect whether a user is tracked or not from the beginning of effects.
The technical scheme is as follows: the invention relates to a Web tracking automatic detection method based on content association, which sequentially comprises the following steps:
1) collecting page elements and user operation information: when a user accesses a Web site, page element information (including text type description information and picture link URLs corresponding to all links) and user operation related information (including input search content, text type description information corresponding to clicked links and link URLs corresponding to clicked pictures) are obtained through browser extension and written into files and databases.
2) Analyzing the relevance of the page content: the page content association comprises a text association and a picture association, wherein the text association is as follows: respectively extracting keywords in text description information in the page element information and the user operation information, and analyzing the association degree of the page element information and the keywords by using a text matching technology; picture association: the page element information and the picture in the user operation are respectively downloaded, and the association degree of the page element information and the picture in the user operation is analyzed by utilizing an image recognition technology.
3) And (3) realizing an automatic process: and starting and configuring the browser by using the browser automatic testing tool, simulating user operation, realizing an automatic flow by using a script, and realizing Web tracking automatic detection.
Has the advantages that: compared with the prior art, the invention has the following advantages:
1. starting from the Web tracking effect, the method judges whether the Web site collects the user information by using the tracking technology or not by analyzing the relevance between the content of the Web site accessed by the user twice and the user operation. Even if Web tracking technology is continuously updated, it can be detected by the present invention as long as the Web site uses it to recommend advertisements that are relevant to the user. The problem that the prior knowledge of the Web tracking technology needs to be updated continuously in the prior art is solved, and the combination of the artificial code analysis is also helpful for discovering the novel Web tracking technology.
2. The invention utilizes the automatic testing tool of the browser and the automatic script to automate the whole process (including starting and configuring the browser, accessing the Web site, simulating the user operation, collecting the page and the user operation information), realizes the automatic detection of Web tracking without manual participation, and is beneficial to carrying out large-scale Web tracking detection experiments and analyzing the application condition of the Web tracking technology in real life.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The technical scheme of the invention is further explained by combining the attached drawings.
As shown in FIG. 1, the Web tracking automatic detection process based on content association is mainly divided into 3 steps, which are respectively the collection of page element information and user operation information, the analysis of page content association and the realization of an automatic process, according to the investigation and discovery, when a user accesses a Web site, the browser extension can record the page element and the user operation information. The concrete implementation is as follows:
step 1, collecting page element information and user operation information
11) Acquisition of page element information
The page element information comprises text description information and picture link URLs corresponding to all links, and the acquisition of the page information elements refers to the acquisition of page HTML source codes and can be acquired through a JavaScript API: document. getElementsByTagName ('html') [0]. innerHTML. Because part of the Web sites use dynamic loading technology, the user cannot obtain the complete HTML source code when just opening the Web page. The invention utilizes JavaScript (window. scrollTo) to simulate the operation of the roller, thereby completely loading the page.
12) Acquisition of user operation information
The user operation information comprises search content input by a user, text type description information of clicking a link, text type description information of clicking a picture and a link URL.
The method comprises the following steps of firstly, dynamically checking real-time change of an input tag by adding a monitor to obtain search content input by a user, and specifically:
the click link, the text description information of the picture and the URL of the picture are obtained by monitoring the click behavior of the user and obtaining the link of the click object and the context text description information. Since the clicked object usually corresponds to the < img > and < a > tags, the invention only obtains the useful attributes (src, alt, title) under the < img > tag and the text information (obtained through the InnerText) under the < a > tag, and the specific method is as follows:
step 2, analyzing the relevance of the page content
21) And calculating the relevance of the page content based on the page element information and the user operation information.
The page content relevance analysis comprises two parts: the textual association is associated with the picture. The text relevance is expressed by a text matching value, and the calculation method comprises the following steps: and (3) performing keyword extraction and word segmentation on the user operation information obtained in the step (1) by using a text analysis tool, matching the occurrence times of each keyword in the page element information, and calculating the sum of the occurrence times of each keyword in the page element information to obtain a text matching value. When extracting the keywords, the invention only concerns words with substantial meanings such as nouns, verbs, adjectives and the like, ignores unimportant information such as prepositions, numerologies, quantifications and the like, and the word segmentation is used for segmenting the extracted Chinese long words again, for example, the 'jeans' is segmented into 'jeans' and 'trousers' again, so that the matching accuracy is improved. The method comprises the following steps:
the picture relevance is expressed by a picture matching value, and the calculation method comprises the following steps: recognizing the pictures clicked by the user and all the pictures on the page by using the technologies such as an image recognition algorithm, a machine learning algorithm and the like to obtain a set S of two picture categories1And S2Then match S1Wherein each element is in S2And the number of times of occurrence in the image is summed up to obtain the image matching value. The final content relevance is the text matching value MatchTextUSMatch value MatchImage with imageUSAnd (3) the sum:
MatchUS=MatchTextUS+MatchImageUS
22) and judging whether the Web site tracks the user or not based on the relevance difference between the two times of visiting the Web site by the user.
The method specifically comprises the steps of recording page element information S of a website A when a user accesses the website A1And user operation information U, wherein the user accesses the Web site A for the second time and records the page element information S of the Web site A again2Respectively calculating the page content relevance in the two visits, and when the relevance between the page information of the current and the last Web site visited and the user operation behavior is greater than a certain threshold value, considering that the Web site can recommend a specific advertisement for the user, so that the Web site tracks the user, namely:
whereinFor the association of the page information accessed for the second time with the user operation,for the relevance of the page information accessed for the first time and the user operation, the threshold is a specified threshold, and the threshold is taken as 5 in the invention.
Step 3, realizing the automatic process
The invention utilizes the browser automatic test tool to start the browser, install the browser extension, simulate the user operation and cooperate with the multi-process automatic script to realize the automatic process. As shown in fig. 1, the task manager is responsible for controlling the number of concurrent processes of the whole process and distributing tasks (specifying visited Web site URLs, configuring browsers, etc.) for each process, i.e., browser automation test tool; each process is responsible for configuring and starting a browser and simulating user operation. For the URL set S of the Web site to be detected, the steps comprise: (1) and selecting a URL from the S to access the Web site, simulating a mouse click behavior (in order to cancel the suspended login window) in a blank page, simulating a roller wheel operation to roll to the bottom of the page (to completely load the page), and recording a page source code. (2) Extracting a search box (namely an < input > tag) in a homepage, simulating that a user inputs a search article (various article categories are expandable), simulating a carriage return operation, randomly selecting 3 image links in a skipped page to click, and recording the content of the search article. (3) And extracting the text link and the picture link in the homepage, randomly clicking 3 times respectively, and recording the content related to the clicked link. (4) And (5) closing all windows, repeating the operation (1) if the URL of the Web site to be detected exists in the S, and otherwise, performing the step (5). (5) For the Web site data recorded in the above steps, that is, the obtained simulated user operation and page information, the Web site set of the tracking user can be obtained by using the relevance analysis method in step 2.
According to the embodiment, the Web tracking automatic detection method based on content association is realized, and the problem of privacy disclosure of users can be effectively prevented. The invention starts from the Web tracking effect and avoids the problem that the prior knowledge of the tracking technology needs to be updated continuously in the existing method. In addition, the invention realizes an automatic detection process, is beneficial to analyzing the application condition of Web tracking in real life in a large scale and is also beneficial to discovering a novel Web tracking technology.
Claims (3)
1. A Web tracking automatic detection method based on content association is characterized by comprising the following steps:
(1) collecting Web page elements and user operation information in a browser extension mode, wherein the page elements comprise all text description information and picture links in a page; the user operation information comprises search content input by a user, text type description information of clicking a link, text type description information of clicking a picture and a link URL;
(2) analyzing page content relevance including text relevance and picture relevance based on Web page elements and user operation information, and judging whether the Web site tracks the user or not, wherein,
the text relevance is expressed by a text matching value, and the calculation method comprises the following steps: extracting keywords and segmenting the user operation information obtained in the step (1) by using a text analysis tool, then matching the occurrence times of each keyword in the page element information and solving the sum of the occurrence times of each keyword in the page element information to obtain a text matching value MatchTextUS;
The picture relevance is expressed by a picture matching value, and the calculation method comprises the following steps: identifying the pictures clicked by the user and all the pictures on the page by using an image identification algorithm and a machine learning algorithm to obtain a set S of two picture categories1And S2Then match S1Wherein each element is in S2The number of times of the image is found and the sum of the times of the image is the MatchImage of the matched value of the imageUS;
The page content association degree is as follows: matchUS=MatchTextUS+MatchImageUS;
(3) And realizing Web tracking automatic detection by using a browser automatic test tool.
2. The automatic detection method for Web tracking based on content association as claimed in claim 1, wherein said step (2) comprises:
recording page element information user operation information of a user accessing a Web site twice, and respectively calculating page content association degrees in the twice accessesAndand when the difference value of the relevance degrees of the pages visited at the previous time and the later time is greater than a specified threshold, the Web site is considered to track the user.
3. The automatic detection method for Web tracking based on content association as claimed in claim 1, wherein said step (3) comprises: and (3) starting and configuring the browser to access the Web site and compiling an automatic script to simulate the operation behaviors of clicking and inputting a text by a user through an automatic browser test tool to obtain the operation and page information of the simulated user, and obtaining the Web site set of the tracking user by utilizing the analysis in the step (2).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711282970.5A CN108171074B (en) | 2017-12-07 | 2017-12-07 | Web tracking automatic detection method based on content association |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711282970.5A CN108171074B (en) | 2017-12-07 | 2017-12-07 | Web tracking automatic detection method based on content association |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108171074A CN108171074A (en) | 2018-06-15 |
CN108171074B true CN108171074B (en) | 2021-03-26 |
Family
ID=62524462
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711282970.5A Active CN108171074B (en) | 2017-12-07 | 2017-12-07 | Web tracking automatic detection method based on content association |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108171074B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109309664B (en) * | 2018-08-14 | 2021-03-23 | 中国科学院数据与通信保护研究教育中心 | Browser fingerprint detection behavior monitoring method |
US11093644B2 (en) * | 2019-05-14 | 2021-08-17 | Google Llc | Automatically detecting unauthorized re-identification |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106650382A (en) * | 2016-12-30 | 2017-05-10 | 北京工业大学 | Browser-based high-performance user tracking method |
CN107239491A (en) * | 2017-04-25 | 2017-10-10 | 广州阿里巴巴文学信息技术有限公司 | For realizing method, equipment, browser and electronic equipment that user behavior is followed the trail of |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8225401B2 (en) * | 2008-12-18 | 2012-07-17 | Symantec Corporation | Methods and systems for detecting man-in-the-browser attacks |
-
2017
- 2017-12-07 CN CN201711282970.5A patent/CN108171074B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106650382A (en) * | 2016-12-30 | 2017-05-10 | 北京工业大学 | Browser-based high-performance user tracking method |
CN107239491A (en) * | 2017-04-25 | 2017-10-10 | 广州阿里巴巴文学信息技术有限公司 | For realizing method, equipment, browser and electronic equipment that user behavior is followed the trail of |
Non-Patent Citations (1)
Title |
---|
浏览器指纹探测识别技术研究;江军等;《保密科学技术》;20170131;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN108171074A (en) | 2018-06-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104766014B (en) | For detecting the method and system of malice network address | |
KR100723867B1 (en) | Apparatus and method for blocking access to phishing web page | |
US8413042B2 (en) | Referrer-based website personalization | |
CN103559235B (en) | A kind of online social networks malicious web pages detection recognition methods | |
CN102436563B (en) | Method and device for detecting page tampering | |
CN106685936B (en) | Webpage tampering detection method and device | |
CN112541476B (en) | Malicious webpage identification method based on semantic feature extraction | |
CN102446255B (en) | Method and device for detecting page tamper | |
CN107437026B (en) | Malicious webpage advertisement detection method based on advertisement network topology | |
CN103279710A (en) | Method and system for detecting malicious codes of Internet information system | |
CN102222098A (en) | Method and system for pre-fetching webpage | |
KR100792700B1 (en) | Method for targeting web advertisement clickers based on click pattern by using a collaborative filtering system with neural networks and system thereof | |
CN108171074B (en) | Web tracking automatic detection method based on content association | |
CN111371757B (en) | Malicious communication detection method and device, computer equipment and storage medium | |
CN104881428A (en) | Information graph extracting and retrieving method and device for information graph webpages | |
Singh et al. | A survey on different phases of web usage mining for anomaly user behavior investigation | |
CN104036189A (en) | Page distortion detecting method and black link database generating method | |
CN104036190A (en) | Method and device for detecting page tampering | |
CN108694325B (en) | Method and device for identifying specified type of website | |
Qu | Research on password detection technology of iot equipment based on wide area network | |
Zhang et al. | Detecting bad information in mobile wireless networks based on the wireless application protocol | |
CN104077353A (en) | Method and device for detecting hacking links | |
CN111382383A (en) | Method, device, medium and computer equipment for determining sensitive type of webpage content | |
Kandpal et al. | A survey on web usage mining: process, application and tools | |
CN115114676A (en) | Remote webpage tampering monitoring method, system, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |