CN108171074A - One kind is based on the associated Web trackings automatic testing method of content - Google Patents

One kind is based on the associated Web trackings automatic testing method of content Download PDF

Info

Publication number
CN108171074A
CN108171074A CN201711282970.5A CN201711282970A CN108171074A CN 108171074 A CN108171074 A CN 108171074A CN 201711282970 A CN201711282970 A CN 201711282970A CN 108171074 A CN108171074 A CN 108171074A
Authority
CN
China
Prior art keywords
user
web
content
page
trackings
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711282970.5A
Other languages
Chinese (zh)
Other versions
CN108171074B (en
Inventor
杨明
周佳欢
罗军舟
吴文甲
凌振
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201711282970.5A priority Critical patent/CN108171074B/en
Publication of CN108171074A publication Critical patent/CN108171074A/en
Application granted granted Critical
Publication of CN108171074B publication Critical patent/CN108171074B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6263Protecting personal data, e.g. for financial or medical purposes during internet communication, e.g. revealing personal data from cookies

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses one kind based on the associated Web trackings automatic testing method of content, it is related to Web user secret protection field, mainly solves the problems, such as part Web site in the unwitting collection of user, leakage user sensitive information.The present invention collects user to the operation behavior of Web page and page elements information in the form of browser extends, the content of pages of more front and rear access and the relevance of user's operation are analyzed by the technologies such as text analyzing and image identification, so as to judge whether the Web site is collecting user information.Since growing Web tracer techniques can avoid traditional detection method, the present invention starts with from Web trackings effect, can not only effectively detect privacy of user leakage problem, moreover it is possible to researcher be helped to find novel tracking means.

Description

One kind is based on the associated Web trackings automatic testing method of content
Technical field
The present invention relates to Web user method for secret protection, and in particular to a kind of Web trackings based on content of pages relevance Automatic testing method.
Background technology
It is quick universal with Web technologies and business, more and more users too busy to get away Web.At the same time, Web Website wishes to identify by equipment with advertising service quotient carries out effective commending contents and more accurate advertisement dispensing, still Part advertiser mutually " cooperation ", peddles user privacy information, is associated with so as to fulfill cross-domain user, and then analyzes the behavior of user Custom and hobby, this has largely violated the secret protection wish of user.At present, the equipment means of identification based on Web Mainly include Cookie, browser fingerprint.Wherein Cookie is the text envelope being stored in by Web server on user browser Breath, it can include user and device-dependent message, and when user accesses Web site, server can access Cookie Information is so as to obtain the browsing of user record and behavior;And browser fingerprint is a variety of clear by UserAgent, font, plug-in unit etc. Looking at device, operating system and device hardware association attributes is formed, and independent of some specific feature, therefore with preferable strong Strong property.
The privacy leakage brought for Web trackings threatens, and has scholar and proposes coherent detection and defence method.Wherein for Cookie, user can directly be disabled by browser or periodically delete to evade;But browser fingerprint identification technology is complete In the ignorant lower collection user information of user, can only be completed at present by monitoring the calling situation of sensitivity JavaScript API Detection, but this scheme is based on the premise for having overall understanding to attack means, if Web site has used undiscovered new category Property, with regard to this scheme can be avoided.
Invention content
Goal of the invention:For the deficiencies in the prior art, the present invention makes full use of the intelligent recommendation and use of Web site The correlation of family operation proposes that a kind of associated Web of content that is based on tracks automatic testing method, and the detection that can start with from effect is used Whether family is tracked.
Technical solution:It is of the present invention a kind of based on the associated Web trackings automatic testing method of content, include successively with Lower step:
1) collection of page elements and user's operation information:When user accesses Web site, extended and obtained by browser Page elements information (including the corresponding text class description information of all-links, image link URL) and user's operation relevant information (search content, the corresponding text class description information of clickthrough comprising input click the corresponding link URL of picture), and write Enter file and database.
2) analysis of content of pages relevance:Content of pages association includes textual association and is associated with picture, and wherein text closes Connection:By extracting page elements information and the keyword in the text class description information in user's operation information respectively, text is utilized Both this matching technique analyses degree of association;Picture is associated with:By downloading page elements information and the picture in user's operation respectively, And analyze the two degree of association using image recognition technology.
3) realization of automatic flow:Started using browser automated test tool and browser is configured, analog subscriber It operates and script is utilized to realize automatic flow, realize Web tracking automatic detections.
Advantageous effect:Compared with prior art, the present invention has the following advantages:
1st, the present invention starts with from Web trackings effect, by the content and use of analyzing the Web site accessed twice before and after user Whether the relevance of family operation judges Web site using tracer technique collection user information.Even if Web tracer techniques are constantly more Newly, it as long as Web site recommends advertisement related to user using it, can just be detected by the present invention.Avoid the prior art The problem of Web tracer technique prioris need to be constantly updated, additionally aids with reference to artificial code analysis and finds novel Web trackings Technology.
2nd, the present invention utilizes browser automated test tool and automatized script by whole flow process (including starting and matching Browser is put, Web site is accessed, analog subscriber operation, collects the page and user's operation information) automation, realize Web trackings Automatic detection without manually participating in, therefore helps to carry out extensive Web trace detections to test and analyze in real-life The applicable cases of Web tracer techniques.
Description of the drawings
Fig. 1 is flow chart of the method for the present invention.
Specific embodiment
Technical scheme of the present invention is described further below in conjunction with the accompanying drawings.
As shown in Figure 1, tracking automatic testing process based on the associated Web of content is broadly divided into 3 steps, it is the page respectively Collection, the analysis of content of pages relevance and the realization of automatic flow of element information and user's operation information, according to investigation It was found that when user accesses Web site, browser extension can record page surface element and user's operation information, the present invention passes through Compare the relevance of these information analysis content of pages and user's operation to judge whether Web site is tracking user, this is not only The problem of existing method is based on tracer technique priori is avoided, also contributes to find novel Web tracer techniques.Tool Body is realized as follows:
The collection of step 1, page elements information and user's operation information
11) acquisition of page elements information
Here page elements information includes the corresponding text class description information of all-links, image link URL, page letter The acquisition of breath element refers to obtain page html source code, can be obtained by JavaScript API: document.getElementsByTagName('html')[0].innerHTML.Since part Web site uses dynamic to add Load technology, thus when user has just opened Web page can not obtain complete html source code.The present invention utilizes JavaScript (window.scrollTo) wheel operation is simulated, so that the page is loaded completely.
12) acquisition of user's operation information
User's operation information is including search content input by user, the text class description information of clickthrough and clicks figure The text class description information and link URL of piece.
Search content wherein input by user is obtained by adding the real-time change of monitor dynamic chek input labels It takes, specific method is as follows:
The acquisition of the URL of the text class description information and picture of clickthrough and picture is clicked by monitoring user Behavior simultaneously obtains the link for clicking object and context text class description information and obtains.It is usually corresponding due to clicking object< img>And<a>Label, therefore the present invention only obtains<img>Useful attribute (src, alt, title) under label and<a>Mark The text message (being obtained by innerText) signed, specific method is as follows:
Step 2, content of pages correlation analysis
21) content of pages relevance is calculated based on page elements information and user's operation information.
Content of pages correlation analysis includes two parts:Textual association is associated with picture.Wherein textual association is with text Matching value represents that computational methods are:The user's operation information obtained in step 1 is carried out using text analyzing tool crucial Word extracts and participle, then matches occurrence number of each keyword in page elements information and asks itself and as text With value.Wherein, when extracting keyword, the present invention, which only focuses on noun, verb, adjective etc., has the word of essential meaning, and ignores The unessential information such as preposition, number, quantifier, participle be in order to the Chinese long word extracted carry out cutting again, such as By " jeans ", cutting is " cowboy " and " trousers " again, improves matched accuracy.Specific practice is as follows:
Picture relevance represents that computational methods are with picture match value:It is calculated using image recognition algorithm, machine learning All pictures on the picture and the page that the technologies such as method identification user clicks, obtain the other set S of two picture categories1And S2, then Match S1In each element in S2The number of middle appearance simultaneously asks itself and as picture match value.Final content relevance is text Matching value MatchTextUSWith images match value MatchImageUSThe sum of:
MatchUS=MatchTextUS+MatchImageUS
22) judge Web site whether in tracking user based on the relevance difference for accessing Web site before and after user twice.
When accessing Web site A the specific steps are user, the page elements information S of record website A1And user's operation information U, user back-call Web site A record its page elements information S again2, content of pages in accessing twice is calculated respectively Relevance when the relevance of the front and rear Web site page info accessed twice and user's operation behavior is more than some threshold value, is recognized Can be that user recommends particular advertisement, therefore the Web site is tracking user for the Web site, i.e.,:
WhereinFor the page info of back-call and the relevance of user's operation,For for the first time The page info of access and the relevance of user's operation, threshhold are specified threshold, and threshhold takes in the present invention 5。
The realization of step 3, automatic flow
The present invention starts browser using browser automated test tool, installation browser extends, analog subscriber operates, Multi-process automatized script is coordinated to realize automatic flow.As shown in Figure 1, task manager is responsible for controlling the concurrent of whole process Into number of passes and it is each process, that is, browser automated test tool distributed tasks (the specified Web site URL accessed, configuration Browser etc.);Each process is responsible for being configured and starting browser, analog subscriber operation.For the set of URL of Web site to be detected S is closed, step includes:(1) URL is chosen from S and accesses Web site, mouse is simulated at page empty and clicks behavior (in order to cancel The login window of suspension), simulation wheel operation is rolled to page bottom (page is made to load completely), records page source code.(2) it extracts Search box in homepage is (i.e.<input>Label), analog subscriber input search article (various article classification and expansible), simulation Carriage return operates, and the link of 3 pictures is randomly choosed in the page redirected and is clicked, record search item contents.(3) extraction master Text Link and image link in page, and click 3 times at random respectively, record clickthrough related content.(4) it is fenestrate to close institute Mouthful, otherwise the repetitive operation (1) if also having Web site URL to be detected in S carries out step (5).(5) above-mentioned steps are remembered The Web site data of record to get to analog subscriber operation and page info, can using the correlation analysis method in step 2 To be tracked the Web site set of user.
It can be seen from above-described embodiment that the present invention is realized based on content association Web tracking automatic testing methods, energy Enough effectively prevention privacy of user leakage problems.The present invention starts with from Web trackings effect, avoids existing method and need to constantly update and chases after The problem of track technology priori.In addition, the present invention realizes automatic detection flow, it is true to be conducive to progress large scale analysis The applicable cases that Web is tracked in life, it helps find novel Web tracer techniques.

Claims (5)

1. one kind is based on the associated Web trackings automatic testing method of content, which is characterized in that includes the following steps:
(1) Web page surface element and user's operation information are collected in the form of browser extends;
(2) based on Web page surface element and user's operation information analysis content of pages relevance, and judge whether Web site is chasing after Track user;
(3) Web tracking automatic detections are realized using browser automated test tool.
It is 2. according to claim 1 based on the associated Web trackings automatic testing method of content, which is characterized in that the step Suddenly page elements include text class description information and image link all in the page in (1);User's operation information includes user The search content of input, the text class description information of clickthrough and the text class description information and link URL of clicking picture.
It is 3. according to claim 2 based on the associated Web trackings automatic testing method of content, which is characterized in that the step Suddenly content of pages association includes textual association and is associated with picture in (2), wherein,
Textual association represents that computational methods are with text matches values:Using text analyzing tool to being obtained in step (1) User's operation information carry out keyword extraction and participle, then match appearance of each keyword in page elements information Number simultaneously asks itself and as text matches value MatchTextUS
Picture relevance represents that computational methods are with picture match value:Known using image recognition algorithm, machine learning algorithm All pictures, obtain the other set S of two picture categories on the picture and the page that other user clicks1And S2, then match S1In Each element is in S2The number of middle appearance simultaneously asks itself and as picture match value MatchImageUS
The content of pages degree of association is:MatchUS=MatchTextUS+MatchImageUS
It is 4. according to claim 3 based on the associated Web trackings automatic testing method of content, which is characterized in that the step Suddenly (2) include:
Page elements information user's operation information of Web site is accessed before and after record user twice, is calculated respectively in accessing twice The content of pages degree of associationWithWhen the front and rear difference of the page degree of association accessed twice is more than specified threshold During threshhold, it is believed that the Web site is in tracking user.
It is 5. according to claim 1 based on the associated Web trackings automatic testing method of content, which is characterized in that the step Suddenly (3) include:It is realized by browser automated test tool and starts, browser access Web site is configured and writes automatic Change the operation behavior that script analog subscriber is clicked, inputs text, analog subscriber operation and page info are obtained, using in step 2 Analysis tracked the Web site set of user.
CN201711282970.5A 2017-12-07 2017-12-07 Web tracking automatic detection method based on content association Active CN108171074B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711282970.5A CN108171074B (en) 2017-12-07 2017-12-07 Web tracking automatic detection method based on content association

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711282970.5A CN108171074B (en) 2017-12-07 2017-12-07 Web tracking automatic detection method based on content association

Publications (2)

Publication Number Publication Date
CN108171074A true CN108171074A (en) 2018-06-15
CN108171074B CN108171074B (en) 2021-03-26

Family

ID=62524462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711282970.5A Active CN108171074B (en) 2017-12-07 2017-12-07 Web tracking automatic detection method based on content association

Country Status (1)

Country Link
CN (1) CN108171074B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109309664A (en) * 2018-08-14 2019-02-05 中国科学院数据与通信保护研究教育中心 A kind of browser fingerprint detection behavior monitoring method
WO2020231988A1 (en) * 2019-05-14 2020-11-19 Google Llc Automatically detecting unauthorized re-identification

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100162393A1 (en) * 2008-12-18 2010-06-24 Symantec Corporation Methods and Systems for Detecting Man-in-the-Browser Attacks
CN106650382A (en) * 2016-12-30 2017-05-10 北京工业大学 Browser-based high-performance user tracking method
CN107239491A (en) * 2017-04-25 2017-10-10 广州阿里巴巴文学信息技术有限公司 For realizing method, equipment, browser and electronic equipment that user behavior is followed the trail of

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100162393A1 (en) * 2008-12-18 2010-06-24 Symantec Corporation Methods and Systems for Detecting Man-in-the-Browser Attacks
CN106650382A (en) * 2016-12-30 2017-05-10 北京工业大学 Browser-based high-performance user tracking method
CN107239491A (en) * 2017-04-25 2017-10-10 广州阿里巴巴文学信息技术有限公司 For realizing method, equipment, browser and electronic equipment that user behavior is followed the trail of

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NATALIIA BIELOVA等: "《24th ACM-SIGSAC Conference on Computer and Communications Security (ACM CCS)》", 3 November 2017 *
XIAOFENG LIU等: "《 1st IEEE International Conference on Data Science in Cyberspace (DSC)》", 16 June 2016 *
江军等: "浏览器指纹探测识别技术研究", 《保密科学技术》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109309664A (en) * 2018-08-14 2019-02-05 中国科学院数据与通信保护研究教育中心 A kind of browser fingerprint detection behavior monitoring method
CN109309664B (en) * 2018-08-14 2021-03-23 中国科学院数据与通信保护研究教育中心 Browser fingerprint detection behavior monitoring method
WO2020231988A1 (en) * 2019-05-14 2020-11-19 Google Llc Automatically detecting unauthorized re-identification
US11093644B2 (en) 2019-05-14 2021-08-17 Google Llc Automatically detecting unauthorized re-identification
CN113287143A (en) * 2019-05-14 2021-08-20 谷歌有限责任公司 Automatic detection of unauthorized re-identification
CN113287143B (en) * 2019-05-14 2022-12-16 谷歌有限责任公司 Automatic detection of unauthorized re-identification
US11720710B2 (en) 2019-05-14 2023-08-08 Google Llc Automatically detecting unauthorized re-identification

Also Published As

Publication number Publication date
CN108171074B (en) 2021-03-26

Similar Documents

Publication Publication Date Title
Vishwakarma et al. Detection and veracity analysis of fake news via scrapping and authenticating the web search
US10032081B2 (en) Content-based video representation
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
CN109299258B (en) Public opinion event detection method, device and equipment
CN109815386B (en) User portrait-based construction method and device and storage medium
CN101534306A (en) Detecting method and a device for fishing website
CN102436564A (en) Method and device for identifying falsified webpage
CN105824822A (en) Method clustering phishing page to locate target page
WO2017084205A1 (en) Network user identity authentication method and system
CN103020123A (en) Method for searching bad video website
CN108694325B (en) Method and device for identifying specified type of website
CN104036190A (en) Method and device for detecting page tampering
CN108171074A (en) One kind is based on the associated Web trackings automatic testing method of content
CN105447148B (en) A kind of Cookie mark correlating method and device
CN104036189A (en) Page distortion detecting method and black link database generating method
CN111199172A (en) Terminal screen recording-based processing method and device and storage medium
CN103729354B (en) web information processing method and device
CN108595453B (en) URL (Uniform resource locator) identifier mapping obtaining method and device
CN108614849A (en) A kind of web advertisement detection method based on dynamic pitching pile and static more script page feature extractions
CN109165264B (en) Webpage analysis method and device based on diversified thermodynamic diagrams
CN110866170A (en) Importance evaluation method, search method and system for Tor darknet service based on site quality
CN104978431B (en) Web data fusion method and device
CN114780891A (en) Website key resource analysis method and device based on page rendering contribution degree
CN104063491B (en) A kind of method and device that the detection page is distorted
KR101277300B1 (en) Method and apparatus for presenting personalized advertisements

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant