CN114662145A - Access tracking detection method and device, readable storage medium and electronic equipment - Google Patents

Access tracking detection method and device, readable storage medium and electronic equipment Download PDF

Info

Publication number
CN114662145A
CN114662145A CN202210281302.5A CN202210281302A CN114662145A CN 114662145 A CN114662145 A CN 114662145A CN 202210281302 A CN202210281302 A CN 202210281302A CN 114662145 A CN114662145 A CN 114662145A
Authority
CN
China
Prior art keywords
information
access
element information
time
difference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210281302.5A
Other languages
Chinese (zh)
Inventor
严梦嘉
皇甫敏娜
黄磊
邓建锋
周旭华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202210281302.5A priority Critical patent/CN114662145A/en
Publication of CN114662145A publication Critical patent/CN114662145A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6263Protecting personal data, e.g. for financial or medical purposes during internet communication, e.g. revealing personal data from cookies

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The disclosure provides an access tracking detection method, an access tracking detection device, a computer readable storage medium and electronic equipment, and relates to the technical field of information security. The method comprises the following steps: acquiring page element information of a first-time website access; acquiring page element information of a second-time website access; determining page element information difference information of a first-time access site and a second-time access site; calculating a correlation difference value according to the page element information difference information and the user operation element information; and determining an access tracking detection result according to the correlation difference. According to the method and the device, the relevance between the content of the website accessed by the user twice and the input operation of the user is analyzed, the page element information with unchanged twice operations is removed, whether the website has access tracking or not is comprehensively judged, and the accuracy and the efficiency of access tracking detection are improved.

Description

Access tracking detection method and device, readable storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of information security technologies, and in particular, to an access tracking detection method and apparatus, a computer-readable storage medium, and an electronic device.
Background
With the rapid popularization of Web (World Wide Web) technology and services, more and more users have been unable to leave the Web. Meanwhile, the Web site and the advertising service provider hope to perform effective content recommendation and more accurate advertisement delivery through equipment identification, but some advertisers mutually 'cooperate' to sell user privacy information, so that cross-domain user association is realized, behavior habits and preferences of users are analyzed, and privacy protection willingness of the users is violated to a great extent.
Currently, Web-based device identification means mainly include cookies (data stored on a user local terminal), and browser fingerprints. The Cookie is text information stored on a user browser by a Web server and can contain related information of a user and equipment, and when the user accesses a Web site, the server can access the Cookie information so as to obtain browsing records and behaviors of the user; the browser fingerprint is composed of multiple browser, operating system and device hardware related attributes such as a user agent, a font and a plug-in, and does not depend on a specific certain characteristic, so that the browser fingerprint has better robustness. Privacy leakage threat brought by Web tracking, for Cookie, a user can directly forbid or delete the Cookie regularly through a browser to avoid the privacy leakage threat; however, the browser fingerprint identification technology collects user information completely without the user knowing, detection can only be completed by monitoring the calling condition of the sensitive JavaScript API at present, but the risk of privacy disclosure is brought by using a new attack means for Web tracking.
It is noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure and therefore may include information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
The present disclosure provides an access tracking detection method, an access tracking detection device, a computer-readable storage medium, and an electronic device, which at least to some extent overcome the problem of low efficiency and accuracy of access tracking detection in the related art.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to an aspect of the present disclosure, there is provided an access tracking detection method including:
acquiring page element information of a first-time website access;
acquiring page element information of a second-time website access;
determining page element information difference information of the first-time visited site and the second-time visited site;
calculating a correlation difference value according to the page element information difference information and the user operation element information;
and determining an access tracking detection result according to the correlation difference.
In one embodiment of the present disclosure, further comprising:
and determining the access tracking detection result according to the correlation difference and the access interval coefficient.
In one embodiment of the present disclosure, further comprising:
acquiring the access time of a first access site;
acquiring the access time of a second access station;
determining the interval time difference and the step number difference between the first-time station and the second-time station;
and determining the visit interval coefficient according to the interval time difference and the step number difference.
In one embodiment of the present disclosure, further comprising:
acquiring user operation information;
extracting keywords and performing word segmentation processing on the user operation information;
calculating the word frequency of each keyword in the user operation information;
and determining the user operation element information according to the total number of the web pages, the number of the web pages containing the keywords and the word frequency of the keywords.
In one embodiment of the present disclosure, further comprising:
acquiring page content information of an access site;
extracting keywords and performing word segmentation processing on the page content information;
calculating the word frequency of each keyword in the page content information;
and determining the page element information according to the total number of the web pages, the number of the web pages containing the keywords and the word frequency of the keywords.
In one embodiment of the present disclosure, the page content information includes text information and picture information;
the process of extracting the keywords and segmenting the words of the page content information comprises the following steps:
extracting keywords and performing word segmentation processing on the character information of the text information;
and converting the picture information into character information, and then extracting keywords and performing word segmentation processing.
In one embodiment of the present disclosure, the user operation information includes browser side information and operating system side information.
In one embodiment of the present disclosure, further comprising:
and when the page element information difference information is inconsistent with the component number of the user operation element information, adjusting the page element information difference information to be consistent with the component number of the user operation element information based on a repeated filling method.
According to another aspect of the present disclosure, there is also provided an access tracking detection apparatus including:
the first site acquisition module is used for acquiring page element information of a site visited for the first time;
the second site acquisition module is used for acquiring page element information of a site visited for the second time;
the differential information determining module is used for determining page element information differential information of the first-time visited site and the second-time visited site;
the correlation difference value calculation module is used for calculating a correlation difference value according to the page element information difference information and the user operation element information;
and the detection result determining module is used for determining the access tracking detection result according to the correlation difference.
According to another aspect of the present disclosure, there is also provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform any of the above access tracking detection methods via execution of the executable instructions.
According to another aspect of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements any one of the above-described access tracking detection methods.
The access tracking detection method, the access tracking detection device, the computer readable storage medium and the electronic equipment provided by the embodiment of the disclosure acquire page element information of a first access site; the method comprises the steps of obtaining page element information of a second-time access site, eliminating page element information with unchanged operation twice, determining page element information differential information of the first-time access site and the second-time access site, and analyzing the relevance of the page element information differential information and user operation element information, so that whether the site has access tracking or not is comprehensively judged, and the accuracy and the efficiency of access tracking detection are improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.
FIG. 1 is a flow chart illustrating an access tracking detection method in an embodiment of the present disclosure;
FIG. 2 is a flow chart illustrating a method for determining an access interval coefficient in an embodiment of the present disclosure;
FIG. 3 is a flowchart illustrating a method for obtaining page element information of a visited site according to an embodiment of the present disclosure;
FIG. 4 is a flow chart illustrating a method for obtaining information of a user operation element according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of an access tracking detection apparatus according to an embodiment of the disclosure;
FIG. 6 is a flow chart illustrating a further method of access tracking detection in an embodiment of the present disclosure;
FIG. 7 is a schematic diagram illustrating a method for obtaining search content according to an embodiment of the disclosure;
FIG. 8 is a schematic diagram illustrating obtaining text information in an embodiment of the disclosure;
FIG. 9 is a diagram illustrating reading clipboard information in an embodiment of the present disclosure;
fig. 10 illustrates an access interval coefficient function image with t-2 in an embodiment of the present disclosure;
fig. 11 shows an access interval coefficient function image with t ═ 1 in an embodiment of the present disclosure;
fig. 12 shows an access interval coefficient function image with t ═ 0.2 in an embodiment of the present disclosure;
FIG. 13 illustrates a step differentiation diagram in an embodiment of the present disclosure;
FIG. 14 shows a further step differentiation diagram in an embodiment of the present disclosure;
FIG. 15 illustrates an access tracking automated detection mechanism in an embodiment of the present disclosure; and
fig. 16 shows a block diagram of an electronic device in an embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The present exemplary embodiment will be described in detail below with reference to the drawings and examples.
First, an embodiment of the present disclosure provides an access tracking detection method, which may be executed by any electronic device with computing processing capability.
Fig. 1 shows a flowchart of an access tracking detection method in an embodiment of the present disclosure, and as shown in fig. 1, the access tracking detection method provided in the embodiment of the present disclosure includes the following steps:
s102, acquiring page element information of a first-time website access;
and S104, acquiring page element information of the site visited for the second time.
In one embodiment, the page content information of the first-time visited site and the second-time visited site are respectively obtained through the browser extension. The page content information comprises text information and picture information, and the character information of the text information is subjected to keyword extraction and word segmentation; converting the picture information into character information, and then performing keyword extraction and word segmentation processing; calculating the word frequency of each keyword in the page content information; and determining the page element information according to the total number of the web pages, the number of the web pages containing the keywords and the word frequency of the keywords. It should be noted that the page content information includes, but is not limited to, text information and picture information; the page content information includes, but is not limited to, text type description information and picture link URL corresponding to all links. For example, page content information of a first-time website access is acquired through browser extension, the page content information comprises text information of 'jeans' and the like, keywords 'jeans' are extracted, the 'jeans' are divided into 'jeans' and 'trousers', word frequencies of 'jeans' in all keywords are calculated, the number of all webpages and webpages containing the 'jeans' keywords is calculated, and first-time page element information including 'jeans' is determined according to the word frequencies of the 'jeans' in all keywords, the number of all webpages and webpages containing the 'jeans' keywords. For example, page content information of a website visited for the second time is obtained through a browser extension, the page content information comprises picture information of 'jeans', the keyword 'jeans' is identified according to an image identification technology, then the keyword 'jeans' is extracted, the 'jeans' is divided into 'jeans' and 'trousers', word frequencies of 'jeans' in all keywords are calculated, all webpages and the number of webpages containing the 'jeans' keyword are calculated, and the page element information for the second time including 'jeans' is determined according to the word frequencies of 'jeans' in all keywords, all webpages and the number of webpages containing the 'jeans' keyword.
And S106, determining page element information difference information of the first-time website access and the second-time website access.
In one embodiment, the page element delta information δ1And page element difference information delta2The calculation is as follows:
Figure BDA0003557040400000061
Figure BDA0003557040400000062
wherein, delta1Information set, delta, representing two successive visits to the site, with the twice-shared visit page information removed on the basis of the first page element information S12And the information set of the two times of common access page information is removed on the basis of the second time of page element information S2, wherein the information set represents the two times of successive accesses to the site. For example, if the first-time page element information S1 includes "jeans" and does not include "pants", and the second-time page element information S2 includes "pants" and does not include "jeans", δ1Comprises a 'jean',δ2including "pants".
And S108, calculating a correlation difference value according to the page element information difference information and the user operation element information.
In one embodiment, the correlation value of the difference value delta between the user operation element information U and the page element information is calculated, the word frequency and the importance of the keyword are calculated by using a TF-IDF algorithm, and the correlation is calculated by using cosine similarity.
In one embodiment, the difference value delta between the user operation element information U and the page element information is calculated respectively1And page element information delta value delta2The calculation formula is as follows:
Figure BDA0003557040400000071
wherein, UiAnd deltaj,iRepresenting vectors U and delta, respectivelyjWhen each component of (1) is U and deltajAnd if the number i of the components is inconsistent, filling by using a repeated filling method.
γ1Is the correlation of the first time page element information S1 and the user operation element information U of the twice visited site; gamma ray2Is the correlation of the second time page element information S2 with the user operation element information U of the twice visited site; the difference in correlation is gamma21
And S110, determining an access tracking detection result according to the correlation difference.
In one embodiment, the access tracking detection result is determined based on the correlation difference and the access interval coefficient.
It should be noted that the visit interval coefficient function decreases exponentially as the parameter t increases, and also decreases exponentially as the integer n increases, that is, the longer the interval time between two visits or the more interval steps, the lower the correlation. And determining an access tracking detection result according to the correlation difference and the access interval coefficient.
In the embodiment, a large amount of common information of two previous visits and two subsequent visits is removed, data noise is removed, the dilution correlation of mass information of large webpages of large websites is avoided, the effectiveness of visit tracking detection is improved, and particularly, a better detection effect is achieved for the visit result of the large webpages.
Fig. 2 is a flowchart illustrating a method for determining an access interval coefficient according to an embodiment of the present disclosure, and as shown in fig. 2, the method for determining an access interval coefficient includes: s202, acquiring the access time of accessing the site for the first time; s204, acquiring the access time of the second access site; s206, determining the interval time difference and the step number difference between the first-time station access and the second-time station access; and S208, determining an access interval coefficient according to the interval time difference and the step number difference.
In one embodiment, the relevance coefficient mu is obtained by calculating the difference value of the difference values of the page element information and the visit interval coefficient, and when the relevance coefficient mu exceeds a certain threshold value, the situation that the site tracks the user can be judged. The correlation coefficient μ is calculated as follows:
Figure BDA0003557040400000072
wherein the content of the first and second substances,
Figure BDA0003557040400000073
access interval coefficient, gamma, for two accesses21Is the correlation difference.
Where t is the time difference between two visits to Web site a, and the unit is seconds. By obtaining the access time of the first access site as T1The access time of the second access to the site is T2The difference t between the two times is calculated to obtain the difference t between the two times, i.e.
t=T2-T1 (5)
Wherein n is the difference between the two steps of accessing the site, and n is a positive integer greater than or equal to 1. It should be noted that, by obtaining the access record of the time sequence page, the step difference n may be calculated, for example, after the user opens the page a in the first step, the page is refreshed, where n is 1; after a user opens the page A in the first step, then clicks the link to access the page B, and then clicks the return link to access the page A, in this case, n is 2, and by analogy, a calculation method of the step difference n can be obtained.
In one embodiment, the relevance coefficient μ takes the value 0.5.
In the above embodiment, the effect of two successive visits by the user is taken into account, the visit interval coefficient of the two visits is designed for the situation that the operation information closer to the second operation is more easily tracked and appears on the second visit page, and the visit interval coefficient function considers the interval time difference and the step number difference, so as to quantify the relevance of the visit interval coefficient function.
Fig. 3 is a flowchart illustrating a method for obtaining page element information of a visited site in an embodiment of the present disclosure, and as shown in fig. 3, the method for obtaining page element information of a visited site includes: s302, acquiring page content information of an access site; s304, extracting keywords and performing word segmentation processing on the page content information; s306, calculating the word frequency of each keyword in the page content information; s308, determining page element information according to the total number of the web pages, the number of the web pages containing the keywords and the word frequency of the keywords.
In one embodiment, the text information may be subjected to keyword extraction and word segmentation processing by a text analysis tool. When extracting the keywords, only words with substantial meanings such as nouns, verbs and adjectives can be concerned, and words can be deleted by neglecting unimportant information such as prepositions, numerologies and quantifications. The image clicked by the user and all the pictures on the page can be identified by utilizing technologies such as an image identification algorithm, a machine learning algorithm and the like, information in the pictures is converted into text information, and keyword extraction and word segmentation processing are carried out by using a method for processing the text information.
In the embodiment, the keyword extraction and the word segmentation processing are respectively performed on the text information and the picture information in the page content information, so that the timeliness of the access tracking detection is improved.
In one embodiment, the keyword t in the page content informationiIn other words, the keyword tiThe word frequency of (c) may be expressed as:
Figure BDA0003557040400000081
wherein n isi,jIs the keyword tiNumber of occurrences, Σ, in page content informationknk,jIs the sum of the number of occurrences of all keywords in the page content information.
It should be noted that the Inverse Document Frequency (IDF) is a measure of the general importance of a word, i.e., a measure of the importance of a word to the entire corpus. For the IDF of a specific keyword in the page content information, the IDF can be obtained by dividing the total number of detected web pages in all web pages by the number of web pages containing the keyword, and then taking the logarithm of the obtained quotient:
Figure BDA0003557040400000091
wherein | D | is the total number of the detected web pages in all the web pages, and the denominator | { j: t |, isi∈djIs to contain the keyword tiIf the word is not in the corpus, the denominator is zero, so 1+ | { j: t: ] is generally usedi∈dj}|。
For on-page element information SjKey word t ofiThe significance TF-IDF is calculated as follows:
Sj=TFi,j×IDFi (8)
the page element information of the first visited site S1 and the page element information of the second visited site S2 are obtained.
In the embodiment, the keyword extraction and word segmentation processing can be rapidly carried out on the page content information of the obtained access site, the page element information is determined according to the total number of the pages, the number of the pages containing the keywords and the word frequency of the keywords, and the operation efficiency and accuracy are improved.
Fig. 4 is a flowchart illustrating a method for acquiring information of a user operation element in an embodiment of the present disclosure, and as shown in fig. 4, the method for acquiring information of a user operation element includes: s402, acquiring user operation information; s404, extracting keywords and performing word segmentation processing on the user operation information; s406, calculating the word frequency of each keyword in the user operation information; s408, determining the information of the user operation elements according to the total number of the web pages, the number of the web pages containing the keywords and the word frequency of the keywords.
The user operation information includes browser side information and operating system side information. And extracting the keywords and performing word segmentation on the character information in the user operation information through a text analysis tool. When extracting the keywords, only words with substantial meanings such as nouns, verbs and adjectives can be concerned, and words can be deleted by neglecting unimportant information such as prepositions, numerologies and quantifications. Recognizing picture information in user operation information by using technologies such as an image recognition algorithm, a machine learning algorithm and the like, converting the picture information into text information, and extracting keywords and performing word segmentation processing by using a method for processing the text information.
In the embodiment, the keyword extraction and word segmentation processing can be rapidly carried out on the user operation information, the page operation element information is determined according to the total number of the web pages, the number of the web pages containing the keywords and the word frequency of the keywords, and the operation efficiency and accuracy are improved.
Based on the same inventive concept, an access tracking detection apparatus is also provided in the embodiments of the present disclosure, as in the following embodiments. Because the principle of the embodiment of the apparatus for solving the problem is similar to that of the embodiment of the method, the embodiment of the apparatus can be implemented by referring to the implementation of the embodiment of the method, and repeated details are not described again.
Fig. 5 is a schematic diagram of an access tracking detection apparatus according to an embodiment of the disclosure, and as shown in fig. 5, the access tracking detection apparatus 5 includes: a first site acquisition module 501, a second site acquisition module 502, a difference information determination module 503, a correlation difference calculation module 504 and a detection result determination module 505; the first site acquisition module 501, the second site acquisition module 502, the difference information determination module 503, the correlation difference calculation module 504, and the detection result determination module 505 are connected in sequence.
A first site obtaining module 501, configured to obtain page element information of a first visited site;
a second site obtaining module 502, configured to obtain page element information of a second site visited;
a difference information determining module 503, configured to determine page element information difference information of a first-time visited site and a second-time visited site;
a correlation difference calculation module 504, which calculates a correlation difference according to the page element information difference information and the user operation element information;
and a detection result determining module 505, which determines the access tracking detection result according to the correlation difference.
In the embodiment, by analyzing the relevance between the page element information and the user operation element information of the sites visited by the user twice, the time difference of the two operations is considered while eliminating the unchanged information of the two operations, and whether the sites collect the user information by using a tracking technology is comprehensively judged; even if the tracking technology is continuously updated, the website can be detected by the method as long as the website recommends the advertisements related to the user, so that the problem that the prior knowledge of the tracking technology needs to be continuously updated in the prior art is solved, and the method is also favorable for discovering a novel tracking technology by combining with artificial code analysis.
Fig. 6 is a flowchart illustrating another access tracking detection method in an embodiment of the present disclosure, and as shown in fig. 6, the access tracking detection method provided in the embodiment of the present disclosure includes the following steps:
s602, collecting user page content information and user operation information;
in one embodiment, when a user accesses a Web site for the first time, page content information is obtained through browser extension, and user operation information is collected through a browser in interval time; and when the user accesses the Web site for the second time, the page content information is acquired through the browser extension again.
It should be noted that, obtaining the page content information includes, but is not limited to, obtaining HTML source code of the page, and may be obtained through a JavaScript API. The webpage of the Web site can be completely loaded by simulating the operation of the roller through JavaScript (window. scroll To), so that the situation that a user cannot acquire a complete HTML source code when just opening the webpage of the Web site because part of the Web site uses a dynamic loading technology is avoided.
It should be noted that the user operation information includes, but is not limited to, browser side information and operating system side information, and the browser side information and the operating system side information are taken as examples in the embodiment of the present disclosure for description. The browser test information comprises but is not limited to input search content, text description information corresponding to clicked links and link URLs corresponding to clicked pictures;
fig. 7 is a schematic diagram illustrating a method for obtaining search content in the embodiment of the present disclosure, and as shown in fig. 7, input search content is obtained by adding a monitor to dynamically check real-time changes of an input tag.
Fig. 8 is a schematic diagram illustrating obtaining text information in the embodiment of the present disclosure, and as shown in fig. 8, the text type description information corresponding to the clicked link and the link URL corresponding to the clicked picture are obtained by monitoring the clicking behavior of the user and obtaining the link of the clicked object and the context text type description information; because the clicked object usually corresponds to the label, only the useful attribute under the label and the text information under the label are obtained; useful attributes under the label include src, alt, title; and acquiring text information under the label through the InnerText.
The operating system side information includes but is not limited to clipboard information, and fig. 9 shows a schematic diagram of reading clipboard information in the embodiment of the present disclosure, as shown in fig. 9, because of security and privacy problems, data in a clipboard cannot be directly read in a browser, and data in the clipboard can only be read when a paste operation is initiated; this paste event will be triggered as long as ctrl + c is pressed in the page, even if there is no input box in the page.
S604, extracting the page content information and the user operation information, and generating page element information and user operation element information.
In one embodiment, the page content information is the sum of the text information T and the picture information P:
in one embodiment, extracting the text information T includes: and performing keyword extraction and word segmentation processing on the text information through a text analysis tool. When extracting the keywords, only words with substantial meanings such as nouns, verbs, adjectives and the like can be concerned, and the words are deleted only by ignoring unimportant information such as prepositions, numerologies, quantifications and the like.
In one embodiment, extracting the picture information P includes: the image clicked by the user and all the images on the page are identified by utilizing technologies such as an image identification algorithm, a machine learning algorithm and the like, information in the images is converted into text information, and then keyword extraction and word segmentation calculation are carried out by using a method for processing the text information T to obtain image information P.
In one embodiment, the text information in the user operation information is subjected to keyword extraction and word segmentation processing through a text analysis tool. When extracting the keywords, only words with substantial meanings such as nouns, verbs and adjectives can be concerned, and words can be deleted by neglecting unimportant information such as prepositions, numerologies and quantifications. Recognizing picture information in user operation information by using technologies such as an image recognition algorithm, a machine learning algorithm and the like, converting the picture information into text information, and extracting keywords and performing word segmentation processing by using a method for processing the text information.
In one embodiment, when a user accesses the Web site a for the first time, the page content information and the user operation information of the Web site a are recorded, and when the user accesses the Web site a for the second time, the page content information of the user is recorded again. Specific page element information SjThe calculation method is as follows:
for keywords t in page content informationiIn other words, the keyword tiThe word frequency of (c) may be expressed as:
Figure BDA0003557040400000121
wherein n isi,jIs the keyword tiThe number of occurrences in the page content information, the denominator, is the sum of the number of occurrences of all keywords in the page content information.
It should be noted that the Inverse Document Frequency (IDF) is a measure of the general importance of a word, i.e., a measure of the importance of a word to the entire corpus. For the IDF of a specific keyword in the page content information, the IDF can be obtained by dividing the total number of detected web pages in all web pages by the number of web pages containing the keyword, and then taking the logarithm of the obtained quotient:
Figure BDA0003557040400000122
wherein | D | is the total number of the detected web pages in all the web pages, and the denominator | { j: t |, isi∈djIs a key word tiIf the word is not in the corpus, the denominator is zero, so 1+ | { j: t: ] is generally usedi∈dj}|。
For on-page element information SjKey word t ofiThe importance degree of TF-IDF is calculated as follows:
Sj=TFi,j×IDFi (8)
the page element information S1 and the page element information S2 are obtained by calculation, and similarly, the user operation element information U may be obtained by calculation.
And S606, performing Web tracking detection judgment according to the page element information and the user operation element information.
In one embodiment, whether the Web site tracks the user is judged by comparing the relevance value of the information difference of the two-time access page elements and the information of the user operation elements.
In one embodiment, the page element information S is calculated1And page element information S2Calculating a correlation coefficient between the user operation element information U and the two times before and after the user operation element information U, calculating a correlation value between the user operation element information U and the page element information difference value delta, calculating the word frequency and the importance of the keyword by using a TF-IDF algorithm, and calculating the correlation by using cosine similarity.
Figure BDA0003557040400000131
Figure BDA0003557040400000132
Respectively calculating the difference value delta between the user operation element information U and the page element information1Sum page element information delta value2The calculation formula is as follows:
Figure BDA0003557040400000133
wherein, UiAnd deltaj,iRepresenting vectors U and delta, respectivelyjWhen each component of (1) is U and deltajAnd if the number i of the components is inconsistent, filling by using a repeated filling method.
Wherein delta1Information set, gamma, representing two successive visits to the Web site a, with two common visits to the page information removed on the basis of the first page element information S11The relevance of the information set and the user operation element information U of the Web site A accessed twice; delta2Information set, gamma, representing two successive visits to Web site a, with two common visits to page information removed on the basis of second page element information S22The relevance of the information and the user operation element information U of the two-time access Web site A is collected.
The relevance coefficient mu is obtained by calculating the relevance difference of the page element information difference value multiplied by the visit interval coefficient twice,
Figure BDA0003557040400000134
wherein, γ21The page element information difference value correlation difference value is two times,
Figure BDA0003557040400000135
the access interval coefficient is an access interval coefficient of two accesses, the access interval coefficient function is exponentially reduced along with the increase of the parameter t and is also exponentially reduced along with the increment of the integer n, namely the interval time of the two accesses is longerOr the more spaced steps, the lower the correlation. Fig. 10 to 12 may be referred to as access interval coefficient function images, where fig. 10 shows an access interval coefficient function image with t equal to 2 in the embodiment of the present disclosure, fig. 11 shows an access interval coefficient function image with t equal to 1 in the embodiment of the present disclosure, and fig. 12 shows an access interval coefficient function image with t equal to 0.2 in the embodiment of the present disclosure.
Where t is the time difference between two visits to Web site a, and the unit is seconds. By obtaining the access time T of the first access to the Web site A1The access time of the second access to the Web site A is T2The difference t between the two times is calculated to obtain the difference t between the two times, i.e.
t=T2-T1 (5)
Wherein n is the difference of the steps of accessing the Web site A twice, and n is a positive integer greater than or equal to 1. The step difference n can be calculated by acquiring the access record of the time sequence page.
Fig. 13 shows a step differential intention in the embodiment of the present disclosure, as shown in fig. 13, S1302, a user opens page a in a first step; s1304, then refresh the page a again, in which case n is 1; fig. 14 shows a differential intention of a further step in the embodiment of the present disclosure, as shown in fig. 14, S1402, the user opens page a in a first step; s1404, then clicking the link to access the page B; s1406, and then click back the return link to access page a, in which case n equals 2, and so on, the calculation method of step difference n can be obtained.
When the relevance coefficient μ exceeds a certain threshold, it can be determined that there is a situation where the Web site a tracks the user.
In one embodiment, the relevance coefficient μ takes the value 0.5.
In the embodiment, the relevance between the page element information and the user operation element information of the sites visited by the user twice before and after is analyzed, the invariant information of the two operations is removed, the relevance is quantified by considering the time difference and the step difference of the two operations, and whether the sites collect the user information by using a tracking technology is comprehensively judged; even if the tracking technology is continuously updated, the website can be detected by the method as long as the website recommends the advertisements related to the user, so that the problem that the prior knowledge of the tracking technology needs to be continuously updated in the prior art is solved, and the method is also favorable for discovering a novel tracking technology by combining with artificial code analysis.
FIG. 15 illustrates an access tracking automated detection device in an embodiment of the disclosure; as shown in fig. 15, the access tracking automation detection apparatus includes a browser 1501, a browser extension 1502, a data storage analysis module 1503, a Web site 1504, and a terminal peripheral 1505; the task manager is responsible for controlling the number of concurrent processes of the whole process and distributing tasks for each process, namely the browser 1501, automatic testing tool; tasks include, but are not limited to, specifying visited Web site 1504 URLs, configuring browser 1501, etc.; each process is responsible for configuring and starting a browser 1501, installing a browser extension 1502, simulating user operation, and matching with a multi-process automation script to realize an automation process.
For the set S of URLs of the Web site 1504 to be detected, the steps include: (1) the browser 1501 selects a URL from the URL set S to access the Web site 1504, a mouse click behavior is simulated in a blank page to cancel a suspended login window, the operation of a roller is simulated to roll to the bottom of the page to completely load the page, and a page source code is recorded. (2) Extracting a search box in a homepage, namely an < input > tag, simulating the input of a search item by a user, simulating the carriage return operation, randomly selecting 3 picture links in a jumped page for clicking, and recording the content of the search item, wherein the search item can be different types and can be expanded. (3) And extracting the text link and the picture link in the homepage, randomly clicking 3 times respectively, and recording the content related to the clicked link. (4) And (5) closing all windows, repeating the operation (1) if the URL set S also has the URL of the Web site 1504 to be detected, and otherwise, performing the step (5). (5) And for the Web site 1504 data recorded in the above steps, the obtained simulated user operation element information and page element information are obtained.
The data storage analysis module 1503 stores, loads, traces and detects a method, receives peripheral operation information transmitted through the terminal peripheral 1505, and receives user operation element information and page element information to obtain a set of Web sites 1504 of which users are being tracked.
In the above embodiment, the data storage analysis module 1503 stores, loads, tracks, and detects a method, receives the peripheral operation information transmitted through the terminal peripheral 1505, and receives the user operation element information and the page element information, which can more effectively determine the set of the websites 1504 of the tracking user, and improve the efficiency of access tracking detection.
As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 1600 according to such an embodiment of the disclosure is described below with reference to fig. 16. The electronic device 1600 shown in fig. 16 is only an example and should not bring any limitations to the functionality and scope of use of the embodiments of the present disclosure.
As shown in fig. 16, electronic device 1600 is in the form of a general purpose computing device. Components of electronic device 1600 may include, but are not limited to: the at least one processing unit 1610, the at least one memory unit 1620, and a bus 1630 that couples various system components including the memory unit 1620 and the processing unit 1610.
Wherein the memory unit stores program code that may be executed by the processing unit 1610 to cause the processing unit 1610 to perform steps according to various exemplary embodiments of the present disclosure described in the above section "exemplary method" of this specification. For example, the processing unit 1610 may perform the following steps of the above method embodiments: acquiring page element information of a first-time website access; acquiring page element information of a second-time website access; determining page element information difference information of the first-time visited site and the second-time visited site; calculating a correlation difference value according to the page element information difference information and the user operation element information; and determining an access tracking detection result according to the correlation difference. .
For example, the processing unit 1610 may perform the following steps of the above method embodiment: acquiring the access time of a first access site; acquiring the access time of a second access station; determining the interval time difference and the step number difference between the first-time station and the second-time station; and determining the visit interval coefficient according to the interval time difference and the step number difference.
For example, the processing unit 1610 may perform the following steps of the above method embodiment: acquiring user operation information; extracting keywords and performing word segmentation processing on the user operation information; calculating the word frequency of each keyword in the user operation information; and determining the user operation element information according to the total number of the web pages, the number of the web pages containing the keywords and the word frequency of the keywords.
For example, the processing unit 1610 may perform the following steps of the above method embodiments: acquiring page content information of an access site; extracting keywords and performing word segmentation processing on the page content information; calculating the word frequency of each keyword in the page content information; and determining the page element information according to the total number of the web pages, the number of the web pages containing the keywords and the word frequency of the keywords.
The memory unit 1620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)16201 and/or a cache memory unit 16202, and may further include a read only memory unit (ROM) 16203.
The storage unit 1620 may also include a program/utility 16204 having a set (at least one) of program modules 16205, such program modules 16205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which or some combination thereof may comprise an implementation of a network environment.
Bus 1630 may be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 1600 can also communicate with one or more external devices 1640 (e.g., keyboard, pointing device, bluetooth device, etc.), and also with one or more devices that enable a user to interact with the electronic device 1600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 1600 to communicate with one or more other computing devices. Such communication may occur through input/output (I/O) interface 1650. Also, the electronic device 1600 can communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 1660. As shown, the network adapter 1660 communicates with the other modules of the electronic device 1600 over the bus 1630. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with electronic device 1600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium, which may be a readable signal medium or a readable storage medium. On which a program product capable of implementing the above-described method of the present disclosure is stored. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.
More specific examples of the computer-readable storage medium in the present disclosure may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
In the present disclosure, a computer readable storage medium may include a propagated data signal with readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Alternatively, program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
In particular implementations, program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (11)

1. An access tracking detection method, comprising:
acquiring page element information of a first-time website access;
acquiring page element information of a second-time website access;
determining page element information difference information of the first-time visited site and the second-time visited site;
calculating a correlation difference value according to the page element information difference information and the user operation element information;
and determining an access tracking detection result according to the correlation difference.
2. The access tracking detection method of claim 1, further comprising:
and determining the access tracking detection result according to the correlation difference and the access interval coefficient.
3. The access tracking detection method of claim 2, further comprising:
acquiring the access time of a first access site;
acquiring the access time of a second access station;
determining the interval time difference and the step number difference between the first-time station and the second-time station;
and determining the visit interval coefficient according to the interval time difference and the step number difference.
4. The access tracking detection method of claim 1, further comprising:
acquiring page content information of an access site;
extracting keywords and performing word segmentation processing on the page content information;
calculating the word frequency of each keyword in the page content information;
and determining the page element information according to the total number of the web pages, the number of the web pages containing the keywords and the word frequency of the keywords.
5. The access tracking detection method of claim 4, wherein the page content information includes text information and picture information;
the process of extracting the keywords and segmenting the words of the page content information comprises the following steps:
extracting keywords and performing word segmentation on the character information of the text information;
and converting the picture information into character information, and then extracting keywords and performing word segmentation processing.
6. The access tracking detection method of claim 1, further comprising:
acquiring user operation information;
extracting keywords and performing word segmentation processing on the user operation information;
calculating the word frequency of each keyword in the user operation information;
and determining the user operation element information according to the total number of the web pages, the number of the web pages containing the keywords and the word frequency of the keywords.
7. The access tracking detection method of claim 6, wherein the user operation information includes browser side information and operating system side information.
8. The access tracking detection method of claim 1, further comprising:
and when the page element information difference information is inconsistent with the component number of the user operation element information, adjusting the page element information difference information to be consistent with the component number of the user operation element information based on a repeated filling method.
9. An access tracking detection device, comprising:
the first site acquisition module is used for acquiring page element information of a site visited for the first time;
the second site acquisition module is used for acquiring page element information of a site visited for the second time;
the differential information determining module is used for determining page element information differential information of the first-time visited site and the second-time visited site;
the correlation difference value calculation module is used for calculating a correlation difference value according to the page element information difference information and the user operation element information;
and the detection result determining module is used for determining the access tracking detection result according to the correlation difference.
10. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the access tracking detection method of any one of claims 1 to 8 via execution of the executable instructions.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the access tracking detection method according to any one of claims 1 to 8.
CN202210281302.5A 2022-03-21 2022-03-21 Access tracking detection method and device, readable storage medium and electronic equipment Pending CN114662145A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210281302.5A CN114662145A (en) 2022-03-21 2022-03-21 Access tracking detection method and device, readable storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210281302.5A CN114662145A (en) 2022-03-21 2022-03-21 Access tracking detection method and device, readable storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN114662145A true CN114662145A (en) 2022-06-24

Family

ID=82030588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210281302.5A Pending CN114662145A (en) 2022-03-21 2022-03-21 Access tracking detection method and device, readable storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN114662145A (en)

Similar Documents

Publication Publication Date Title
US8103599B2 (en) Calculating web page importance based on web behavior model
US8639687B2 (en) User-customized content providing device, method and recorded medium
US20090006995A1 (en) Associating Website Clicks With Links On A Web Page
CN107483443B (en) Advertisement information processing method, client, storage medium and electronic device
WO2015114753A1 (en) Analysis device and analysis method
CN106339380A (en) Method and device for recommending frequently asked question information
US20120166428A1 (en) Method and system for improving quality of web content
CN105718533A (en) Information pushing method and device
CN105243058A (en) Webpage content translation method and electronic apparatus
US20110029559A1 (en) Method, apparatus, and program for extracting relativity of web pages
Ghasemisharif et al. Speedreader: Reader mode made fast and private
WO2021098242A1 (en) Page processing method and apparatus, electronic device and computer readable medium
Van Nortwick et al. Setting the Bar Low: Are Websites Complying With the Minimum Requirements of the CCPA?
CN116015842A (en) Network attack detection method based on user access behaviors
US11657228B2 (en) Recording and analyzing user interactions for collaboration and consumption
CN109002550B (en) Test method and device for reduction equipment
US9152948B2 (en) Method and system for providing a structured topic drift for a displayed set of user comments on an article
CN108171074B (en) Web tracking automatic detection method based on content association
CN111127057B (en) Multi-dimensional user portrait recovery method
US20130230248A1 (en) Ensuring validity of the bookmark reference in a collaborative bookmarking system
CN111400575A (en) User identification generation method, user identification method and device
CN112384940A (en) Mechanism for WEB crawling of electronic business resource page
CN113806667B (en) Method and system for supporting webpage classification
CN114662145A (en) Access tracking detection method and device, readable storage medium and electronic equipment
US20210264059A1 (en) System and method of detecting hacking activities during the interaction of users with banking services

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination