CN113158061A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN113158061A
CN113158061A CN202110501807.3A CN202110501807A CN113158061A CN 113158061 A CN113158061 A CN 113158061A CN 202110501807 A CN202110501807 A CN 202110501807A CN 113158061 A CN113158061 A CN 113158061A
Authority
CN
China
Prior art keywords
information
page
browser
recommendation
dispersion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110501807.3A
Other languages
Chinese (zh)
Inventor
王云森
苏家进
谷丽芳
吴小平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202110501807.3A priority Critical patent/CN113158061A/en
Publication of CN113158061A publication Critical patent/CN113158061A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition

Abstract

The invention discloses a data processing method and a data processing device, wherein the data processing method comprises the following steps: 1) the method comprises the steps of obtaining a first target page, obtaining information of at least one recommendation component used for displaying recommendation information, and obtaining dispersion of recommendation values in the recommendation component; 2) if the proportion of the first type of mark attribute is higher than the expected value R, at least one second page without the first mark attribute is loaded; 3) recalculating the dispersion of the recommended value on a third target page, if the dispersion is higher than the expected value R, executing the step 2), and if not, exiting; the first target page and the third target page contain personalized recommendation information; and the dispersion of the recommended numerical value is to obtain the content of the visual part of the recommended component, analyze the content to obtain the information of the text label, count the text label and determine the dispersion of the text label. The method can be used for avoiding privacy disclosure formed based on browsing information.

Description

Data processing method and device
Technical Field
The invention belongs to the field of sensitive information protection, and particularly relates to confusion of personal directional data push and avoidance of directional push.
Background
At present, the collection of the preference of the client becomes a basis and common means for pushing information of the commercial website, each operation of the client on a page may be recorded and analyzed, an access record is formed in the background, the user is analyzed based on the access record, so as to form a user portrait, and the information is pushed according to the user portrait. In the big data age, the related user portrait can be used for information screening of the user after being formed.
However, although the push mode of the information is popular with the merchants, the push mode of the information is not popular with the users due to invasion of privacy, and the users are in an unequal position. Since the pushing of information is often built on a variety of factors, which makes it possible to infer the user's rough representation based on partial information even in the case of no login or in the case of a specific session, holding a specific token, Cookies, which is obviously disadvantageous for the user, and some service providers Do Not provide or provide overly cumbersome guidance steps for preventing the user from cancelling the tracking and Do Not follow the convention that Do Not Track (DNT) or actively obtain information beyond the scope by exploring the user's information (such as accessing the user's stored browsing records), technical means need to be provided to prevent such leakage risk.
Disclosure of Invention
Aiming at the risk of leakage of user information in the prior art, the invention provides a data processing method, which avoids or delays the leakage of real user information by providing information deviating from an portrait.
The data processing method provided by the invention comprises the following steps:
1) obtaining a first target page, wherein the first target page comprises information of at least one recommendation component used for displaying recommendation information, and obtaining dispersion or distribution of recommendation values in the recommendation component;
2) if the proportion of the first mark attribute in the recommended value is higher than the expected value R, at least loading a second page without the first mark attribute;
3) recalculating the dispersion of the recommended value on a third target page, if the proportion of the first type mark attribute is higher than the expected value R, executing the step 2), otherwise, exiting;
the first target page and the third target page contain personalized recommendation information, and the recommendation information is mapped to a tag set containing a first tag attribute, wherein the first tag attribute belongs to a proportion expected to be reduced;
and the dispersion of the recommended numerical value is to obtain the content of the visual part of the recommended component, analyze the content to obtain the information of the text label, count the text label and determine the dispersion of the text label.
In an embodiment of the present invention, the information of the text label includes content and position of the text label, and the position of the text label is a relative position or an absolute position in the page.
In another embodiment of the invention, the marking attribute of the text label is determined according to the rule, and the dispersion of the recommended value in the page is calculated according to the marking attribute.
The text label is mapped to at least two types of mark attributes, and at least a first mark attribute and a second mark attribute exist, wherein the first mark attribute belongs to the condition that a user expects to reduce the appearance in the information recommendation, and the second mark attribute belongs to the condition that the priority of the appearance in the information recommendation which the user expects to appear in the information recommendation is higher than that of the first mark attribute.
In an embodiment of the present invention, the first page is obtained through a first configuration file, a shooter window, or a selected active window;
when a page is loaded by using a first configuration file, the first configuration file at least comprises one or more of a browser type, a process module name and an address;
when the first configuration file comprises the browser type, the program can obtain the process ID of the current browser in a process searching mode, further obtain the handle of the browser and obtain the window information according to the handle of the browser;
the process ID of the current browser can be obtained by searching the process module, the handle of the browser is further obtained, and the window information is obtained according to the handle of the browser;
the process ID of the current browser can also be obtained by searching the address of the process module, the handle of the browser is further obtained, and the information of the window is obtained according to the handle of the browser.
When information is obtained by enumerating windows, the information is matched with the information component at least in a mode of obtaining window information and process information, for example, a desktop browser is taken as an example, a set of common browsers can be set, and running window processes are obtained in a mode of enumerating window information and matching with a browser module.
In one embodiment of the invention, when the target page is obtained by using the selected active window, the target page is obtained by setting a hover ball or providing a current window list, or by manual selection.
The visual content is obtained through remote control, screenshot function or page analysis;
when the remote control is obtained, at least a framework similar to selenium, script and the like can be used, a common or mainstream browser provides support for remote debugging, and part of browsers do not have rendering characteristics although the common or mainstream browser provides the support for remote debugging.
Screenshots of other parts of the non-browser, for example, screenshots of the mobile device, may be made in a manner referred to in the art, for example using screenshot activity or other open source items. The screen capture of the desktop client can be realized in a non-browser mode, for example, a plurality of browsers or a community provides java, c # and other versions of a suite, and even can be realized in a mode of embedding a traditional browser.
If page parsing is selected, the foregoing various clients or framework implementations may be selected, for example, conventional page access is used to obtain a webpage source code, or a debugging framework is used to obtain a page source code, or a customized browser is used to obtain a source code, but for mobile end applications other than browsers, the implementation is limited by privacy and security considerations, and it is impractical to obtain a source code of another browser or app by page parsing, so that screenshot is suggested instead of direct parsing, but in a customized information presentation device, the source code of a directly obtained page is actually consistent with a desktop.
The obtaining of the page recommendation content is realized through predefined rules or keywords.
And obtaining the recommended content which is carried out according to the predefined rule as a matching node of the preset website and the recommended rule, and carrying out screenshot on the corresponding node. Taking an information recommendation page of a certain website as an example, the XPath corresponding to the recommendation content is "/html/body/div [1]/div/main/div/div [2]/div [2]/div/div [3]/div [2 ]", for a stable commercial website, the position is generally fixed in a version period, and for an information platform with frequently changed layout, the rule corresponding to the XPath needs to be updated and then can be used; under the condition that the XPath is invalid, a keyword rule can be further set, for example, a class attribute of a corresponding div or a text value textcontent corresponding to the div node is searched for to determine a corresponding node; part of the pages use font obfuscation techniques, in which case the actual text can only be obtained by means of font mapping reduction or optical character recognition.
When the recommended content can be obtained in the first two ways (XPath and keywords), the content of the minimum image acquisition area or the text node can be obtained by reading the corresponding attribute of the WebElement, otherwise, full-text analysis should be performed to obtain all keywords or the distribution of the corresponding nodes is determined by screen capture.
For the content analyzed by the DOM tree or the source code, all tags can be obtained by selecting corresponding nodes and extracting the text, and the corresponding tag attribution can be determined by adopting a dictionary mode for the tags, particularly, partial websites can distinguish topics expected to be concerned by using different style sheets, and proper spaces can be filled in when the text is extracted, and then the character string is split, so that the tag attribution can be more favorably positioned; for a long character string, a greedy mode can be selected, a sliding window of 2, 3, 4 or 5 is selected for word segmentation, the classification mode has higher efficiency in partial commercial websites, such as shopping websites, because partial wrong words often cause defects (such as wrong words with strong Central Processing Unit (CPU) emotion being too strong) which cannot be confirmed according to dictionaries or word segmentation tools, and under the condition that partial words are wrong, a certain hit rate can still be achieved according to correct words according to the mode; the long character string can also be segmented by selecting a natural language processing mode, such as tools like HanLP and jieba, or by a natural language processing api provided by each service platform.
And for the information obtained by the OCR, distinguishing the subjects in the mode and obtaining marks corresponding to the nodes corresponding to the pages. For example, when the offset (x, y) corresponding to the "recommended content" in the OCR-obtained information is 800, 200, the text content having an x value smaller than 800 in the other OCR-obtained information should not be considered. The lower limit of the relevant area is determined according to a rule of a stop word or critical speculation, for example, when the rule of the stop word is adopted, if the offset (x, y) corresponding to the "recommended content" is 800, 200, and when the corresponding offset (x, y) is 860, 3200 is resolved to "more content", the lower limit should be selected as the end point of the recommended area; the method for determining the recommendation area according to the critical speculation mode comprises the steps of determining all text positions in the credible area according to the starting point of the recommendation area, counting the intervals of all text areas, selecting nodes with regularity as recommendation information nodes,
another way is to obtain the range of nodes based on DOM information, for example, when the obtained mark offset (x, y) is 800, 200, traverse the DOM tree, and may construct an auxiliary map, for example, java, construct a graph HashMap < WebElement, Point >, parse the DOM layer by layer, and add the nodes satisfying the offset into the graph, in this process, a plurality of pieces of recommended information may be verified, when a page or a rule indicates that there is only one node, a corresponding web page element WebElement may be directly determined in an enumeration manner without a form of constructing a graph, on the basis of determining one or more web page elements, and obtain a specific size thereof, and further match the OCR recognition result according to the size of the web page element.
The tag dispersion can be calculated simply by calculating the variance of the non-zero part of the weight value of the corresponding field, or by calculating the variance of the non-zero part of the weight value of the specific field.
Corresponding to the calculation of the tag dispersion, taking a shopping website as an example, a local dictionary may be set, a first-level, a second-level and a third-level classification may be set according to categories, one or more keywords may be set corresponding to the second-level or third-level classification, for example, for a router, keywords including a brand (hua shi, pu lian, xinhua shi), an environment of use (large house type, medium house type, small house type), a feature (Wifi6, Wi-Fi6, ax), and the like may be set, when a recognized text at least includes one keyword, the weight of the corresponding tag is increased by 1 or a specific value, and for one text, only the first matching tag or all matching tags may be set.
Determining the node position obtained by screenshot, wherein the condition that the node position cannot be determined in an analytic mode or an element node position can be determined manually or by a preset rule, for example, a specific area is selected, so that the specific area is associated with the node; for the node position determined in the preset rule mode, the position of the corresponding part in the page can be further obtained by obtaining the resolution information and the scaling ratio of the page.
The first page and the third page may be pages of the same address, and when the second page includes the recommendation information, the second page may be the third page.
After step 2) is executed for multiple times, it is expected that the distribution of the recommended content in the first page or the third page will change, and in order to avoid data delay caused by cache reasons in the process, approximate setting may be performed in the request header.
By the aid of the method, information prediction can be performed by using the browsing records, so that deviation of pushed information is realized, and privacy of users is protected.
Drawings
FIG. 1 is a flow chart of a data processing method of the present invention;
FIG. 2 is a flow diagram of selective access to pages in an embodiment of the invention, by which pages without a first attribute are accessed.
Detailed Description
The following are specific examples of the invention, which are intended to be illustrative of the invention only and not limiting.
Taking website 1 as an example, after a virtual machine Hyper-V is installed in a system, an agent is used to access website 1, under the condition that an account is not logged in, the content displayed on a recommended page is furniture, women's clothing, jewelry and a mobile phone, and when the agent is cancelled and the virtual switch mode of the Hyper V is configured to be an external network type, website 1 pushes advertisements of ' mother and baby, home clothing, memory bank and hard disk ' according to the source, and has the push content consistent with an external physical computer. After mirror images are restored for multiple times, the website 1 is accessed by using an agent, the contents displayed on a recommended page of the website 1 comprise contents such as 'Bluetooth, women's clothing, sports shoes, mobile phones, leather bags, western style pants, sleeping bags, refrigerators, books, storage boxes, laundry detergents, snacks ', and the like, but when the agent is cancelled and an IP address used by a computer is replaced, the website 1 is accessed, the obtained pushing result comprises' mother and baby, crib, bed protector 'and computer hardware, particularly for the sake of reliable test data, a specific CPU model such as ROME is searched in a local network before the test is carried out, and after the agent is cancelled, the website pushing content in a virtual machine at least has associated searching results such as' strong platinum, strong gold ', and the like', which indicates that the website 1 has a background method for relevance recommendation; the test on a knowledge question and answer website, a news website and a self-media website also shows that the corresponding website has the actual behavior of tracking the browsing history. Processing of the user information is performed based on this, thereby avoiding significant leakage of information.
Two types of labels with different label attributes can be constructed, wherein the first type is a sensitive information type and comprises related vocabularies of mother and infant and computers, the second type of labels are vocabularies except the sensitive information type, the first label attribute label set belongs to the class of the labels with the characteristics that the occurrence of the labels with the first label attribute in information recommendation is expected to be reduced by users, and the second label attribute label set belongs to the class of the labels with the higher priority in information recommendation expected by users than the first label; in addition, a third class can be subdivided in the second class of mark attributes, so that the probability of directional recommendation is improved.
In constructing the first-class or second-class keyword vocabulary, one class of keywords may be set, and more sub-keywords may be set as query keywords or stop words for subsequent use according to the keywords.
Referring to fig. 1-2, the data processing method provided by the present invention includes:
1) the method comprises the steps of obtaining a first target page, obtaining information of at least one recommendation component used for displaying recommendation information, and obtaining dispersion or distribution of recommendation values in the recommendation component;
2) if the first type mark attribute ratio R in the ratio or dispersion of the recommended valueexpWhen the value is higher than the expected value R, at least one second page without the first mark attribute is loaded;
3) recalculating the dispersion of the recommended value on a third target page, if the dispersion is higher than the expected value R, executing the step 2), and if not, exiting;
the first target page and the third target page contain personalized recommendation information;
the dispersion of the recommended numerical value is to obtain the content of the visual part of the recommended component, analyze the content to obtain the information of the text label, count the text label and determine the dispersion of the text label;
the recommendation information of the first target page, the second target page and the third target page is mapped to a tag set at least comprising a first tag attribute, wherein the first tag attribute is a tag attribute expected to be reduced.
The mapping process is to obtain corresponding mark attributes by matching with a dictionary after word segmentation or keyword extraction.
In an embodiment of the present invention, the first page is obtained through a first configuration file, a shooter window, or a selected active window;
when a page is loaded by using a first configuration file, the first configuration file at least comprises one or more of a browser type, a process module name and an address;
when the first configuration file comprises the browser type, the program can obtain the process ID of the current browser in a process searching mode, further obtain the handle of the browser and obtain the window information according to the handle of the browser;
the process ID of the current browser can be obtained by searching the process module, the handle of the browser is further obtained, and the window information is obtained according to the handle of the browser;
the process ID of the current browser can also be obtained by searching the address of the process module, the handle of the browser is further obtained, and the information of the window is obtained according to the handle of the browser.
When information is obtained by enumerating windows, the information is matched with the information component at least in a mode of obtaining window information and process information, for example, a desktop browser can be used, a set of common browsers can be set, and running window processes are obtained in a mode of enumerating window information and matching browser process names (such as Firefox. ex, msedge. ex).
Correspondingly, it may be provided that the configuration file config.ini comprises at least:
Figure BDA0003056682410000071
Figure BDA0003056682410000081
the configuration of the browser in the configuration file can be obtained in a scanning mode or manually specified by a user; the configuration file also includes the XPath of the target website and the corresponding information recommendation area in the page, and the file can be set as the automatically updated content or the XPath of the information area is manually confirmed.
When the XPath of the target page is obtained by manually confirming and using the selected active window, the target page is obtained by setting a floating ball or providing a current window list form or manually selecting, and the position of the information recommendation area is further obtained in the technology.
In an embodiment of the present invention, the information of the text label includes content and position of the text label, and the position of the text label is a relative position or an absolute position in the page.
In another embodiment of the invention, the marking attribute of the text label is determined according to the rule, and the dispersion of the recommended value in the page is calculated according to the marking attribute.
The visual content is obtained through remote control, screenshot function or page analysis;
when the information is obtained through remote control, at least the information can be obtained by using a framework like selenium, script and the like, a common or mainstream browser provides support for remote debugging, however, part of browsers, especially headless browsers which lack reliable rendering performance, do not support the characteristic of rendering pages and capturing images, and therefore corresponding rendering information cannot be obtained. In order to prevent collection or take flow control into consideration, the rendering process is actually realized by multiple times of rendering, a specific event is required to trigger part of the rendering process, and information may be omitted by processing information only based on the structure of the DOM tree; and for partial rendering results, if the visualization factor is not considered, it is highly likely that information containing honeypots is obtained. Corresponding to plug-ins that support plug-in capabilities, particularly content filtering capabilities, it is possible to add and subtract DOM trees during and after page loading, and the corresponding plug-ins should be disabled during this process.
Taking a selenium suite as an example, when a page with recommendation information is used, image information with analysis value can be obtained by setting resolution information, and after the page is loaded, if the page is implemented by using java language, screen information can be obtained by calling an interface which realizes TakesScreenshop; for ordinary page information, it is often impossible to perform single screen capture, so that overall information acquisition can be achieved by scrolling page screens for multiple times, when screen capture information is acquired, image information is not loaded, the setting can be set in a configuration file of a browser, so that difficulty in image splicing caused by pixel change of partial image information is avoided, for wakefulness requiring image loading, capture can also be performed by selecting a maximum matching degree, for example, after a 1200 x 800 picture is captured, loading of other contents is achieved by a selenium down scrolling, in the process, rgb values of the last 2 lines of the last image can be saved, for example, Cx, y is (R + G + B)/3, x is a horizontal offset of the image, y is a vertical offset, and a flag that C is greater than 128 is 0, the label less than 128 is 1, and the top 1 pixel or the 25 rows corresponding to the vertical offset of the new image are processed in a similar way, and after the images are processed, the maximum matching value is selected for splicing.
Screenshots of other parts of the non-browser, for example, screenshots of the mobile device, may be made in a manner referred to in the art, for example using screenshot activity or other open source items. The screen capture of the desktop client can be realized in a non-browser mode, for example, multiple browsers or communities provide suites of versions such as java and c #, and even can be realized in a mode of embedding a traditional browser, most of the suites realize the kernel of the webkit, can support the execution of a script language, are more convenient, can provide a more convenient interface for corresponding processing, for example, after the WebKitBaser. In cases where the foregoing example is not used, an api provided by the system or programming language may be selected to obtain image acquisition of a particular region.
If page parsing is selected, the aforementioned multiple clients or framework implementations may be selected, for example, conventional page access is used to obtain a webpage source code, or a debugging framework is used to obtain a source code of a page, or a customized browser is used to obtain a source code, but for mobile end applications other than browsers, the implementation is limited by privacy and security considerations, and it is impractical to obtain a source code of another browser or app through page parsing, so that screenshot is suggested instead of direct parsing.
The obtaining of the page recommendation content is realized through predefined rules or keywords.
And obtaining the recommended content which is carried out according to the predefined rule as a matching node of the preset website and the recommended rule, and carrying out screenshot on the corresponding node. Taking an information recommendation page of a certain website as an example, the XPath corresponding to the recommendation content is "/html/body/div [1]/div/main/div/div [2]/div [2]/div/div [3]/div [2 ]", for a stable commercial website, the position is generally fixed, and for the modification of the webpage, the rule corresponding to the XPath needs to be updated and then can be used; under the condition that the XPath is invalid, a keyword rule can be further set, for example, a class attribute of a corresponding div or a text value textcontent corresponding to the div node is searched for to determine a corresponding node; part of the pages use font obfuscation techniques, in which case the actual text can only be obtained by means of font mapping reduction or optical character recognition.
When the recommended content can be obtained in the first two ways (XPath and keywords), the content of the minimum image acquisition area or the text node can be obtained by reading the corresponding attribute of the WebElement, otherwise, full-text analysis should be performed to obtain all keywords or the distribution of the corresponding nodes is determined by screen capture.
For the content analyzed by the DOM tree or the source code, all tags can be obtained by selecting corresponding nodes and extracting the text, and the corresponding tag attribution can be determined by adopting a dictionary mode for the tags, particularly, part of websites can be distinguished by using different style sheets for topics expected to be concerned, and proper blank spaces can be filled in when the text is extracted, so that the matching workload is reduced, and then the character string is split, thereby being more beneficial to positioning the tag attribution; for a long character string, a greedy mode can be selected, a sliding window of 2, 3, 4 or 5 is selected for word segmentation, the classification mode has higher efficiency in partial commercial websites, such as shopping websites, because partial wrong words often cause defects (such as wrong words with strong Central Processing Unit (CPU) emotion being too strong) which cannot be confirmed according to dictionaries or word segmentation tools, and under the condition that partial words are wrong, a certain hit rate can still be achieved according to correct words according to the mode; the long character string can also be segmented by selecting a natural language processing mode, such as tools like HanLP and jieba, or by a natural language processing api provided by each service platform.
And for the information obtained by the OCR, distinguishing the subjects in the mode and obtaining marks corresponding to the nodes corresponding to the pages. For example, when the offset (x, y) corresponding to the "recommended content" in the OCR-obtained information is 800, 200, the text content having an x value smaller than 800 in the other OCR-obtained information should not be considered. The lower limit of the relevant area is determined according to a rule of a stop word or critical speculation, for example, when the rule of the stop word is adopted, if the offset (x, y) corresponding to the "recommended content" is 800, 200, and when the corresponding offset (x, y) is 860, 3200 is resolved to "more content", the lower limit should be selected as the end point of the recommended area; the method for determining the recommended region by the critical inference is to determine all text positions in the trusted region from the starting point of the recommended region, count the intervals at which all text regions appear, select nodes with regularity as recommended information nodes, for example, at (810,220), (815,282), (850,320), (820,360), (830,420), (830,480), (1024,520), (810,560), (850,720), (810,840), (890,840) where "30", "2999.0", "rival I7", "1799", "E5 king", "transparent", "second I7 eight core host", "1488.00", "2021", "30", "199.00" are respectively recognized, determine the corresponding text distribution at intervals of 280, specifically calculate the text in reverse order from the maximum longitudinal offset value, for example, select y 840, assuming that the valid information node is I as the total node number, respectively try the numbers of 1 to I, judging whether recognizable characters exist in y/i pixel points which are separated by 5, recording the recognizable characters as 1, recording the recognizable characters as-1 when matching fails, counting the actual matching total number, and determining the maximum matching node total number according to the method so as to determine an actual rendering unit; and based thereon, select the unit of the clickable query.
Another way is to obtain the range of nodes based on DOM information, for example, when the obtained mark offset (x, y) is 800, 200, traverse the DOM tree and can construct an auxiliary map, taking java as an example, construct a graph HashMap < WebElement, Point >, analyze the DOM layer by layer, and add the nodes meeting the offset into the graph, in this process, a plurality of pieces of recommended information can be verified, when a page or a rule indicates that there is only one node, the corresponding web page element WebElement can be directly determined in an enumeration way without the form of constructing the graph, on the basis of determining one or more web page elements, and obtain the specific size thereof, and further match the OCR recognition result according to the size of the web page element; and based thereon, determines the clickable query area.
The simple method for calculating the label dispersion is to calculate the variance of the non-zero part of the weight value of the corresponding field, or calculate the variance of the non-zero part of the weight value of the corresponding field, and calculate the ratio R of the first type label expected to be reducedexp
In one embodiment of the invention, the proportion of attributes corresponding to all the appeared labels is calculated, n labels and the proportion thereof are obtained, m labels belonging to the first class of labels and the proportion thereof are obtained, the square sum of the proportions corresponding to the i labels is calculated, and the square sum of all the labels belonging to the first class is taken as RexpSpecifically, the following formula:
Figure BDA0003056682410000111
in one embodiment of the present invention, the tag dispersion is calculated by the variance of the corresponding weight value in the field to which the tag belongs.
Corresponding to the calculation of the tag dispersion, taking the shopping site as an example, a local dictionary may be set up, including "women's clothing/underwear/home, women's shoes/men's shoes/bags, mother and baby/children's clothes/toys, men's clothing/sports outdoors, make-up/color-make-up/personal care, mobile phones/digital phones/enterprises, everyone's electric/life electric appliances, snacks/fresh/tea wine, kitchen ware/storage/cleaning, home textiles/home decorations/flowers, books audio/stationery, healthcare/import medicines, automobiles/second-hand vehicles/supplies, home products/decoration furniture/building materials, watches/glasses/jewelry ornaments" according to the first-level classification of "women's clothing/underwear/home", etc., which may further include "fashion trends", and "children's clothing/home" and "items of the same, Sweater, coat, small blackskirt, robustum, cotton wool, suit coat, babyb's trousers, steel-ring-free bra, brassiere for beauty, top dress, sweater, shirt, T-shirt, vest, chiffon, dress, short coat, sweater, fur coat, windbreaker, suit, fur coat, dress, panty, casual pants, leggings, wadded down pants, distinctive dress, suit-size dress, suit-in-the-old, fashion suit, sport suit, apparel service, laundry service, socks, silk stockings, panty-hose, boatswaps, home suit, thermal undergarment, pajamas, socks, camisole, braces, gown, pajamas, underpants, ladies ' underpants, men's underpants, brassiere patch, bra, pad, buckle, body shaping pad, body shaping, coverlet, body shaping, body shaping, etc, The method comprises the steps of dividing a body into a first class and a second class, dividing the body into a plurality of classes, and dividing the body into a plurality of classes, wherein one or more keywords are set corresponding to the classes of the second class or the third class, for example, for a router, the keywords including brands (Huashi, Puyi, Xinhua, and Xinhua), environments (large house type, medium house type and small house type), characteristics (Wifi6, Wi-Fi6, ax) and the weight value can be increased by 1 or a specific value for one text, and only the first matched label or all matched labels can be set, and the weight value can be calculated according to the method mentioned above.
Referring to FIG. 2, after obtaining clickable or accessible click query regions, a set may be constructed and a time interval set, in java implementation, such as (5000+ new Date (). getTime ()% 12000) ms, after which it is tested whether the next clickable region in the set belongs to the target page.
In another embodiment of the invention, the number of times of querying the clicked query area of a single page source is set, namely, the number of times of querying the clickable area based on the single page source is not exceeded, so as to avoid being intercepted.
In another embodiment of the present invention, the number of times of the query in the click query area of a single page source is set to 1, that is, the next page is always obtained based on a new single page source.
However, the clickable or accessible unit determined in the above manner may not belong to the target page (i.e. contain the first tagged attribute), and therefore, when it is found that there is no clickable area after the determination, the input focus may be obtained by obtaining the unit corresponding to queryField in the configuration file, assigning a value by calling ajax, or changing dom to assign a value, or obtaining the query position, and then inputting the input focus by using an automatic click tool.
In the embodiment, the positions of the elements or nodes are determined according to the node positions obtained by screenshot, and for the condition that the positions of the elements or nodes cannot be determined in an analysis mode or a manual determination mode, for example, a specific area is selected, so that the specific area is associated with the nodes; for the node position determined in the preset rule mode, the position of the corresponding part in the page can be further obtained by obtaining the resolution information and the scaling ratio of the page.
Referring to fig. 2, after a cycle is completed, for example, after at least 60 pages are accessed, the first page is accessed, or when the second page contains recommended content, or a third page is designated as a test page, and the recommended dispersion is recalculated in the foregoing manner, it is found that the appearance rate of the phrase to which the target keyword belongs is significantly reduced, for example, the recommendation related to "mother and infant" and "computer hardware" has disappeared. And when the virtual machine environment test is used, the corresponding recommendation does not exist the recommendation related to mother and infant and computer hardware, which shows that the corresponding operation changes the portrait of the website to the user to a certain extent.
In one example, the above operations are performed after logging in the account, and after the network environment and the type of the client used are changed (PC terminal → android phone terminal), the content of the push area is approximately changed.
The similar operation is suitable for a knowledge question-answering platform and a self-media platform which use intelligent recommendation, and the protection of personal preference information is realized.
In some cases, due to network reasons, locally acquired information may be cached, so that the computation of the dispersion is affected, and in this case, the defect can be avoided by setting an http header to include a no-cache setting.
In some cases, some websites are not suitable for protection of sensitive information in this way, and in this case, the number of second pages in step 2) may not exceed 100 each time step 2) is executed, so as to avoid being mistakenly judged as abnormal access by the service platform, by setting a threshold, for example, setting step 2) to be executed 3 times at most.
By the aid of the method, information prediction can be performed by using the browsing records, so that deviation of pushed information is realized, and privacy of users is protected.

Claims (6)

1. A method of data processing, comprising:
1) obtaining a first target page, wherein the first target page comprises information of at least one recommendation component used for displaying recommendation information, and obtaining dispersion of recommendation values in the recommendation component;
2) if the proportion of the first mark attribute in the recommended value is higher than the expected value R, at least loading a second page without the first mark attribute;
3) recalculating the dispersion of the recommended value on a third target page, if the proportion of the first type mark attribute is higher than the expected value R, executing the step 2), otherwise, exiting;
the first target page and the third target page contain personalized recommendation information, and the recommendation information is mapped to a tag set containing a first tag attribute, wherein the first tag attribute is a tag attribute expected to be reduced;
and the dispersion of the recommended numerical value is to obtain the content of the visual part of the recommended component, analyze the content to obtain the information of the text label, count the text label and determine the dispersion of the text label.
2. The method of claim 1, wherein the information of the text label comprises content and position of the text label, and the position of the text label is a relative position or an absolute position in a page.
3. The method of claim 1, wherein the tagged attributes of the text labels are determined according to rules, and the dispersion of recommended values in the page is calculated according to the tagged attributes.
4. The method of claim 1, wherein the first page is obtained via a first profile, a scrubber window, or a selected active window;
when a page is loaded by using a first configuration file, the first configuration file at least comprises one or more of a browser type, a process module name and an address;
when the first configuration file comprises the browser type, acquiring a process ID of the current browser in a process searching mode, further acquiring a handle of the browser, and acquiring window information according to the handle of the browser;
or acquiring the process ID of the current browser in a mode of searching a process module, further acquiring a handle of the browser, and acquiring window information according to the handle of the browser;
or obtaining the process ID of the current browser by searching the process module address, further obtaining the handle of the browser, and obtaining the window information according to the handle of the browser.
5. The method of claim 1, wherein the visual content is obtained by remote control, screenshot functionality or parsing a DOM tree of the page.
6. A data processing apparatus comprising storage means having stored therein instructions, characterized in that said instructions, when executed, implement the functionality of any of claims 1-5.
CN202110501807.3A 2021-05-08 2021-05-08 Data processing method and device Pending CN113158061A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110501807.3A CN113158061A (en) 2021-05-08 2021-05-08 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110501807.3A CN113158061A (en) 2021-05-08 2021-05-08 Data processing method and device

Publications (1)

Publication Number Publication Date
CN113158061A true CN113158061A (en) 2021-07-23

Family

ID=76873854

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110501807.3A Pending CN113158061A (en) 2021-05-08 2021-05-08 Data processing method and device

Country Status (1)

Country Link
CN (1) CN113158061A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070067305A1 (en) * 2005-09-21 2007-03-22 Stephen Ives Display of search results on mobile device browser with background process
US20080040653A1 (en) * 2006-08-14 2008-02-14 Christopher Levine System and methods for managing presentation and behavioral use of web display content
US20140201620A1 (en) * 2013-01-15 2014-07-17 Webezo Inc. Method and system for intelligent web site information aggregation with concurrent web site access
CN105045864A (en) * 2015-07-10 2015-11-11 浙江工商大学 Personalized recommendation method of digital resources
CN107305557A (en) * 2016-04-20 2017-10-31 北京陌上花科技有限公司 Content recommendation method and device
US20210103632A1 (en) * 2019-10-08 2021-04-08 Adobe Inc. Content aware font recommendation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070067305A1 (en) * 2005-09-21 2007-03-22 Stephen Ives Display of search results on mobile device browser with background process
US20080040653A1 (en) * 2006-08-14 2008-02-14 Christopher Levine System and methods for managing presentation and behavioral use of web display content
US20140201620A1 (en) * 2013-01-15 2014-07-17 Webezo Inc. Method and system for intelligent web site information aggregation with concurrent web site access
CN105045864A (en) * 2015-07-10 2015-11-11 浙江工商大学 Personalized recommendation method of digital resources
CN107305557A (en) * 2016-04-20 2017-10-31 北京陌上花科技有限公司 Content recommendation method and device
US20210103632A1 (en) * 2019-10-08 2021-04-08 Adobe Inc. Content aware font recommendation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡州明;彭柏;赵永彬;杨帆;金成明;: "基于页面交互机制的浏览器整体架构设计", 现代电子技术, no. 15 *

Similar Documents

Publication Publication Date Title
US9697183B2 (en) Client side page processing
US20130282808A1 (en) System and Method for Generating Contextual User-Profile Images
WO2017041359A1 (en) Information pushing method, apparatus and device, and non-volatile computer storage medium
CN103455524B (en) Method and device for displaying and acquiring entry information
US9934206B2 (en) Method and apparatus for extracting web page content
WO2020211249A1 (en) Network shopping guiding method and apparatus based on data crawling
WO2016192309A1 (en) Pushed information processing method, apparatus, and device, and non-volatile computer storage medium
TW201202980A (en) Infinite browse
CN107729475A (en) Web page element acquisition method, device, terminal and computer-readable recording medium
CN106844472A (en) A kind of searching method and device, a kind of device for searching for
CN107562939A (en) Vertical field news recommends method, apparatus and readable storage medium
US20210097045A1 (en) Object identifier index
CN106708502A (en) Webpage processing method and device
US20140108919A1 (en) Information providing device, information providing method, information providing program, information display program, and computer-readable recording medium storing information providing program
JP4939637B2 (en) Information providing apparatus, information providing method, program, and information recording medium
JPWO2012063772A1 (en) Related word registration device, information processing device, related word registration method, program for related word registration device, recording medium, and related word registration system
CN104615639B (en) A kind of method and apparatus for providing the presentation information of picture
JP4955841B2 (en) Information providing apparatus, information providing method, program, and information recording medium
JP2010113489A (en) Relevant blog presentation device, method, and program
KR20160117678A (en) Product registration and recommendation method in curation commerce
WO2013072647A1 (en) Interactive image tagging
CN113158061A (en) Data processing method and device
US11468675B1 (en) Techniques for identifying objects from video content
US20140380199A1 (en) System and method for contextually enriching content of a referrer page when returning from a landing page
CN111368236B (en) Method, device and equipment for realizing personalized decoration function

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination