CN111723263A - Webpage data processing method, device, equipment and storage medium - Google Patents

Webpage data processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN111723263A
CN111723263A CN202010567841.6A CN202010567841A CN111723263A CN 111723263 A CN111723263 A CN 111723263A CN 202010567841 A CN202010567841 A CN 202010567841A CN 111723263 A CN111723263 A CN 111723263A
Authority
CN
China
Prior art keywords
character
webpage data
web page
character conversion
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010567841.6A
Other languages
Chinese (zh)
Other versions
CN111723263B (en
Inventor
王亚森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tongbang Zhuoyi Technology Co ltd
Original Assignee
Beijing Tongbang Zhuoyi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tongbang Zhuoyi Technology Co ltd filed Critical Beijing Tongbang Zhuoyi Technology Co ltd
Priority to CN202010567841.6A priority Critical patent/CN111723263B/en
Publication of CN111723263A publication Critical patent/CN111723263A/en
Application granted granted Critical
Publication of CN111723263B publication Critical patent/CN111723263B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention provides a method, a device, equipment and a storage medium for processing webpage data. The method comprises the following steps: receiving first intermediate webpage data sent by a server, wherein the first intermediate webpage data are generated after the server converts characters in original webpage data according to a first character conversion relation; and converting the first intermediate webpage data according to a second character conversion relation to generate original webpage data, and rendering the original webpage data to a webpage, wherein the first character conversion relation and the second character conversion relation are both the relation of conversion between characters of at least two sets of character sets. According to the technical scheme of the embodiment of the invention, the webpage data can be simply and effectively prevented from being captured by the web crawler, the complexity of front-end page development is not increased, and the monitoring is not required at the server end, so that the development and operation cost can be reduced.

Description

Webpage data processing method, device, equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for processing web page data.
Background
With the development of internet technology, the world wide web becomes a carrier of a large amount of information, and a web crawler technology capable of capturing related web page resources is developed. However, for some web page resources or web page data, it may not be wanted to be crawled by web crawler technology, and how to avoid the web page data being crawled by the web crawler becomes a focus of attention.
In one technical scheme, important data display elements of a page are split, other useless garbage confusion elements are inserted, the important data display elements are hidden by using a Cascading Style Sheets (CSS), and when a crawler analyzes a page source code, useless confusion data are grabbed together. However, in this solution, confusion of page elements increases development complexity, and the page elements are easy to crack and cause failure of anti-crawlers.
Therefore, how to simply and effectively prevent the webpage data from being captured by the web crawler becomes a technical problem to be solved urgently.
Disclosure of Invention
The embodiment of the invention provides a webpage data processing method, a webpage data processing device, equipment and a storage medium, which are used for solving the problem of simply and effectively preventing webpage data from being captured by a web crawler.
According to a first aspect of the embodiments of the present invention, there is provided a method for processing web page data, including:
receiving first intermediate webpage data sent by a server, wherein the first intermediate webpage data are generated after the server converts characters in original webpage data according to a first character conversion relation;
converting the first intermediate webpage data according to a second character conversion relation to generate original webpage data, rendering the original webpage data to a webpage,
the first character conversion relationship and the second character conversion relationship are both the relationships of conversion between characters of at least two sets of character sets.
In some example embodiments, the second character conversion relationship comprises a first character mapping relationship and a second character mapping relationship, and the converting the first intermediate web page data according to the second character conversion relationship comprises:
converting the first intermediate webpage data according to the first character mapping relationship to generate second intermediate webpage data, wherein the second intermediate webpage data can be converted into the original webpage data through the second character mapping relationship;
and converting the second intermediate webpage data into the original webpage data according to the second character mapping relation, and rendering the original webpage data to a webpage.
In some example embodiments, the second character mapping relationship is a font file containing a character mapping relationship table, and the converting the second intermediate web page data into the original web page data according to the second character mapping relationship includes:
and converting the characters in the second intermediate webpage data into characters corresponding to the original webpage data according to the character mapping relation table in the font file.
In some example embodiments, the first character conversion relationship comprises a third character mapping relationship, the second character conversion relationship comprises a fourth character mapping relationship, and the third character mapping relationship is an inverse mapping of the fourth character mapping relationship.
In some example embodiments, the first character conversion relationship and the second character conversion relationship belong to a set of character conversion relationships of a predetermined number of sets of character conversion relationships, the method further comprising:
and selecting a set of character conversion relations from the preset number of sets of character conversion relations, wherein each set of character conversion relations comprises at least two character conversion relations, and the character conversion relations are used for converting characters in the webpage data.
In some example embodiments, said selecting a set of character conversion relationships from said predetermined number of sets of character conversion relationships comprises:
performing modulus extraction on the preset number of values according to the current date to obtain a corresponding remainder;
and selecting a character conversion relation from the preset number of character conversion relations according to the value of the remainder.
In some example embodiments, said selecting a set of character conversion relationships from said predetermined number of sets of character conversion relationships comprises:
randomly selecting a set of character conversion relationships from the predetermined number of sets of character conversion relationships.
In a second aspect of the embodiments of the present invention, there is provided a web page data processing apparatus, including:
the receiving module is used for receiving first intermediate webpage data sent by a server side, wherein the first intermediate webpage data are generated after the server side converts characters in original webpage data according to a first character conversion relation;
a conversion module for converting the first intermediate web page data according to a second character conversion relationship to generate the original web page data and rendering the original web page data to a web page,
the first character conversion relationship and the second character conversion relationship are both the relationships of conversion between characters of at least two sets of character sets.
In some example embodiments, the second character conversion relationship comprises a first character mapping relationship and a second character mapping relationship, the conversion module comprises:
the first conversion unit is used for converting the first intermediate webpage data according to the first character mapping relationship to generate second intermediate webpage data, wherein the second intermediate webpage data can be converted into the original webpage data through the second character mapping relationship;
and the second conversion unit is used for converting the second intermediate webpage data into the original webpage data according to the second character mapping relation and rendering the original webpage data to a webpage.
In some example embodiments, the second character mapping relationship is a font file including a character mapping relationship table, and the second conversion unit is further specifically configured to:
and converting the characters in the second intermediate webpage data into characters corresponding to the original webpage data according to the character mapping relation table in the font file.
In some example embodiments, the first character conversion relationship comprises a third character mapping relationship, the second character conversion relationship comprises a fourth character mapping relationship, and the third character mapping relationship is an inverse mapping of the fourth character mapping relationship.
In some example embodiments, the first character conversion relationship and the second character conversion relationship belong to one of a predetermined number of sets of character conversion relationships, the apparatus further comprising:
and the rule selection module is used for selecting a set of character conversion relations from the preset number of sets of character conversion relations, wherein each set of character conversion relations comprises at least two character conversion relations, and the character conversion relations are used for converting characters in the webpage data.
In some example embodiments, the rule selection module is further specifically configured to:
performing modulus extraction on the preset number of values according to the current date to obtain a corresponding remainder;
and selecting a character conversion relation from the preset number of character conversion relations according to the value of the remainder.
In some example embodiments, the rule selection module is further specifically configured to:
randomly selecting a set of character conversion relationships from the predetermined number of sets of character conversion relationships.
In a third aspect of the embodiments of the present invention, there is provided an electronic device, including:
a memory, a processor; wherein,
a memory for storing the processor-executable instructions;
the processor is configured to implement the web page data processing method according to the first aspect.
In a fourth aspect of the embodiments of the present invention, a computer-readable storage medium is provided, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the computer-readable storage medium is configured to implement the method for processing web page data according to the first aspect.
According to the webpage data processing method, device, equipment and storage medium provided by the embodiment of the invention, on one hand, one or more character conversion relations are configured in advance, the original webpage data are converted at the server side according to the character conversion relations, intermediate webpage data are generated and returned to the client side, and as the data returned by the server side to the client side are the mixed data, the original webpage data can be prevented from being grabbed by a webpage crawler, so that the webpage data can be simply and effectively prevented from being grabbed by the network crawler; on the other hand, the intermediate webpage data are converted into the original webpage data and displayed at the browser end according to the character conversion relation, page display can be normally carried out, the complexity of front-end page development cannot be increased, monitoring is not needed at the server end, and therefore development and running cost can be reduced.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a schematic block diagram of an application scenario of a web page data processing method of some embodiments of the present invention;
FIG. 2 is a flowchart illustrating a method for processing web page data according to some embodiments of the invention;
FIG. 3 is a flowchart illustrating a method for processing web page data according to still other embodiments of the present invention;
FIG. 4 is a flowchart illustrating a method for processing web page data according to another embodiment of the present invention;
FIG. 5 is a diagram illustrating 4 sets of character conversion relationships according to some embodiments of the invention;
FIG. 6 is a block diagram illustrating a first embodiment of a web page data processing apparatus according to some embodiments of the present invention;
FIG. 7 is a schematic block diagram of a translation module provided by some embodiments of the present invention;
FIG. 8 is a schematic block diagram of a second embodiment of a web page data processing apparatus according to some embodiments of the present invention;
fig. 9 is a schematic block diagram of an electronic device provided by some embodiments of the invention.
With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
First, the terms related to the present invention are explained:
a web crawler is a program or script that automatically captures web information according to certain rules.
Font-face: the method is a method for customizing fonts by adopting CSS in the web page development technology.
Node: js is a JavaScript running environment based on Chrome V8 engine.
Character conversion relationship: the rules for converting between characters in at least two character sets can convert characters in one character set into characters in another character by means of character conversion relations, such as converting character 1 into 0xefab, and converting character 2 into 0xeba 2.
Character mapping relation: the mapping relationship between the characters in the two sets of character sets, for example, 0xefab is set as mapping character 1, and 0xeba2 is set as mapping character 2.
Taking a mold: is a data calculation method, and the remainder obtained is the division of one number by another number.
Front end: the browser side is operated to render the page content presentation.
A rear end: the data processing system runs at the server end, is responsible for data processing, and then is transmitted to the front end to be displayed as a page.
The technical concept of the present invention will be explained below.
At present, there are some anti-crawler technologies on both a server side and a client side, for example, in one technical scheme, at the server side, whether a crawler program exists is determined by monitoring a User-Agent field in a header file of a network request. In another technical scheme, at a client, partial data content in a webpage is replaced by using a picture, or data in webpage source codes and data displayed by the webpage are mixed up by using a custom font.
However, in the first technical solution, it is necessary to monitor whether the network request includes a crawler request at the server, and the crawler request is easily bypassed to cause a crawler failure; in the second technical scheme, the method of replacing pictures is not beneficial to page typesetting, when the types of page data contents are more, the complexity of front-end page development can be increased, and the simple font replacement rule can be cracked to cause the failure of anti-crawler.
Based on the above, the basic idea of the invention is: the method comprises the steps of configuring one or more character mapping relations in advance, converting original webpage data according to the character mapping relations at a server side and/or a client side to generate intermediate webpage data, and converting the intermediate webpage data into the original webpage data according to the character mapping relations at a browser side and displaying the original webpage data. According to the technical scheme of the embodiment of the invention, on one hand, one or more character mapping relations are configured in advance, original webpage data are converted according to the character mapping relations at the server side and/or the client side to generate intermediate webpage data, and as the data of the server side and/or the client side are not the original webpage data, the original webpage data can be prevented from being grabbed by a webpage crawler, so that the webpage data can be simply and effectively prevented from being grabbed by the network crawler; on the other hand, the intermediate webpage data are converted into the original webpage data and displayed at the browser end according to the character mapping relation, page display can be normally carried out, the complexity of front-end page development cannot be increased, monitoring is not needed at the server end, and therefore development and running cost can be reduced.
Fig. 1 is a schematic block diagram of an application scenario of a web page data processing method according to some embodiments of the present invention. Referring to fig. 1, the application scenario includes a server side 110 and a client side 120. The client 120 is configured with a database 122, and the database 122 is configured with a plurality of character conversion relationships in advance, and the character conversion relationships are used for converting characters in the web page data. In some example embodiments, the client 120 sends a web page obtaining request to the server 110, converts the original web page data according to a pre-configured character conversion relationship after receiving the original web page data returned by the server 110, generates intermediate web page data, converts the intermediate web page data into the original web page data according to a corresponding character conversion relationship when displaying on a web page, and renders and displays the original network data on the web page.
In other embodiments, a plurality of character conversion relationships are also preconfigured in the server 110, the client 120 sends a web page acquisition request to the server 110, and after receiving the web page acquisition request, the server 110 acquires original web page data corresponding to the web page acquisition request and converts the original web page data according to the preconfigured character conversion relationships to generate first intermediate network data; the first intermediate network data is returned to the client 120. After receiving the first intermediate network data, the client 120 converts the first intermediate network data according to the preconfigured character conversion relationship to generate second intermediate network data, converts the second intermediate network data into original web page data according to the preconfigured character conversion relationship when displaying on the web page, and renders and displays the original network data on the web page.
It should be noted that the server 110 may be a physical server including an independent host, or a virtual server carried by a host cluster, or a cloud server. The client 120 may be a desktop computer, a portable notebook computer, a tablet computer, a mobile phone, or a suitable computer terminal, and the invention is not limited thereto.
A web page data processing method according to an exemplary embodiment of the present invention is described below with reference to the accompanying drawings in conjunction with an application scenario of fig. 1. It should be noted that the above application scenarios are merely illustrative for the convenience of understanding the spirit and principles of the present invention, and the embodiments of the present invention are not limited in this respect. Rather, embodiments of the present invention may be applied to any scenario where applicable.
Fig. 2 is a flowchart illustrating a web page data processing method according to some embodiments of the present invention. The web page data processing method can be applied to the client 120 in fig. 1, and the web page data processing method in the example embodiment is described in detail below with reference to the accompanying drawings.
Referring to fig. 2, in step S210, first intermediate web page data sent by the server is received, where the first intermediate web page data is generated after the server converts characters in the original web page data according to a first character conversion relationship.
In an example embodiment, a client sends a network acquisition request to a server, and after receiving the web page acquisition request, the server queries an original web page file corresponding to the web page acquisition request, converts characters in original web page data according to a first character conversion relationship, and generates first intermediate web page data.
It should be noted that the first character conversion relationship is a mapping relationship between at least two sets of character sets, for example, a character 0xefab may be set as a mapping character 1, a character 0xeba2 may be set as a mapping character 2, and the like.
Further, the server side converts the character 1 in the original webpage data into 0xefab according to the character conversion relationship, converts the character 2 into 0xeba2, converts other characters into corresponding character strings, generates first intermediate webpage data, and sends the first webpage data to the client side. The client receives first intermediate webpage data sent by the server.
In step S220, the first intermediate web page data is converted according to the second character conversion relationship to generate original web page data, and the original web page data is rendered on the web page.
In an example embodiment, the second character conversion relationship is a mapping relationship between at least two sets of characters of the character set, for example, the second character conversion relationship is a reverse character conversion relationship of the first character conversion relationship, for example, if the first character conversion relationship includes the first character mapping relationship and the second character conversion relationship includes the second character mapping relationship, the second character mapping relationship is a reverse mapping of the first character mapping relationship. For example, in the first character conversion relationship, 0xefab is set to map to character 1, 0xeba2 to map to character 2, and so on for other characters; then in the second character conversion relationship, character 1 is set to map to 0xefab and character 2 is set to map to 0xeba 2.
Further, in an example embodiment, the browser on the client converts the first intermediate web page data according to the second character conversion relationship, generates original web page data, and renders the original web page data onto the web page.
According to the technical scheme in the example embodiment of fig. 2, on one hand, one or more character mapping relations are configured in advance, original webpage data are converted at the server side according to the character mapping relations, intermediate webpage data are generated and returned to the client side, and as the data returned by the server side to the client side are confused data, the original webpage data can be prevented from being grabbed by a webpage crawler, so that the webpage data can be simply and effectively prevented from being grabbed by the network crawler; on the other hand, the intermediate webpage data are converted into the original webpage data and displayed at the browser end according to the character mapping relation, page display can be normally carried out, the complexity of front-end page development is not increased, monitoring is not carried out at the server end, and therefore development and running cost can be reduced.
Further, in an example embodiment, the second character conversion relationship may be a font file containing a character mapping relationship table, and converting the second intermediate web page data into the original web page data according to the second character conversion relationship includes: and converting the characters in the second intermediate webpage data into characters corresponding to the original webpage data according to the character mapping relation table in the font file.
For example, the Font file may be a Font-face Font file, a character set is defined in the Font-face Font file, the characters are encoded by Unicode, a set of character mapping relationship table may be defined according to the character set in the Font-face, for example, 0xefab is set to map to character 1, 0xeba2 is set to map to character 2, and the other characters are the same. When the character 1 is displayed on the webpage, the source code on the webpage is 0xefab, the source code collected by the web crawler is also 0xefab, but not the character 1, but the original character 1 is displayed on the webpage, so that the method has no influence on a normally used user because the browser loads the Font file of the CSS for mapping and rendering, and displays the original character 1 in the webpage.
Fig. 3 is a flowchart illustrating a web page data processing method according to still other embodiments of the present invention.
Referring to fig. 3, in step S310, first intermediate web page data sent by the server is received, where the first intermediate web page data is generated by the server converting characters in original web page data according to a first character conversion relationship.
In an exemplary embodiment, a processing interface, for example, a Node layer interface, is provided at the server side, and the Node layer is an execution environment in which JavaScript is executed at the server side. The Node layer interface is used for receiving a webpage acquisition request sent by a client, acquiring corresponding original webpage data according to the webpage acquisition request, converting characters in the original webpage data according to a first character conversion relation, generating first intermediate webpage data, and returning the generated first intermediate webpage data to the client. The client receives the first intermediate webpage data returned by the server through the Node layer interface.
For example, if the original web page data includes characters 2 and 4, and in the first character conversion relationship, the characters 2 and 4 in the original web page data are mapped to characters 5 and 7, i.e., 2 → 5, 4 → 7, respectively, then in the first intermediate web page data, the characters 5 and 7 represent the characters 2 and 4 in the original web page data, respectively.
It should be noted that the processing interface at the server end may also be an interface implemented by other programming languages, for example, an interface implemented by a server end language such as PHP, Python, Perl, Ruby, and the like, which is not particularly limited in the present invention.
In step S320, the first intermediate webpage data is converted according to a second character conversion relationship to generate original webpage data, and the original webpage data is rendered on the webpage, where the second character conversion relationship includes a first character mapping relationship and a second character mapping relationship.
In an example embodiment, the first intermediate webpage data is converted according to the first character mapping relationship, and second intermediate webpage data is generated, wherein the second intermediate webpage data can be converted into original webpage data through the second character mapping relationship. For example, in the first intermediate web page data, the characters 5 and 7 represent the characters 2 and 4 in the original web page data, respectively, in the first character mapping relationship, the characters 5 and 7 in the first intermediate web page file are mapped to the characters 7 and 3, i.e., 5 → 7 and 4 → 3, in the second intermediate web page data, respectively, in the second mapping relationship, the characters 7 and 3 in the second intermediate web page data are mapped to the characters 2 and 4, i.e., 7 → 2 and 3 → 4, in the original web page data, respectively, and the conversion principles of the other characters are similar, so that the second intermediate web page data can be converted into the original web page data through the second character mapping relationship.
Further, the browser of the client converts the second intermediate webpage data into original webpage data according to the second character mapping relation, and renders the original webpage data onto the webpage. For example, the second character mapping relationship is a font file containing a character mapping relationship table, and the converting of the second intermediate web page data into the original web page data according to the second character mapping relationship includes: and converting the characters in the second intermediate webpage data into characters corresponding to the original webpage data according to the character mapping relation table in the font file.
For example, the Font file may be a Font-face Font file, a character set is defined in the Font-face Font file, the characters are encoded by Unicode, a set of character mapping relationship table may be defined according to the character set in the Font-face, for example, 0xefab is set to map to character 1, 0xeba2 is set to map to character 2, and the other characters are the same. When the character 1 is displayed on the webpage, the source code on the webpage is 0xefab, the source code collected by the web crawler is also 0xefab, but not the character 1, but the original character 1 is displayed on the webpage, so that the method has no influence on a normally used user because the browser loads the Font file of the CSS for mapping and rendering, and displays the original character 1 in the webpage.
According to the technical scheme in the example embodiment of fig. 3, on one hand, one or more character mapping relationships are configured in advance, original webpage data are converted according to the character mapping relationships at the server side and/or the client side to generate intermediate webpage data, and as the data of the server side and/or the client side are the confused data, the original webpage data can be prevented from being grabbed by the webpage crawler, so that the webpage data can be simply and effectively prevented from being grabbed by the network crawler; on the other hand, the intermediate webpage data are converted into the original webpage data and displayed at the browser end according to the character mapping relation, page display can be normally carried out, the complexity of front-end page development cannot be increased, monitoring is not needed at the server end, and therefore development and running cost can be reduced.
Fig. 4 is a flowchart illustrating a web page data processing method according to another embodiment of the present invention.
Referring to fig. 4, in step S410, a set of character conversion relationships is selected from a predetermined number of sets of character conversion relationships, wherein each set of character conversion relationships includes at least two character conversion relationships, and the character conversion relationships are used for converting characters in web page data.
In some example embodiments, a set of character conversion relationships is randomly selected from a predetermined number of sets of character conversion relationships. For example, a set of character conversion relationships may be randomly selected from a predetermined number of sets of character conversion relationships by a random operation.
In other exemplary embodiments, a set of character conversion relationships is selected from a predetermined number of conversions based on a current date. For example, a predetermined number of values are modulo according to the current date to obtain a corresponding remainder; and selecting a character conversion relation from the preset number of character conversion relations according to the value of the remainder. For example, if the predetermined number is 4, then the module is taken from 4 according to the current date, which corresponds to four results 1, 2, 3, and 4, respectively, and a set of character conversion relationships is selected from 4 sets of character conversion relationships according to the module-taken result. FIG. 5 is a diagram illustrating 4 sets of character conversion relationships according to some embodiments of the invention.
Further, each set of character conversion relationships includes at least two character conversion relationships. Referring to rule 1 of fig. 5, the set of character conversion relationships includes three character conversion relationships, which are respectively a character conversion relationship configured at the server side, a character conversion relationship configured at the front end, and a character conversion relationship configured at the web page side.
In step S420, first intermediate web page data sent by the server is received, where the first intermediate web page data is generated after the server converts characters in the original web page data according to the first character conversion relationship.
In an exemplary embodiment, a processing interface, for example, a Node layer interface, is provided at the server side, and the Node layer is an execution environment in which JavaScript is executed at the server side. The Node layer interface is used for receiving a webpage acquisition request sent by a client, acquiring corresponding original webpage data according to the webpage acquisition request, converting characters in the original webpage data according to a first character conversion relation, generating first intermediate webpage data, and returning the generated first intermediate webpage data to the client. The client receives the first intermediate webpage data returned by the server through the Node layer interface.
Referring to relation 1 of fig. 5, assuming that the original web page data includes characters 2 and 4, in the first character conversion relation, characters 2 and 4 in the original web page data are mapped to characters 5 and 7, i.e., 2 → 5, 4 → 7, respectively, and then characters 5 and 7 in the first intermediate web page data represent characters 2 and 4 in the original web page data, respectively.
In step S430, the first intermediate web page data is converted according to the second character conversion relationship to generate second intermediate web page data, wherein the second intermediate web page data can be converted into the original web page data through the third character conversion relationship.
In an example embodiment, the first intermediate web page data is converted according to a second character conversion relationship to generate second intermediate web page data, wherein the second intermediate web page data can be converted into the original web page data through a third character conversion relationship. Referring to rule 1 of fig. 5, it is assumed that in the first intermediate web page data, characters 5 and 7 represent characters 2 and 4 in the original web page data, respectively, in the second character conversion relationship, characters 5 and 7 in the first intermediate web page file are mapped to characters 7 and 3, i.e., 5 → 7, 4 → 3, in the second intermediate web page data, respectively, and in the third character conversion relationship, characters 7 and 3 in the second intermediate web page data are mapped to characters 2 and 4, i.e., 7 → 2, 3 → 4, in the original web page data, respectively, and the conversion principles of the other characters are similar, and therefore, the second intermediate web page data can be converted into the original web page data through the third character conversion relationship.
In step S440, the second intermediate web page data is converted into original web page data according to the three-character conversion relationship, and the original web page data is rendered onto the web page.
In an example embodiment, the browser of the client converts the second intermediate web page data into original web page data according to the third character conversion relationship, and renders the original web page data onto the web page. For example, the third character conversion relationship may be a font file containing a character mapping relationship table, and the converting the second intermediate web page data into the original web page data according to the third character conversion relationship includes: and converting the characters in the second intermediate webpage data into characters corresponding to the original webpage data according to the character mapping relation table in the font file.
For example, the Font file may be a Font-face Font file, a character set is defined in the Font-face Font file, the characters are encoded by Unicode, a set of character mapping relationship table may be defined according to the character set in the Font-face, for example, 0xefab is set to map to character 1, 0xeba2 is set to map to character 2, and the other characters are the same. When the character 1 is displayed on the webpage, the source code on the webpage is 0xefab, the source code collected by the web crawler is also 0xefab, but not the character 1, but the original character 1 is displayed on the webpage, so that the method has no influence on a normally used user because the browser loads the Font file of the CSS for mapping and rendering, and displays the original character 1 in the webpage.
The conversion process of the other character conversion relationships is similar to the conversion process from step S420 to step S440, and is not described herein again.
According to the technical scheme in the example embodiment of fig. 4, on one hand, since the data of the server and/or the client is the obfuscated data, the web crawler can be prevented from capturing the original web data, so that the web data can be simply and effectively prevented from being captured by the web crawler, and on the other hand, by dynamically adjusting the character conversion relationship, the obfuscating rules of the web data captured by the web crawler every day are different, so that the web data can be further and effectively prevented from being captured by the web crawler, and the security and confidentiality of the web data are improved.
Fig. 6 is a schematic block diagram of a first embodiment of a web page data processing apparatus according to some embodiments of the present invention. Referring to fig. 6, the web page data processing apparatus 600 includes:
a receiving module 610, configured to receive first intermediate web page data sent by a server, where the first intermediate web page data is generated by converting characters in original web page data by the server according to a first character conversion relationship;
a conversion module 620, configured to convert the first intermediate web page data according to a second character conversion relationship, generate the original web page data, and render the original web page data to a web page, where the first character conversion relationship and the second character conversion relationship are both relationships for converting between characters of at least two sets of character sets.
According to the technical scheme in the example embodiment of fig. 6, on one hand, one or more character mapping relationships are configured in advance, original webpage data are converted at the server side according to the character mapping relationships, intermediate webpage data are generated and returned to the client side, and as the data returned to the client side by the server side is not the original webpage data, the original webpage data can be prevented from being grabbed by a webpage crawler, so that the webpage data can be simply and effectively prevented from being grabbed by the network crawler; on the other hand, the intermediate webpage data are converted into the original webpage data and displayed at the browser end according to the character mapping relation, page display can be normally carried out, the complexity of front-end page development cannot be increased, monitoring is not needed at the server end, and therefore development and running cost can be reduced.
Fig. 7 is a schematic block diagram of a conversion module provided by some embodiments of the present invention. Referring to fig. 7, in some example embodiments, the second character conversion relationship includes a first character mapping relationship and a second character mapping relationship, and the conversion module 620 includes:
a first conversion unit 710, configured to convert the first intermediate web page data according to the first character mapping relationship to generate second intermediate web page data, where the second intermediate web page data can be converted into the original web page data through the second character mapping relationship;
a second converting unit 720, configured to convert the second intermediate web page data into the original web page data according to the second character mapping relationship, and render the original web page data to a web page.
In some example embodiments, the second character mapping relationship is a font file containing a character mapping relationship table, and the second converting unit 720 is further specifically configured to:
and converting the characters in the second intermediate webpage data into characters corresponding to the original webpage data according to the character mapping relation table in the font file.
In some example embodiments, the first character conversion relationship comprises a third character mapping relationship, the second character conversion relationship comprises a fourth character mapping relationship, and the third character mapping relationship is an inverse mapping of the fourth character mapping relationship.
Fig. 8 is a schematic block diagram of a second embodiment of a web page data processing apparatus according to some embodiments of the present invention. Referring to fig. 8, in some example embodiments, the first character conversion relationship and the second character conversion relationship belong to one of a predetermined number of character conversion relationships, the apparatus 600 further comprises:
a rule selecting module 810, configured to select a set of character conversion relationships from the predetermined number of sets of character conversion relationships, where each set of character conversion relationships includes at least two character conversion relationships, and the character conversion relationships are used to convert characters in the web page data.
In some example embodiments, the rule selection module 810 is further specifically configured to:
performing modulus extraction on the preset number of values according to the current date to obtain a corresponding remainder;
and selecting a character conversion relation from the preset number of character conversion relations according to the value of the remainder.
In some example embodiments, the rule selection module 810 is further specifically configured to:
randomly selecting a set of character conversion relationships from the predetermined number of sets of character conversion relationships.
The webpage data processing device provided by the embodiment of the invention can realize each process in the method embodiment and achieve the same function and effect, and the process is not repeated.
In addition, an embodiment of the present application further provides an electronic device, which is configured to execute the webpage data processing method described in the foregoing embodiment. Fig. 9 is a schematic block diagram of an electronic device provided by some embodiments of the invention. As shown in fig. 9, the electronic device 900 includes: at least one processor 902, memory 904, bus 906, and communication interface 908.
Wherein: the processor 902, communication interface 908, and memory 904 communicate with one another via a bus 906.
A communication interface 908 for communicating with other devices.
The processor 902 is configured to execute the program 910, and may specifically execute the relevant steps in the method described in the foregoing embodiment. For example, the processor 902 may perform the following steps: s210, receiving first intermediate webpage data sent by a server, wherein the first intermediate webpage data are generated after the server converts characters in original webpage data according to a first character conversion relation; and S220, converting the first intermediate webpage data according to a second character conversion relation to generate original webpage data, and rendering the original webpage data to a webpage.
In particular, the program 910 may include program code that includes computer operating instructions.
The processor 902 may be a central processing unit, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement an embodiment of the present invention. The electronic device comprises one or more processors, which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.
And a memory 904 for storing a program 910. The memory 904 may comprise high-speed RAM memory, and may also include non-volatile memory, such as at least one disk memory.
In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 904 comprising instructions, executable by the processor 920 of the apparatus 900 to perform the above-described method is also provided. For example, the non-transitory computer-readable storage medium may be Read-Only Memory (ROM), random-access Memory, CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (12)

1. A method for processing web page data, comprising:
receiving first intermediate webpage data sent by a server, wherein the first intermediate webpage data are generated after the server converts characters in original webpage data according to a first character conversion relation;
converting the first intermediate webpage data according to a second character conversion relation to generate original webpage data, rendering the original webpage data to a webpage,
the first character conversion relationship and the second character conversion relationship are both the relationships of conversion between characters of at least two sets of character sets.
2. The method of claim 1, wherein the second character conversion relationship comprises a first character mapping relationship and a second character mapping relationship, and wherein converting the first intermediate web page data according to the second character conversion relationship comprises:
converting the first intermediate webpage data according to the first character mapping relationship to generate second intermediate webpage data, wherein the second intermediate webpage data can be converted into the original webpage data through the second character mapping relationship;
and converting the second intermediate webpage data into the original webpage data according to the second character mapping relation, and rendering the original webpage data to a webpage.
3. The method of claim 2, wherein the second character mapping relationship is a font file containing a character mapping relationship table, and the converting the second intermediate web page data into the original web page data according to the second character mapping relationship comprises:
and converting the characters in the second intermediate webpage data into characters corresponding to the original webpage data according to the character mapping relation table in the font file.
4. The method of claim 1, wherein the first character conversion relationship comprises a third character mapping relationship, wherein the second character conversion relationship comprises a fourth character mapping relationship, and wherein the third character mapping relationship is an inverse mapping of the fourth character mapping relationship.
5. The method of any of claims 1-4, wherein the first character conversion relationship and the second character conversion relationship belong to one of a predetermined number of sets of character conversion relationships, the method further comprising:
and selecting a set of character conversion relations from the preset number of sets of character conversion relations, wherein each set of character conversion relations comprises at least two character conversion relations, and the character conversion relations are used for converting characters in the webpage data.
6. The method of claim 5, wherein selecting a set of character conversion relationships from the predetermined number of sets of character conversion relationships comprises:
performing modulus extraction on the preset number of values according to the current date to obtain a corresponding remainder;
and selecting a character conversion relation from the preset number of character conversion relations according to the value of the remainder.
7. The method of claim 5, wherein selecting a set of character conversion relationships from the predetermined number of sets of character conversion relationships comprises:
randomly selecting a set of character conversion relationships from the predetermined number of sets of character conversion relationships.
8. A web page data processing apparatus characterized by comprising:
the receiving module is used for receiving first intermediate webpage data sent by a server side, wherein the first intermediate webpage data are generated after the server side converts characters in original webpage data according to a first character conversion relation;
a conversion module for converting the first intermediate web page data according to a second character conversion relationship to generate the original web page data and rendering the original web page data to a web page,
the first character conversion relationship and the second character conversion relationship are both the relationships of conversion between characters of at least two sets of character sets.
9. The apparatus of claim 8, wherein the second character conversion relationship comprises a first character mapping relationship and a second character mapping relationship, and wherein the conversion module comprises:
the first conversion unit is used for converting the first intermediate webpage data according to the first character mapping relationship to generate second intermediate webpage data, wherein the second intermediate webpage data can be converted into the original webpage data through the second character mapping relationship;
and the second conversion unit is used for converting the second intermediate webpage data into the original webpage data according to the second character mapping relation and rendering the original webpage data to a webpage.
10. The apparatus according to claim 9, wherein the second character mapping relationship is a font file containing a character mapping relationship table, and the second conversion unit is further configured to:
and converting the characters in the second intermediate webpage data into characters corresponding to the original webpage data according to the character mapping relation table in the font file.
11. An electronic device, comprising: a memory, a processor;
a memory for storing the processor-executable instructions;
the processor is configured to implement the web page data processing method of any one of claims 1 to 7.
12. A computer-readable storage medium having stored therein computer-executable instructions for implementing the web page data processing method according to any one of claims 1 to 7 when executed by a processor.
CN202010567841.6A 2020-06-19 2020-06-19 Webpage data processing method, device, equipment and storage medium Active CN111723263B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010567841.6A CN111723263B (en) 2020-06-19 2020-06-19 Webpage data processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010567841.6A CN111723263B (en) 2020-06-19 2020-06-19 Webpage data processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111723263A true CN111723263A (en) 2020-09-29
CN111723263B CN111723263B (en) 2024-04-05

Family

ID=72568159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010567841.6A Active CN111723263B (en) 2020-06-19 2020-06-19 Webpage data processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111723263B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050256924A1 (en) * 2004-05-14 2005-11-17 Microsoft Corporation Systems and methods for persisting data between web pages
US20130073745A1 (en) * 2011-09-19 2013-03-21 Google Inc. Context-Specific Unicode Characters In Shortened URLs
CN109543454A (en) * 2019-01-25 2019-03-29 腾讯科技(深圳)有限公司 A kind of anti-crawler method and relevant device
CN110069688A (en) * 2019-03-16 2019-07-30 平安城市建设科技(深圳)有限公司 Page display method, server, storage medium and the device of anti-crawler
CN110083751A (en) * 2019-03-18 2019-08-02 平安科技(深圳)有限公司 The anti-crawler grasping means of web data and device, storage medium, electronic equipment
CN110166465A (en) * 2019-05-27 2019-08-23 北京达佳互联信息技术有限公司 Processing method, device, server and the storage medium of access request
CN110990746A (en) * 2019-12-06 2020-04-10 北京同邦卓益科技有限公司 Page loading method, device, system, storage medium and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050256924A1 (en) * 2004-05-14 2005-11-17 Microsoft Corporation Systems and methods for persisting data between web pages
US20130073745A1 (en) * 2011-09-19 2013-03-21 Google Inc. Context-Specific Unicode Characters In Shortened URLs
CN109543454A (en) * 2019-01-25 2019-03-29 腾讯科技(深圳)有限公司 A kind of anti-crawler method and relevant device
CN110069688A (en) * 2019-03-16 2019-07-30 平安城市建设科技(深圳)有限公司 Page display method, server, storage medium and the device of anti-crawler
CN110083751A (en) * 2019-03-18 2019-08-02 平安科技(深圳)有限公司 The anti-crawler grasping means of web data and device, storage medium, electronic equipment
CN110166465A (en) * 2019-05-27 2019-08-23 北京达佳互联信息技术有限公司 Processing method, device, server and the storage medium of access request
CN110990746A (en) * 2019-12-06 2020-04-10 北京同邦卓益科技有限公司 Page loading method, device, system, storage medium and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BART KAHLER 等: "Language translation of web-based content", 《2012 IEEE NATIONAL AEROSPACE AND ELECTRONICS CONFERENCE (NAECON)》, pages 40 - 45 *
温娅娜等: "基于Python爬虫技术的网页解析与数据获取研究", 《现代信息科技》, no. 01, pages 20 - 21 *

Also Published As

Publication number Publication date
CN111723263B (en) 2024-04-05

Similar Documents

Publication Publication Date Title
US9311281B2 (en) Methods for facilitating web page image hotspots and devices thereof
KR20130066603A (en) Initiating font subsets
CN107301046B (en) Icon processing method and device, computer equipment and storage medium
CN110727417B (en) Data processing method and device
CN105843800A (en) DOI-based language information display method and device
CN113177171B (en) Graph display method, device, terminal equipment and storage medium
CN104317570B (en) The apparatus and method of dynamic analysis Web applications
CN111722764A (en) Message input and display method and device, electronic equipment and readable storage medium
CN112395027A (en) Widget interface generation method and device, storage medium and electronic equipment
CN104156421B (en) The page shows method, apparatus and system
CN116569165B (en) Page display method and device, storage medium and electronic equipment
CN112581568B (en) Dynamic poster generation method, device, server and storage medium
CN116820635A (en) Document block sharing method, device, system and storage medium
CN113254819A (en) Page rendering method, system, equipment and storage medium
JP2015518612A (en) Computer system, non-transitory computer readable storage medium and method enabling styling and decoration of multiple and dissimilar web pages by remote method invocation
US20200358747A1 (en) Method of processing data
CN116827637B (en) Canvas-based data encryption transmission method, system, equipment and medium
CN107248947A (en) Expression processing method and processing device, computer equipment and storage medium
CN111723263B (en) Webpage data processing method, device, equipment and storage medium
CN109710869B (en) Page display method and device of webpage content, server and storage medium
CN112799745B (en) Page display control method and device
CN114302207A (en) Bullet screen display method, device, system, equipment and storage medium
CN110636105B (en) Tree graph obtaining method and device, storage medium and electronic equipment
CN109657184B (en) Rich text processing method, rich text processing device, server and computer readable medium
CN115250259B (en) Information interaction method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant