CN110020343B - Method and device for determining webpage coding format - Google Patents

Method and device for determining webpage coding format Download PDF

Info

Publication number
CN110020343B
CN110020343B CN201710784883.3A CN201710784883A CN110020343B CN 110020343 B CN110020343 B CN 110020343B CN 201710784883 A CN201710784883 A CN 201710784883A CN 110020343 B CN110020343 B CN 110020343B
Authority
CN
China
Prior art keywords
format
determining
target webpage
character string
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710784883.3A
Other languages
Chinese (zh)
Other versions
CN110020343A (en
Inventor
张野
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201710784883.3A priority Critical patent/CN110020343B/en
Publication of CN110020343A publication Critical patent/CN110020343A/en
Application granted granted Critical
Publication of CN110020343B publication Critical patent/CN110020343B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

The application discloses a method and a device for determining a webpage coding format. Wherein, the method comprises the following steps: acquiring a Uniform Resource Locator (URL), wherein a webpage corresponding to the URL is a target webpage; determining the encoding format of the target webpage according to the URL and preset field content; determining the coding format of the target webpage according to the URL and the character string conversion mode; judging whether the encoding format of the target webpage determined according to the URL and preset field content is the same as the encoding format of the target webpage determined according to the URL and the character string conversion mode; and determining the coding format of the target webpage according to the judgment result. By the method and the device, the technical problem of low efficiency in determining the encoding format of the webpage in the related technology is solved.

Description

Method and device for determining webpage coding format
Technical Field
The application relates to the technical field of web pages, in particular to a method and a device for determining a web page coding format.
Background
In the related art, when the coding format of a web page is judged, a mouse is generally used for clicking a plug-in the web page, a code of the web page is selected to be viewed through the plug-in, and then a user needs to carefully read the code of the web page, so that the coding format of the code in the web page is determined. However, the above-mentioned method for determining the web page encoding format requires the user to view the web page codes line by line, which requires a long time and is inefficient.
Aiming at the problem of low efficiency in determining the encoding format of the webpage in the related technology, no effective solution is provided at present.
Disclosure of Invention
The main objective of the present application is to provide a method for determining a webpage encoding format, so as to solve the problem of low efficiency in determining the encoding format of a webpage in the related art.
In order to achieve the above object, according to one aspect of the present application, a method for determining a web page encoding format is provided. The method comprises the following steps: acquiring a Uniform Resource Locator (URL), wherein a webpage corresponding to the URL is a target webpage; determining the encoding format of the target webpage according to the URL and preset field content; determining the coding format of the target webpage according to the URL and the character string conversion mode; judging whether the encoding format of the target webpage determined according to the URL and preset field content is the same as the encoding format of the target webpage determined according to the URL and the character string conversion mode; and determining the coding format of the target webpage according to the judgment result.
Further, determining the encoding format of the target webpage according to the URL and the character string conversion manner includes: converting the target webpage into a page in a character string format; converting the page in the character string format into a byte stream by adopting a first preset encoding format; converting the byte stream into a target character string by adopting a second preset encoding format; and judging the encoding format of the target webpage according to whether the target character string comprises characters of a preset format type.
Further, the character of the preset format type is a chinese character, and determining the encoding format of the target web page according to whether the character of the preset format type is included in the target character string includes: if the target character string comprises Chinese characters, determining that the encoding format of the target webpage is UTF-8; and if the target character string does not comprise Chinese characters, determining that the coding format of the target webpage is GBK or GB 2312.
Further, determining the encoding format of the target webpage according to the judgment result includes: if the judgment results are the same, taking the coding format of the target webpage determined according to the URL and preset field contents or the coding format of the target webpage determined according to the URL or a preset character string conversion mode as the coding format of the target webpage; and if the judgment results are different, taking the coding format of the target webpage determined according to the URL and a preset character string conversion mode as the coding format of the target webpage.
Further, determining the encoding format of the target webpage according to the URL and the preset field content includes: extracting a preset target character string in the preset field content; and determining the coding format of the target webpage according to the extracted preset target character string and the URL.
In order to achieve the above object, according to another aspect of the present application, there is provided a web page encoding format determining apparatus. The device includes: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a Uniform Resource Locator (URL), and a webpage corresponding to the URL is a target webpage; the first determining unit is used for determining the coding format of the target webpage according to the URL and preset field content; the second determining unit is used for determining the coding format of the target webpage according to the URL and the character string conversion mode; the judging unit is used for judging whether the encoding format of the target webpage determined according to the URL and the preset field content is the same as the encoding format of the target webpage determined according to the URL and the character string conversion mode; and the third determining unit is used for determining the coding format of the target webpage according to the judgment result.
Further, the second determination unit includes: the conversion module is used for converting the target webpage into a page in a character string format; the first conversion module is used for converting the page in the character string format into a byte stream by adopting a first preset coding format; the second conversion module is used for converting the byte stream into a target character string by adopting a second preset coding format; and the judging module is used for judging the coding format of the target webpage according to whether the target character string comprises characters of a preset format type.
Further, the characters in the preset format type are Chinese characters, and the determination module comprises: the first determining submodule is used for determining that the encoding format of the target webpage is UTF-8 if the target character string comprises Chinese characters; and the second determining submodule is used for determining that the coding format of the target webpage is GBK or GB2312 if the target character string does not comprise Chinese characters.
In order to achieve the above object, according to another aspect of the present application, there is provided a storage medium including a stored program, wherein the program performs the method for determining a web page encoding format according to any one of the above.
In order to achieve the above object, according to another aspect of the present application, there is provided a processor for executing a program, wherein the program executes to execute the method for determining a web page encoding format according to any one of the above.
According to the method and the device, the coding format of the target webpage can be determined according to the URL and the preset field content, the coding format of the target webpage can also be determined according to the URL and the character string conversion mode, then whether the coding format of the target webpage determined according to the URL and the preset field content is the same as the coding format of the target webpage determined according to the URL and the character string conversion mode is judged, and the coding format of the target webpage is determined according to the judgment result. By the two methods for determining the coding formats, the coding format of the target webpage can be determined more accurately, the efficiency for determining the coding format of the target webpage can be improved correspondingly, and the problem of low efficiency in determining the coding format of the webpage in the related art is solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:
FIG. 1 is a flow chart of a method for determining a web page encoding format according to an embodiment of the present application; and
fig. 2 is a schematic diagram of an apparatus for determining a web page encoding format according to an embodiment of the present application.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Some terms or nouns referred to in the embodiments of the present application are explained below as follows:
url (uniform Resource locator), a uniform Resource locator, is a compact representation of the location and access method of a Resource available from the internet, and is the address of a standard Resource on the internet. Each file on the internet has a unique URL that contains information indicating the location of the file.
GB2312 codes, and the information exchange uses Chinese character code character set, which is suitable for information exchange between Chinese character processing and Chinese character communication systems.
The UTF-8(8-bit Unicode Transformation Format) encoding Format, a variable length character encoding for Unicode, can encode Unicode characters with 1 to 6 bytes. When the Chinese character input method is used on a webpage, the Chinese characters, simplified and traditional characters and other languages can be displayed on the webpage in a unified way.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The following embodiments may be applied to determining an encoding format of a web page, where the encoding format is generally an encoding manner adopted when creating a web page or a website, and in the related art, the setting of the type of the encoding format is relatively fixed, and after determining the encoding format of the web page, according to a user requirement, a position of each target element in the web page in a display screen and a height and a width of the display may be determined.
According to an embodiment of the application, a method for determining a webpage encoding format is provided.
Fig. 1 is a flowchart of a method for determining a web page encoding format according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:
step S101, obtaining a Uniform Resource Locator (URL), wherein a webpage corresponding to the URL is a target webpage.
In the application, the encoding format of the target webpage is judged, the target webpage can be a webpage designated by a user, each target webpage can comprise different file resources, a path pointing to the webpage can be determined as a Uniform Resource Locator (URL), and the URL can be directly linked to the target webpage.
The encoding formats of different target webpages are different, and the encoding format of each target webpage can be determined through the implementation mode in the application.
And S102, determining the coding format of the target webpage according to the URL and the preset field content.
Through the steps, one coding format of the target webpage can be determined, the target webpage can be determined according to the URL, and the coding format of the target webpage can be determined according to the file content in the target webpage and the field content preset in the file. The preset field content may be determined by determining a charset field commonly used for determining a target web page encoding format, for example, an encoding pointed by a URL includes: < meta http-equ ═ content-type, "content ═ text/html: and charset GBK'/>, the encoding format of the target webpage can be determined to be the GBK encoding format according to the content (charset GBK) in the field.
Determining the encoding format of the target webpage according to the URL and the preset field content comprises the following steps: extracting a preset target character string in preset field content; and determining the coding format of the target webpage according to the extracted preset target character string and URL. Wherein the preset target character string may be "charset". The encoding format of the webpage can be extracted through the preset target character string.
The preset target character string in the above embodiment may also be a field written by the user, for example, bm-gbk. The user can determine the field mode capable of judging the coding format according to the self-written code.
And step S103, determining the coding format of the target webpage according to the URL and the character string conversion mode.
In the above embodiment, the character string conversion manner may include multiple manners, and since the web page content corresponding to each URL is different, when determining the encoding format of the target web page, the URL character or the encoded file content of the target web page corresponding to the URL may be converted into a corresponding character string, for example, the encoded file content of the target web page is converted into a binary form.
Optionally, determining the encoding format of the target webpage according to the URL and the character string conversion method includes: converting the target webpage into a page in a character string format; converting the page in the character string format into a byte stream by adopting a first preset encoding format; converting the byte stream into a target character string by adopting a second preset encoding format; and judging the encoding format of the target webpage according to whether the target character string comprises characters of a preset format type.
Optionally, the first preset encoding format may be a GB2312 encoding format, and the GB2312 encoding may be divided into 94 regions, each region has 94 bits, and each region has only one character, so that the region and the bit in which the region and the bit are located may be used to encode the content in the target web page. The page in the character string format can be converted into the byte stream through the first preset encoding format, and the page in the character string format can be converted into the file in the byte stream format through GB2312 encoding under the condition that the character string format is the target webpage content represented by the binary system.
The second predetermined encoding format may be a plurality of encoding formats, such as UTF-8 encoding format. The file of the target web page represented by the byte stream may be converted into a target character string by the second preset encoding format, and a specific character, for example, a chinese character, may be included in the target character string.
In another optional implementation manner, the characters of the preset format type are chinese characters, and the determining the encoding format of the target web page according to whether the characters of the preset format type are included in the target character string includes: if the target character string comprises Chinese characters, determining the encoding format of the target webpage to be UTF-8; and if the target character string does not comprise Chinese characters, determining that the coding format of the target webpage is GBK or GB 2312.
Through the implementation, the encoding format of the target webpage can be determined. The encoding format may be the same as or different from the encoding format determined using the preset field content.
And step S104, judging whether the encoding format of the target webpage determined according to the URL and the preset field content is the same as the encoding format of the target webpage determined according to the URL and the character string conversion mode.
And step S105, determining the coding format of the target webpage according to the judgment result.
For the above steps, two ways of determining the encoding format of the target webpage may be established, and by comparison, the encoding format of the target webpage determined according to the URL and the preset field content may be determined to be the encoding format of the target webpage determined according to the URL and the preset field content under the condition that the encoding format of the target webpage determined according to the URL and the character string conversion way is determined to be the same as the encoding format of the target webpage determined according to the URL and the preset field content.
In the above embodiment, if it is determined that the encoding format of the target webpage determined according to the URL and the preset field content is different from the encoding format of the target webpage determined according to the URL and the character string conversion method, the encoding format of the target webpage needs to be determined again. Optionally, under the condition that it is determined that the encoding format of the target webpage determined according to the URL and the preset field content is different from the encoding format of the target webpage determined according to the URL and the character string conversion manner, it may also be determined that the encoding format of the target webpage determined according to the URL and the preset field content is the encoding format of the target webpage.
Optionally, determining the encoding format of the target webpage according to the determination result includes: if the judgment results are the same, taking the encoding format of the target webpage determined according to the URL and the preset field content or the encoding format of the target webpage determined according to the URL or the preset character string conversion mode as the encoding format of the target webpage; and if the judgment results are different, taking the coding format of the target webpage determined according to the URL and the preset character string conversion mode as the coding format of the target webpage.
In the embodiment, when the determination result is different, the encoding format of the target webpage determined according to the URL and the preset character string conversion manner may be used as the encoding format of the target webpage.
By the embodiment, the encoding format of the target webpage can be determined according to the URL and the preset field content, the encoding format of the target webpage can also be determined according to the URL and the character string conversion mode, then whether the encoding format of the target webpage determined according to the URL and the preset field content is the same as the encoding format of the target webpage determined according to the URL and the character string conversion mode is judged, and the encoding format of the target webpage is determined according to the judgment result. By the two methods for determining the coding formats, the coding format of the target webpage can be determined more accurately, the efficiency for determining the coding format of the target webpage can be improved correspondingly, and the problem of low efficiency in determining the coding format of the webpage in the related art is solved.
The following are specific examples according to the present application.
In this embodiment, UTF-8 is used as the reference encoding format. Optionally, in this embodiment, the format of the web page code is determined by a character string mode, the web page obtained according to the URL is converted into a character string mode, and then the character string is converted into a byte stream by a GB2312 coding mode; the converted byte stream is then restored to a string-formatted file in UTF-8. And judging whether the Chinese characters exist in the file in the character string format, if so, determining that the encoding format of the webpage is a UTF-8 encoding format, otherwise, determining that the webpage is GBK or GB2312 encoding.
Optionally, the encoding format of the web page is determined by the charset field, and when the encoding format of the web page determined by the charset field is consistent with the determined encoding format of the web page, the encoding format of the web page may be determined.
Through repeated tests, the method has high judgment accuracy rate which basically reaches 98%, is a reverse judgment mode, can be used for judging the coding format of the webpage before crawling of the webpage crawler, and reduces the probability of messy codes.
In the related art, if the coding format of the web page is judged, only a small part of byte streams in the web page are often judged, and the coding format of the web page is not judged according to the full-text file. By the embodiment of the application, the accuracy of judging the webpage coding format can be improved, the coding format is dynamically generated in the crawling process, and a CSN file of the coding format does not need to be configured independently. According to the method and the device, the judgment is carried out through all byte streams of the page and the confirmation is carried out by combining fields of the webpage chase, so that the efficiency of judging the webpage codes is improved.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
The embodiment of the present application further provides a device for determining a web page encoding format, and it should be noted that the device for determining a web page encoding format of the embodiment of the present application may be used to execute the method for determining a web page encoding format provided by the embodiment of the present application. The following describes a device for determining a web page encoding format according to an embodiment of the present application.
Fig. 2 is a schematic diagram of an apparatus for determining a web page encoding format according to an embodiment of the present application, as shown in fig. 2, the apparatus includes: the acquiring unit 21 is configured to acquire a uniform resource locator URL, where a webpage corresponding to the URL is a target webpage; the first determining unit 23 is configured to determine an encoding format of the target webpage according to the URL and preset field content; a second determining unit 25, configured to determine an encoding format of the target webpage according to the URL and the character string conversion manner; a judging unit 27, configured to judge whether the encoding format of the target webpage determined according to the URL and the preset field content is the same as the encoding format of the target webpage determined according to the URL and the character string conversion manner; and a third determining unit 29, configured to determine the encoding format of the target webpage according to the determination result.
With the above embodiment, the encoding format of the target web page may be determined by the first determining unit 23 according to the URL and the preset field content, or the encoding format of the target web page may be determined by the second determining unit 25 according to the URL and the character string conversion method, then the determining unit 27 determines whether the encoding format of the target web page determined according to the URL and the preset field content is the same as the encoding format of the target web page determined according to the URL and the character string conversion method, and the third determining unit 29 determines the encoding format of the target web page according to the determination result. By the two methods for determining the coding formats, the coding format of the target webpage can be determined more accurately, the efficiency for determining the coding format of the target webpage can be correspondingly improved, and the problem of low efficiency when the coding format of the webpage is determined in the related art is solved.
Optionally, the second determining unit 25 includes: the conversion module is used for converting the target webpage into a page in a character string format; the first conversion module is used for converting the page in the character string format into a byte stream by adopting a first preset coding format; the second conversion module is used for converting the byte stream into a target character string by adopting a second preset coding format; and the judging module is used for judging the coding format of the target webpage according to whether the target character string comprises characters of a preset format type.
Wherein, the characters of the preset format type are Chinese characters, and the judging module comprises: the first determining submodule is used for determining that the encoding format of the target webpage is UTF-8 if the target character string comprises Chinese characters; and the second determining submodule is used for determining that the coding format of the target webpage is GBK or GB2312 if the target character string does not comprise Chinese characters.
With the above embodiment, the third determination unit 29 includes: the third determining submodule is used for determining the coding format of the target webpage according to the URL and the preset field content or determining the coding format of the target webpage according to the URL or the preset character string conversion mode as the coding format of the target webpage if the judging results are the same; and the fourth determining submodule is used for determining the coding format of the target webpage as the coding format of the target webpage according to the URL and the preset character string conversion mode if the judgment results are different.
The device for determining the webpage encoding format comprises a processor and a memory, wherein the acquiring unit 21, the first determining unit 23, the second determining unit 25, the judging unit 27, the third determining unit 29 and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and the efficiency of determining the coding format of the target webpage is improved by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
An embodiment of the present invention provides a storage medium, on which a program is stored, and the program, when executed by a processor, implements the method for determining the encoding format of the web page.
The embodiment of the invention provides a processor, which is used for running a program, wherein the method for determining the webpage coding format is executed when the program runs.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps: acquiring a Uniform Resource Locator (URL), wherein a webpage corresponding to the URL is a target webpage; determining the coding format of the target webpage according to the URL and the preset field content; determining the coding format of the target webpage according to the URL and the character string conversion mode; judging whether the encoding format of the target webpage determined according to the URL and the preset field content is the same as the encoding format of the target webpage determined according to the URL and the character string conversion mode; and determining the coding format of the target webpage according to the judgment result.
Determining the encoding format of the target webpage according to the URL and the character string conversion mode comprises the following steps: converting the target webpage into a page in a character string format; converting the page in the character string format into a byte stream by adopting a first preset encoding format; converting the byte stream into a target character string by adopting a second preset encoding format; and judging the encoding format of the target webpage according to whether the target character string comprises characters of a preset format type.
The method comprises the following steps of determining the encoding format of a target webpage according to whether characters of a preset format type are included in a target character string, wherein the characters of the preset format type are Chinese characters: if the target character string comprises Chinese characters, determining the encoding format of the target webpage to be UTF-8; and if the target character string does not comprise Chinese characters, determining that the coding format of the target webpage is GBK or GB 2312.
Determining the encoding format of the target webpage according to the judgment result comprises the following steps: if the judgment results are the same, taking the encoding format of the target webpage determined according to the URL and the preset field content or the encoding format of the target webpage determined according to the URL or the preset character string conversion mode as the encoding format of the target webpage; and if the judgment results are different, taking the coding format of the target webpage determined according to the URL and the preset character string conversion mode as the coding format of the target webpage.
Determining the encoding format of the target webpage according to the URL and the preset field content comprises the following steps: extracting a preset target character string in preset field content; and determining the coding format of the target webpage according to the extracted preset target character string and URL. The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: acquiring a Uniform Resource Locator (URL), wherein a webpage corresponding to the URL is a target webpage; determining the coding format of the target webpage according to the URL and the preset field content; determining the coding format of the target webpage according to the URL and the character string conversion mode; judging whether the encoding format of the target webpage determined according to the URL and the preset field content is the same as the encoding format of the target webpage determined according to the URL and the character string conversion mode; and determining the coding format of the target webpage according to the judgment result.
Determining the encoding format of the target webpage according to the URL and the character string conversion mode comprises the following steps: converting the target webpage into a page in a character string format; converting the page in the character string format into a byte stream by adopting a first preset encoding format; converting the byte stream into a target character string by adopting a second preset encoding format; and judging the encoding format of the target webpage according to whether the target character string comprises characters of a preset format type.
The method comprises the following steps of determining the encoding format of a target webpage according to whether characters of a preset format type are included in a target character string, wherein the characters of the preset format type are Chinese characters: if the target character string comprises Chinese characters, determining the encoding format of the target webpage to be UTF-8; and if the target character string does not comprise Chinese characters, determining that the coding format of the target webpage is GBK or GB 2312.
Determining the encoding format of the target webpage according to the judgment result comprises the following steps: if the judgment results are the same, taking the encoding format of the target webpage determined according to the URL and the preset field content or the encoding format of the target webpage determined according to the URL or the preset character string conversion mode as the encoding format of the target webpage; and if the judgment results are different, taking the coding format of the target webpage determined according to the URL and the preset character string conversion mode as the coding format of the target webpage.
Determining the encoding format of the target webpage according to the URL and the preset field content comprises the following steps: extracting a preset target character string in preset field content; and determining the coding format of the target webpage according to the extracted preset target character string and URL.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (8)

1. A method for determining a web page encoding format, comprising:
acquiring a Uniform Resource Locator (URL), wherein a webpage corresponding to the URL is a target webpage;
determining the encoding format of the target webpage according to the URL and preset field content;
determining the coding format of the target webpage according to the URL and the character string conversion mode, wherein the target character string of the target webpage is determined through the character string conversion mode;
judging whether the encoding format of the target webpage determined according to the URL and preset field content is the same as the encoding format of the target webpage determined according to the URL and the character string conversion mode;
determining the coding format of the target webpage according to the judgment result;
determining the encoding format of the target webpage according to the URL and the character string conversion mode comprises the following steps: converting the target webpage into a page in a character string format; converting the page in the character string format into a byte stream by adopting a first preset encoding format; converting the byte stream into a target character string by adopting a second preset encoding format; judging the coding format of the target webpage according to whether the target character string comprises characters of a preset format type; wherein, under the condition that the character string format is the target webpage content represented by the binary system, the first preset encoding format at least comprises a GB2312 encoding, and the second preset encoding format at least comprises a UTF-8 encoding format.
2. The method of claim 1, wherein the characters of the preset format type are chinese characters, and determining the encoding format of the target webpage according to whether the characters of the preset format type are included in the target character string comprises:
if the target character string comprises Chinese characters, determining that the encoding format of the target webpage is UTF-8;
and if the target character string does not comprise Chinese characters, determining that the coding format of the target webpage is GBK or GB 2312.
3. The method of claim 1, wherein determining the encoding format of the target webpage according to the determination result comprises:
if the judgment results are the same, taking the coding format of the target webpage determined according to the URL and preset field contents or the coding format of the target webpage determined according to the URL or a preset character string conversion mode as the coding format of the target webpage;
and if the judgment results are different, taking the coding format of the target webpage determined according to the URL and a preset character string conversion mode as the coding format of the target webpage.
4. The method of claim 1, wherein determining the encoding format of the target webpage according to the URL and the preset field content comprises:
extracting a preset target character string in the preset field content;
and determining the coding format of the target webpage according to the extracted preset target character string and the URL.
5. An apparatus for determining a web page encoding format, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a Uniform Resource Locator (URL), and a webpage corresponding to the URL is a target webpage;
the first determining unit is used for determining the coding format of the target webpage according to the URL and preset field content;
the second determining unit is used for determining the coding format of the target webpage according to the URL and the character string conversion mode;
the judging unit is used for judging whether the encoding format of the target webpage determined according to the URL and the preset field content is the same as the encoding format of the target webpage determined according to the URL and the character string conversion mode;
the third determining unit is used for determining the coding format of the target webpage according to the judgment result;
the second determination unit includes: the conversion module is used for converting the target webpage into a page in a character string format; the first conversion module is used for converting the page in the character string format into a byte stream by adopting a first preset coding format; the second conversion module is used for converting the byte stream into a target character string by adopting a second preset coding format; the judging module is used for judging the coding format of the target webpage according to whether the target character string comprises characters of a preset format type; wherein, under the condition that the character string format is the target webpage content represented by the binary system, the first preset encoding format at least comprises a GB2312 encoding, and the second preset encoding format at least comprises a UTF-8 encoding format.
6. The apparatus of claim 5, wherein the characters of the predetermined format type are Chinese characters, and the determining module comprises:
the first determining submodule is used for determining that the encoding format of the target webpage is UTF-8 if the target character string comprises Chinese characters;
and the second determining submodule is used for determining that the coding format of the target webpage is GBK or GB2312 if the target character string does not comprise Chinese characters.
7. A storage medium characterized by comprising a stored program, wherein the program executes the method for determining a web page encoding format according to any one of claims 1 to 4.
8. A processor, configured to execute a program, wherein the program executes the method for determining the encoding format of the web page according to any one of claims 1 to 4.
CN201710784883.3A 2017-09-01 2017-09-01 Method and device for determining webpage coding format Active CN110020343B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710784883.3A CN110020343B (en) 2017-09-01 2017-09-01 Method and device for determining webpage coding format

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710784883.3A CN110020343B (en) 2017-09-01 2017-09-01 Method and device for determining webpage coding format

Publications (2)

Publication Number Publication Date
CN110020343A CN110020343A (en) 2019-07-16
CN110020343B true CN110020343B (en) 2021-03-30

Family

ID=67186195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710784883.3A Active CN110020343B (en) 2017-09-01 2017-09-01 Method and device for determining webpage coding format

Country Status (1)

Country Link
CN (1) CN110020343B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113595683A (en) * 2021-07-07 2021-11-02 西安震有信通科技有限公司 Conversion processing method, device, terminal and medium based on various encoding files
CN114615074A (en) * 2022-03-25 2022-06-10 山石网科通信技术股份有限公司 Network message decoding method, network attack detection method, device and storage medium
CN114827113B (en) * 2022-04-18 2024-04-16 阿里巴巴(中国)有限公司 Webpage access method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101606A (en) * 2007-08-03 2008-01-09 中兴通讯股份有限公司 Web page coding language automatic identification method and device for embedded type browser
CN101526963A (en) * 2009-04-17 2009-09-09 深圳华为通信技术有限公司 Method for identifying web page coding, device and terminal equipment
CN102360392A (en) * 2011-10-24 2012-02-22 青岛海信移动通信技术股份有限公司 Method and device for determining webpage encoding mode
CN104361021A (en) * 2014-10-21 2015-02-18 小米科技有限责任公司 Webpage encoding identifying method and device
CN104391993A (en) * 2014-12-15 2015-03-04 浪潮(北京)电子信息产业有限公司 Method and system for recognizing webpage codes
CN106570044A (en) * 2015-10-13 2017-04-19 北京国双科技有限公司 Method and device for analyzing webpage code

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101606A (en) * 2007-08-03 2008-01-09 中兴通讯股份有限公司 Web page coding language automatic identification method and device for embedded type browser
CN101526963A (en) * 2009-04-17 2009-09-09 深圳华为通信技术有限公司 Method for identifying web page coding, device and terminal equipment
CN102360392A (en) * 2011-10-24 2012-02-22 青岛海信移动通信技术股份有限公司 Method and device for determining webpage encoding mode
CN104361021A (en) * 2014-10-21 2015-02-18 小米科技有限责任公司 Webpage encoding identifying method and device
CN104391993A (en) * 2014-12-15 2015-03-04 浪潮(北京)电子信息产业有限公司 Method and system for recognizing webpage codes
CN106570044A (en) * 2015-10-13 2017-04-19 北京国双科技有限公司 Method and device for analyzing webpage code

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于内容单元的网页解析与内容提取";王璟琦;《万方》;20130305;全文 *

Also Published As

Publication number Publication date
CN110020343A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
CN102567516B (en) Script loading method and device
CN105824830B (en) Method, client and equipment for displaying page
CN110020343B (en) Method and device for determining webpage coding format
CN110020353B (en) Method and device for constructing webpage form
CN107943465B (en) Method and device for generating HTML (Hypertext markup language) form
TW201800962A (en) Webpage file sending method, webpage rendering method and device and webpage rendering system
CN107147645B (en) Method and device for acquiring network security data
CN109062906B (en) Translation method and device for program language resources
CN104268229A (en) Resource obtaining method and device based on multi-process browser
CN105022619A (en) Code processing method and device
CN104636396A (en) Page positioning method and device
CN115544304A (en) File analysis method and device, readable storage medium and file analysis equipment
CN108874379B (en) Page processing method and device
CN104899203B (en) Webpage generation method and device and terminal equipment
CN104978325A (en) Webpage processing method and device, and user terminal
CN109558548B (en) Method for eliminating CSS style redundancy and related product
CN111460348B (en) File processing method and device
CN109240660B (en) Access method of advertisement data, storage medium, electronic device and system
CN111209009A (en) Content distribution method and device, storage medium and electronic equipment
CN110968810A (en) Webpage data processing method and device
CN110929188A (en) Method and device for rendering server page
CN115297042A (en) Method for detecting consistency of web pages under different networks and related equipment
CN116136757A (en) Log output method and device and electronic equipment
CN113377376A (en) Data packet generation method, data packet generation device, electronic device, and storage medium
CN110851746B (en) Crawler seed generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant