CN110020343A - The determination method and apparatus of web page coding format - Google Patents
The determination method and apparatus of web page coding format Download PDFInfo
- Publication number
- CN110020343A CN110020343A CN201710784883.3A CN201710784883A CN110020343A CN 110020343 A CN110020343 A CN 110020343A CN 201710784883 A CN201710784883 A CN 201710784883A CN 110020343 A CN110020343 A CN 110020343A
- Authority
- CN
- China
- Prior art keywords
- format
- target webpage
- url
- target
- coded format
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
This application discloses a kind of determination method and apparatus of web page coding format.Wherein, this method comprises: obtaining uniform resource position mark URL, wherein the corresponding webpage of the URL is target webpage;The coded format of the target webpage is determined according to the URL and preset field contents;The coded format of the target webpage is determined according to the URL and character string transform mode;Judge whether the coded format for the target webpage determined according to the URL and preset field contents is identical as the coded format for the target webpage determined according to the URL and character string transform mode;The coded format of the target webpage is determined according to judging result.By the application, efficiency lower technical problem when solving the coded format for determining webpage in the related technology.
Description
Technical field
This application involves web technologies fields, in particular to a kind of determination method and apparatus of web page coding format.
Background technique
In the related technology, when the coded format to webpage judges, generally by one in mouse webpage clicking
Plug-in unit is checked the code of the webpage by plug-in unit selection, then just user is needed to read over web page code, thus
Determine the coded format of code in webpage.But the judgment mode of above-mentioned web page coding format, need user to check net line by line
Page code, takes a long time, and efficiency is lower.
Efficiency lower problem when for the coded format for determining webpage in the related technology, not yet proposes effective solution at present
Certainly scheme.
Summary of the invention
The main purpose of the application is to provide a kind of determination method of web page coding format, true in the related technology to solve
Determine efficiency lower problem when the coded format of webpage.
To achieve the goals above, according to the one aspect of the application, a kind of determination side of web page coding format is provided
Method.This method comprises: obtaining uniform resource position mark URL, wherein the corresponding webpage of the URL is target webpage;According to described
URL and preset field contents determine the coded format of the target webpage;It is determined according to the URL and character string transform mode
The coded format of the target webpage;Judge the target webpage determined according to the URL and preset field contents
Whether coded format is identical as the coded format for the target webpage determined according to the URL and character string transform mode;
The coded format of the target webpage is determined according to judging result.
Further, according to the URL and character string transform mode determine the target webpage coded format include: by
The target webpage is converted into the page of string format;Using the first pre-arranged code format by the page of the string format
Be converted to byte stream;The byte stream is converted to by target string using the second pre-arranged code format;According to the target word
It whether include that the character of preset format type determines the coded format of the target webpage in symbol string.
Further, whether the character of the preset format type is Chinese character, wrap according in the target string
If it includes Chinese in the target string that the character for including preset format type, which determines that the coded format of the target webpage includes:,
Character determines that the coded format of the target webpage is UTF-8;If not including Chinese character in the target string, determine
The coded format of the target webpage is GBK or GB2312.
Further, if determining that the coded format of the target webpage includes: that the judging result is according to judging result
It is identical, by the coded format for the target webpage determined according to the URL and preset field contents or according to the URL
Or coded format of the coded format of the target webpage determined of preset characters string transform mode as the target webpage;
If the judging result is difference, by the target webpage determined according to the URL and preset characters string transform mode
Coded format of the coded format as the target webpage.
Further, determine that the coded format of the target webpage includes: to mention according to the URL and preset field contents
Take the goal-selling character string in the preset field contents;According to the goal-selling character string of extraction and the URL,
Determine the coded format of the target webpage.
To achieve the goals above, according to the another aspect of the application, a kind of determining dress of web page coding format is provided
It sets.The device includes: acquiring unit, for obtaining uniform resource position mark URL, wherein the corresponding webpage of the URL is target
Webpage;First determination unit, for determining the coded format of the target webpage according to the URL and preset field contents;
Second determination unit, for determining the coded format of the target webpage according to the URL and character string transform mode;Judgement is single
Member, the coded format of the target webpage for judging to be determined according to the URL and preset field contents, and according to institute
Whether the coded format for stating the target webpage that URL and character string transform mode are determined is identical;Third determination unit, is used for
The coded format of the target webpage is determined according to judging result.
Further, the second determination unit includes: conversion module, for converting string format for the target webpage
The page;First conversion module, for using the first pre-arranged code format by the conversion of page of the string format for byte
Stream;Second conversion module, for the byte stream to be converted to target string using the second pre-arranged code format;Determine mould
Block, for whether including that the character of preset format type determines the coding lattice of the target webpage according in the target string
Formula.
Further, the character of the preset format type is Chinese character, and determination module includes: the first determining submodule
Block, if determining that the coded format of the target webpage is UTF-8 for including Chinese character in the target string;Second
Submodule is determined, if determining that the coded format of the target webpage is for not including Chinese character in the target string
GBK or GB2312.
To achieve the goals above, according to the another aspect of the application, a kind of storage medium, the storage medium are provided
Program including storage, wherein described program executes the determination method of web page coding format described in above-mentioned any one.
To achieve the goals above, according to the another aspect of the application, a kind of processor is provided, the processor is used for
Run program, wherein described program executes the determination method of web page coding format described in above-mentioned any one when running.
By the application, the coded format of target webpage can be determined according to URL and preset field contents, it can also root
Then the coded format that target webpage is determined according to URL and character string transform mode judges true according to URL and preset field contents
The coded format for the target webpage made and the coded format for the target webpage determined according to URL and character string transform mode are
It is no identical, and determine according to judging result the coded format of target webpage.It, can be compared with by the determination method of two kinds of coded formats
For the accurate coded format for determining target webpage, it can be correspondingly improved the efficiency of the coded format of determining target webpage,
And then solve the problems, such as that efficiency is lower when determining the coded format of webpage in the related technology.
Detailed description of the invention
The attached drawing constituted part of this application is used to provide further understanding of the present application, the schematic reality of the application
Example and its explanation are applied for explaining the application, is not constituted an undue limitation on the present application.In the accompanying drawings:
Fig. 1 is the flow chart according to the determination method of the web page coding format of the embodiment of the present application;And
Fig. 2 is the schematic diagram according to the determining device of the web page coding format of the embodiment of the present application.
Specific embodiment
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application
Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only
The embodiment of the application a part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people
Member's every other embodiment obtained without making creative work, all should belong to the model of the application protection
It encloses.
Description below is made to part term or noun involved in the embodiment of the present application below:
URL (Uniform Resource Locator), uniform resource locator are to can obtain from internet
The position of resource and a kind of succinct expression of access method, are the addresses of standard resource on internet.It is each on internet
A file has a unique URL, and the information that it includes points out the position of file.
GB2312 coding, Chinese Character Set Code for Informati are suitable between the systems such as Chines words processing, Chinese communication
Information exchange.
UTF-8 (8-bit Unicode Transformation Format) coded format, it is a kind of for Uniconde's
Variable length character coding, can use 1 to 6 byte code Unicode characters.With can unify during the page shows on webpage
Literary simplified traditional font and other language.
It should be noted that the description and claims of this application and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way
Data be interchangeable under appropriate circumstances, so as to embodiments herein described herein.In addition, term " includes " and " tool
Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing a series of steps or units
Process, method, system, product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include without clear
Other step or units listing to Chu or intrinsic for these process, methods, product or equipment.
Following embodiment can be applied in the coded format for determining webpage, wherein coded format is usually in creation net
When page or website, the coding mode of use, in the related art, for the type of coded format, setting is relatively fixed
, after determining the coded format of webpage, according to user demand, each object element in webpage can be determined in display screen
Height, the width of position and display in curtain.
According to an embodiment of the present application, a kind of determination method of web page coding format is provided.
Fig. 1 is the flow chart according to the determination method of the web page coding format of the embodiment of the present application.As shown in Figure 1, the party
Method the following steps are included:
Step S101 obtains uniform resource position mark URL, wherein the corresponding webpage of URL is target webpage.
In the application, the coded format of target webpage is judged, which can be the webpage that user specifies,
It will include different file resources in each target webpage, and uniform resource locator can be determined as by being directed toward the path of the webpage
URL can be directly linked to the target webpage by the URL.
Wherein, the coded format of different target webpages is different, and can be determined by the embodiment in the application
The coded format of each target webpage out.
Step S102 determines the coded format of target webpage according to URL and preset field contents.
Through the above steps, a kind of coded format that can determine target webpage can determine target network according to URL
Page, can determine the coding lattice of target webpage according to preset field contents in the file content and file in target webpage
Formula.Wherein, preset field contents can be the common charset field of determining target webpage coded format to determine, for example,
Coding pointed by one URL includes: < meta http-equiv=" content-type " content=" text/html:
Charset=gbk "/>, then the coded format of target webpage can be determined according to the content (charset=gbk) in field
For GBK coded format.
Wherein, determine that the coded format of target webpage includes: to extract preset field according to URL and preset field contents
Goal-selling character string in content;According to the goal-selling character string and URL of extraction, the coding lattice of target webpage are determined
Formula.Wherein the goal-selling character string can be " charset ".The volume of webpage can be extracted by the goal-selling character string
Code format.
Wherein, the goal-selling character string of above-described embodiment may be the field that user voluntarily writes, for example, bm=
gbk.User can determine the field mode that may determine that coded format according to the code voluntarily write.
Step S103 determines the coded format of target webpage according to URL and character string transform mode.
In the above-described embodiments, character string transform mode may include various ways, the webpage as corresponding to each URL
Content is different, can be by target network corresponding to URL character or URL in the coded format for determining the target webpage
The coding file content of page is converted to corresponding character string, for example, the coding file content of target webpage is converted to binary system
Form.
Optionally, determine that the coded format of target webpage includes: to turn target webpage according to URL and character string transform mode
Turn to the page of string format;Use the first pre-arranged code format by the conversion of page of string format for byte stream;Using
Byte stream is converted to target string by the second pre-arranged code format;It whether include preset format type according in target string
Character determine target webpage coded format.
Optionally, the first above-mentioned pre-arranged code format can be GB2312 coded format, which can be encoded
It is divided into 94 areas, each area there are 94 positions, only one character on each position, therefore can be with the area at place and position come to target
Content in webpage is encoded.It can be byte stream by the conversion of page of string format by the first pre-arranged code format,
In the case where the targeted web content represented by string format is binary system, it can be encoded by GB2312 by character string lattice
The conversion of page of formula is the file of bytestream format.
Wherein, the second above-mentioned pre-arranged code format can be a variety of coded formats, such as UTF-8 coded format.Pass through
Second pre-arranged code format can convert target string for the file of the target webpage indicated with byte stream, in target character
It may include specific character in string, for example, Chinese character.
The character of another optional embodiment, preset format type is Chinese character, is according in target string
If it includes middle text in target string that the no character including preset format type, which determines that the coded format of target webpage includes:,
Symbol determines that the coded format of target webpage is UTF-8;If in target string not including Chinese character, target webpage is determined
Coded format is GBK or GB2312.
By above embodiment, the coded format of target webpage can be determined.The coded format can with use it is pre-
If the coded format determined of field contents it is identical, can also be different.
Step S104 judges the coded format and basis of the target webpage determined according to URL and preset field contents
Whether the coded format for the target webpage that URL and character string transform mode are determined is identical.
Step S105 determines the coded format of target webpage according to judging result.
It for above-mentioned steps, can establish two kinds of modes for judging target webpage coded format, by comparing, can sentence
It is disconnected go out the coded format of target webpage determined according to URL and preset field contents, and according to URL and character string conversion side
In the identical situation of the coded format for the target webpage that formula is determined, determine target webpage coded format be according to URL and
The coded format for the target webpage that preset field contents are determined.
In above-described embodiment, if judging the coded format for the target webpage determined according to URL and preset field contents
When not identical as the coded format for the target webpage determined according to URL and character string transform mode, need to redefine target
The coded format of webpage.Optionally, the coding lattice for the target webpage that above-mentioned judgement is determined according to URL and preset field contents
Formula in the different situation of coded format for the target webpage determined according to URL and character string transform mode, can also be true
The coded format of the fixed target webpage determined according to URL and preset field contents is the coded format of the target webpage.
Optionally, if according to judging result determine target webpage coded format include: judging result be it is identical, by basis
The coded format for the target webpage that URL and preset field contents are determined is true according to URL or preset characters string transform mode
Coded format of the coded format for the target webpage made as target webpage;It, will be according to URL and pre- if judging result is difference
If coded format of the coded format for the target webpage that character string transform mode is determined as target webpage.
, can be in judging result difference in above-described embodiment mode, it will be according to URL and preset characters string transform mode
Coded format of the coded format for the target webpage determined as target webpage.
Through the foregoing embodiment, the coded format that target webpage can be determined according to URL and preset field contents, can also
To determine the coded format of target webpage according to URL and character string transform mode, then judge according in URL and preset field
Hold the coded format and the coding lattice for the target webpage determined according to URL and character string transform mode of the target webpage determined
Whether formula is identical, and the coded format of target webpage is determined according to judging result.It, can by the determination method of two kinds of coded formats
Accurately to determine the coded format of target webpage, the effect of the coded format of determining target webpage can be correspondingly improved
Rate, and then solve the problems, such as that efficiency is lower when determining the coded format of webpage in the related technology.
Here is the specific embodiment according to the application.
In this embodiment, using UTF-8 as reference encoder format.Optionally, pass through the side of character string in this embodiment
Formula judges the format of web page coding, and the Webpage got according to URL is converted to the form of character string, then by the character
String is converted into byte stream by GB2312 coding mode;Then the byte stream of conversion is reduced to by string format with UTF-8
File.Judge just if it exists can then determine the coding of the webpage with the presence or absence of Chinese character in the file of the string format
Format is UTF-8 coded format, is otherwise exactly GBK or GB2312 coding.
Optionally, the coded format that the webpage is determined by charset field determines the volume of webpage in charset field
When code format is consistent with the coded format of the webpage of above-mentioned determination, the coded format of the webpage can be determined.
By testing repeatedly, the accuracy rate of this method judgement is very high, basically reaches 98%, this is that one kind reversely judges
Mode judges the coded format of webpage before can crawling used in the crawler of webpage, reduces the probability for messy code occur.
In the related art, if judging web page coding format, often only judge the fraction byte stream in webpage, do not have
The file of with good grounds full text judges the coded format of webpage, in this application, can be obtained by judging full text file
The coded format of webpage.By above-described embodiment of the application, the accuracy rate for judging web page coding format can be improved, crawling
During dynamic generation coded format, do not need the CSN file that coded format is separately configured.It is complete by the page in the application
Portion's byte stream judge and the field of webpage chaset is combined to be confirmed, improves the efficiency for judging web page coding.
It should be noted that step shown in the flowchart of the accompanying drawings can be in such as a group of computer-executable instructions
It is executed in computer system, although also, logical order is shown in flow charts, and it in some cases, can be with not
The sequence being same as herein executes shown or described step.
The embodiment of the present application also provides a kind of determining devices of web page coding format, it should be noted that the application is real
The determining device for applying the web page coding format of example can be used for executing provided by the embodiment of the present application for web page coding format
Determination method.The determining device of web page coding format provided by the embodiments of the present application is introduced below.
Fig. 2 is according to the schematic diagram of the determining device of the web page coding format of the embodiment of the present application, as shown in Fig. 2, the dress
Setting includes: acquiring unit 21, for obtaining uniform resource position mark URL, wherein the corresponding webpage of URL is target webpage;First
Determination unit 23, for determining the coded format of target webpage according to URL and preset field contents;Second determination unit 25,
For determining the coded format of target webpage according to URL and character string transform mode;Judging unit 27, for judging according to URL
The coded format for the target webpage determined with preset field contents and the mesh determined according to URL and character string transform mode
Whether the coded format for marking webpage is identical;Third determination unit 29, for determining the coding lattice of target webpage according to judging result
Formula.
Through the foregoing embodiment, target can be determined according to URL and preset field contents by the first determination unit 23
The coded format of webpage can also determine target webpage according to URL and character string transform mode by the second determination unit 25
Then coded format judges the coding for the target webpage determined according to URL and preset field contents by judging unit 27
Whether format is identical as the coded format for the target webpage determined according to URL and character string transform mode, and true by third
Order member 29 determines the coded format of target webpage according to judging result.It, can be compared with by the determination method of two kinds of coded formats
For the accurate coded format for determining target webpage, it can be correspondingly improved the efficiency of the coded format of determining target webpage,
And then when solving to determine the coded format of webpage in the related technology, the lower problem of efficiency.
Optionally, the second determination unit 25 includes: conversion module, for converting target webpage to the page of string format
Face;First conversion module, for using the first pre-arranged code format by the conversion of page of string format for byte stream;Second turn
Block is changed the mold, for byte stream to be converted to target string using the second pre-arranged code format;Determination module, for according to target
It whether include that the character of preset format type determines the coded format of target webpage in character string.
Wherein, the character of preset format type is Chinese character, and determination module includes: the first determining submodule, if for
Include Chinese character in target string, determines that the coded format of target webpage is UTF-8;Second determines submodule, if for
Do not include Chinese character in target string, determines that the coded format of target webpage is GBK or GB GB2312.
For above-described embodiment, third determination unit 29 includes: that third determines submodule, if being phase for judging result
Together, turn by the coded format for the target webpage determined according to URL and preset field contents or according to URL or preset characters string
Coded format of the coded format for the target webpage that change mode is determined as target webpage;4th determines submodule, if for
Judging result is difference, using the coded format for the target webpage determined according to URL and preset characters string transform mode as mesh
Mark the coded format of webpage.
The determining device of the web page coding format includes processor and memory, and above-mentioned acquiring unit 21, first determines
Unit 23, the second determination unit 25, judging unit 27 and third determination unit 29 etc. are stored in memory as program unit
In, above procedure unit stored in memory is executed by processor to realize corresponding function.
Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one
Or more, the efficiency of the coded format of determining target webpage is improved by adjusting kernel parameter.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/
Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (flash RAM), memory include that at least one is deposited
Store up chip.
The embodiment of the invention provides a kind of storage mediums, are stored thereon with program, real when which is executed by processor
The determination method of the existing web page coding format.
The embodiment of the invention provides a kind of processor, the processor is for running program, wherein described program operation
The determination method of web page coding format described in Shi Zhihang.
The embodiment of the invention provides a kind of equipment, equipment include processor, memory and storage on a memory and can
The program run on a processor, processor perform the steps of acquisition uniform resource position mark URL when executing program, wherein
The corresponding webpage of URL is target webpage;The coded format of target webpage is determined according to URL and preset field contents;According to URL
The coded format of target webpage is determined with character string transform mode;Judge the mesh determined according to URL and preset field contents
Whether the coded format for marking webpage is identical as the coded format for the target webpage determined according to URL and character string transform mode;
The coded format of target webpage is determined according to judging result.
The coded format that target webpage is determined according to URL and character string transform mode includes: to convert word for target webpage
Accord with the page of string format;Use the first pre-arranged code format by the conversion of page of string format for byte stream;It is pre- using second
If byte stream is converted to target string by coded format;According in target string whether include preset format type character
Determine the coded format of target webpage.
The character of preset format type be Chinese character, according in target string whether include preset format type word
If symbol determines that the coded format of target webpage includes: to determine the coding lattice of target webpage including Chinese character in target string
Formula is UTF-8;If in target string not including Chinese character, determine that the coded format of target webpage is GBK or GB2312.
If according to judging result determine target webpage coded format include: judging result be it is identical, will be according to URL and pre-
If the field contents coded format of target webpage determined or the mesh determined according to URL or preset characters string transform mode
Mark coded format of the coded format of webpage as target webpage;If judging result is difference, will be according to URL and preset characters string
Coded format of the coded format for the target webpage that transform mode is determined as target webpage.
The coded format that target webpage is determined according to URL and preset field contents includes: to extract preset field contents
In goal-selling character string;According to the goal-selling character string and URL of extraction, the coded format of target webpage is determined.This
Equipment in text can be server, PC, PAD, mobile phone etc..
Present invention also provides a kind of computer program products, when executing on data processing equipment, are adapted for carrying out just
The program of beginningization there are as below methods step: uniform resource position mark URL is obtained, wherein the corresponding webpage of URL is target webpage;
The coded format of target webpage is determined according to URL and preset field contents;Target is determined according to URL and character string transform mode
The coded format of webpage;Judge the coded format for the target webpage determined according to URL and preset field contents and according to URL
Whether the coded format for the target webpage determined with character string transform mode is identical;Target webpage is determined according to judging result
Coded format.
The coded format that target webpage is determined according to URL and character string transform mode includes: to convert word for target webpage
Accord with the page of string format;Use the first pre-arranged code format by the conversion of page of string format for byte stream;It is pre- using second
If byte stream is converted to target string by coded format;According in target string whether include preset format type character
Determine the coded format of target webpage.
The character of preset format type be Chinese character, according in target string whether include preset format type word
If symbol determines that the coded format of target webpage includes: to determine the coding lattice of target webpage including Chinese character in target string
Formula is UTF-8;If in target string not including Chinese character, determine that the coded format of target webpage is GBK or GB2312.
If according to judging result determine target webpage coded format include: judging result be it is identical, will be according to URL and pre-
If the field contents coded format of target webpage determined or the mesh determined according to URL or preset characters string transform mode
Mark coded format of the coded format of webpage as target webpage;If judging result is difference, will be according to URL and preset characters string
Coded format of the coded format for the target webpage that transform mode is determined as target webpage.
The coded format that target webpage is determined according to URL and preset field contents includes: to extract preset field contents
In goal-selling character string;According to the goal-selling character string and URL of extraction, the coded format of target webpage is determined.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application
Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net
Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/
Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie
The example of matter.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices
Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability
It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap
Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element
There is also other identical elements in process, method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product.
Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application
Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code
The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
Formula.
The above is only embodiments herein, are not intended to limit this application.To those skilled in the art,
Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement,
Improve etc., it should be included within the scope of the claims of this application.
Claims (10)
1. a kind of determination method of web page coding format characterized by comprising
Obtain uniform resource position mark URL, wherein the corresponding webpage of the URL is target webpage;
The coded format of the target webpage is determined according to the URL and preset field contents;
The coded format of the target webpage is determined according to the URL and character string transform mode, wherein pass through the character string
Transform mode determines the target string of the target webpage;
The coded format for judging the target webpage determined according to the URL and preset field contents, and according to
Whether the coded format for the target webpage that URL and character string transform mode are determined is identical;
The coded format of the target webpage is determined according to judging result.
2. the method according to claim 1, wherein according to the URL and the determination of character string transform mode
The coded format of target webpage includes:
Convert the target webpage to the page of string format;
Use the first pre-arranged code format by the conversion of page of the string format for byte stream;
The byte stream is converted to by target string using the second pre-arranged code format;
It whether include that the character of preset format type determines the coded format of the target webpage according in the target string.
3. according to the method described in claim 2, it is characterized in that, the character of the preset format type is Chinese character, root
It whether include that the character of preset format type determines that the coded format of the target webpage includes: according in the target string
If in the target string including Chinese character, determine that the coded format of the target webpage is UTF-8;
If not including Chinese character in the target string, determine that the coded format of the target webpage is GBK or GB2312.
4. the method according to claim 1, wherein determining the coding lattice of the target webpage according to judging result
Formula includes:
If the judging result be it is identical, by the target webpage determined according to the URL and preset field contents
Coded format or the coded format for the target webpage determined according to the URL or preset characters string transform mode are as institute
State the coded format of target webpage;
If the judging result is difference, the target network that will be determined according to the URL and preset characters string transform mode
Coded format of the coded format of page as the target webpage.
5. the method according to claim 1, wherein according to the URL and the determination of preset field contents
The coded format of target webpage includes:
Extract the goal-selling character string in the preset field contents;
According to the goal-selling character string of extraction and the URL, the coded format of the target webpage is determined.
6. a kind of determining device of web page coding format characterized by comprising
Acquiring unit, for obtaining uniform resource position mark URL, wherein the corresponding webpage of the URL is target webpage;
First determination unit, for determining the coded format of the target webpage according to the URL and preset field contents;
Second determination unit, for determining the coded format of the target webpage according to the URL and character string transform mode;
Judging unit, the coding lattice of the target webpage for judging to be determined according to the URL and preset field contents
Whether formula is identical as the coded format for the target webpage determined according to the URL and character string transform mode;
Third determination unit, for determining the coded format of the target webpage according to judging result.
7. device according to claim 6, which is characterized in that the second determination unit includes:
Conversion module, for converting the target webpage to the page of string format;
First conversion module, for using the first pre-arranged code format by the conversion of page of the string format for byte stream;
Second conversion module, for the byte stream to be converted to target string using the second pre-arranged code format;
Determination module, for whether the character including preset format type to determine the target network according in the target string
The coded format of page.
8. device according to claim 7, which is characterized in that the character of the preset format type is Chinese character, is sentenced
Cover half block includes:
First determines submodule, if determining the coding of the target webpage for including Chinese character in the target string
Format is UTF-8;
Second determines submodule, if determining the volume of the target webpage for not including Chinese character in the target string
Code format is GBK or GB2312.
9. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein described program right of execution
Benefit require 1 to the web page coding format described in any one of claim 5 determination method.
10. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run
Benefit require 1 to the web page coding format described in any one of claim 5 determination method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710784883.3A CN110020343B (en) | 2017-09-01 | 2017-09-01 | Method and device for determining webpage coding format |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710784883.3A CN110020343B (en) | 2017-09-01 | 2017-09-01 | Method and device for determining webpage coding format |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110020343A true CN110020343A (en) | 2019-07-16 |
CN110020343B CN110020343B (en) | 2021-03-30 |
Family
ID=67186195
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710784883.3A Active CN110020343B (en) | 2017-09-01 | 2017-09-01 | Method and device for determining webpage coding format |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110020343B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113595683A (en) * | 2021-07-07 | 2021-11-02 | 西安震有信通科技有限公司 | Conversion processing method, device, terminal and medium based on various encoding files |
CN114615074A (en) * | 2022-03-25 | 2022-06-10 | 山石网科通信技术股份有限公司 | Network message decoding method, network attack detection method, device and storage medium |
CN114827113A (en) * | 2022-04-18 | 2022-07-29 | 阿里巴巴(中国)有限公司 | Webpage access method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101606A (en) * | 2007-08-03 | 2008-01-09 | 中兴通讯股份有限公司 | Web page coding language automatic identification method and device for embedded type browser |
CN101526963A (en) * | 2009-04-17 | 2009-09-09 | 深圳华为通信技术有限公司 | Method for identifying web page coding, device and terminal equipment |
CN102360392A (en) * | 2011-10-24 | 2012-02-22 | 青岛海信移动通信技术股份有限公司 | Method and device for determining webpage encoding mode |
CN104361021A (en) * | 2014-10-21 | 2015-02-18 | 小米科技有限责任公司 | Webpage encoding identifying method and device |
CN104391993A (en) * | 2014-12-15 | 2015-03-04 | 浪潮(北京)电子信息产业有限公司 | Method and system for recognizing webpage codes |
CN106570044A (en) * | 2015-10-13 | 2017-04-19 | 北京国双科技有限公司 | Method and device for analyzing webpage code |
-
2017
- 2017-09-01 CN CN201710784883.3A patent/CN110020343B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101606A (en) * | 2007-08-03 | 2008-01-09 | 中兴通讯股份有限公司 | Web page coding language automatic identification method and device for embedded type browser |
CN101526963A (en) * | 2009-04-17 | 2009-09-09 | 深圳华为通信技术有限公司 | Method for identifying web page coding, device and terminal equipment |
CN102360392A (en) * | 2011-10-24 | 2012-02-22 | 青岛海信移动通信技术股份有限公司 | Method and device for determining webpage encoding mode |
CN104361021A (en) * | 2014-10-21 | 2015-02-18 | 小米科技有限责任公司 | Webpage encoding identifying method and device |
CN104391993A (en) * | 2014-12-15 | 2015-03-04 | 浪潮(北京)电子信息产业有限公司 | Method and system for recognizing webpage codes |
CN106570044A (en) * | 2015-10-13 | 2017-04-19 | 北京国双科技有限公司 | Method and device for analyzing webpage code |
Non-Patent Citations (1)
Title |
---|
王璟琦: ""基于内容单元的网页解析与内容提取"", 《万方》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113595683A (en) * | 2021-07-07 | 2021-11-02 | 西安震有信通科技有限公司 | Conversion processing method, device, terminal and medium based on various encoding files |
CN114615074A (en) * | 2022-03-25 | 2022-06-10 | 山石网科通信技术股份有限公司 | Network message decoding method, network attack detection method, device and storage medium |
CN114827113A (en) * | 2022-04-18 | 2022-07-29 | 阿里巴巴(中国)有限公司 | Webpage access method and device |
CN114827113B (en) * | 2022-04-18 | 2024-04-16 | 阿里巴巴(中国)有限公司 | Webpage access method and device |
Also Published As
Publication number | Publication date |
---|---|
CN110020343B (en) | 2021-03-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105574049B (en) | Page processing method, device and system for mobile application | |
CN110020353B (en) | Method and device for constructing webpage form | |
CN110020343A (en) | The determination method and apparatus of web page coding format | |
CN109086126B (en) | Task scheduling processing method and device, server, client and electronic equipment | |
CN109582948A (en) | The method and device that evaluated views extract | |
CN110264361A (en) | A kind of data analysis method and device of block chain | |
CN110321675A (en) | Generation, source tracing method and device based on webpage watermark | |
CN106909567B (en) | Data processing method and device | |
CN109683773A (en) | Corpus labeling method and device | |
CN110110198B (en) | Webpage information extraction method and device | |
CN105989126B (en) | A kind of Webpage display process and device | |
CN109062906A (en) | The interpretation method and device of program language resource | |
CN104978325B (en) | A kind of web page processing method, device and user terminal | |
CN105405002A (en) | Formula data configuration method and system based on SAP platform | |
CN108874379B (en) | Page processing method and device | |
CN109558548A (en) | A kind of method and Related product for eliminating CSS style redundancy | |
CN109582188A (en) | A kind of method, apparatus and relevant device for realizing the positioning of pop-up interior element | |
CN104346174A (en) | Method for describing and reproducing on-line vector diagram modeling process | |
CN112560403A (en) | Text processing method and device and electronic equipment | |
CN110232155A (en) | The information recommendation method and electronic equipment of browser interface | |
CN111209009A (en) | Content distribution method and device, storage medium and electronic equipment | |
CN107368557B (en) | Page editing method and device | |
CN115297042A (en) | Method for detecting consistency of web pages under different networks and related equipment | |
CN108228145A (en) | Data processing method, system and the mobile equipment of mixed type application program | |
CN109542401A (en) | A kind of Web development approach, device, storage medium and processor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |