CN106570044A - Method and device for analyzing webpage code - Google Patents

Method and device for analyzing webpage code Download PDF

Info

Publication number
CN106570044A
CN106570044A CN201510670507.2A CN201510670507A CN106570044A CN 106570044 A CN106570044 A CN 106570044A CN 201510670507 A CN201510670507 A CN 201510670507A CN 106570044 A CN106570044 A CN 106570044A
Authority
CN
China
Prior art keywords
coding information
webpage
web page
decoded
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510670507.2A
Other languages
Chinese (zh)
Other versions
CN106570044B (en
Inventor
李可欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510670507.2A priority Critical patent/CN106570044B/en
Publication of CN106570044A publication Critical patent/CN106570044A/en
Application granted granted Critical
Publication of CN106570044B publication Critical patent/CN106570044B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a device for analyzing a webpage code and relates to the field of Internet technology. The problem that webpage information acquisition efficiency is low in the process that complicated statistical calculation needs to be performed on webpage data to guess a code actually used in a webpage when webpage analysis is performed on a crawler system is solved. The method comprises the steps that webpage response data is read from a webpage response package; the webpage response data is decoded in a segmented mode through preset coded information, and whether webpage coded information is recorded in a current data segment or not is judged; if the judgment result is yes, the webpage coded information is used for decoding the current data segment, and when the current data segment is completely decoded, the webpage response data is decoded through the webpage coded information; if the judgment result is no, another data segment is decoded through the preset coded information, and whether webpage coded information is recorded in the data segment or not is judged. The method and the device are mainly used for acquiring webpage information in real time through the crawler system.

Description

A kind of method and device of analyzing web page coding
Technical field
The present invention relates to Internet technical field, the method for more particularly to a kind of analyzing web page coding and Device.
Background technology
Web crawlers is a kind of program or script that info web is captured automatically according to certain rule, Web data is interacted with binary form in network transmission process, the acquisition side of data Web data is carried out decoding the shape for obtaining that the mankind can read by needs with a kind of specific coding rule Formula.The network transmission of main flow with HTTP (Hypertext transfer protoco l, Abbreviation HTTP) encapsulation of webpage is carried out, web data is with HTML (Hyper text Markup Language, abbreviation HTML) tissue and described.There is code field in http protocol Configured for service end, but http protocol is had no strict demand to the code field, some websites are taken The coding of the code field and the coding used in webpage are not carried out unifying to set by the developer at business end Put;Likewise, generally with the Charset attributes of meta labels identifying the net in HTML structure The coding adopted by page, but some web developers do not fill in the attribute, or even the category filled in The property coding actually used with webpage is not inconsistent.
Although most codings can be correctly decoded to the encoded radio of English character, solve The encoded radio of code Chinese character just has strict requirements to coding, and the specific coding of needs could centering The encoded radio of Chinese character is decoded.Presence for above-mentioned reasons, causes web crawlers in acquisition Cannot be it is determined that being decoded to web data with which kind of coding during web page text information.For above-mentioned feelings Condition, browser on the market is relied primarily at present carries out complicated statistic algorithm to guess to web data Actually used coding, but in web crawlers experiment, the efficiency of these algorithms is insufficient for net Network reptile high real-time obtains the requirement of info web.
The content of the invention
In view of this, the present invention proposes a kind of method and device of analyzing web page coding, main purpose It is to solve crawler system to carry out during web analysis, needing the statistical computation for carrying out web data complexity What conjecture webpage was actually used encodes asking for the acquisition info web inefficiency caused by this process Topic.
According to the first aspect of the invention, the present invention proposes a kind of method of analyzing web page coding, Including:
Webpage response data is read from web page answers bag;
The segmentation of webpage response data is decoded by preset coding information, in judging current data section Whether record has web page coding information;
If the determination result is YES, then current data section is decoded with web page coding information, when to working as When front data segment is decoded completely, webpage response data is decoded by web page coding information;
If judged result is no, another data segment is decoded by preset coding information, judged Web page coding information whether has been recorded wherein.
According to the second aspect of the invention, the present invention proposes a kind of device of analyzing web page coding, Including:
Acquiring unit, for webpage response data is read from web page answers bag;
Judging unit, for the webpage response data point read to acquiring unit by preset coding information Duan Jinhang is decoded, and whether has recorded web page coding information in judging current data section;
Processing unit, for when the judged result of judging unit is to be, with web page coding information to working as Front data segment is decoded, when being decoded to current data section completely, by web page coding information to net Page response data is decoded;
Processing unit is additionally operable to when the judged result of judging unit is no, by preset coding information pair Another data segment is decoded, and judges wherein whether recorded web page coding information by judging unit.
By the method and device that above-mentioned technical proposal, analyzing web page provided in an embodiment of the present invention are encoded, Webpage response data can be read from web page answers bag, number is responded to webpage by preset coding information Decoded according to segmentation, in judging current data section, whether recorded web page coding information, when judgement knot When fruit is to be, current data section is decoded with the web page coding information, when to the current number When decoded according to section completely, illustrate that the web page coding information is the actually used coding information of webpage, enter one Step the webpage response data is decoded by the web page coding information after and preserve the net Page;When judged result is no, another data segment is decoded by the preset coding information, Whether judgement has wherein recorded web page coding information.With need to carry out web data in prior art it is multiple Miscellaneous statistic algorithm is compared guessing the actually used coded system of webpage, and the present invention improves reptile system The speed of system analyzing web page, meets the requirement that crawler system high real-time obtains info web.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the present invention's Technological means, and being practiced according to the content of description, and in order to allow the above-mentioned of the present invention and Other objects, features and advantages can become apparent, below especially exemplified by the specific embodiment of the present invention.
Description of the drawings
By the detailed description for reading hereafter preferred implementation, various other advantages and benefit for Those of ordinary skill in the art will be clear from understanding.Accompanying drawing is only used for the mesh for illustrating preferred implementation , and it is not considered as limitation of the present invention.And in whole accompanying drawing, with identical with reference to symbol Number represent identical part.In the accompanying drawings:
The flow chart that Fig. 1 shows a kind of method of analyzing web page coding provided in an embodiment of the present invention;
Fig. 2 shows a kind of composition frame chart of the device of analyzing web page coding provided in an embodiment of the present invention;
Fig. 3 shows a kind of composition frame chart of the device of analyzing web page coding provided in an embodiment of the present invention.
Specific embodiment
The exemplary embodiment of the disclosure is described more fully below with reference to accompanying drawings.Although in accompanying drawing Show the exemplary embodiment of the disclosure, it being understood, however, that may be realized in various forms public Open and should not be limited by embodiments set forth here.On the contrary, there is provided these embodiments are able to The disclosure is best understood from, and can be by the scope of the present disclosure complete skill for conveying to this area Art personnel.
When info web is obtained by web crawlers, as the code field in http protocol bag is recorded Coding information and in html data structure meta tag identifiers web page coding information and webpage Actually used coding information is not inconsistent, and crawler system generally cannot be it is determined that with what in this case Kind coding web data is decoded, only by web data is carried out complexity statistic algorithm come The actually used coding of conjecture webpage, the efficiency of these algorithms are insufficient for web crawlers high real-time Obtain the requirement of info web.
In order to solve that the statistics meter that complexity is carried out to web data is needed when crawler system carries out web analysis Calculate to guess the actually used acquisition info web inefficiency encoded caused by this process of webpage Problem, embodiments provides a kind of method of analyzing web page coding, as shown in figure 1, the party Method includes:
101st, the reading webpage response data from web page answers bag.
Generally web crawlers is needed from the URL specified when page info is obtained (Uniform Resource Locator, abbreviation URL) address starts to download web data, in webpage With HTTP (Hypertext transfer protoco l, letter during data transfer Claim HTTP) encapsulation of webpage is carried out, web data is with HTML (Hyper text Markup Language, abbreviation HTML) tissue and described.Web crawlers is by sending webpage to server Request bag (HTTP request bag), after device to be serviced is responded to web-page requests bag, the reception server is returned Web page answers bag (HTTP response bags), statusline, message header and response are included in web page answers bag What content, wherein statusline reflected is the response of the web-page requests bag that server is sent to web crawlers, Response contents are usually the html page data that Web page text is that server is returned, that is, step 101 In webpage response data.
102nd, the segmentation of webpage response data is decoded by preset coding information, judges current data Whether web page coding information has been recorded in section.
As information is interacted in network transmission process in the form of binary stream, the acquisition of data Side needs to be decoded as data in the form of the mankind can read with a kind of coding rule, therefore works as step 101 After webpage response data is read from web page answers bag, need to return webpage by preset coding information Decoding data is answered, webpage echo message in part can be obtained after decoding to webpage response data (not Decode completely) or whole webpage echo messages (decoding completely), and arbitrarily coding all may be used under normal circumstances Correctly to decode to English character, the English code field of record web page coding information is thus be accordingly used in Can be decoded, so that it is determined that web page coding information whether has been recorded in webpage response data.By to net After page response data carries out above-mentioned process, if the actually used coding information of webpage still cannot be determined, So need to re-start decoding process to webpage response data, in order to avoid entering webpage response data Row repeat decoding causes the loss of process resource, therefore the embodiment of the present invention can be logical with execution step 102 Cross preset coding information to decode the segmentation of webpage response data, whether remember in judging current data section Record has web page coding information.Current data section in webpage response data still cannot after process Determine the actually used coding information of webpage, then can continue to process another data segment, Avoiding carries out processing the waste of the process resource for causing to whole webpage response datas again.
103rd, if the determination result is YES, then current data section is decoded with web page coding information, when When decoded to current data section completely, webpage response data is decoded by web page coding information.
After in current data section is judged in a step 102, record has web page coding information, need to pass through Web page coding information is decoded to current data section, determines whether web page coding information can be to current number Decoded according to section completely, when decoding completely, the web page coding information recorded in illustrating current data section Can as the actually used coding information of webpage, after webpage actually used coding information is got, Further webpage response data can be decoded by the web page coding information, obtaining the mankind can With the info web read, and then info web can be preserved.If can not decode completely, say The web page coding information recorded in bright current data section is not the actually used coding information of webpage, is needed Using preset coding information to webpage response data in another data segment carry out Decoding Analysis.
If the 104, judged result is no, another data segment is decoded by preset coding information, Whether judgement has wherein recorded web page coding information.
After web page coding information is not recorded in current data section is judged in a step 102, need logical Cross preset coding information to decode another data segment, whether judgement has wherein recorded web page coding is believed Breath, can be with execution step 103 if record has web page coding information;If not recording web page coding information Step 104 is continued executing with then.
The method of analyzing web page coding provided in an embodiment of the present invention, can read from web page answers bag Webpage response data, is decoded to the segmentation of webpage response data by preset coding information, judges to work as Web page coding information whether is recorded in front data segment, when judged result is to be, has been compiled with the webpage Code information is decoded to current data section, when being decoded to the current data section completely, illustrates this Web page coding information is the actually used coding information of webpage, is further believed by the web page coding Cease after decoding to the webpage response data and preserve the webpage;When judged result is no, lead to Cross the preset coding information to decode another data segment, whether judgement has wherein recorded webpage is compiled Code information.Webpage reality is guessed with needing to carry out web data in prior art the statistic algorithm of complexity Border using coded system compare, the present invention improves the speed of crawler system analyzing web page, meets Crawler system high real-time obtains the requirement of info web.
Text is responded due to the webpage response data that includes in web page answers bag i.e. webpage to be usually HTML structure data, therefore in order to preferably understand to the method shown in above-mentioned Fig. 1, the present invention Embodiment will using HTTP response bags as web page answers bag, HTML structure data as webpage response data, It is described in detail for each step of Fig. 1.
Webpage response data is read from web page answers bag namely HTML is read from HTTP response bags The process of structured data namely captures webpage HTTP response bags using web crawlers and therefrom obtains whole The process of webpage html data bag, subsequently needs for the binary data of webpage to be decoded as character string display.
Before segmentation decoding is carried out to HTML structure data by preset coding information, need to obtain pre- Put coding information.Specifically, need code field is searched and read from HTTP response bags, judge to compile Whether response bag coding information has been recorded in code field.If recording in the code field in HTTP response bags There is response bag coding information, then with the response bag coding information as preset coding information, wherein, The response bag coding information of the code field record in HTTP response bags is the word that HTML structure data are used Symbol collection, includes conventional character, and specifies the coding of these conventional characters in usual character set Rule, such as code field for Chinese web page, from HTTP response bags Coding information i.e. character set that web data is used can be read out in " charset=GB2312 " For GB2312, GB2312 is Chinese Character Set Code for Informati, is a kind of encoding of chinese characters extension rule Model.Can will be binary by coding information GB2312 of code field record in HTTP response bags Web data decoding obtains Chinese character string.Additionally, the field from HTTP response bags “Content-Type:Text/html " can read out the HTML structure number that web data is plain text According to.If response bag coding information is not recorded in the code field in HTTP response bags, HTML knots The acquisition side of structure data can use default code information as preset coding information, for example, can use ten thousand Country code UTF-8 is used as preset coding information.
After HTML structure data and preset coding information is got, it is necessary to from HTML structure data The middle coding information for obtaining web page files statement, it is therefore desirable to using preset coding information to HTML structure Decoding data, again to whole HTML during in order to improve decoding efficiency and need repeat decoding Structured data decoded caused by the loss to process resource, the embodiment of the present invention can be according to pre- If HTML structure data are first segmented by chopping rule, number is responded to webpage by preset coding information According to from the beginning being decoded paragraph by paragraph.Used as a kind of optional embodiment, the embodiment of the present invention can be by HTML Structured data is segmented with the preset rules that every 20% data are divided into a data segment, by preset coding Information starts to be decoded paragraph by paragraph from the data segment of starting 20%.It is, of course, also possible to according to other ratios pair HTML structure data are segmented.
After be segmented by HTML structure data, need by preset coding Whether information has recorded net in judging current data section from the beginning paragraph by paragraph to HTML structure decoding data Page coding information, its process include reading the code field in current data section, in judging code field Whether record has web page coding information.
In practical situations both, due to generally carrying out presentation web page text with meta labels in HTML structure data The coding information of part statement, and the coding information in HTML structure data can typically record and entirely count According to first half, it is therefore desirable to after being segmented to HTML structure data according to default chopping rule, From the beginning HTML structure data are decoded paragraph by paragraph by preset coding information, searched in current data section Whether record has meta labels, further to read whether meta labels are identified with web page coding information. For example,<Meta http-equiv=" Content-Type " content=" text/html;Charset=utf-8 "/> For the http-equiv attributes of meta labels in HTML structure data, wherein code field charset Record has web page coding information utf-8.As the label and attribute information of HTML structure data are English Text, therefore no matter preset coding information is what type of coding, which can be correctly to HTML structure Label and attribute in data carries out Decoding Analysis, obtains web page coding information.
Record in current data section is determined by the way after having web page coding information, if webpage When coding information is different from preset coding information, need using web page coding information to current data Duan Chong Newly decoded, if can be decoded to current data section completely, illustrated that the web page coding information can be made For the actually used coding information of webpage, further, the web page coding information can be passed through to other Data segment is decoded, and the info web obtained after decoding is preserved.If web page coding information with When preset coding information is identical, can not illustrate that the web page coding information is exactly the actually used volume of webpage Code information, in order to obtain correct web page coding information, then needs also exist for using the web page coding Information re-starts decoding to current data section, if can be decoded to current data section completely, illustrates this Web page coding information can be used as the actually used coding information of webpage.
After web page coding information is not recorded in current data section is determined by the way, then need To be decoded by the preset coding information pair next data segment adjacent with current data section, be judged Web page coding information wherein whether is recorded, if having, " has been determined in current data section and is recorded with above-mentioned After having web page coding information " processing mode it is identical, until it is determined that webpage it is actually used coding letter Till breath.
Additionally, after being decoded to current data section by preset coding information, if current data section Be preset coding information to decode completely, and the web page coding information recorded in current data section with it is preset When coding information is identical, then the web page coding information that need not be recorded in passing through current data section is to current number Preset coding information is defined as into the actually used coding information of webpage by decoding is re-started according to section, Further, other data segments in webpage response data can be solved by preset coding information Code, the info web obtained after decoding is preserved.
The embodiment of the present invention carries out segmentation decoding by preset coding information to webpage response data, obtains The web page coding information of code field record in web data, by preset information and web page coding information Comparison, have found a balance between accurately mate coding information and rough matching coding information Point, brings the effect that crawler system high real-time obtains info web by sacrificing certain matching accuracy Rate, and the coding information stated in avoiding coding information and web data in web page answers bag is not The problem of the process resource loss that caused repeat decoding whole web data brings when consistent.
As the realization to method shown in above-mentioned Fig. 1, a kind of analyzing web page is embodiments provided The device of coding, as shown in Fig. 2 the device includes:Acquiring unit 21, judging unit 22 and process Unit 23, wherein,
Acquiring unit 21, for webpage response data is read from web page answers bag;
Judging unit 22, the webpage for being read to acquiring unit 21 by preset coding information respond number Decoded according to segmentation, in judging current data section, whether recorded web page coding information;
Processing unit 23, for when the judged result of judging unit 22 is to be, using web page coding information Current data section is decoded, when being decoded to current data section completely, by web page coding information Webpage response data is decoded;
Processing unit 23 is additionally operable to when the judged result of judging unit 22 is no, by preset coding Information is decoded to another data segment, judges that wherein whether having recorded webpage compiles by judging unit 22 Code information.
Further, as shown in figure 3, acquiring unit 21 is additionally operable to obtain preset coding information;Obtain Unit 21 includes:
Read module 211, for reading the code field in web page answers bag;
Judge module 212, for judge read module 211 read code field in whether recorded should Answer packet encoder information;
Determining module 213, for judging that record has response packet encoder in code field when judge module 212 During information, response bag coding information is defined as into preset coding information;
Determining module 213 is additionally operable to not record response bag in judge module 212 judges code field During coding information, default code information is defined as into preset coding information.
Further, judging unit 22 includes:
Segmentation module 221, for being segmented webpage response data according to default chopping rule;
Decoder module 222, for from the beginning being solved paragraph by paragraph to webpage response data by preset coding information Code.
Further, judging unit 22 also includes:
Read module 223, for reading the code field in current data section, in judging code field be No record has web page coding information.
The device of analyzing web page coding provided in an embodiment of the present invention, can read from web page answers bag Webpage response data, is decoded to the segmentation of webpage response data by preset coding information, judges to work as Web page coding information whether is recorded in front data segment, when judged result is to be, has been compiled with the webpage Code information is decoded to current data section, when being decoded to the current data section completely, illustrates this Web page coding information is the actually used coding information of webpage, is further believed by the web page coding Cease after decoding to the webpage response data and preserve the webpage;When judged result is no, lead to Cross the preset coding information to decode another data segment, whether judgement has wherein recorded webpage is compiled Code information.Webpage reality is guessed with needing to carry out web data in prior art the statistic algorithm of complexity Border using coded system compare, the present invention improves the speed of crawler system analyzing web page, meets Crawler system high real-time obtains the requirement of info web.
Additionally, the embodiment of the present invention carries out segmentation decoding to webpage response data by preset coding information, The web page coding information of code field record in web data is obtained, by preset information and web page coding The comparison of information, have found one between accurately mate coding information and rough matching coding information Equilibrium point, brings crawler system high real-time acquisition info web by sacrificing certain matching accuracy Efficiency, and state in avoiding coding information and web data in web page answers bag coding letter The problem of the process resource loss that caused repeat decoding whole web data brings when ceasing inconsistent.
In the above-described embodiments, the description to each embodiment all emphasizes particularly on different fields, and does not have in certain embodiment The part being described in detail, may refer to the associated description of other embodiment.
It is understood that said method and the correlated characteristic in device mutually can be referred to.In addition, " first ", " second " in above-described embodiment etc. is, for distinguishing each embodiment, and not represent The quality of each embodiment.
Those skilled in the art can be understood that, for convenience and simplicity of description, above-mentioned The specific work process of the system, apparatus, and unit of description, may be referred in preceding method embodiment Corresponding process, will not be described here.
Provided herein algorithm and show not with any certain computer, virtual system or miscellaneous equipment It is intrinsic related.Various general-purpose systems can also be used together based on teaching in this.According to above Description, the structure constructed required by this kind of system is obvious.Additionally, the present invention is also not for Any certain programmed language.It is understood that, it is possible to use various programming languages realize described here The content of invention, and the description done to language-specific above is for the optimal reality for disclosing the present invention Apply mode.
In description mentioned herein, a large amount of details are illustrated.It is to be appreciated, however, that Embodiments of the invention can be put into practice in the case where not having these details.In some instances, Known method, structure and technology are not been shown in detail, so as not to obscure the understanding of this description.
Similarly, it will be appreciated that in order to simplify the disclosure and help understand in each inventive aspect It is individual or multiple, in above to the description of the exemplary embodiment of the present invention, each feature of the invention Sometimes it is grouped together in single embodiment, figure or descriptions thereof.However, should be by The method of the disclosure is construed to reflect following intention:I.e. the present invention for required protection requires ratio at each The more features of feature being expressly recited in claim.More precisely, as following right will As asking book reflected, inventive aspect is less than all features of single embodiment disclosed above. Therefore, it then follows thus claims of specific embodiment are expressly incorporated in the specific embodiment, Wherein each claim itself is used as separate embodiments of the invention.
Those skilled in the art are appreciated that can be carried out to the module in the equipment in embodiment Adaptively change and they are arranged in one or more different from embodiment equipment. Module or unit or component in embodiment can be combined into a module or unit or component, and In addition multiple submodule or subelement or sub-component can be divided into.Except such feature and/or Outside at least some in process or unit is excluded each other, can be using any combinations to this explanation All features disclosed in book (including adjoint claim, summary and accompanying drawing) and such as the displosure Any method or all processes or unit of equipment be combined.Unless expressly stated otherwise, originally Each feature disclosed in description (including adjoint claim, summary and accompanying drawing) can be by carrying For identical, equivalent or similar purpose alternative features replacing.
Although additionally, it will be appreciated by those of skill in the art that some embodiments described herein include Some included features rather than further feature in other embodiments, but the feature of different embodiments Combination mean to be within the scope of the present invention and formed different embodiments.For example, under In the claims in face, embodiment required for protection one of arbitrarily can be in any combination Mode is using.
The all parts embodiment of the present invention can be realized with hardware, or with one or more The software module run on reason device is realized, or is realized with combinations thereof.Those skilled in the art It should be appreciated that can be realized using microprocessor or digital signal processor (DSP) in practice Some in denomination of invention (device as determined website internal chaining grade) according to embodiments of the present invention Or some or all functions of whole parts.The present invention is also implemented as performing institute here Some or all equipment of the method for description or program of device (for example, computer program and Computer program).Such program for realizing the present invention can be stored in computer-readable medium On, or can have the form of one or more signal.Such signal can be from the Internet net Download on standing and obtain, or provide on carrier signal, or provided with any other form.
It should be noted that above-described embodiment the present invention will be described rather than the present invention is limited Make, and those skilled in the art can design without departing from the scope of the appended claims Alternative embodiment.In the claims, any reference markss between bracket should not be configured to Limitations on claims.Word "comprising" do not exclude the presence of element not listed in the claims or Step.Word "a" or "an" before element does not exclude the presence of multiple such elements. The present invention can be by means of the hardware for including some different elements and by means of properly programmed calculating Machine is realizing.If in the unit claim for listing equipment for drying, several in these devices can Being embodying by same hardware branch.The use of word first, second, and third Any order is not indicated that.These words can be construed to title.

Claims (10)

1. a kind of method that analyzing web page is encoded, it is characterised in that methods described includes:
Webpage response data is read from web page answers bag;
Webpage response data segmentation is decoded by preset coding information, judge current data Whether web page coding information has been recorded in section;
If the determination result is YES, then the current data section is decoded with the web page coding information, When being decoded to the current data section completely, the webpage is responded by the web page coding information Decoding data;
If judged result is no, another data segment is decoded by the preset coding information, Whether judgement has wherein recorded web page coding information.
2. method according to claim 1, it is characterised in that believed by preset coding described Before breath is decoded to webpage response data segmentation, methods described also includes:
Obtain preset coding information;It is described to obtain preset coding information and include:
Read the code field in web page answers bag;
Whether response bag coding information has been recorded in judging the code field;
If record has response bag coding information, with the response bag coding information as the preset volume Code information;
If not recording response bag coding information, with default code information as the preset coding letter Breath.
3. method according to claim 1, it is characterised in that described by preset coding information Webpage response data segmentation is decoded, including:
The webpage response data is segmented according to default chopping rule, by preset coding information From the beginning the webpage response data is decoded paragraph by paragraph.
4. method according to claim 1, it is characterised in that in the judgement current data section Whether record has web page coding information, including:
Read the code field in current data section;
Whether web page coding information has been recorded in judging the code field.
5. the method according to claim 1 or 4, it is characterised in that described when to described current When data segment is decoded completely, the webpage response data is decoded by the web page coding information Including:
Other data segments are decoded by the web page coding information.
6. method according to claim 1, it is characterised in that believed by preset coding described Breath is decoded to webpage response data segmentation, whether has recorded webpage in judging current data section After coding information, methods described also includes:
If the current data section is decoded completely by the preset coding information, and the current data The web page coding information recorded in section is identical with the preset coding information, then by the preset coding Information is decoded to the webpage response data.
7. the device that a kind of analyzing web page is encoded, it is characterised in that described device includes:
Acquiring unit, for webpage response data is read from web page answers bag;
Judging unit, the webpage for being read to the acquiring unit by preset coding information are returned Answer data sectional to be decoded, in judging current data section, whether record web page coding information;
Processing unit, for when the judged result of the judging unit is to be, using the web page coding Information is decoded to the current data section, when being decoded to the current data section completely, is passed through The web page coding information is decoded to the webpage response data;
The processing unit is additionally operable to when the judged result of the judging unit is no, by described pre- Put coding information to decode another data segment, judge wherein whether record by the judging unit There is web page coding information.
8. device according to claim 7, it is characterised in that the acquiring unit is additionally operable to obtain Take preset coding information;The acquiring unit includes:
Read module, for reading the code field in web page answers bag;
Judge module, for judging whether recorded in the code field that the read module reads Response bag coding information;
Determining module, for judging that record has response bag to compile in the code field when the judge module During code information, the response bag coding information is defined as into the preset coding information;
The determining module is additionally operable in the judge module judges the code field should without record When answering packet encoder information, default code information is defined as into the preset coding information.
9. device according to claim 7, it is characterised in that the judging unit includes:
Segmentation module, for being segmented the webpage response data according to default chopping rule;
Decoder module, for from the beginning being carried out paragraph by paragraph to the webpage response data by preset coding information Decoding.
10. device according to claim 7, it is characterised in that the judging unit also includes:
Read module, for reading the code field in current data section, in judging the code field Whether record has web page coding information.
CN201510670507.2A 2015-10-13 2015-10-13 Method and device for analyzing webpage codes Active CN106570044B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510670507.2A CN106570044B (en) 2015-10-13 2015-10-13 Method and device for analyzing webpage codes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510670507.2A CN106570044B (en) 2015-10-13 2015-10-13 Method and device for analyzing webpage codes

Publications (2)

Publication Number Publication Date
CN106570044A true CN106570044A (en) 2017-04-19
CN106570044B CN106570044B (en) 2019-12-24

Family

ID=58508827

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510670507.2A Active CN106570044B (en) 2015-10-13 2015-10-13 Method and device for analyzing webpage codes

Country Status (1)

Country Link
CN (1) CN106570044B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020343A (en) * 2017-09-01 2019-07-16 北京国双科技有限公司 The determination method and apparatus of web page coding format

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6952425B1 (en) * 2000-11-14 2005-10-04 Cisco Technology, Inc. Packet data analysis with efficient and flexible parsing capabilities
CN101101606A (en) * 2007-08-03 2008-01-09 中兴通讯股份有限公司 Web page coding language automatic identification method and device for embedded type browser
CN101526963A (en) * 2009-04-17 2009-09-09 深圳华为通信技术有限公司 Method for identifying web page coding, device and terminal equipment
CN103207877A (en) * 2012-01-17 2013-07-17 阿里巴巴集团控股有限公司 Decoding method and device
CN103443741A (en) * 2011-02-07 2013-12-11 黑莓有限公司 Method and apparatus for receiving presentation metadata
CN103870487A (en) * 2012-12-13 2014-06-18 腾讯科技(深圳)有限公司 Webpage file processing method and mobile terminal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6952425B1 (en) * 2000-11-14 2005-10-04 Cisco Technology, Inc. Packet data analysis with efficient and flexible parsing capabilities
CN101101606A (en) * 2007-08-03 2008-01-09 中兴通讯股份有限公司 Web page coding language automatic identification method and device for embedded type browser
CN101526963A (en) * 2009-04-17 2009-09-09 深圳华为通信技术有限公司 Method for identifying web page coding, device and terminal equipment
CN103443741A (en) * 2011-02-07 2013-12-11 黑莓有限公司 Method and apparatus for receiving presentation metadata
CN103207877A (en) * 2012-01-17 2013-07-17 阿里巴巴集团控股有限公司 Decoding method and device
CN103870487A (en) * 2012-12-13 2014-06-18 腾讯科技(深圳)有限公司 Webpage file processing method and mobile terminal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王姝文: "嵌入式浏览器跨平台服务组件研究与设计", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020343A (en) * 2017-09-01 2019-07-16 北京国双科技有限公司 The determination method and apparatus of web page coding format
CN110020343B (en) * 2017-09-01 2021-03-30 北京国双科技有限公司 Method and device for determining webpage coding format

Also Published As

Publication number Publication date
CN106570044B (en) 2019-12-24

Similar Documents

Publication Publication Date Title
US10567407B2 (en) Method and system for detecting malicious web addresses
CN104881603B (en) Webpage redirects leak detection method and device
CN104866512B (en) Extract the method, apparatus and system of web page contents
CN104104649B (en) The method of page login, apparatus and system
JP6203374B2 (en) Web page style address integration
CN101526963A (en) Method for identifying web page coding, device and terminal equipment
CN104320679B (en) A kind of user information acquiring method and server based on HLS protocol
CN108334508B (en) Webpage information extraction method and device
CN107153716B (en) Webpage content extraction method and device
CN103416073B (en) For providing the method and apparatus of the feedback about the process to video content
CN103546505A (en) Method, system and device for displaying page blocks in priority order
US20130007274A1 (en) Method for Analyzing Browsing and Device for Implementing the Method
US20210064453A1 (en) Automated application programming interface (api) specification construction
CN111104587A (en) Webpage display method and device and server
CN110286917A (en) File packing method, device, equipment and storage medium
CN107239970A (en) A kind of Behavior-based control daily record determines the method and system of ad click rate
CN103207877B (en) Coding/decoding method and device
CN110851136A (en) Data acquisition method and device, electronic equipment and storage medium
CN110381363A (en) Video encoding/decoding method, device, server and storage medium
CN105630927A (en) Link generation method and apparatus
CN104978325B (en) A kind of web page processing method, device and user terminal
CN103825772A (en) Method for identifying user click behavior and gateway equipment
CN102681996B (en) Pre-head method and device
CN106570044A (en) Method and device for analyzing webpage code
CN107368484A (en) Compression method and device, the acquisition methods and device of the static resource file of webpage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant