CN104978325B - A kind of web page processing method, device and user terminal - Google Patents

A kind of web page processing method, device and user terminal Download PDF

Info

Publication number
CN104978325B
CN104978325B CN201410133677.2A CN201410133677A CN104978325B CN 104978325 B CN104978325 B CN 104978325B CN 201410133677 A CN201410133677 A CN 201410133677A CN 104978325 B CN104978325 B CN 104978325B
Authority
CN
China
Prior art keywords
page
initial data
resource file
webpage
archived
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410133677.2A
Other languages
Chinese (zh)
Other versions
CN104978325A (en
Inventor
王文涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yayue Technology Co ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201410133677.2A priority Critical patent/CN104978325B/en
Publication of CN104978325A publication Critical patent/CN104978325A/en
Application granted granted Critical
Publication of CN104978325B publication Critical patent/CN104978325B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a kind of web page processing method, device and user terminals, the method comprise the steps that obtaining the page initial data of webpage to be archived, and obtain the code identification of the page initial data;The page initial data for parsing the webpage to be archived, determines each link page of the Webpage correlation to be archived respectively, and obtains the page initial data and code identification of each associated link page;Page initial data and code identification to the webpage to be archived are encoded to obtain primary resource file, are encoded to obtain child resource file to the page initial data and code identification of each link page respectively;It is cluster web pages document by the obtained primary resource file and each child resource Document encapsulation.Using the present invention, can more effectively, completely obtain the cluster web pages document of all kinds of webpages, and meet automation of the user to cluster web pages document process, intelligent demand.

Description

A kind of web page processing method, device and user terminal
Technical field
The present invention relates to computer website applied technical field more particularly to a kind of web page processing methods, device and user Terminal.
Background technique
MHT file is also known as cluster web pages html document or single page file, can will include one or more yuan Element webpage (such as comprising picture, Flash animation, small video element webpage) be stored as single file, extension is entitled .mht, the file of this format is called MHT file for short.This make user for web page contents preservation, management meeting side Just.
The realization of existing MHT file is generally only for the page initial data of current web page, if current web page further includes The link page of the elements such as some other linked web pages, such as attached picture, the animation of certain webpages, then can file error or The case where there are messy codes after MHT file is opened.
Summary of the invention
The technical problem to be solved by the embodiment of the invention is that it is whole to provide a kind of web page processing method, device and user End, can more effectively, completely obtain the cluster web pages document of all kinds of webpages.
In order to solve the above-mentioned technical problem, the embodiment of the invention provides a kind of web page processing methods, comprising:
The page initial data of webpage to be archived is obtained, and obtains the code identification of the page initial data;
The page initial data for parsing the webpage to be archived determines each chain of the Webpage correlation to be archived respectively The page is connect, and obtains the page initial data and code identification of each associated link page;
Page initial data and code identification to the webpage to be archived are encoded to obtain primary resource file, right respectively The page initial data and code identification of each link page are encoded to obtain child resource file;
It is cluster web pages document by the obtained primary resource file and each child resource Document encapsulation.
The embodiment of the invention also provides another web page processing methods, comprising:
According to the boundary marker in the cluster web pages document header information of reading, divide from the cluster web pages document To primary resource file and each child resource file;
The primary resource file is decoded to obtain the page initial data of webpage to be archived, and successively each height is provided Source file is decoded, and obtains the page initial data of each link page;
To the corresponding page initial data of decoded each child resource file according to preset local file naming rule It is named and stores;
Successively according to the storage address of the corresponding page initial data of each child resource file, the corresponding institute for obtaining decoding It states the link network address in the page initial data of webpage to be archived and is revised as local links address, and will link what network address had been modified The page initial data of the webpage to be archived is stored as web page files.
Correspondingly, the embodiment of the invention also provides a kind of page processors, comprising:
Module is obtained, for obtaining the page initial data of webpage to be archived, and obtains the coding of the page initial data Mark;
Parsing module determines the webpage to be archived for parsing the page initial data of the webpage to be archived respectively Each associated link page, and obtain the page initial data and code identification of each associated link page;
Coding module, for the webpage to be archived page initial data and code identification encoded to obtain main money Source file is encoded to obtain child resource file to the page initial data and code identification of each link page respectively;
Profiling module, for being cluster web pages text by the obtained primary resource file and each child resource Document encapsulation Shelves.
The embodiment of the invention also provides another page processors, comprising:
Divide module, for the boundary marker in the cluster web pages document header information according to reading, from the polymeric network Segmentation obtains primary resource file and each child resource file in page document;
Decoder module obtains the page initial data of webpage to be archived for being decoded to the primary resource file, and Successively each child resource file is decoded, obtains the page initial data of each link page;
Child resource processing module, for the corresponding page initial data of decoded each child resource file according to default Local file naming rule be named and store;
Memory module, for successively according to the storage address of the corresponding page initial data of each child resource file, corresponding to Link network address in the page initial data of the webpage to be archived that decoding obtains is revised as local links address, and by chain The page initial data for connecing the webpage to be archived that network address has been modified is stored as web page files.
Correspondingly, the embodiment of the invention provides a kind of user terminals, comprising: processor and memory;
The processor for obtaining the page initial data of webpage to be archived, and obtains the volume of the page initial data Code mark;The page initial data for parsing the webpage to be archived determines each chain of the Webpage correlation to be archived respectively The page is connect, and obtains the page initial data and code identification of each associated link page;To the webpage to be archived Page initial data and code identification are encoded to obtain primary resource file, respectively to the page original number of each link page According to and code identification encoded to obtain child resource file;By the obtained primary resource file and each child resource Document encapsulation For cluster web pages document;
The memory, the cluster web pages document obtained for storage enclosure.
The embodiment of the invention provides another user terminals, comprising: processor, memory and display;
The memory, the cluster web pages document obtained for storage enclosure;
The processor, for the boundary marker in the cluster web pages document header information according to reading, from the polymerization Segmentation obtains primary resource file and each child resource file in web document;The primary resource file is decoded to obtain wait return The page initial data of shelves webpage, and successively each child resource file is decoded, the page for obtaining each link page is former Beginning data;To the corresponding page initial data of decoded each child resource file according to preset local file naming rule into Row is named and is stored;Successively according to the storage address of the corresponding page initial data of each child resource file, correspondence will be decoded To the webpage to be archived page initial data in link network address be revised as local links address, and will link network address The page initial data of the webpage to be archived of modification is stored as web page files;
The display, for showing that after being opened by processor parsing include the page initial data and decoding The page of the page initial data obtained after child resource file.
The embodiment of the present invention is handled to obtain cluster web pages document by the page initial data to webpage to be archived Primary resource file, and the page initial data of each link page of the webpage to be archived is obtained based on page initial data, and The page initial data for obtaining the link page is handled to obtain each child resource file of cluster web pages document, is finally filed To cluster web pages document, the embodiment of the present invention more can all-sidedly and accurately file to obtain each data of webpage to be archived, from And more effectively, completely obtain the cluster web pages document of all kinds of webpages, can be complete so that when carrying out relevant decoding Decoding obtains all data of webpage to be archived, and there are messy codes etc. to ask when avoiding filing error and opening cluster web pages document Topic meets user to the automation of cluster web pages document process, intelligent demand.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is a kind of flow diagram of web page processing method of the embodiment of the present invention;
Fig. 2 is the flow diagram of another web page processing method of the embodiment of the present invention;
Fig. 3 is the flow diagram of another web page processing method of the embodiment of the present invention;
Fig. 4 is that the primary resource file of the embodiment of the present invention and child resource file correspond to showing for decoded document storage mode It is intended to;
Fig. 5 is the flow diagram of another web page processing method of the embodiment of the present invention;
Fig. 6 is a kind of structural schematic diagram of page processor of the embodiment of the present invention;
Fig. 7 is one of structural schematic diagram of the parsing module in Fig. 6;
Fig. 8 is one of structural schematic diagram of the coding module in Fig. 6;
Fig. 9 is one of structural schematic diagram of the profiling module in Fig. 6;
Figure 10 is a kind of structural schematic diagram of user terminal of the embodiment of the present invention;
Figure 11 is the structural schematic diagram of another page processor of the embodiment of the present invention;
Figure 12 is one of structural schematic diagram of the segmentation module in Figure 11;
Figure 13 is the structural schematic diagram of another user terminal of the embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
The page initial data of the available webpage to be archived of the embodiment of the present invention, and it is original from the page of webpage to be archived Its related all-links page is obtained in data, and further determination obtains the page original number of the all-links page According to, then handle page initial data is obtained, final filing obtains cluster web pages document, can be more efficiently to all classes The webpage of type is saved in the form of cluster web pages document, also facilitate it is subsequent when opening the cluster web pages document, can be correct, complete The temporary corresponding content of pages of site preparation.
Referring to Figure 1, be the embodiment of the present invention a kind of web page processing method flow diagram, the embodiment of the present invention The method can be applicable to mobile phone, tablet computer, PC, intelligent wearable device and wait the user of internet browsing function whole In end, specifically, the embodiment of the present invention the described method includes:
S101: the page initial data of webpage to be archived is obtained, and obtains the code identification of the page initial data;
The page initial data of webpage to be archived involved in the embodiment of the present invention mainly includes the source code number of the page According to.The webpage to be archived can be user typing webpage link address URL(Uniform Resoure in a browser Locator, uniform resource locator) after, the main page opened by browser, terminal can be directly from the main page of the opening Read the page initial data including source code data;It is also possible to user when it is desirable that filing to some webpage, in net The webpage link address of typing in page chained address typing frame, terminal can be automatically according to the webpage link address to corresponding clothes It is engaged in determining corresponding webpage in device, and pulls the page initial data of the correspondence webpage.
The code identification can specifically be obtained from page initial data, the coding staff including the page initial data The character set identifier and content type of formula, generally include content-type(content type in page initial data) and Charset(character set) content, it is based on the two contents, the code identification of available page initial data, for example, at some It include: meta http-equiv=" content-type " content=" text/html in page initial data;charset= Utf-8 ", it is possible thereby to determine the page initial data code identification be " text/html " (page of text) and " utf-8 ", I.e. the content type of the page initial data is " text/html " (page of text), and used character set is " utf-8 " character Collection.
S102: the page initial data of the parsing webpage to be archived determines the every of the Webpage correlation to be archived respectively One link page, and obtain the page initial data and code identification of each associated link page.
The link page described in the embodiment of the present invention refers to each web page element for constituting the webpage to be archived, such as should Picture involved in webpage to be archived, video, FLASH animation etc. are had recorded in page initial data in the webpage to be archived The web page elements such as picture, video, FLASH animation opposite storage address.
It is according to the opposite storage address in page initial data, to obtain the corresponding link page in the S102 Page initial data.Wherein, if in the page initial data of the link page of acquisition further including the relative address for linking the page When, then it also needs further to obtain the corresponding link page of the relative address by Recursion process and gets all links The page initial data and code identification of the page.The acquisition of the page initial data of each link page and code identification with it is to be archived The page initial data of webpage main page and the acquisition modes of code identification are identical.
S103: page initial data and code identification to the webpage to be archived are encoded to obtain primary resource file, Respectively the page initial data and code identification of each link page are encoded to obtain child resource file.
The content type generally used in webpage main page to be archived is " Content-Type:text/html " (page of text Face), therefore, when page initial data, that is, code identification to webpage to be archived encodes, it can be directly used printable The coding mode of character reference coding Quoted-printable is encoded, and the primary resource file of cluster web pages document is obtained. And for the page initial data of each link page, then different coding modes can be used, for example, being for content type When " Content-Type:image/png " (picture type), then using base64 coding to the page original number of the link page It according to being encoded to obtain corresponding child resource file, and is " Content-Type:text/html " (text for content type The page) link the page initial data, then equally using printable character reference coding Quoted-printable coding Mode is encoded, and corresponding child resource file is obtained.
S104: being cluster web pages document by the obtained primary resource file and each child resource Document encapsulation.
After obtaining primary resource and each child resource, it can be directly based upon agreement corresponding to cluster web pages document MHT, it is complete At the encapsulation of the primary resource file and each child resource file, cluster web pages document is obtained.
Specifically, in embodiments of the present invention, a kind of method and step of encapsulation includes: the head of building MHT first, filling The protocol information of MHT, and the information such as content type and boundary marker that primary resource can be set in head, wherein setting Primary resource content type, can make it is subsequent decoding when, by first content type be the head MHT primary resource The resource of content type is as primary resource, to be correctly decoded to obtain the main page content of webpage to be archived.Followed by successively construct Each component part of MHT sets first part for primary resource under normal circumstances, and each child resource is set as second part, Wherein, when constructing each part of MHT, boundary marker set in the head MHT is respectively provided with for each resource, And the content type of the corresponding page initial data of each resource of corresponding record, in order to accurately divide and decode.Finally, root Splice to obtain complete MHT file according to the various pieces that building is completed, the text for saving as utf-8 format saves.Wherein, In embodiments of the present invention, when each corresponding child resource in url page face and its code identification are built into cluster web pages text After shelves MHT corresponding portion, setting end label is described according to each of building completion in order to execute when detecting terminates label Splice to obtain complete MHT file in a part.
Filing primary resource file and when each child resource file, the filing for each child resource file, is according to son Position of the link network address corresponding to the page initial data of resource file in the corresponding page initial data of primary resource file, Successively filed, facilitate it is subsequent decoding when, can successively decode child resource file and with the page of primary resource file original Beginning data are corresponding, and browser is facilitated to open the contents such as corresponding picture, video in correct position.
The embodiment of the present invention is handled to obtain cluster web pages document by the page initial data to webpage to be archived Primary resource file, and the page initial data of each link page of the webpage to be archived is obtained based on page initial data, and The page initial data for obtaining the link page is handled to obtain each child resource file of cluster web pages document, is finally filed To cluster web pages document, the embodiment of the present invention more can all-sidedly and accurately file to obtain each data of webpage to be archived, from And more effectively, completely obtain the cluster web pages document of all kinds of webpages, can be complete so that when carrying out relevant decoding Decoding obtains all data of webpage to be archived, and there are messy codes etc. to ask when avoiding filing error and opening cluster web pages document Topic meets user to the automation of cluster web pages document process, intelligent demand.
Fig. 2 is referred to again, is the flow diagram of another web page processing method of the embodiment of the present invention, and the present invention is implemented The method of example can be applicable to the use that mobile phone, tablet computer, PC, intelligent wearable device wait internet browsing function In the terminal of family, specifically, the embodiment of the present invention the described method includes:
S201: the page initial data of webpage to be archived is obtained, and obtains the code identification of the page initial data.Its In, the code identification of the page initial data of the webpage main page to be archived includes: for determining that the page of main page is original The character set identifier of the coding mode of data and for distinguishing whether the page initial data is in binary format data Hold type.
It in embodiments of the present invention, specifically can be by intercepting the HTML head file character stream and word of the filing webpage Symbol, repeatedly carry out decode detect to obtain coded character set belonging to character, can efficiently determine in this way webpage to be archived for Character set, and accuracy rate is high.And content type can then be determined by " content-type " in page initial data, So as in the case of, the page initial data of the content types such as picture, video is determined as binary format data.
S202: the page initial data of the webpage to be archived is parsed as target data, obtains target data Each link page of middle record.
Wherein, in the S201, each the link page recorded in target data is obtained, comprising: according to preset Regular expression intercepts each the link network address recorded in the target data respectively;According to the network address of the webpage to be archived Completion processing is carried out to each link network address of interception, obtains each corresponding absolute network address of link network address;According to obtaining Each absolute Web site query each link page for obtaining recording in target data.Wherein the link network address is phase For the opposite network address of the network address of the webpage to be archived.
By the opposite road strength and absolute path in the efficient intercept page initial data of regular expression, and page can be excluded The interference of face internal chaining.The regular expression that the embodiment of the present invention can use can specifically include following various expression formulas:
Adaptation: url (" image/logo.jpg :)
regexRule=@"url\\(\\s*(('\\s*[^']+')|(\"\\s*[^\"]+\")|(\\s*[^\\)] +))";
Adaptation: src=' filename.ext';background="filename.ext"
@"(\\ssrc|\\sbackground)\\s*=\\s*(('[^']+')|(\"[^\"]+\")|([^\\n\\r\\ f]+))";
Adaptation :@import " style.css " or@import url (style.css)
@"(@import\\s|\\S+-image:|background:)\\s*(url)*\\s*[\"'(]{1,2} [^\"')]+[\"')]{1,2}";
Adaptation:<link rel=stylesheet href=" style.css ">
@"<link[^>]+?href\\s*=\\s*('|\")*[^'\">]+('|\")*";
Adaptation:<iframe src=" mypage.htm ">or<frame src=" mypage.aspx ">
@"<i*frame[^>]+?src\\s*=\\s*['\"]{0,1}[^'\"\\\\>]+['\"]{0,1}"。
Specifically, for example, can be by the link network address url in the page initial data of the webpage to be archived: "/ Static/search/ala/callicon.png " obtains absolute network address after carrying out completion reparation with the network address of webpage to be archived Url: " http://www.baidu.com/static/search/ala/callicon.png ".
S203: the page initial data and code identification of each link page recorded in the target data are obtained.
S204: whether get related url page face page initial data and code identification.
S205: if not having, using the page initial data of each link page recorded in the target data as new Target data jumps to the S203.Obtain layer by layer it is each link the page page initial data and code identification, obtain with The page initial data and code identification of all link pages of Webpage correlation to be archived.
Wherein, after obtaining the initial data of the link page of each level, above-mentioned canonical equally can specifically be passed through Expression formula come from link the page initial data in intercept next layer link the page relative address, then obtain next level Link the page page initial data and code identification.The S203 is repeated up to url page is not present in new target data Until face, that is, the page initial data and code identification of each link page of the Webpage correlation to be archived are got.
The code identification of the page initial data of the link page at all levels includes: for determining page initial data The character set identifier of coding mode and for distinguish the page initial data whether be binary format data content type.
S206: if get related url page face page initial data and code identification, for cluster web pages Document constructs header information;The header information includes: the coding mark of the page initial data of boundary marker, webpage to be archived Know, further includes the information such as MTH agreement.Specific example is as described below:
Subject:WebArchive
Date:004,04Mar2XXX23:22:27PST
MIME-Version:1.0 (MIME version)
Content-Type:multipart/related;
type="text/html";(content type)
Boundary=" --- -=_ NextPart_000_00 " (boundary segmentation symbol)
Creation time, protocol version, content type, boundary marker etc. are specifically indicated in above-mentioned header information example Information.
S207: the page initial data of webpage to be archived is encoded, primary resource file is obtained.It specifically includes based on printable word Symbol reference coding Quoted-printable encodes the page initial data of the webpage to be archived, and will be after coding The code identification of the page initial data of file and the webpage to be archived combines to obtain primary resource file.It is again the primary resource File adds boundary marker.
Quoted-printable is the character indicated under various coded formats using printable ascii character, especially When text does not include many non-ascii characters, for example the content of html, the mostly result by the coding of label tissue are readable Property is preferable and compact, correctly to handle data on corresponding data path or media.Following illustrate a kind of right The page initial data of webpage to be archived is encoded, and the form of finally obtained primary resource file:
------=_NextPart_000_00
Content-Type:text/html;
charset="utf-8"
Content-Transfer-Encoding:quoted-printable
Content-Location:http://www.baidu.com
<html><!--STATUS OK--><head><meta http-equiv=3D"Content-Type"=
content=3D"text/html;charset=3Dutf-8"><meta name=3D"viewport"=
content=3D"width=3Ddevice-width,minimum-scale=3D1.0,maximum-scale= 3D1.0=
,user-scalable=3Dno"><link rel=3D"apple-touch-icon-precomposed"=
href=3D"http://m.baidu.com/static/index/screen_icon.png"><meta=
Name=3D " format-detection "=(omitting herein)
What the first row in above-mentioned primary resource document form indicated is boundary marker, is split processing in order to subsequent, The boundary marker is recorded in the header information of above-mentioned MHT and is indicated.Second row illustrates the main page page of webpage to be archived The content type of initial data is " text/html " page of text type.The third line then illustrates that used character set is "utf-8".Fourth line illustrates that coding has used " quoted-printable ", and fifth line illustrates net belonging to primary resource Location.Subsequent content is then the content of the main page page initial data after coding.
S208: the page initial data of each link page of coding obtains child resource file.It specifically includes: if url page The page initial data in face corresponds to the content type in code identification and indicates that the page initial data is the data of binary format, Then the page initial data of binary format is encoded based on preconfigured base64 coding;Respectively coding obtains every One sub- resource file adds boundary marker.If not when the data of binary format, being then based on preconfigured printable word Symbol reference coding Quoted-printable encodes the page initial data of nonbinary format, and respectively by each volume File after code combines to obtain primary resource file with the code identification of the corresponding page initial data for linking the page.It is again respectively to compile Each child resource file addition boundary marker that code obtains.
Wherein, after encoding to the page initial data of binary format, the example for obtaining child resource file is as follows:
--- --=_ NextPart_000_00(child resource)
Content-Type:image/png;
Content-Transfer-Encoding:base64
Content-Location:http://m.baidu.com/static/index/logo_index2.png
iVBORw0KGgoAAAANSUhEUgAAAVAAAABrCAYAAAAhItoDAAAXvEl EQVR4Xu3dfXAV5b3A8eeck+Qk
5IWEtwAiBpCXYhEVpFB8KSDiK0WqohX0qrRK0SKtXuzItVaFq1crA8L4Vr hUi5dCuVgKl0Kpg3Ws
KfWClIsIhIAQYmLEmMY0Ho/hd7/D2BnNZPfsnrNPzmbPw8xn+IORhRn8zr O7v+ DZ9V5OyEeMAb0q(is omitted herein)
The first row of above-mentioned example illustrates boundary marker.The content of the corresponding page initial data of second behavior child resource Type, third behavior coded format, fourth line are affiliated absolute path, and subsequent content is then to related pages initial data Content after carrying out base64 coding.The example for the child resource file that the page initial data of nonbinary format obtains with it is above-mentioned The example of child resource is substantially similar, is changed on content type and coded format.
S209: to specify the primary resource file and child resource file that are added to boundary marker described in storage format preservation, and The header information of the building, obtain include primary resource file and each child resource file cluster web pages document.
The primary resource file and each child resource file completed according to building splice to obtain complete MHT file, save as The text of utf-8 format saves.Wherein, in embodiments of the present invention, when each corresponding child resource in url page face and After its code identification is built into cluster web pages document MHT corresponding portion, setting terminates label, in order to detect end The various pieces completed according to building are executed when label to splice to obtain complete MHT file.
The embodiment of the present invention is handled to obtain cluster web pages document by the page initial data to webpage to be archived Primary resource file, and the page initial data of each link page of the webpage to be archived is obtained based on page initial data, and The page initial data for obtaining the link page is handled to obtain each child resource file of cluster web pages document, is finally filed To cluster web pages document, the embodiment of the present invention more can all-sidedly and accurately file to obtain each data of webpage to be archived, from And more effectively, completely obtain the cluster web pages document of all kinds of webpages, can be complete so that when carrying out relevant decoding Decoding obtains all data of webpage to be archived, and there are messy codes etc. to ask when avoiding filing error and opening cluster web pages document Topic meets user to the automation of cluster web pages document process, intelligent demand.
Fig. 3 is referred to again, is the flow diagram of another web page processing method of the embodiment of the present invention, and the present invention is implemented The method of example can be applicable to the use that mobile phone, tablet computer, PC, intelligent wearable device wait internet browsing function In the terminal of family, specific implementation parses MHT document obtained in above-mentioned Fig. 1 and Fig. 2 corresponding embodiment, and reduction is corresponded to Complete web data, specifically, the embodiment of the present invention the described method includes:
S301: according to the boundary marker in the cluster web pages document header information of reading, from the cluster web pages document Segmentation obtains primary resource file and each child resource file.
The processing Yu filing of primary resource file and each child resource file can refer to above-mentioned Fig. 1 in the cluster web pages document To the description of Fig. 2 corresponding embodiment.It include that boundary marker and primary resource file are corresponding in cluster web pages document header information The content type of page initial data further includes the content of protocol information of some MHT etc certainly.In segmentation, specifically may be used The resource file identical with the content type extracted from header information of content type in first code identification to be made For main resource file, other resource files are determined as child resource file.
After segmentation obtains primary resource file and child resource file, respective decoding can be selected according to corresponding content type Mode (quoted-printable decompression and base64 decoding) is decoded primary resource file and each child resource file, leads to The page initial data that primary resource file decoding obtains the homepage of webpage to be archived is crossed, is obtained pair by child resource file decoding The original number of pages of the page for the subpage frame answered.
S302: the primary resource file is decoded to obtain the page initial data of webpage to be archived, and successively to each A sub- resource file is decoded, and obtains the page initial data of each link page.
S303: the corresponding page initial data of decoded each child resource file is named according to preset local file Rule is named and stores.
Filename after handling each child resource file decoding, and by the corresponding homepage initial data of primary resource file In link network address be changed to local links address.Wherein, in above-mentioned archiving process, since child resource file is according to it Position of the corresponding link network address in the corresponding page initial data of primary resource file is filed, at this time in decoding, The filename of decoded child resource file corresponding data is then successively decoded and handled, then is successively stored.For example, to child resource File after file decoding can be arranged according to digital number, respectively obtain unmht_cid_1, unmht_cid_2, unmht_cid_ Then the web page interlinkage decoded in the page initial data that primary resource file obtains is converted by the files such as 3, unmht_cid_4 Local links address, for example, specifically can be by the address url are as follows: http://m.baidu.com/static/index/logo_ Index2.png conversion are as follows:/subfile/unmht_cid_2(is with numeric sorting).
S304: successively according to the storage address of the corresponding page initial data of each child resource file, correspondence will be decoded To the webpage to be archived page initial data in link network address be revised as local links address, and will link network address The page initial data of the webpage to be archived of modification is stored as web page files.
Specific storage mode can refer to the description of Fig. 4, and the page initial data for the webpage to be archived that decoding is obtained is protected Final html file is saved as, index.html can be named as.And the page obtained for each child resource file decoding is original Data, then it is corresponding with the index.html to save, it can specifically be stored under corresponding subfile file.By this Kind storage mode can be realized when calling web page browsing tool open index.html file of browser to be archived The parsing opening operation of all page elements of webpage.
The embodiment of the present invention is handled to obtain cluster web pages document by the page initial data to webpage to be archived Primary resource file, and the page initial data of each link page of the webpage to be archived is obtained based on page initial data, and The page initial data for obtaining the link page is handled to obtain each child resource file of cluster web pages document, is finally filed To cluster web pages document, the embodiment of the present invention more can all-sidedly and accurately file to obtain each data of webpage to be archived, from And more effectively, completely obtain the cluster web pages document of all kinds of webpages, can be complete so that when carrying out relevant decoding Decoding obtains all data of webpage to be archived, and there are messy codes etc. to ask when avoiding filing error and opening cluster web pages document Topic meets user to the automation of cluster web pages document process, intelligent demand.
Fig. 5 is referred to again, is the flow diagram of another web page processing method of the embodiment of the present invention, and the present invention is implemented The method of example can be applicable to the use that mobile phone, tablet computer, PC, intelligent wearable device wait internet browsing function In the terminal of family, specific implementation parses MHT document obtained in above-mentioned Fig. 1 and Fig. 2 corresponding embodiment, and reduction is corresponded to Complete web data, specifically, the embodiment of the present invention the described method includes:
S401: according to the boundary marker in the cluster web pages document header information of reading, to the cluster web pages document into Row segmentation, obtains each resource file.
S402: the content type in the code identification of the cluster web pages document header information is extracted, and will be believed from head The content type of the content type extracted in breath and the code identification for including in each resource file is compared.
S403: it according to comparison result, described extracts content type in first code identification and from header information The identical resource file of content type is determined as child resource file as primary resource file, other resource files.
S404: the primary resource file is decoded to obtain the page initial data of webpage to be archived, and successively to each A sub- resource file is decoded, and obtains the page initial data of each link page.
S405: the corresponding page initial data of decoded each child resource file is named according to preset local file Rule is named and stores.
S406: successively according to the storage address of the corresponding page initial data of each child resource file, correspondence will be decoded To the webpage to be archived page initial data in link network address be revised as local links address, and will link network address The page initial data of the webpage to be archived of modification is stored as web page files.
S407: when detecting the opening operation of the web page files of storage, preset web page browsing tool solution is called Analyse the corresponding page initial data of the web page files.
S408: the page for loading the child resource file of the corresponding storage in local links address in the page initial data is former Beginning data.
S409: display includes page initial data reconciliation numeral money in the display window of the web page browsing tool The page of the page initial data obtained after source file.
Equally, storage mode can refer to the description of Fig. 4, and the page initial data for the webpage to be archived that decoding is obtained saves For final html file, index.html can be named as.And for page original number that each child resource file decoding obtains According to, then it is corresponding with the index.html to save, it can specifically be stored under corresponding subfile file.Pass through this kind Storage mode can be realized when calling web page browsing tool open index.html file of browser to net to be archived The parsing opening operation of all page elements of page.
The specific implementation process of the correlation step of the embodiment of the present invention can refer to above-mentioned Fig. 1 retouching to Fig. 3 corresponding embodiment It states.
The embodiment of the present invention is handled to obtain cluster web pages document by the page initial data to webpage to be archived Primary resource file, and the page initial data of each link page of the webpage to be archived is obtained based on page initial data, and The page initial data for obtaining the link page is handled to obtain each child resource file of cluster web pages document, is finally filed To cluster web pages document, the embodiment of the present invention more can all-sidedly and accurately file to obtain each data of webpage to be archived, from And more effectively, completely obtain the cluster web pages document of all kinds of webpages, can be complete so that when carrying out relevant decoding Decoding obtains all data of webpage to be archived, and there are messy codes etc. to ask when avoiding filing error and opening cluster web pages document Topic meets user to the automation of cluster web pages document process, intelligent demand.
The page processor of the embodiment of the present invention and user terminal are described in detail below.
Fig. 6 is referred to, is a kind of structural schematic diagram of page processor of the embodiment of the present invention, the embodiment of the present invention Described device may be provided at mobile phone, tablet computer, PC, intelligent wearable device and wait the user of internet browsing function whole In end, specifically, the described device of the embodiment of the present invention includes:
Module 11 is obtained, for obtaining the page initial data of webpage to be archived, and obtains the volume of the page initial data Code mark;
Parsing module 12 determines the net to be archived for parsing the page initial data of the webpage to be archived respectively Each associated link page of page, and obtain the page initial data and code identification of each associated link page;
Coding module 13, for the webpage to be archived page initial data and code identification encoded and led Resource file is encoded to obtain child resource file to the page initial data and code identification of each link page respectively;
Profiling module 14, for being cluster web pages text by the obtained primary resource file and each child resource Document encapsulation Shelves.
The page initial data of webpage to be archived involved in the embodiment of the present invention mainly includes the source code number of the page According to.The webpage to be archived can be user in a browser after typing webpage link address URL, the major network opened by browser Page, acquisition module 11 acquisition can read the page original number including source code data directly from the main page of the opening According to;It is also possible to user when it is desirable that filing to some webpage, the webpage chain of typing in webpage link address typing frame It is grounded location, the acquisition module 11 obtains can be determining corresponding into respective server automatically according to the webpage link address Webpage, and pull the page initial data of the correspondence webpage.
The module 11 that obtains obtains the code identification described in can specifically obtaining from page initial data, code identification The character set identifier and content type for specifically including the coding mode of the page initial data, are usually wrapped in page initial data Include content-type(content type) and charset(character set) content, the two contents are based on, the available page is original The code identification of data, for example, including: meta http-equiv=" content-type " in some page initial data content="text/html;Charset=utf-8 ", thus the acquisition module 11 can determine the page initial data Code identification is " text/html " (page of text) and " utf-8 ", i.e. the content type of the page initial data is " text/ Html " (page of text), used character set are " utf-8 " character set.
The link page described in the embodiment of the present invention refers to each web page element for constituting the webpage to be archived, such as should Picture involved in webpage to be archived, video, FLASH animation etc. are had recorded in page initial data in the webpage to be archived The web page elements such as picture, video, FLASH animation opposite storage address.
The parsing module 12 is according to the opposite storage address in page initial data, to obtain the corresponding link page Page initial data.Wherein, if further including linking the page relatively in the page initial data of the link page of acquisition When location, then the parsing module 12 also needs further to obtain the corresponding link page of the relative address, by Recursion process, Get the page initial data and code identification of all link pages.The page initial data and coding mark of each link page The acquisition of knowledge is identical as the acquisition modes of the page initial data of webpage main page to be archived and code identification.
The content type generally used in webpage main page to be archived is " Content-Type:text/html " (page of text Face), therefore, when page initial data, that is, code identification to webpage to be archived encodes, the coding module 13 can be with The coding mode for directlying adopt printable character reference coding Quoted-printable is encoded, and cluster web pages document is obtained Primary resource file.And for the page initial data of each link page, then different coding modes can be used, for example, right When content type is " Content-Type:image/png " (picture type), then encoded using base64 to the link page Page initial data encoded to obtain corresponding child resource file, and be for content type " Content-Type: The initial data of the link page of text/html " (page of text), then equally using printable character reference coding Quoted- The coding mode of printable is encoded, and corresponding child resource file is obtained.
After obtaining primary resource file and each child resource file, the profiling module 14 can be directly based upon cluster web pages Agreement corresponding to document MHT completes the encapsulation of the primary resource file and each child resource file, obtains cluster web pages text Shelves.
Specifically, in embodiments of the present invention, the encapsulation filing step of the profiling module 14 includes: building MHT first Head, the letter such as content type and boundary marker filled the protocol information of MHT, and primary resource can be set in head Breath, wherein the content type of the primary resource of setting, can make it is subsequent decoding when, by first content type be the MHT The resource of the primary resource content type on head is as primary resource, to be correctly decoded to obtain the main page content of webpage to be archived.Its Secondary is each component part for successively constructing MHT, sets first part, each child resource setting for primary resource under normal circumstances For second part, wherein when constructing each part of MHT, be respectively provided in the head MHT for each resource set Boundary marker, and the content type of the corresponding page initial data of each resource of corresponding record, in order to accurately divide with Decoding.Finally, splicing to obtain complete MHT file according to the various pieces that building is completed, the text for saving as utf-8 format is protected It deposits.Wherein, in embodiments of the present invention, when each corresponding child resource in url page face and its code identification are fabricated To after cluster web pages document MHT corresponding portion, setting terminates label, in order to execute the basis when detecting terminates label The various pieces that building is completed splice to obtain complete MHT file.
Filing primary resource file and when each child resource file, the filing for each child resource file, the filing Module 14 is that link network address corresponding to page initial data according to child resource file is former in the corresponding page of primary resource file Position in beginning data, is successively filed, facilitate it is subsequent decoding when, can successively decode child resource file and with main money The page initial data of source file is corresponding, and browser is facilitated to open the contents such as corresponding picture, video in correct position.
Still optionally further, in embodiments of the present invention, the parsing module 12 can specifically include with lower unit, specifically Refer to Fig. 7.
Determination unit 121 is obtained for parsing the page initial data of the webpage to be archived as target data Each recorded in target data is taken to link the page;
Processing unit 122, for obtaining the page initial data and volume of each link page recorded in the target data Code mark, and using the page initial data of each link page recorded in the target data as new target data at Reason gets the page initial data and code identification of each link page of the Webpage correlation to be archived.
Wherein specifically, the determination unit 121, for intercepting the number of targets respectively according to preset regular expression According to each link network address of middle record;It is mended according to each the link network address of the network address of the webpage to be archived to interception Full processing obtains each corresponding absolute network address of link network address;Target is obtained according to each obtained absolute Web site query Each the link page recorded in data.
The specific implementation of each unit in above-mentioned parsing module 12 can be accordingly with reference to the embodiment of above-mentioned Fig. 1 to Fig. 2 Description.
Still optionally further, in embodiments of the present invention, the code identification of page initial data includes: for determining the page The character set identifier of the coding mode of initial data and for distinguishing whether the page initial data is binary format data Content type.As shown in figure 8, the coding module 13 of the embodiment of the present invention can specifically include:
Primary resource coding unit 131, for based on printable character reference coding Quoted-printable to it is described to The page initial data of filing webpage is encoded, and by the page initial data of file and the webpage to be archived after coding Code identification combine to obtain primary resource file;
Child resource coding unit 132, if the page initial data for linking the page corresponds to the content class in code identification Type indicates that the page initial data is the data of binary format, then is encoded based on preconfigured base64 to binary format Page initial data encoded, otherwise, based on preconfigured printable character reference coding Quoted-printable The page initial data of nonbinary format encoded, and respectively by file and the corresponding page that links after each coding The code identification of page initial data combines to obtain primary resource file.
The specific implementation of each unit in above-mentioned coding module 13 can be accordingly with reference to the embodiment of above-mentioned Fig. 1 to Fig. 2 Description.
Still optionally further, as shown in figure 9, profiling module 14 described in the embodiment of the present invention can specifically include:
Head setting unit 141, for constructing header information for cluster web pages document;
Flag setting unit 142 for adding boundary marker for the primary resource file, and is each child resource file Add boundary marker;
Storage unit 143, for specify the primary resource file and son that are added to boundary marker described in storage format preservation Resource file and the header information of the building, obtain include primary resource file and each child resource file cluster web pages text Shelves.
The specific implementation of each unit in above-mentioned profiling module 14 can be accordingly with reference to the embodiment of above-mentioned Fig. 1 to Fig. 2 Description.
Again referring to Figure 10, it is a kind of structural schematic diagram of user terminal of the embodiment of the present invention, the embodiment of the present invention At least one processor 1001 of the user terminal, such as CPU, at least one communication bus 1002, at least one network interface 1003, memory 1004.Wherein, communication bus 1002 is for realizing the connection communication between these components.Wherein, the network Interface 1003 optionally may include standard wireline interface and wireless interface (such as WI-FI, mobile communication interface).It is described to deposit Reservoir 1004 can be high speed RAM memory, be also possible to non-labile memory (non-volatile memory), example Such as at least one magnetic disk storage.The memory 1004 optionally can also be that at least one is located remotely from aforementioned processor 1001 storage device.As shown in Figure 10, as be stored in a kind of memory 1004 of computer storage medium operating system, Network communication module, and it is stored with Web Page Processing application program and other programs.
Specifically, the processor 1001 can be used for calling the Web Page Processing application journey stored in the memory 1004 Sequence for obtaining the page initial data of webpage to be archived, and obtains the code identification of the page initial data;Parsing it is described to The page initial data for filing webpage, determines each link page of the Webpage correlation to be archived respectively, and obtains association Each link the page page initial data and code identification;To the page initial data and coding of the webpage to be archived Mark is encoded to obtain primary resource file, is compiled respectively to the page initial data and code identification of each link page Code obtains child resource file;It is cluster web pages document by the obtained primary resource file and each child resource Document encapsulation;
The memory 1004 is also used to the cluster web pages document that storage enclosure obtains.
The embodiment of the present invention is handled to obtain cluster web pages document by the page initial data to webpage to be archived Primary resource file, and the page initial data of each link page of the webpage to be archived is obtained based on page initial data, and The page initial data for obtaining the link page is handled to obtain each child resource file of cluster web pages document, is finally filed To cluster web pages document, the embodiment of the present invention more can all-sidedly and accurately file to obtain each data of webpage to be archived, from And more effectively, completely obtain the cluster web pages document of all kinds of webpages, can be complete so that when carrying out relevant decoding Decoding obtains all data of webpage to be archived, and there are messy codes etc. to ask when avoiding filing error and opening cluster web pages document Topic meets user to the automation of cluster web pages document process, intelligent demand.
Again referring to Figure 11, it is the structural schematic diagram of another page processor of the embodiment of the present invention, the present invention is real The described device for applying example may be provided at mobile phone, tablet computer, PC, intelligent wearable device and wait internet browsing function In user terminal, specifically, the described device of the embodiment of the present invention includes:
Divide module 21, for the boundary marker in the cluster web pages document header information according to reading, from the polymerization Segmentation obtains primary resource file and each child resource file in web document;
Decoder module 22 obtains the page initial data of webpage to be archived for being decoded to the primary resource file, And successively each child resource file is decoded, obtain the page initial data of each link page;
Child resource processing module 23, for the corresponding page initial data of decoded each child resource file according to pre- If local file naming rule be named and store;
Memory module 24 is right for successively according to the storage address of the corresponding page initial data of each child resource file Link network address in the page initial data for the webpage to be archived that decoding obtains should be revised as to local links address, and will The page initial data for the webpage to be archived that link network address has been modified is stored as web page files.
The processing Yu filing of primary resource file and each child resource file can refer to above-mentioned Fig. 1 in the cluster web pages document To the description of Fig. 2 corresponding embodiment.It include that boundary marker and primary resource file are corresponding in cluster web pages document header information The content type of page initial data further includes the content of protocol information of some MHT etc certainly.The segmentation module 21 exists It, specifically can be identical as the content type extracted from header information by content type in first code identification when segmentation Resource file as primary resource file, other resource files are determined as child resource file.
After the segmentation module 21 segmentation obtains primary resource file and child resource file, the decoder module 22 can basis Corresponding content type selects respective decoding process (quoted-printable decompression and base64 decoding) to primary resource text Part and each child resource file are decoded, and the page for obtaining the homepage of webpage to be archived by primary resource file decoding is original Data, the original number of pages of the page that corresponding subpage frame is obtained by child resource file decoding.
The child resource processing module 23 handles the filename after each child resource file decoding, and by primary resource file pair The link network address in homepage initial data answered is changed to local links address.Wherein, in above-mentioned archiving process, due to Child resource file is returned according to its position of corresponding link network address in the corresponding page initial data of primary resource file Shelves, at this time in decoding, then the filename of decoded child resource file corresponding data is successively decoded and handles, then successively deposit Storage.For example, the child resource processing module 23 can be arranged to the decoded file of sub- resource file according to digital number, respectively Unmht_cid_1, unmht_cid_2, unmht_cid_3, the files such as unmht_cid_4 are obtained, the memory module 24 again will Link network address conversion cost in the obtained page initial data of decoding primary resource file ground chained address, for example, specifically can be with The link network address url:http of corresponding position in the page initial data that the primary resource file obtains will be decoded: // M.baidu.com/static/index/logo_index2.png is accordingly converted to locality connection address url :/ Subfile/unmht_cid_2(is with numeric sorting).
Specific storage mode can refer to the description of Fig. 4 based on the memory module 24, by decoding obtain wait return The page initial data of shelves webpage saves as final html file, can be named as index.html.And for each child resource The page initial data that file decoding obtains, then it is corresponding with the index.html to save, it can specifically be stored in corresponding Under subfile file.By this kind of storage mode, the web page browsing tool open of browser index.html text is being called When part, the parsing opening operation of the page elements all to webpage to be archived can be realized.
Still optionally further, in embodiments of the present invention, referring to Figure 12, the segmentation module 21 can specifically include:
Resource file cutting unit 211, it is right for the boundary marker in the cluster web pages document header information according to reading The cluster web pages document is split, and obtains each resource file;
Comparing unit 212, the content type in code identification for extracting the cluster web pages document header information, and The content type for the code identification for including from the content type extracted in header information and each resource file is compared;
Determination unit 213, for according to comparison result, content type in first code identification to be believed with described from head The identical resource file of the content type extracted in breath is determined as child resource file as primary resource file, other resource files.
The correlation that the specific implementation of each unit can refer to above-mentioned Fig. 3 to Fig. 5 corresponding embodiment in above-mentioned segmentation module 21 is retouched It states.
Still optionally further, then referring to Figure 11, the described device of the embodiment of the present invention can also include:
Primary resource file load module 25, for calling when detecting the opening operation of the web page files of storage Preset web page browsing tool parses the corresponding page initial data of the web page files;
Child resource file load module 26, for loading the corresponding storage in the local links address in the page initial data Child resource file page initial data;
Display module 27 includes the page initial data for showing in the display window of the web page browsing tool The page initial data obtained after reconciliation numeral resource file.
Again referring to Figure 13, it is the structural schematic diagram of another user terminal of the embodiment of the present invention, the embodiment of the present invention Described in user terminal at least one processor 2001, such as CPU, at least one communication bus 2002, at least one network Interface 2003, memory 2004 and display 2005.Wherein, communication bus 2002 is for realizing the connection between these components Communication.Wherein, the network interface 2003 optionally may include standard wireline interface and wireless interface (such as WI-FI, movement Communication interface etc.).The memory 2004 can be high speed RAM memory, be also possible to non-labile memory (non- Volatile memory), a for example, at least magnetic disk storage.The memory 2004 optionally can also be at least one It is located remotely from the storage device of aforementioned processor 2001.As shown in figure 12, the memory as a kind of computer storage medium It is stored with operating system, network communication module in 2004, and is stored with Web Page Processing application program and other programs.
Specifically, the memory 2004 is also used to the cluster web pages document that storage enclosure obtains;
The processor 2001 can be used for calling the Web Page Processing application program stored in the memory 2004, be used for According to the boundary marker in the cluster web pages document header information of reading, segmentation obtains primary resource from the cluster web pages document File and each child resource file;The primary resource file is decoded to obtain the page initial data of webpage to be archived, and Successively each child resource file is decoded, obtains the page initial data of each link page;To decoded each height The corresponding page initial data of resource file is named and stores according to preset local file naming rule;Successively according to each The storage address of the corresponding page initial data of a sub- resource file, the page of the corresponding webpage to be archived for obtaining decoding Link network address in initial data is revised as local links address, and the page that will link the webpage to be archived that network address has been modified Face initial data is stored as web page files;
The display 2005, for showing that after being opened by the processor 2001 parsing include the page original number According to the page of the page initial data obtained after reconciliation numeral resource file.
The embodiment of the present invention is handled to obtain cluster web pages document by the page initial data to webpage to be archived Primary resource file, and the page initial data of each link page of the webpage to be archived is obtained based on page initial data, and The page initial data for obtaining the link page is handled to obtain each child resource file of cluster web pages document, is finally filed To cluster web pages document, the embodiment of the present invention more can all-sidedly and accurately file to obtain each data of webpage to be archived, from And more effectively, completely obtain the cluster web pages document of all kinds of webpages, can be complete so that when carrying out relevant decoding Decoding obtains all data of webpage to be archived, and there are messy codes etc. to ask when avoiding filing error and opening cluster web pages document Topic meets user to the automation of cluster web pages document process, intelligent demand.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
The above disclosure is only the preferred embodiments of the present invention, cannot limit the right model of the present invention with this certainly It encloses, therefore equivalent changes made in accordance with the claims of the present invention, is still within the scope of the present invention.

Claims (19)

1. a kind of web page processing method characterized by comprising
The page initial data of webpage to be archived is obtained, and obtains the code identification of the page initial data;
The page initial data for parsing the webpage to be archived determines each url page of the Webpage correlation to be archived respectively Face, and obtain the page initial data and code identification of each associated link page;
Page initial data and code identification to the webpage to be archived are encoded to obtain primary resource file, respectively to each The page initial data and code identification of a link page are encoded to obtain child resource file;
It is cluster web pages document by the obtained primary resource file and each child resource Document encapsulation;
Wherein, the page initial data and code identification to the webpage to be archived is encoded to obtain primary resource file, Respectively the page initial data and code identification of each link page are encoded to obtain child resource file, comprising:
Coding Quoted-printable is quoted based on printable character to carry out the page initial data of the webpage to be archived Coding, and combine the code identification of the page initial data of file and the webpage to be archived after coding to obtain primary resource text Part;
If link the page page initial data correspond to the content type in code identification indicate the page initial data for two into The data of format processed then encode the page initial data of binary format based on preconfigured base64 coding, no Then, coding Quoted-printable is quoted to the page original number of nonbinary format based on preconfigured printable character It combines the file after each coding with the code identification of the corresponding page initial data for linking the page according to being encoded, and respectively Obtain primary resource file.
2. the method as described in claim 1, which is characterized in that the page initial data of the parsing webpage to be archived, Each link page of the Webpage correlation to be archived is determined respectively, and the page for obtaining each associated link page is former Beginning data and code identification, comprising:
The page initial data of the webpage to be archived is parsed as target data, is recorded in acquisition target data every One link page;
Obtain the page initial data and code identification of each link page recorded in the target data, and by the target data The page initial data of each link page of middle record as new target data, repeat this step until get it is described to File the page initial data and code identification of each link page of Webpage correlation.
3. method according to claim 2, which is characterized in that described to obtain each url page recorded in target data Face, comprising:
Intercept each the link network address recorded in the target data respectively according to preset regular expression;
Completion processing is carried out according to each the link network address of the network address of the webpage to be archived to interception, obtains each link The corresponding absolute network address of network address;
Each the link page recorded in target data is obtained according to each obtained absolute Web site query.
4. method as described in any one of claims 1 to 3, which is characterized in that the code identification of page initial data includes:
For determining the character set identifier of the coding mode of page initial data and for whether distinguishing the page initial data For the content type of binary format data.
5. method as claimed in claim 4, which is characterized in that described by the obtained primary resource file and each child resource Document encapsulation is cluster web pages document, comprising:
Header information is constructed for cluster web pages document;
Boundary marker is added for the primary resource file, and adds boundary marker for each child resource file;
With specify storage format save described in be added to primary resource file and child resource file and the building of boundary marker Header information, obtain include primary resource file and each child resource file cluster web pages document.
6. method as claimed in claim 5, which is characterized in that the header information includes: boundary marker, webpage to be archived The code identification of page initial data.
7. a kind of web page processing method characterized by comprising
According to the boundary marker in the cluster web pages document header information of reading, divides from the cluster web pages document and led Resource file and each child resource file;
The primary resource file is decoded to obtain the page initial data of webpage to be archived, and successively to each child resource text Part is decoded, and obtains the page initial data of each link page;
The corresponding page initial data of decoded each child resource file is carried out according to preset local file naming rule It names and stores;
Successively according to the storage address of the corresponding page initial data of each child resource file, it is corresponding will decoding obtain described in The link network address filed in the page initial data of webpage is revised as local links address, and will be described in link network address modified The page initial data of webpage to be archived is stored as web page files.
8. the method for claim 7, which is characterized in that in the cluster web pages document header information according to reading Boundary marker, segmentation obtains primary resource file and each child resource file from the cluster web pages document, comprising:
According to the boundary marker in the cluster web pages document header information of reading, the cluster web pages document is split, is obtained To each resource file;
It extracts the content type in the code identification of the cluster web pages document header information, and will be extracted from header information The content type for the code identification for including in content type and each resource file is compared;
According to comparison result, by content type in first code identification and the content type phase extracted from header information For same resource file as primary resource file, other resource files are determined as child resource file.
9. method as claimed in claim 7 or 8, which is characterized in that further include:
When detecting the opening operation of the web page files of storage, preset web page browsing tool is called to parse the webpage The corresponding page initial data of file;
Load the page initial data of the child resource file of the corresponding storage in local links address in the page initial data;
Display includes after the page initial data conciliates numeral resource file in the display window of the web page browsing tool The page of obtained page initial data.
10. a kind of page processor characterized by comprising
Module is obtained, for obtaining the page initial data of webpage to be archived, and obtains the code identification of the page initial data;
Parsing module determines the Webpage correlation to be archived for parsing the page initial data of the webpage to be archived respectively Each link page, and obtain it is associated each link the page page initial data and code identification;
Coding module, for the webpage to be archived page initial data and code identification encoded to obtain primary resource text Part is encoded to obtain child resource file to the page initial data and code identification of each link page respectively;
Profiling module, for being cluster web pages document by the obtained primary resource file and each child resource Document encapsulation;
Wherein, the coding module includes:
Primary resource coding unit, for encoding Quoted-printable to the webpage to be archived based on printable character reference Page initial data encoded, and by the coding mark of the page initial data of file and the webpage to be archived after coding Know combination and obtains primary resource file;
Child resource coding unit, if the page initial data for linking the page corresponds to the content type in code identification, instruction should Page initial data is the data of binary format, then is encoded based on preconfigured base64 former to the page of binary format Beginning data are encoded, otherwise, based on preconfigured printable character reference coding Quoted-printable to non-two into The page initial data of format processed is encoded, and respectively by after each coding file and the corresponding page for linking the page it is original The code identification of data combines to obtain primary resource file.
11. device as claimed in claim 10, which is characterized in that the parsing module includes:
Determination unit obtains target for parsing the page initial data of the webpage to be archived as target data Each the link page recorded in data;
Processing unit, for obtaining the page initial data and code identification of each link page recorded in the target data, And handle the page initial data of each link page recorded in the target data as new target data, it obtains To the page initial data and code identification of each link page of the Webpage correlation to be archived.
12. device as claimed in claim 11, which is characterized in that
The determination unit, for intercepting each chain recorded in the target data respectively according to preset regular expression Connect network address;Completion processing is carried out according to each the link network address of the network address of the webpage to be archived to interception, obtains each Link the corresponding absolute network address of network address;Each recorded in target data is obtained according to each obtained absolute Web site query Link the page.
13. such as the described in any item devices of claim 10 to 12, which is characterized in that the code identification packet of page initial data Include: for determine the coding mode of page initial data character set identifier and for distinguish the page initial data whether be The content type of binary format data.
14. device as claimed in claim 13, which is characterized in that the profiling module includes:
Head setting unit, for constructing header information for cluster web pages document;
Flag setting unit for adding boundary marker for the primary resource file, and adds side for each child resource file Boundary mark note;
Storage unit, for specify the primary resource file and child resource text that are added to boundary marker described in storage format preservation Part and the header information of the building, obtain include primary resource file and each child resource file cluster web pages document.
15. a kind of page processor characterized by comprising
Divide module, for the boundary marker in the cluster web pages document header information according to reading, from the cluster web pages text Segmentation obtains primary resource file and each child resource file in shelves;
Decoder module obtains the page initial data of webpage to be archived for being decoded to the primary resource file, and successively Each child resource file is decoded, the page initial data of each link page is obtained;
Child resource processing module, for the corresponding page initial data of decoded each child resource file according to preset Ground file designation rule is named and stores;
Memory module, according to the storage address of the corresponding page initial data of each child resource file, corresponding to for successively will solution Link network address in the page initial data for the webpage to be archived that code obtains is revised as local links address, and will link net The page initial data for the webpage to be archived that location has been modified is stored as web page files.
16. device as claimed in claim 15, which is characterized in that the segmentation module includes:
Resource file cutting unit, for the boundary marker in the cluster web pages document header information according to reading, to described poly- It closes web document to be split, obtains each resource file;
Comparing unit, the content type in code identification for extracting the cluster web pages document header information, and will from the beginning The content type of the content type extracted in portion's information and the code identification for including in each resource file is compared;
Determination unit, for according to comparison result, content type in first code identification to be mentioned from header information with described The identical resource file of the content type taken is determined as child resource file as primary resource file, other resource files.
17. the device as described in claim 15 or 16, which is characterized in that further include:
Primary resource file load module, for calling preset when detecting the opening operation of the web page files of storage Web page browsing tool parses the corresponding page initial data of the web page files;
Child resource file load module, for loading the son money of the corresponding storage in the local links address in the page initial data The page initial data of source file;
Display module includes the page initial data and decoding for showing in the display window of the web page browsing tool The page initial data obtained after child resource file.
18. a kind of user terminal characterized by comprising processor and memory;
The processor for obtaining the page initial data of webpage to be archived, and obtains the coding mark of the page initial data Know;The page initial data for parsing the webpage to be archived determines each url page of the Webpage correlation to be archived respectively Face, and obtain the page initial data and code identification of each associated link page;To the page of the webpage to be archived Initial data and code identification are encoded to obtain primary resource file, respectively to each link the page page initial data and Code identification is encoded to obtain child resource file;It is poly- by the obtained primary resource file and each child resource Document encapsulation Close web document;
The memory, the cluster web pages document obtained for storage enclosure;
Wherein, the processor, specifically for encoding Quoted-printable to described wait return based on printable character reference The page initial data of shelves webpage is encoded, and by the page initial data of file and the webpage to be archived after coding Code identification combines to obtain primary resource file;If the page initial data of the link page corresponds to the content type in code identification and refers to Show that the page initial data is the data of binary format, then encodes the page to binary format based on preconfigured base64 Face initial data is encoded, otherwise, based on preconfigured printable character reference coding Quoted-printable to non- The page initial data of binary format is encoded, and respectively by file and the corresponding page for linking the page after each coding The code identification of initial data combines to obtain primary resource file.
19. a kind of user terminal characterized by comprising processor, memory and display;
The memory, the cluster web pages document obtained for storage enclosure;
The processor, for the boundary marker in the cluster web pages document header information according to reading, from the cluster web pages Segmentation obtains primary resource file and each child resource file in document;The primary resource file is decoded to obtain net to be archived The page initial data of page, and successively each child resource file is decoded, obtain the page original number of each link page According to;The corresponding page initial data of decoded each child resource file is ordered according to preset local file naming rule Name simultaneously stores;Successively according to the storage address of the corresponding page initial data of each child resource file, correspondence obtains decoding Link network address in the page initial data of the webpage to be archived is revised as local links address, and link network address has been modified The page initial data of the webpage to be archived be stored as web page files;
The display, for showing that after being opened by processor parsing include page initial data reconciliation numeral money The page of the page initial data obtained after source file.
CN201410133677.2A 2014-04-03 2014-04-03 A kind of web page processing method, device and user terminal Active CN104978325B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410133677.2A CN104978325B (en) 2014-04-03 2014-04-03 A kind of web page processing method, device and user terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410133677.2A CN104978325B (en) 2014-04-03 2014-04-03 A kind of web page processing method, device and user terminal

Publications (2)

Publication Number Publication Date
CN104978325A CN104978325A (en) 2015-10-14
CN104978325B true CN104978325B (en) 2019-06-25

Family

ID=54274840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410133677.2A Active CN104978325B (en) 2014-04-03 2014-04-03 A kind of web page processing method, device and user terminal

Country Status (1)

Country Link
CN (1) CN104978325B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106959975B (en) * 2016-01-11 2021-06-04 阿里巴巴(中国)有限公司 Transcoding resource cache processing method, device and equipment
CN106161427B (en) * 2016-06-08 2020-02-11 北京兰云科技有限公司 Webpage processing method, network analyzer and HTTP server
CN110175302B (en) * 2019-05-20 2021-04-23 北京字节跳动网络技术有限公司 Method and device for embedding webpage in document
CN111931113B (en) * 2020-09-16 2021-01-05 深圳壹账通智能科技有限公司 Data cleaning method and related equipment
CN112632919B (en) * 2020-09-28 2022-03-08 腾讯科技(深圳)有限公司 Document editing method and device, computer equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1873644A (en) * 2005-06-03 2006-12-06 国际商业机器公司 Method and computer system for content recovery due to user triggering
CN101179550A (en) * 2006-12-14 2008-05-14 腾讯科技(深圳)有限公司 Personal homepage implementing method and system
CN101578592A (en) * 2006-08-22 2009-11-11 雅虎公司 Persistent saving portal
CN101785005A (en) * 2007-08-29 2010-07-21 国际商业机器公司 Apparatus, system, and method for cooperation between a browser and a server to package small objects in one or more archives
CN102065571A (en) * 2010-12-30 2011-05-18 深圳市五巨科技有限公司 Mobile terminal browser and working method thereof
CN102651017A (en) * 2012-03-30 2012-08-29 北京英富森信息技术有限公司 Webpage original edition and original appearance display method based on uniform resource locator (URL) address rewrite
CN102737116A (en) * 2012-05-29 2012-10-17 深圳市同洲电子股份有限公司 Method and device for storing webpage resources
CN103365860A (en) * 2012-03-28 2013-10-23 腾讯科技(深圳)有限公司 Method, device and terminal for processing web pages
CN103631916A (en) * 2013-11-29 2014-03-12 北京奇虎科技有限公司 Method and device for downloading downloadable resources

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8874694B2 (en) * 2009-08-18 2014-10-28 Facebook, Inc. Adaptive packaging of network resources

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1873644A (en) * 2005-06-03 2006-12-06 国际商业机器公司 Method and computer system for content recovery due to user triggering
CN101578592A (en) * 2006-08-22 2009-11-11 雅虎公司 Persistent saving portal
CN101179550A (en) * 2006-12-14 2008-05-14 腾讯科技(深圳)有限公司 Personal homepage implementing method and system
CN101785005A (en) * 2007-08-29 2010-07-21 国际商业机器公司 Apparatus, system, and method for cooperation between a browser and a server to package small objects in one or more archives
CN102065571A (en) * 2010-12-30 2011-05-18 深圳市五巨科技有限公司 Mobile terminal browser and working method thereof
CN103365860A (en) * 2012-03-28 2013-10-23 腾讯科技(深圳)有限公司 Method, device and terminal for processing web pages
CN102651017A (en) * 2012-03-30 2012-08-29 北京英富森信息技术有限公司 Webpage original edition and original appearance display method based on uniform resource locator (URL) address rewrite
CN102737116A (en) * 2012-05-29 2012-10-17 深圳市同洲电子股份有限公司 Method and device for storing webpage resources
CN103631916A (en) * 2013-11-29 2014-03-12 北京奇虎科技有限公司 Method and device for downloading downloadable resources

Also Published As

Publication number Publication date
CN104978325A (en) 2015-10-14

Similar Documents

Publication Publication Date Title
US11042736B2 (en) Methods and systems for monitoring documents exchanged over computer networks
CN104978325B (en) A kind of web page processing method, device and user terminal
CN110083383A (en) Browser style compatibility method, device, computer equipment and storage medium
US20140304389A1 (en) Identifying Selected Dynamic Content Regions
CN108572819A (en) Method for updating pages, device, terminal and computer readable storage medium
CN104063401B (en) The method and apparatus that a kind of webpage pattern address merges
CN101526963A (en) Method for identifying web page coding, device and terminal equipment
CN108572990A (en) Information-pushing method and device
US20210064453A1 (en) Automated application programming interface (api) specification construction
CN113158101B (en) Visual page rendering method, device, equipment and storage medium
CN113382083B (en) Webpage screenshot method and device
CN109725965A (en) Cascading style listing maintenance, device, computer equipment and storage medium
CN112765516A (en) Page content display method and device, storage medium and electronic device
CN104750663B (en) The recognition methods of text messy code and device in the page
CN107301137A (en) RSET interface realizing methods and device and electronic equipment and computer-readable recording medium
CN110851136A (en) Data acquisition method and device, electronic equipment and storage medium
CN116992081A (en) Page form data processing method and device and user terminal
WO2023092580A1 (en) Page display method and apparatus, storage medium, and electronic device
CN113495730A (en) Resource package generation and analysis method and device
CN114500423A (en) Message processing method, device, equipment and storage medium
CN112822265A (en) Data encoding method, device, equipment end and storage medium
CN107368557B (en) Page editing method and device
CN113590985B (en) Page jump configuration method and device, electronic equipment and computer readable medium
CN105808628A (en) Webpage transcoding method, apparatus and system
CN108664511A (en) Obtain webpage information method and apparatus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20221111

Address after: 1402, Floor 14, Block A, Haina Baichuan Headquarters Building, No. 6, Baoxing Road, Haibin Community, Xin'an Street, Bao'an District, Shenzhen, Guangdong 518133

Patentee after: Shenzhen Yayue Technology Co.,Ltd.

Address before: 2, 518000, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.