CN104978325A

CN104978325A - Webpage processing method and device, and user terminal

Info

Publication number: CN104978325A
Application number: CN201410133677.2A
Authority: CN
Inventors: 王文涛
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Yayue Technology Co ltd
Priority date: 2014-04-03
Filing date: 2014-04-03
Publication date: 2015-10-14
Anticipated expiration: 2034-04-03
Also published as: CN104978325B

Abstract

The embodiment of the invention discloses a webpage processing method and device, and a user terminal. The method comprises the following steps: obtaining the page original data of a webpage to be archived, and obtaining a coding identifier of the page original data; analyzing the page original data of the webpage to be archived, independently determining each link page associated with the webpage to be archived, and obtaining the page original data and the coding identifier of each associated link page; coding the page original data and the coding identifier of the webpage to be archived to obtain a main resource file, and independently coding the page original data and the coding identifier of each associated link page to obtain a child resource file; and packaging the main resource file and each child resource file to obtain an aggregation webpage document. The webpage processing method and device can effectively and integrally obtain the aggregation webpage document of each type of webpage and meets the automation and intelligentization requirements of aggregation webpage document processing by users.

Description

A kind of web page processing method, device and user terminal

Technical field

The present invention relates to computer website applied technical field, particularly relate to a kind of web page processing method, device and user terminal.

Background technology

MHT file is also called cluster web pages html document, or single page file, the webpage (as comprising the webpage of the elements such as picture, Flash animation, small video) comprising one or more element can be stored as single file, its expansion .mht by name, the file of this form is called MHT file for short.This make user for web page contents preservation, management can be more convenient.

The realization of existing MHT file is general only for the page raw data of current web page, if current web page also comprises some other linked web pages, the link page of the element such as picture, animation that such as some webpage is attached, then can file the situation of makeing mistakes or there is mess code after MHT file is opened.

Summary of the invention

Embodiment of the present invention technical matters to be solved is, provides a kind of web page processing method, device and user terminal, comparatively effectively, intactly can obtain the cluster web pages document of all kinds of webpage.

In order to solve the problems of the technologies described above, embodiments provide a kind of web page processing method, comprising:

Obtain the page raw data of webpage to be archived, and obtain the code identification of this page raw data;

Resolve the page raw data of described webpage to be archived, determine each link page of described Webpage correlation to be archived respectively, and obtain page raw data and the code identification of each link page of association;

Coding is carried out to the page raw data of described webpage to be archived and code identification and obtains primary resource file, respectively coding is carried out to the page raw data of each link page and code identification and obtain child resource file;

Be cluster web pages document by the described primary resource file that obtains and each child resource Document encapsulation.

The embodiment of the present invention additionally provides another kind of web page processing method, comprising:

According to the boundary marker in the cluster web pages document header information read, from described cluster web pages document, segmentation obtains primary resource file and each child resource file;

Described primary resource file is decoded and obtains the page raw data of webpage to be archived, and successively each child resource file is decoded, obtain the page raw data of each link page;

The page raw data corresponding to each child resource file decoded is named according to the local file naming rule preset and stores;

Successively according to the memory address of page raw data corresponding to each child resource file, link network address in the page raw data of the corresponding webpage described to be archived decoding obtained is revised as local links address, and the page raw data of the webpage described to be archived link network address revised is stored as web page files.

Correspondingly, the embodiment of the present invention additionally provides a kind of page processor, comprising:

Acquisition module, for obtaining the page raw data of webpage to be archived, and obtains the code identification of this page raw data;

Parsing module, for resolving the page raw data of described webpage to be archived, determines each link page of described Webpage correlation to be archived respectively, and obtains page raw data and the code identification of each link page of association;

Coding module, obtains primary resource file for carrying out coding to the page raw data of described webpage to be archived and code identification, carries out coding respectively obtain child resource file to the page raw data of each link page and code identification;

Profiling module is cluster web pages document for described primary resource file and each child resource Document encapsulation that will obtain.

The embodiment of the present invention additionally provides another kind of page processor, comprising:

Segmentation module, for according to the boundary marker in the cluster web pages document header information read, splits and obtains primary resource file and each child resource file from described cluster web pages document;

Decoder module, obtains the page raw data of webpage to be archived for decoding to described primary resource file, and decodes to each child resource file successively, obtains the page raw data of each link page;

Child resource processing module, names for the page raw data corresponding to each child resource file decoded according to the local file naming rule preset and stores;

Memory module, for successively according to the memory address of page raw data corresponding to each child resource file, link network address in the page raw data of the corresponding webpage described to be archived decoding obtained is revised as local links address, and the page raw data of the webpage described to be archived link network address revised is stored as web page files.

Correspondingly, embodiments provide a kind of user terminal, comprising: processor and storer;

Described processor, for obtaining the page raw data of webpage to be archived, and obtains the code identification of this page raw data; Resolve the page raw data of described webpage to be archived, determine each link page of described Webpage correlation to be archived respectively, and obtain page raw data and the code identification of each link page of association; Coding is carried out to the page raw data of described webpage to be archived and code identification and obtains primary resource file, respectively coding is carried out to the page raw data of each link page and code identification and obtain child resource file; Be cluster web pages document by the described primary resource file that obtains and each child resource Document encapsulation;

Described storer, for the cluster web pages document that storage enclosure obtains.

Embodiments provide another kind of user terminal, comprising: processor, storer and display;

Described storer, for the cluster web pages document that storage enclosure obtains;

Described processor, for according to the boundary marker in the cluster web pages document header information read, splits and obtains primary resource file and each child resource file from described cluster web pages document; Described primary resource file is decoded and obtains the page raw data of webpage to be archived, and successively each child resource file is decoded, obtain the page raw data of each link page; The page raw data corresponding to each child resource file decoded is named according to the local file naming rule preset and stores; Successively according to the memory address of page raw data corresponding to each child resource file, link network address in the page raw data of the corresponding webpage described to be archived decoding obtained is revised as local links address, and the page raw data of the webpage described to be archived link network address revised is stored as web page files;

Described display, resolves the page comprising the page raw data obtained after described page raw data conciliates numeral resource file after opening for showing by described processor.

The embodiment of the present invention is by processing the primary resource file obtaining cluster web pages document to the page raw data of webpage to be archived, and the page raw data of each link page of this webpage to be archived is obtained based on page raw data, and each child resource file obtaining cluster web pages document is processed to the page raw data obtaining linking the page, final filing obtains cluster web pages document, the embodiment of the present invention can file each data obtaining webpage to be archived comparatively all-sidedly and accurately, thus it is comparatively effective, intactly obtain the cluster web pages document of all kinds of webpage, make when carrying out the decoding of being correlated with, complete decoding can obtain all data of webpage to be archived, there is the problems such as mess code when avoiding filing to make mistakes and open cluster web pages document, meet the robotization of user to cluster web pages document process, intelligent demand.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the schematic flow sheet of a kind of web page processing method of the embodiment of the present invention;

Fig. 2 is the schematic flow sheet of the another kind of web page processing method of the embodiment of the present invention;

Fig. 3 is the schematic flow sheet of another web page processing method of the embodiment of the present invention;

Fig. 4 is the primary resource file of the embodiment of the present invention and the schematic diagram of the corresponding decoded document storage mode of child resource file;

Fig. 5 is the schematic flow sheet of another web page processing method of the embodiment of the present invention;

Fig. 6 is the structural representation of a kind of page processor of the embodiment of the present invention;

Fig. 7 is wherein a kind of structural representation of the parsing module in Fig. 6;

Fig. 8 is wherein a kind of structural representation of the coding module in Fig. 6;

Fig. 9 is wherein a kind of structural representation of the profiling module in Fig. 6;

Figure 10 is the structural representation of a kind of user terminal of the embodiment of the present invention;

Figure 11 is the structural representation of the another kind of page processor of the embodiment of the present invention;

Figure 12 is wherein a kind of structural representation of the segmentation module in Figure 11;

Figure 13 is the structural representation of the another kind of user terminal of the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

The embodiment of the present invention can obtain the page raw data of webpage to be archived, and the all-links page obtained from the page raw data of webpage to be archived involved by it, and determine the page raw data obtaining the all-links page further, process obtaining page raw data again, final filing obtains cluster web pages document, can more efficiently preserve with the form of cluster web pages document all types of webpage, also facilitate follow-up when opening this cluster web pages document, can correctly, intactly temporarily corresponding content of pages.

Refer to Fig. 1, it is the schematic flow sheet of a kind of web page processing method of the embodiment of the present invention, the described method of the embodiment of the present invention can be applicable in the user terminal of mobile phone, panel computer, PC, intelligent wearable device wait internet browsing function, concrete, the described method of the embodiment of the present invention comprises:

S101: the page raw data obtaining webpage to be archived, and obtain the code identification of this page raw data;

The page raw data of webpage to be archived involved in the embodiment of the present invention mainly comprises the source code data of the page.Described webpage to be archived can be user typing webpage link address URL(Uniform Resoure Locator in a browser, uniform resource locator) after, the main page opened by browser, terminal directly can read the page raw data comprising source code data from the main page that this is opened; Also can be that user is when hope is filed certain webpage, the webpage link address of typing in webpage link address typing frame, terminal automatically according to determining corresponding webpage in this webpage link address to respective server, and can pull the page raw data of this corresponding webpage.

Described code identification specifically can obtain from page raw data, comprise character set mark and the content type of the coded system of this page raw data, content-type(content type is generally included in page raw data) and charset(character set) content, based on these two contents, the code identification of page raw data can be obtained, such as, comprise in certain page raw data: meta http-equiv=" content-type " content=" text/html; Charset=utf-8 "; can determine that the code identification of this page raw data is " text/html " (page of text) and " utf-8 " thus; namely the content type of this page raw data is " text/html " (page of text), and the character set adopted is " utf-8 " character set.

S102: the page raw data of resolving described webpage to be archived, determines each link page of described Webpage correlation to be archived respectively, and obtains page raw data and the code identification of each link page of association.

The link page described in the embodiment of the present invention refers to each web page element forming this webpage to be archived, such as this picture, video, FLASH animation etc. involved by webpage to be archived, have recorded the relative memory address of the web page elements such as the picture in this webpage to be archived, video, FLASH animation in page raw data.

Be according to the relative memory address in page raw data in described S102, obtain the page raw data of the corresponding link page.Wherein, if when also comprising the relative address of the link page in the page raw data of the link page obtained, then also need to obtain the link page corresponding to this relative address further, by Recursion process, get page raw data and the code identification of all link pages.The page raw data of each link page and the acquisition of code identification identical with the page raw data of webpage main page to be archived and the obtain manner of code identification.

S103: carry out coding to the page raw data of described webpage to be archived and code identification and obtain primary resource file, carries out coding to the page raw data of each link page and code identification respectively and obtains child resource file.

In webpage main page to be archived, the general content type adopted is " Content-Type:text/html " (page of text), therefore, when encoding to the page raw data of webpage to be archived and code identification, the coded system that can directly adopt printable character to quote coding Quoted-printable is encoded, and obtains the primary resource file of cluster web pages document.And for the page raw data of each link page, then can adopt different coded systems, such as, when being " Content-Type:image/png " (picture/mb-type) for content type, then adopt base64 coding to encode to the page raw data of this link page and obtain corresponding child resource file, and be the raw data of the link page of " Content-Type:text/html " (page of text) for content type, the coded system then adopting printable character to quote coding Quoted-printable is equally encoded, obtain corresponding child resource file.

S104: be cluster web pages document by the described primary resource file that obtains and each child resource Document encapsulation.

After obtaining primary resource and each child resource, directly based on the agreement corresponding to cluster web pages document MHT, the encapsulation of described primary resource file and each child resource file can be completed, obtains cluster web pages document.

Concrete, in embodiments of the present invention, a kind of method step of encapsulation comprises: the head first building MHT, fill the protocol information of MHT, and the information such as the content type of primary resource and boundary marker can be set in head, wherein, the content type of the primary resource arranged, can make follow-up decode time, using first content type be the resource of the primary resource content type of described MHT head as primary resource, to be correctly decoded the main page content obtaining webpage to be archived.Next is each ingredient building MHT successively, generally primary resource is set to Part I, each child resource is set to Part II, wherein, when building each part of MHT, for each resource, set boundary marker is all set in described MHT head, and the content type of page raw data corresponding to each resource of corresponding record, so that segmentation accurately and decoding.Finally, the various piece splicing according to having built obtains complete MHT file, and the text saving as utf-8 form is preserved.Wherein, in embodiments of the present invention, after child resource corresponding to each url page face and code identification thereof are all built into cluster web pages document MHT appropriate section, end mark is set, so that perform the described various piece splicing according to having built to obtain complete MHT file when end mark being detected.

When filing primary resource file and each child resource file, for the filing of each child resource file, according to the position of link network address in the page raw data that primary resource file is corresponding corresponding to the page raw data of child resource file, carry out successively filing, facilitate follow-up when decoding, can to decode successively child resource file corresponding with the page raw data of primary resource file, to facilitate browser to open the corresponding content such as picture, video in correct position.

Refer to Fig. 2 again, it is the schematic flow sheet of the another kind of web page processing method of the embodiment of the present invention, the described method of the embodiment of the present invention can be applicable in the user terminal of mobile phone, panel computer, PC, intelligent wearable device wait internet browsing function, concrete, the described method of the embodiment of the present invention comprises:

S201: the page raw data obtaining webpage to be archived, and obtain the code identification of this page raw data.Wherein, the code identification of the page raw data of described webpage main page to be archived comprises: for determine the coded system of the page raw data of main page character set mark and for distinguishing the content type whether described page raw data is binary format data.

In embodiments of the present invention, specifically can by intercepting HTML head file character stream and the character of this filing webpage, repeatedly carry out decode and detect the coded character set obtained belonging to character, can determine efficiently like this webpage to be archived for character set, and accuracy rate is high.Content type then can be determined by " content-type " in page raw data, so that in situation, the page raw data of the content type such as picture, video is defined as binary format data.

S202: the page raw data of described webpage to be archived resolved as target data, obtains each the link page recorded in target data.

Wherein, in described S201, obtain each the link page recorded in target data, comprising: intercept each the link network address recorded in described target data according to preset regular expression respectively; Network address according to described webpage to be archived carries out completion process to each the link network address intercepted, and obtains the absolute network address that each link network address is corresponding; Each the link page recorded in target data is obtained according to each the absolute Web site query obtained.Wherein said link network address is the relative network address of the network address relative to described webpage to be archived.

By satisfy the need mutually strength and absolute path in the efficient intercept page raw data of regular expression, and the interference of page internal chaining can be got rid of.The regular expression that the embodiment of the present invention can adopt specifically can comprise following various expression formula:

Adaptive: url (" image/logo.jpg :)

regexRule=@"url\\(\\s*(('\\s*[^']+')|(\"\\s*[^\"]+\")|(\\s*[^\\)]+))"；

Adaptive: src='filename.ext'; Background=" filename.ext "

@"(\\ssrc|\\sbackground)\\s*=\\s*(('[^']+')|(\"[^\"]+\")|([^\\n\\r\\f]+))";

Adaptive :@import " style.css " or@import url (style.css)

@"(@import\\s|\\S+-image:|background:)\\s*(url)*\\s*[\"'(]{1,2}[^\"')]+[\"')]{1,2}";

Adaptive: <link rel=stylesheet href=" style.css " >

@"<link[^>]+?href\\s*=\\s*('|\")*[^'\">]+('|\")*"；

Adaptive: <iframe src=" mypage.htm " >or<frame src=" mypage.aspx " >

@"<i*frame[^>]+?src\\s*=\\s*['\"]{0,1}[^'\"\\\\>]+['\"]{0,1}"。

Concrete, such as, can by the link network address url in the page raw data of described webpage to be archived: "/static/search/ala/callicon.png ", obtains absolute network address url after carrying out completion reparation with the network address of webpage to be archived: " http://www.baidu.com/static/search/ala/callicon.png ".

S203: the page raw data and the code identification that obtain each the link page recorded in this target data.

S204: whether get the page raw data of the related link page and code identification.

S205: if do not have, then link the page raw data of the page as new target data using each record in this target data, jump to described S203.Namely obtain page raw data and the code identification of each link page layer by layer, obtain and all page raw data linking the page of Webpage correlation to be archived and code identification.

Wherein, after obtaining the raw data of the link page of each level, equally specifically can carry out by above-mentioned regular expression the relative address intercepting lower one deck url page face from the raw data of the link page, the page raw data obtaining the link page of next level then and code identification.Repeat till described S203 do not exist the link page in new target data, namely to have got each the link page raw data of the page and code identification of described Webpage correlation to be archived.

The code identification of the page raw data of the link page at all levels comprises: for determine the coded system of page raw data character set mark and for distinguishing the content type whether described page raw data is binary format data.

S206: if get the page raw data of the related link page and code identification, be then that cluster web pages document builds header information; Described header information comprises: the code identification of the page raw data of boundary marker, webpage to be archived, also comprises the information such as MTH agreement.Concrete example is as described below:

Subject:WebArchive

Date:004,04Mar2XXX23:22:27PST

MIME-Version:1.0 (MIME version)

Content-Type:multipart/related;

type="text/html";(content type)

Boundary="----=_ NextPart_000_00 " (boundary segmentation symbol)

The information such as creation-time, protocol version, content type, boundary marker are specifically indicated in above-mentioned header information example.

S207: the page raw data of webpage to be archived of encoding, obtains primary resource file.Specifically comprise and quote the coding page raw data of Quoted-printable to described webpage to be archived based on printable character and encode, and the code identification combination of the page raw data of the file after coding and described webpage to be archived is obtained primary resource file.Be that described primary resource file adds boundary marker again.

Quoted-printable is the character under using printable ascii character to represent various coded format, especially when text does not comprise a lot of non-ascii character, the content of such as html, the result of this coding mostly organized by label is readable better and compact, correctly can process data on corresponding data path or media.Following illustrate a kind of page raw data to webpage to be archived to encode, and the form of the primary resource file finally obtained:

------=_NextPart_000_00

Content-Type:text/html;

charset="utf-8"

Content-Transfer-Encoding:quoted-printable

Content-Location:http://www.baidu.com

<html><head><meta http-equiv=3D"Content-Type"=

content=3D"text/html;charset=3Dutf-8"><meta name=3D"viewport"=

content=3D"width=3Ddevice-width,minimum-scale=3D1.0,maximum-scale=3D1.0=

,user-scalable=3Dno"><link rel=3D"apple-touch-icon-precomposed"=

href=3D"http://m.baidu.com/static/index/screen_icon.png"><meta=

Name=3D " format-detection "=(herein omitting)

What the first row in above-mentioned primary resource document form represented is boundary marker, carries out dividing processing so that follow-up, and this boundary marker is recorded and indicated in the header information of above-mentioned MHT.Second row illustrates the content type of the main page page raw data of webpage to be archived for " text/html " page of text type.The third line then illustrates adopted character set for " utf-8 ".Fourth line illustrates coding and employs " quoted-printable ", and fifth line illustrates the network address belonging to primary resource.Follow-up content is then the content of the main page page raw data after coding.

S208: the page raw data of each link page of encoding, obtains child resource file.Specifically comprise: if the content type in the corresponding code identification of the page raw data of the link page indicates this page raw data to be the data of binary format, then based on pre-configured base64 coding, the page raw data of binary format is encoded; Be respectively coding and obtain each child resource file interpolation boundary marker.If not during the data of binary format, then quote the coding page raw data of Quoted-printable to nonbinary form based on pre-configured printable character to encode, and the file after each being encoded respectively combines with the corresponding code identification linking the page raw data of the page and obtains primary resource file.Each child resource file that being respectively encodes again obtains adds boundary marker.

Wherein, after encoding to the page raw data of binary format, the example obtaining child resource file is as follows:

-----=_ NextPart_000_00(child resource)

Content-Type:image/png;

Content-Transfer-Encoding:base64

Content-Location:http://m.baidu.com/static/index/logo_index2.png

iVBORw0KGgoAAAANSUhEUgAAAVAAAABrCAYAAAAhItoDAAAXvElEQVR4Xu3dfXAV5b3A8eeck+Qk

5IWEtwAiBpCXYhEVpFB8KSDiK0WqohX0qrRK0SKtXuzItVaFq1crA8L4VrhUi5dCuVgKl0Kpg3Ws

KfWClIsIhIAQYmLEmMY0Ho/hd7/D2BnNZPfsnrNPzmbPw8xn+IORhRn8 zrO7v+dZ9V5OyEeMAb0q(omits herein)

The first row of above-mentioned example illustrates boundary marker.The content type of the page raw data that the second behavior child resource is corresponding, the third line is coded format, and fourth line is affiliated absolute path, and follow-up content is then for carry out the content after base64 coding to related pages raw data.The example of child resource file that the page raw data of nonbinary form obtains and the example basic simlarity of above-mentioned child resource are only change to some extent on content type and coded format.

S209: the primary resource file and the child resource file that with the addition of boundary marker described in preserving with designated store form, and the header information of described structure, obtain the cluster web pages document comprising primary resource file and each child resource file.

Obtain complete MHT file according to the primary resource file built and the splicing of each child resource file, the text saving as utf-8 form is preserved.Wherein, in embodiments of the present invention, after child resource corresponding to each url page face and code identification thereof are all built into cluster web pages document MHT appropriate section, end mark is set, so that perform the described various piece splicing according to having built to obtain complete MHT file when end mark being detected.

Refer to Fig. 3 again, it is the schematic flow sheet of another web page processing method of the embodiment of the present invention, the described method of the embodiment of the present invention can be applicable in the user terminal of mobile phone, panel computer, PC, intelligent wearable device wait internet browsing function, specific implementation is resolved the MHT document obtained in the corresponding embodiment of above-mentioned Fig. 1 and Fig. 2, reduction obtains corresponding complete web data, concrete, the described method of the embodiment of the present invention comprises:

S301: according to the boundary marker in the cluster web pages document header information read, from described cluster web pages document, segmentation obtains primary resource file and each child resource file.

In described cluster web pages document, the process of primary resource file and each child resource file can with reference to the description of the corresponding embodiment of above-mentioned Fig. 1 to Fig. 2 with filing.The content type of boundary marker and page raw data corresponding to primary resource file is included, certainly the content of also protocol information comprising some MHT and so in cluster web pages document header information.When splitting, specifically can using resource file identical with the described content type extracted from header information for content type in first code identification as primary resource file, other resource files are defined as child resource file.

After segmentation obtains primary resource file and child resource file, can according to the content type of correspondence, respective decoding process (quoted-printable decompress(ion) and base64 decoding) is selected to decode to primary resource file and each child resource file, obtained the page raw data of the homepage of webpage to be archived by primary resource file decoding, obtained the original number of pages of the page of corresponding subpage frame by child resource file decoding.

S302: decode to described primary resource file and obtain the page raw data of webpage to be archived, and decode to each child resource file successively, obtains the page raw data of each link page.

S303: the page raw data corresponding to each child resource file decoded is named according to the local file naming rule preset and store.

Namely process the filename after each child resource file decoding, and the link network address in homepage raw data corresponding for primary resource file is changed to local links address.Wherein, in above-mentioned archiving process, because child resource file carries out filing according to the position of link network address in the page raw data that primary resource file is corresponding of its correspondence, now when decoding, then decode successively and process the filename of decoded child resource file corresponding data, then storing successively.Such as, can arrange according to digital number at the decoded file of antithetical phrase resource file, obtain unmht_cid_1 respectively, unmht_cid_2, the files such as unmht_cid_3, unmht_cid_4, the chained address, web page interlinkage conversion cost ground in the page raw data then decoding primary resource file obtained, such as, can be by url address specifically: http://m.baidu.com/static/index/logo_index2.png is converted to :/subfile/unmht_cid_2(is with numeric sorting).

S304: successively according to the memory address of page raw data corresponding to each child resource file, link network address in the page raw data of the corresponding webpage described to be archived decoding obtained is revised as local links address, and the page raw data of the webpage described to be archived link network address revised is stored as web page files.

The page raw data of the webpage to be archived obtained of decoding with reference to the description of figure 4, can be saved as final html file by concrete storage mode, can called after index.html.And for the page raw data that each child resource file decoding obtains, then correspondingly with described index.html to preserve, under specifically can being kept at corresponding subfile file.By this kind of storage mode, when calling this index.html file of web page browsing tool to open of browser, the parsing opening operation to all page elements of webpage to be archived can be realized.

Refer to Fig. 5 again, it is the schematic flow sheet of another web page processing method of the embodiment of the present invention, the described method of the embodiment of the present invention can be applicable in the user terminal of mobile phone, panel computer, PC, intelligent wearable device wait internet browsing function, specific implementation is resolved the MHT document obtained in the corresponding embodiment of above-mentioned Fig. 1 and Fig. 2, reduction obtains corresponding complete web data, concrete, the described method of the embodiment of the present invention comprises:

S401: according to the boundary marker in the cluster web pages document header information read, described cluster web pages document is split, obtains each resource file.

S402: extract the content type in the code identification of described cluster web pages document header information, and the content type of the code identification content type extracted from header information and each resource file comprised compares.

S403: according to comparative result, using resource file identical with the described content type extracted from header information for content type in first code identification as primary resource file, other resource files are defined as child resource file.

S404: decode to described primary resource file and obtain the page raw data of webpage to be archived, and decode to each child resource file successively, obtains the page raw data of each link page.

S405: the page raw data corresponding to each child resource file decoded is named according to the local file naming rule preset and store.

S406: successively according to the memory address of page raw data corresponding to each child resource file, link network address in the page raw data of the corresponding webpage described to be archived decoding obtained is revised as local links address, and the page raw data of the webpage described to be archived link network address revised is stored as web page files.

S407: when the opening operation of described web page files of storage being detected, calls the page raw data that described in preset web page browsing tool parses, web page files is corresponding.

S408: the page raw data loading the child resource file of the local links address corresponding stored in described page raw data.

S409: display comprises the page of the page raw data obtained after described page raw data conciliates numeral resource file in the display window of described web page browsing instrument.

Equally, the page raw data of the webpage to be archived obtained of decoding with reference to the description of figure 4, can be saved as final html file by storage mode, can called after index.html.And for the page raw data that each child resource file decoding obtains, then correspondingly with described index.html to preserve, under specifically can being kept at corresponding subfile file.By this kind of storage mode, when calling this index.html file of web page browsing tool to open of browser, the parsing opening operation to all page elements of webpage to be archived can be realized.

The specific implementation process of the correlation step of the embodiment of the present invention can with reference to the description of the corresponding embodiment of above-mentioned Fig. 1 to Fig. 3.

Below the page processor of the embodiment of the present invention and user terminal are described in detail.

Refer to Fig. 6, it is the structural representation of a kind of page processor of the embodiment of the present invention, the described device of the embodiment of the present invention can be arranged in the user terminal of mobile phone, panel computer, PC, intelligent wearable device wait internet browsing function, concrete, the described device of the embodiment of the present invention comprises:

Acquisition module 11, for obtaining the page raw data of webpage to be archived, and obtains the code identification of this page raw data;

Parsing module 12, for resolving the page raw data of described webpage to be archived, determines each link page of described Webpage correlation to be archived respectively, and obtains page raw data and the code identification of each link page of association;

Coding module 13, obtains primary resource file for carrying out coding to the page raw data of described webpage to be archived and code identification, carries out coding respectively obtain child resource file to the page raw data of each link page and code identification;

Profiling module 14 is cluster web pages document for described primary resource file and each child resource Document encapsulation that will obtain.

The page raw data of webpage to be archived involved in the embodiment of the present invention mainly comprises the source code data of the page.Described webpage to be archived can be user in a browser after typing webpage link address URL, the main page opened by browser, described acquisition module 11 obtain can directly from the main page that this is opened reading comprise the page raw data of source code data; Also can be that user is when hope is filed certain webpage, the webpage link address of typing in webpage link address typing frame, described acquisition module 11 obtains and automatically according to determining corresponding webpage in this webpage link address to respective server, and can pull the page raw data of this corresponding webpage.

Described acquisition module 11 obtains and specifically can obtain described code identification from page raw data, code identification specifically comprises character set mark and the content type of the coded system of this page raw data, content-type(content type is generally included in page raw data) and charset(character set) content, based on these two contents, the code identification of page raw data can be obtained, such as, comprise in certain page raw data: meta http-equiv=" content-type " content=" text/html; Charset=utf-8 "; described acquisition module 11 can determine that the code identification of this page raw data is " text/html " (page of text) and " utf-8 " thus; namely the content type of this page raw data is " text/html " (page of text), and the character set adopted is " utf-8 " character set.

Described parsing module 12 is according to the relative memory address in page raw data, obtains the page raw data of the corresponding link page.Wherein, if when also comprising the relative address of the link page in the page raw data of the link page obtained, then described parsing module 12 also needs to obtain the link page corresponding to this relative address further, by Recursion process, get page raw data and the code identification of all link pages.The page raw data of each link page and the acquisition of code identification identical with the page raw data of webpage main page to be archived and the obtain manner of code identification.

In webpage main page to be archived, the general content type adopted is " Content-Type:text/html " (page of text), therefore, when encoding to the page raw data of webpage to be archived and code identification, the coded system that described coding module 13 can directly adopt printable character to quote coding Quoted-printable is encoded, and obtains the primary resource file of cluster web pages document.And for the page raw data of each link page, then can adopt different coded systems, such as, when being " Content-Type:image/png " (picture/mb-type) for content type, then adopt base64 coding to encode to the page raw data of this link page and obtain corresponding child resource file, and be the raw data of the link page of " Content-Type:text/html " (page of text) for content type, the coded system then adopting printable character to quote coding Quoted-printable is equally encoded, obtain corresponding child resource file.

After obtaining primary resource file and each child resource file, described profiling module 14 directly based on the agreement corresponding to cluster web pages document MHT, can complete the encapsulation of described primary resource file and each child resource file, obtains cluster web pages document.

Concrete, in embodiments of the present invention, the encapsulation filing step of described profiling module 14 comprises: the head first building MHT, fill the protocol information of MHT, and the information such as the content type of primary resource and boundary marker can be set in head, wherein, the content type of the primary resource arranged, can make follow-up decode time, using first content type be the resource of the primary resource content type of described MHT head as primary resource, to be correctly decoded the main page content obtaining webpage to be archived.Next is each ingredient building MHT successively, generally primary resource is set to Part I, each child resource is set to Part II, wherein, when building each part of MHT, for each resource, set boundary marker is all set in described MHT head, and the content type of page raw data corresponding to each resource of corresponding record, so that segmentation accurately and decoding.Finally, the various piece splicing according to having built obtains complete MHT file, and the text saving as utf-8 form is preserved.Wherein, in embodiments of the present invention, after child resource corresponding to each url page face and code identification thereof are all built into cluster web pages document MHT appropriate section, end mark is set, so that perform the described various piece splicing according to having built to obtain complete MHT file when end mark being detected.

When filing primary resource file and each child resource file, for the filing of each child resource file, described profiling module 14 is according to the position of link network address in the page raw data that primary resource file is corresponding corresponding to the page raw data of child resource file, carry out successively filing, facilitate follow-up when decoding, can to decode successively child resource file corresponding with the page raw data of primary resource file, to facilitate browser to open the corresponding content such as picture, video in correct position.

Further alternatively, in embodiments of the present invention, described parsing module 12 specifically can comprise with lower unit, specifically refers to Fig. 7.

Determining unit 121, for the page raw data of described webpage to be archived being resolved as target data, obtains each the link page recorded in target data;

Processing unit 122, for obtaining page raw data and the code identification of each the link page recorded in this target data, and the page raw data of each the link page recorded in this target data is processed as new target data, get page raw data and the code identification of each link page of described Webpage correlation to be archived.

Wherein concrete, described determining unit 121, for intercepting each the link network address recorded in described target data respectively according to preset regular expression; Network address according to described webpage to be archived carries out completion process to each the link network address intercepted, and obtains the absolute network address that each link network address is corresponding; Each the link page recorded in target data is obtained according to each the absolute Web site query obtained.

The specific implementation of each unit in above-mentioned parsing module 12 may correspond to reference to the description of the embodiment of above-mentioned Fig. 1 to Fig. 2.

Further alternatively, in embodiments of the present invention, the code identification of page raw data comprises: for determine the coded system of page raw data character set mark and for distinguishing the content type whether described page raw data is binary format data.As shown in Figure 8, the described coding module 13 of the embodiment of the present invention specifically can comprise:

Primary resource coding unit 131, encode for quoting the coding page raw data of Quoted-printable to described webpage to be archived based on printable character, and the code identification combination of the page raw data of the file after coding and described webpage to be archived is obtained primary resource file;

Child resource coding unit 132, if the content type in the corresponding code identification of the page raw data for linking the page indicates this page raw data to be the data of binary format, then based on pre-configured base64 coding, the page raw data of binary format is encoded, otherwise, quote the coding page raw data of Quoted-printable to nonbinary form based on pre-configured printable character to encode, and the file after each being encoded respectively combines with the corresponding code identification linking the page raw data of the page and obtains primary resource file.

The specific implementation of each unit in above-mentioned coding module 13 may correspond to reference to the description of the embodiment of above-mentioned Fig. 1 to Fig. 2.

Further alternatively, as shown in Figure 9, described in the embodiment of the present invention, profiling module 14 specifically can comprise:

Head setting unit 141, for building header information for cluster web pages document;

Flag setting unit 142 for adding boundary marker for described primary resource file, and is that each child resource file adds boundary marker;

Storage unit 143, for primary resource file and child resource file with the addition of boundary marker described in the preservation of designated store form, and the header information of described structure, obtain the cluster web pages document comprising primary resource file and each child resource file.

The specific implementation of each unit in above-mentioned profiling module 14 may correspond to reference to the description of the embodiment of above-mentioned Fig. 1 to Fig. 2.

Referring to Figure 10 again, is the structural representation of a kind of user terminal of the embodiment of the present invention, at least one processor 1001 of the described user terminal of the embodiment of the present invention, such as CPU, at least one communication bus 1002, at least one network interface 1003, storer 1004.Wherein, communication bus 1002 is for realizing the connection communication between these assemblies.Wherein, described network interface 1003 optionally can comprise wireline interface, the wave point (as WI-FI, mobile communication interface etc.) of standard.Described storer 1004 can be high-speed RAM storer, also can be non-labile storer (non-volatile memory), such as at least one magnetic disk memory.Described storer 1004 can also be optionally that at least one is positioned at the memory storage away from aforementioned processor 1001.As shown in Figure 10, store operating system, network communication module as in a kind of storer 1004 of computer-readable storage medium, and store Web Page Processing application program and other programs.

Concrete, described processor 1001 may be used for calling the Web Page Processing application program stored in described storer 1004, for obtaining the page raw data of webpage to be archived, and obtains the code identification of this page raw data; Resolve the page raw data of described webpage to be archived, determine each link page of described Webpage correlation to be archived respectively, and obtain page raw data and the code identification of each link page of association; Coding is carried out to the page raw data of described webpage to be archived and code identification and obtains primary resource file, respectively coding is carried out to the page raw data of each link page and code identification and obtain child resource file; Be cluster web pages document by the described primary resource file that obtains and each child resource Document encapsulation;

Described storer 1004, also for cluster web pages document that storage enclosure obtains.

Refer to Figure 11 again, it is the structural representation of the another kind of page processor of the embodiment of the present invention, the described device of the embodiment of the present invention can be arranged in the user terminal of mobile phone, panel computer, PC, intelligent wearable device wait internet browsing function, concrete, the described device of the embodiment of the present invention comprises:

Segmentation module 21, for according to the boundary marker in the cluster web pages document header information read, splits and obtains primary resource file and each child resource file from described cluster web pages document;

Decoder module 22, obtains the page raw data of webpage to be archived for decoding to described primary resource file, and decodes to each child resource file successively, obtains the page raw data of each link page;

Child resource processing module 23, names for the page raw data corresponding to each child resource file decoded according to the local file naming rule preset and stores;

Memory module 24, for successively according to the memory address of page raw data corresponding to each child resource file, link network address in the page raw data of the corresponding webpage described to be archived decoding obtained is revised as local links address, and the page raw data of the webpage described to be archived link network address revised is stored as web page files.

In described cluster web pages document, the process of primary resource file and each child resource file can with reference to the description of the corresponding embodiment of above-mentioned Fig. 1 to Fig. 2 with filing.The content type of boundary marker and page raw data corresponding to primary resource file is included, certainly the content of also protocol information comprising some MHT and so in cluster web pages document header information.Described segmentation module 21, specifically can using resource file identical with the described content type extracted from header information for content type in first code identification as primary resource file when splitting, and other resource files are defined as child resource file.

After described segmentation module 21 segmentation obtains primary resource file and child resource file, described decoder module 22 can according to the content type of correspondence, respective decoding process (quoted-printable decompress(ion) and base64 decoding) is selected to decode to primary resource file and each child resource file, obtained the page raw data of the homepage of webpage to be archived by primary resource file decoding, obtained the original number of pages of the page of corresponding subpage frame by child resource file decoding.

Described child resource processing module 23 processes the filename after each child resource file decoding, and the link network address in homepage raw data corresponding for primary resource file is changed to local links address.Wherein, in above-mentioned archiving process, because child resource file carries out filing according to the position of link network address in the page raw data that primary resource file is corresponding of its correspondence, now when decoding, then decode successively and process the filename of decoded child resource file corresponding data, then storing successively.Such as, described child resource processing module 23 can arrange according to digital number at the decoded file of antithetical phrase resource file, obtain unmht_cid_1 respectively, unmht_cid_2, unmht_cid_3, the files such as unmht_cid_4, chained address, link network address conversion cost ground in the page raw data that decoding primary resource file obtains by described memory module 24 again, such as, specifically can by link network address url:http: the //m.baidu.com/static/index/logo_index2.png of correspondence position in the decoding described primary resource file page raw data that obtain, be converted to local link address url:/subfile/unmht_cid_2(accordingly with numeric sorting).

The page raw data of the webpage to be archived obtained of decoding with reference to the description of figure 4, can be saved as final html file by the concrete storage mode of described memory module 24 foundations, can called after index.html.And for the page raw data that each child resource file decoding obtains, then correspondingly with described index.html to preserve, under specifically can being kept at corresponding subfile file.By this kind of storage mode, when calling this index.html file of web page browsing tool to open of browser, the parsing opening operation to all page elements of webpage to be archived can be realized.

Further alternatively, in embodiments of the present invention, refer to Figure 12, described segmentation module 21 specifically can comprise:

Resource file cutting unit 211, for according to the boundary marker in the cluster web pages document header information read, splits described cluster web pages document, obtains each resource file;

Comparing unit 212, for extracting the content type in the code identification of described cluster web pages document header information, and the content type of the code identification content type extracted from header information and each resource file comprised compares;

Determining unit 213, for according to comparative result, using resource file identical with the described content type extracted from header information for content type in first code identification as primary resource file, other resource files are defined as child resource file.

In above-mentioned segmentation module 21, the specific implementation of each unit can with reference to the associated description of the corresponding embodiment of above-mentioned Fig. 3 to Fig. 5.

Further alternatively, then refer to Figure 11, the described device of the embodiment of the present invention can also comprise:

Primary resource file load module 25, for when the opening operation of described web page files of storage being detected, calls the page raw data that described in preset web page browsing tool parses, web page files is corresponding;

Child resource file load module 26, for loading the page raw data of the child resource file of the local links address corresponding stored in described page raw data;

Display module 27, comprises the page raw data obtained after described page raw data conciliates numeral resource file for display in the display window of described web page browsing instrument.

Referring to Figure 13 again, is the structural representation of the another kind of user terminal of the embodiment of the present invention, at least one processor 2001 of user terminal described in the embodiment of the present invention, such as CPU, at least one communication bus 2002, at least one network interface 2003, storer 2004 and display 2005.Wherein, communication bus 2002 is for realizing the connection communication between these assemblies.Wherein, described network interface 2003 optionally can comprise wireline interface, the wave point (as WI-FI, mobile communication interface etc.) of standard.Described storer 2004 can be high-speed RAM storer, also can be non-labile storer (non-volatilememory), such as at least one magnetic disk memory.Described storer 2004 can also be optionally that at least one is positioned at the memory storage away from aforementioned processor 2001.As shown in figure 12, store operating system, network communication module as in a kind of storer 2004 of computer-readable storage medium, and store Web Page Processing application program and other programs.

Concrete, described storer 2004 is also for cluster web pages document that storage enclosure obtains;

Described processor 2001 may be used for calling the Web Page Processing application program stored in described storer 2004, for according to the boundary marker in the cluster web pages document header information read, split from described cluster web pages document and obtain primary resource file and each child resource file; Described primary resource file is decoded and obtains the page raw data of webpage to be archived, and successively each child resource file is decoded, obtain the page raw data of each link page; The page raw data corresponding to each child resource file decoded is named according to the local file naming rule preset and stores; Successively according to the memory address of page raw data corresponding to each child resource file, link network address in the page raw data of the corresponding webpage described to be archived decoding obtained is revised as local links address, and the page raw data of the webpage described to be archived link network address revised is stored as web page files;

Described display 2005, resolves the page comprising the page raw data obtained after described page raw data conciliates numeral resource file after opening for showing by described processor 2001.

One of ordinary skill in the art will appreciate that all or part of flow process realized in above-described embodiment method, that the hardware that can carry out instruction relevant by computer program has come, described program can be stored in a computer read/write memory medium, this program, when performing, can comprise the flow process of the embodiment as above-mentioned each side method.Wherein, described storage medium can be magnetic disc, CD, read-only store-memory body (Read-Only Memory, ROM) or random store-memory body (Random Access Memory, RAM) etc.

Above disclosedly be only present pre-ferred embodiments, certainly can not limit the interest field of the present invention with this, therefore according to the equivalent variations that the claims in the present invention are done, still belong to the scope that the present invention is contained.

Claims

1. a web page processing method, is characterized in that, comprising:

2. the method for claim 1, it is characterized in that, the page raw data of the described webpage to be archived of described parsing, determine each link page of described Webpage correlation to be archived respectively, and obtain page raw data and the code identification of each link page of association, comprising:

The page raw data of described webpage to be archived is resolved as target data, obtains each the link page recorded in target data;

Obtain page raw data and the code identification of each the link page recorded in this target data, and each record in this target data is linked the page raw data of the page as new target data, repeat this step until get page raw data and the code identification of each link page of described Webpage correlation to be archived.

3. method as claimed in claim 2, is characterized in that, each the link page recorded in described acquisition target data, comprising:

Each the link network address recorded in described target data is intercepted respectively according to preset regular expression;

Network address according to described webpage to be archived carries out completion process to each the link network address intercepted, and obtains the absolute network address that each link network address is corresponding;

Each the link page recorded in target data is obtained according to each the absolute Web site query obtained.

4. the method as described in any one of claims 1 to 3, is characterized in that, the code identification of page raw data comprises:

For determine the coded system of page raw data character set mark and for distinguishing the content type whether described page raw data is binary format data.

5. method as claimed in claim 4, it is characterized in that, the described page raw data to described webpage to be archived and code identification carry out coding and obtain primary resource file, carry out coding respectively and obtain child resource file, comprising each link page raw data of the page and code identification:

Quote the coding page raw data of Quoted-printable to described webpage to be archived based on printable character to encode, and the code identification combination of the page raw data of the file after coding and described webpage to be archived is obtained primary resource file;

If the content type in the corresponding code identification of the page raw data of the link page indicates this page raw data to be the data of binary format, then based on pre-configured base64 coding, the page raw data of binary format is encoded, otherwise, quote the coding page raw data of Quoted-printable to nonbinary form based on pre-configured printable character to encode, and the file after each being encoded respectively combines with the corresponding code identification linking the page raw data of the page and obtains primary resource file.

6. method as claimed in claim 4, is characterized in that, described is cluster web pages document by the described primary resource file that obtains and each child resource Document encapsulation, comprising:

For cluster web pages document builds header information;

For described primary resource file adds boundary marker, and be that each child resource file adds boundary marker;

Primary resource file and the child resource file of boundary marker is with the addition of described in preserving with designated store form, and the header information of described structure, obtain the cluster web pages document comprising primary resource file and each child resource file.

7. method as claimed in claim 6, it is characterized in that, described header information comprises: the code identification of the page raw data of boundary marker, webpage to be archived.

8. a web page processing method, is characterized in that, comprising:

9. method as claimed in claim 8, is characterized in that, the boundary marker in the described cluster web pages document header information according to reading, and from described cluster web pages document, segmentation obtains primary resource file and each child resource file, comprising:

According to the boundary marker in the cluster web pages document header information read, described cluster web pages document is split, obtains each resource file;

Extract the content type in the code identification of described cluster web pages document header information, and the content type of the code identification content type extracted from header information and each resource file comprised compares;

According to comparative result, using resource file identical with the described content type extracted from header information for content type in first code identification as primary resource file, other resource files are defined as child resource file.

10. method as claimed in claim 8 or 9, is characterized in that, also comprise:

When the opening operation of described web page files of storage being detected, call the page raw data that described in preset web page browsing tool parses, web page files is corresponding;

Load the page raw data of the child resource file of the local links address corresponding stored in described page raw data;

In the display window of described web page browsing instrument, display comprises the page of the page raw data obtained after described page raw data conciliates numeral resource file.

11. 1 kinds of page processor, is characterized in that, comprising:

12. devices as claimed in claim 11, it is characterized in that, described parsing module comprises:

Determining unit, for the page raw data of described webpage to be archived being resolved as target data, obtains each the link page recorded in target data;

Processing unit, for obtaining page raw data and the code identification of each the link page recorded in this target data, and the page raw data of each the link page recorded in this target data is processed as new target data, get page raw data and the code identification of each link page of described Webpage correlation to be archived.

13. devices as claimed in claim 12, is characterized in that,

Described determining unit, for intercepting each the link network address recorded in described target data respectively according to preset regular expression; Network address according to described webpage to be archived carries out completion process to each the link network address intercepted, and obtains the absolute network address that each link network address is corresponding; Each the link page recorded in target data is obtained according to each the absolute Web site query obtained.

14. devices as described in any one of claim 11 to 13, it is characterized in that, the code identification of page raw data comprises: for determine the coded system of page raw data character set mark and for distinguishing the content type whether described page raw data is binary format data.

15. devices as claimed in claim 14, it is characterized in that, described coding module comprises:

Primary resource coding unit, encode for quoting the coding page raw data of Quoted-printable to described webpage to be archived based on printable character, and the code identification combination of the page raw data of the file after coding and described webpage to be archived is obtained primary resource file;

Child resource coding unit, if the content type in the corresponding code identification of the page raw data for linking the page indicates this page raw data to be the data of binary format, then based on pre-configured base64 coding, the page raw data of binary format is encoded, otherwise, quote the coding page raw data of Quoted-printable to nonbinary form based on pre-configured printable character to encode, and the file after each being encoded respectively combines with the corresponding code identification linking the page raw data of the page and obtains primary resource file.

16. devices as claimed in claim 14, it is characterized in that, described profiling module comprises:

Head setting unit, for building header information for cluster web pages document;

Flag setting unit for adding boundary marker for described primary resource file, and is that each child resource file adds boundary marker;

Storage unit, for primary resource file and child resource file with the addition of boundary marker described in the preservation of designated store form, and the header information of described structure, obtain the cluster web pages document comprising primary resource file and each child resource file.

17. 1 kinds of page processor, is characterized in that, comprising:

18. devices as claimed in claim 17, it is characterized in that, described segmentation module comprises:

Resource file cutting unit, for according to the boundary marker in the cluster web pages document header information read, splits described cluster web pages document, obtains each resource file;

Comparing unit, for extracting the content type in the code identification of described cluster web pages document header information, and the content type of the code identification content type extracted from header information and each resource file comprised compares;

Determining unit, for according to comparative result, using resource file identical with the described content type extracted from header information for content type in first code identification as primary resource file, other resource files are defined as child resource file.

19. devices as described in claim 17 or 18, is characterized in that, also comprise:

Primary resource file load module, for when the opening operation of described web page files of storage being detected, calls the page raw data that described in preset web page browsing tool parses, web page files is corresponding;

Child resource file load module, for loading the page raw data of the child resource file of the local links address corresponding stored in described page raw data;

Display module, comprises the page raw data obtained after described page raw data conciliates numeral resource file for display in the display window of described web page browsing instrument.

20. 1 kinds of user terminals, is characterized in that, comprising: processor and storer;

21. 1 kinds of user terminals, is characterized in that, comprising: processor, storer and display;