CN103617224A - Webpage collecting method, webpage collecting device and webpage collecting system - Google Patents

Webpage collecting method, webpage collecting device and webpage collecting system Download PDF

Info

Publication number
CN103617224A
CN103617224A CN201310603186.5A CN201310603186A CN103617224A CN 103617224 A CN103617224 A CN 103617224A CN 201310603186 A CN201310603186 A CN 201310603186A CN 103617224 A CN103617224 A CN 103617224A
Authority
CN
China
Prior art keywords
webpage
web page
page contents
content
capturing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310603186.5A
Other languages
Chinese (zh)
Other versions
CN103617224B (en
Inventor
曾强
张平
魏钦刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201310603186.5A priority Critical patent/CN103617224B/en
Priority claimed from CN201210092944.7A external-priority patent/CN102646135B/en
Publication of CN103617224A publication Critical patent/CN103617224A/en
Application granted granted Critical
Publication of CN103617224B publication Critical patent/CN103617224B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage collecting method, a webpage collecting device and a webpage collecting system. The method includes after receiving a collection operational order, executed by a user, of a browsed webpage, capturing content description information of the webpage by a script code which is written into the webpage and used for capturing webpage content; analyzing the content description information, and capturing the webpage content according to analysis results; storing the captured webpage content. By the webpage collecting method, the webpage collecting device and the webpage collecting system, comprehensiveness of the captured webpage content can be guaranteed, ordering of webpage collection results can be improved, and reading of users can be facilitated.

Description

A kind of web page storage method, Apparatus and system
Patented claim of the present invention be that March 31, application number in 2012 are 201210092944.7 the applying date, name is called the dividing an application of Chinese invention patent application of " a kind of web page storage method, Apparatus and system ".
Technical field
The present invention relates to network data processing field, particularly relate to a kind of web page storage method, Apparatus and system.
Background technology
Collection webpage, refers to that Internet user preserves interested webpage, so that at any time can re-reading relevant information.
In prior art, a kind of web page storage method is: the web page contents that user was browsed saves with the form of snapshot.Concrete, in the process that the method realizes, the hyperlink of the webpage to be collected providing according to user, locating web-pages also carries out snapshot to webpage, snapshot is saved as to web page storage information, and user can also further edit information such as collected web page title, brief introduction and labels.After web page storage success, user can check at any time.
But web page storage method of the prior art, when the form display web page collection content of snapshot, also lose a large amount of webpage raw information, be easy to cause the collection web page contents of described demonstration to lose original typesetting format, the page is disorderly and unsystematic, readable poor.
Summary of the invention
The object of this invention is to provide a kind of web page storage method, Apparatus and system, preservation web page storage content of pages that can be comparatively complete.
For achieving the above object, the invention provides following scheme:
A web page storage method, comprising:
Receive after the collection operational order that user carries out browsed webpage, utilize write described webpage for capturing the scripted code of web page contents, capture the content description information of described webpage;
Described content description information is resolved, according to analysis result, capture the content of described webpage;
Captured web page contents is preserved.
Wherein, also comprise:
When webpage that described user browses being detected and loaded, in the webpage of browsing to user, write for capturing the scripted code of web page contents;
Or,
When receiving the collection operational order that user carries out browsed webpage, in the webpage of browsing to user, write for capturing the scripted code of web page contents.
Wherein, in the described webpage of browsing to user, write for capturing the scripted code of web page contents, comprising:
In the webpage of browsing user, add embedded framework;
In described embedded framework, write described scripted code.
Wherein, the content description information of the described webpage of described crawl, comprising:
Capture the DOM Document Object Model information of described webpage.
Wherein, described captured web page contents is preserved, being comprised:
According to the DOM Document Object Model information of described webpage, captured web page contents is preserved with structuring pattern.
Wherein, the described content that captures described webpage according to analysis result comprises:
According to default rule, the content without collection meaning comprising in web page contents is filtered, according to filter result, capture the content of described webpage.
Wherein, the described content that captures described webpage according to analysis result comprises:
In the situation that web page contents comprises picture, judge whether the picture number in webpage is greater than default threshold value, if so, adopt asynchronous system to download the image content of described webpage.
Wherein,
After capturing the content description information of webpage, also comprise: described content description information is sent to server end equipment;
Described server end equipment is resolved described content description information, captures the content of described webpage according to analysis result, and captured web page contents is preserved.
A web page storage device, comprising:
Descriptor placement unit, for receive after the collection operational order that user carries out browsed webpage, utilize write described webpage for capturing the scripted code of web page contents, capture the content description information of described webpage;
Capturing webpage contents unit, for described content description information is resolved, captures the content of described webpage according to analysis result;
Web page contents storage unit, for preserving captured web page contents.
Wherein, also comprise:
Code injection unit, for when webpage that described user browses being detected and loaded, writes in the webpage of browsing for capturing the scripted code of web page contents to user; Or, when receiving the collection operational order that user carries out browsed webpage, in the webpage of browsing to user, write for capturing the scripted code of web page contents.
Wherein, described code injection unit, comprising:
Framework adds subelement, for the webpage of browsing user, adds embedded framework;
Code writes subelement, for writing described scripted code at described embedded framework.
Wherein, described descriptor placement unit, specifically for:
After receiving user's collection operational order, utilize the scripted code writing in advance, capture the DOM Document Object Model information of described webpage.
Wherein, described web page contents storage unit, specifically for:
According to the DOM Document Object Model information of described webpage, captured web page contents is preserved with structuring pattern.
Wherein, described capturing webpage contents unit, specifically for:
According to default rule, the content without collection meaning comprising in web page contents is filtered, according to filter result, capture the content of described webpage.
Wherein, described capturing webpage contents unit, specifically for:
In the situation that web page contents comprises picture, judge whether the picture number in webpage is greater than default threshold value, if so, adopt asynchronous system to download the image content of described webpage.
A web page storage system, comprises client device and server end equipment;
Described client device, comprising:
Descriptor placement unit, for receive after the collection operational order that user carries out browsed webpage, utilize write described webpage for capturing the scripted code of web page contents, capture the content description information of described webpage;
Descriptor transmitting element, for being sent to server end equipment by described web page contents descriptor;
Described server end equipment, comprising:
Descriptor receiving element, the web page contents descriptor sending for receiving client device;
Capturing webpage contents unit, for described web page contents descriptor is resolved, captures the content of webpage according to analysis result;
Web page contents storage unit, for preserving captured web page contents.
Wherein, described client device also comprises:
Code injection unit, for when webpage that described user browses being detected and loaded, writes in the webpage of browsing for capturing the scripted code of web page contents to user; Or, when receiving the collection operational order that user carries out browsed webpage, in the webpage of browsing to user, write for capturing the scripted code of web page contents.
Wherein, described code injection unit, comprising:
Framework adds subelement, for the webpage of browsing user, adds embedded framework;
Code writes subelement, for writing described scripted code at described embedded framework.
Wherein, described descriptor placement unit, specifically for:
After receiving user's collection operational order, utilize the scripted code writing in advance, capture the DOM Document Object Model information of described webpage.
Wherein, described web page contents storage unit, specifically for:
According to the DOM Document Object Model information of described webpage, captured web page contents is preserved with structuring pattern.
Wherein, described capturing webpage contents unit, specifically for:
According to default rule, the content without collection meaning comprising in web page contents is filtered, according to filter result, capture the content of described webpage.
Wherein, described capturing webpage contents unit, specifically for:
In the situation that web page contents comprises picture, judge whether the picture number in webpage is greater than default threshold value, if so, adopt asynchronous system to download the image content of described webpage.
The technical scheme that the embodiment of the present invention provides, owing to by writing in advance the scripted code of described webpage, the descriptor of webpage being captured, has guaranteed on the one hand the comprehensive of the web page contents that captures; On the other hand, in the descriptor due to webpage, carry the style information of webpage, therefore, when preserving web page contents, can to web page contents, carry out typesetting according to style information, thereby improve the order of web page storage result, be convenient to user and read.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment be briefly described below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skills, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is the process flow diagram of a kind of embodiment of web page storage method of the present invention;
Fig. 2 is the process flow diagram of the another kind of embodiment of web page storage method of the present invention;
Fig. 3 is the structural representation of web page storage device embodiment of the present invention;
Fig. 4 is the structural representation of web page storage system embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, the every other embodiment that those of ordinary skills obtain, belongs to the scope of protection of the invention.
First a kind of web page storage method embodiment of the present invention being provided describes, and the method can comprise the following steps:
Receive after the collection operational order that user carries out browsed webpage, utilize write described webpage for capturing the scripted code of web page contents, capture the content description information of described webpage;
Described content description information is resolved, according to analysis result, capture the content of described webpage;
Captured web page contents is preserved.
In one embodiment of the invention, above-mentioned steps can all realize in client device, for example, in the web page storage software of browser itself, browser plug-in or special use, realize.
In one embodiment of the invention, the step that writes scripted code step and crawl web page contents descriptor in above-mentioned steps realizes in client device, client is sent to server end equipment after grabbing web page contents descriptor, by server, completes subsequent step.
First, as shown in Figure 1, this web page storage method comprises step:
S101: receive after the collection operational order that user carries out browsed webpage, utilize write described webpage for capturing the scripted code of web page contents, capture the content description information of described webpage;
In embodiments of the present invention, be not to adopt server directly to capture Webpage content, this be because: for a part of webpage, server cannot directly capture, such as some page just can represent after must logining, if not login of client, server end also cannot capture.Therefore in embodiments of the present invention, the operation that captures the content of Webpage is completed by client, such as being completed by softwares such as browsers.
According to the scheme of the embodiment of the present invention, can, when user's browsing page, after viewed webpage loaded being detected, in the webpage of browsing to user, write scripted code.This section of code can show a button (can show words such as " I like " on button) on webpage specified location (such as right side), can trigger collection operation after point " I like " button.Or, under another kind of real-time mode, can give tacit consent in webpage specified location (such as right side) and show a button (can show words such as " I like " on button), if user wants to collect the current webpage of browsing, just can click this " I like " button, then carry out the operation that writes scripted code in the webpage of browsing to user, be equivalent to user simultaneously and triggered collection operation.
Wherein, the scripted code that writes webpage has the function that captures Webpage content, owing to there being at present a lot of Webpages to adopt JS(JavaScript) technological development, therefore in the embodiment of the present invention, adopt to the mode that writes JS scripted code in webpage and realize, the content of pages that can either solve after user login captures problem, the security again can guarantee information capturing.
In a modification of the present invention embodiment, in the webpage that can first browse user, the embedded framework of middle interpolation then writes described scripted code in embedded framework.
Wherein embedded framework can be iframe framework, and iframe framework can be isolated scripted code and browser interface.The reason of implementing is like this: because in practical application, if unique user can obtain scripted code, just can operating browser interface, thereby band is served safety problem, such as: user can be by utilizing scripted code to initiate cross-domain request in browser, can revise browser profile by operating browser interface, and other interface functions of browser.For fear of scripted code, by malicious exploitation, in the embodiment of the present invention, scripted code is write in embedded framework, by embedded framework, scripted code and browser interface is isolated, thus increase security.
To webpage, write after scripted code, can, after page loaded, in the page one side, draw button or user interactions panel.So that user clicks this button and triggers collection operation.Certainly, in the present invention, user sends the mode of collection operational order and only limits to button click.In addition, user can also arrange button skin, share the operations such as configuration with crossing mutual panel, repeats no more here.
Certainly, in actual applications, the scheme of the embodiment of the present invention can realize by the mode of a browser plug-in, in the situation that browser plug-in is supported, also injection script in the webpage that can directly browse user, and needn't adopt the mode of the embedded framework of above-mentioned interpolation to realize.
When receiving, user collects action button by click or other modes are initiated to collect after operational order, utilizes the scripted code writing in advance, captures the content description information of webpage.
In the present invention, the web page contents descriptor that mainly need to capture comprises the DOM(Document Object Model of webpage, DOM Document Object Model) information, the layout structure information that includes the page in the dom tree of webpage, utilize these information, follow-uply just can when preserving web page contents, according to the original pattern of webpage, carry out typesetting, with structurized form, preserve.
It will be understood by those skilled in the art that in capturing the process of web page contents descriptor, except DOM information, the information such as the page hyperlink of all right further crawl webpage, title.The embodiment of the present invention does not need this to limit.
S102: described content description information is resolved, capture the content of described webpage according to analysis result;
By the dom tree of analyzing web page, can extract the contents such as word that the page comprises, picture.Wherein, the image content parsing is the source position at picture file place, also needs further from source position by actual picture file, to download to this locality.
Under in process at picture file, can first judge whether the picture number in webpage is greater than certain default threshold value (for example 10 width, 20 width etc.), if not, directly download each image file.And the picture number comprising at webpage is when many, will be very time-consuming in the process of capturing pictures file.In order to improve system performance, the multithreading that can adopt asynchronous system to realize picture file is downloaded in batches, and all picture files are filed unified after handling, and can effectively reduce the required time of capturing pictures like this.
In actual applications, some website may adopt door chain technology, directly download pictures file.For this situation, in embodiments of the present invention, when the request of download pictures file is initiated, the source domain name of the website at picture resource place on can adding in the referer of http head field.During this request of the server parses of the website at picture resource place, can think that this request is to be initiated by self, thereby return to image content.
In the process of capturing pictures content, can also first obtain the size of picture in webpage, for undersized picture, do not download.The mode of this capturing pictures, can filter out the picture that dimension of picture is greater than pre-set dimension threshold value.This is because the picture in webpage may have a lot, and this does not exist the content of collection meaning comprising a large amount of advertising pictures etc.Yet as the picture of webpage main contents, conventionally all have larger size, the mode that therefore adopts dimension of picture to filter, can effectively reduce the crawl of useless image content, has both saved system resource, also improved the readability of collection result.
Be understandable that, place is except utilizing dimension of picture to carry out image content filtration, can also adopt other presetting rule, modes such as network address key word, filename key word, the information without collection meaning that may exist in webpage is filtered, thereby reach the readable object of saving system resource and having improved collection result, the embodiment of the present invention does not need this to limit.
S103: captured web page contents is preserved.
In this step, the web page contents capturing in S103 is preserved, especially, according to the dom tree information of webpage, can be to the web page contents capturing according to the original pattern of webpage, pattern carries out typesetting, with structurized form, preserves.
Further, can also, according to preserved content information generating web page summary, to show user in the favorites list, be convenient to user and browse.In specific implementation process, can according to web page title information can generate summary title, according to the page word of webpage, can generate word segment in summary, according to page pictures information, can generate the thumbnail in summary, etc.Preserve described summary info, user just can, in follow-up surfing the web in process, directly check the summary info of the webpage of collecting in web page storage list.
In addition, application the present invention program, also allows user that the webpage of collection is shared to other websites, can also be by calling the interface of other websites, typesetted web page content information and summary info are sent to targeted website, thereby realize sharing of user profile, improve user and experience.
Above-mentioned provided web page storage method, captures the descriptor of webpage by writing in advance the scripted code of described webpage, has guaranteed on the one hand the comprehensive of the web page contents that captures; On the other hand, in the descriptor due to webpage, carry the style information of webpage, therefore, when preserving web page contents, can to web page contents, carry out typesetting according to style information, thereby improve the order of web page storage result, be convenient to user and read.
In the above-described embodiments, all web page storage steps are all to realize in client device, in another embodiment of the invention, can be operated by the client and server equipment web page storage that cooperated, and shown in Figure 2, the method comprises the following steps:
S201: client device receives after the collection operational order that user carries out browsed webpage, utilize write described webpage for capturing the scripted code of web page contents, capture the content description information of described webpage;
S202: client device is sent to server end equipment by described content description information;
S203: server end equipment is resolved described content description information, captures the content of described webpage according to analysis result;
S204: server end equipment is preserved captured web page contents.
Compare with last embodiment: S201 and S101 are identical; S203-S204 compares with S102-S103, and difference is that executive agent becomes server end equipment from client device; Increased S202 client device and content description information descriptor has been sent to the step of server end equipment.
Due to the analysis ability of service end, download controllability, the aspect such as typesetting exceeds much than front end JS script again.Therefore can effectively promote the crawl quality of web page contents by this way.And the storage space of service end is more abundant, the Information Sharing of being also more convenient between user.
In addition, according to description before, because service end cannot directly capture some webpage, the step that therefore captures webpage descriptor is still completed by client, thereby guarantees the success ratio of crawl.
Be understandable that, client device, content description information descriptor being sent in the process of server end equipment, can adopt data compression technique, thereby further promotes transfer efficiency.
Corresponding to embodiment of the method above, the embodiment of the present invention also provides a kind of web page storage device, shown in Figure 3, and this device can comprise:
Descriptor placement unit 301, for receive after the collection operational order that user carries out browsed webpage, utilize write described webpage for capturing the scripted code of web page contents, capture the content description information of described webpage;
Capturing webpage contents unit 302, for described content description information is resolved, captures the content of described webpage according to analysis result;
Web page contents storage unit 303, for preserving captured web page contents.
During specific implementation, this device can also comprise:
Code injection unit, for when webpage that described user browses being detected and loaded, writes in the webpage of browsing for capturing the scripted code of web page contents to user; Or, when receiving the collection operational order that user carries out browsed webpage, in the webpage of browsing to user, write for capturing the scripted code of web page contents.
Wherein, in one embodiment of the invention, described code injection unit, can comprise:
Framework adds subelement, for adding embedded framework in the webpage of browsing user;
Code writes subelement, for writing described scripted code at described embedded framework.
Wherein, described descriptor placement unit 301, can be specifically for:
After receiving user's collection operational order, utilize the scripted code writing in advance, capture the DOM Document Object Model information of described webpage.
Web page contents storage unit 303, can be specifically for:
According to the DOM Document Object Model information of described webpage, captured web page contents is preserved with structuring pattern.
In one embodiment of the invention, described capturing webpage contents unit 302, can be specifically for:
According to default rule, the content without collection meaning comprising in web page contents is filtered, according to filter result, capture the content of described webpage.
In another embodiment of the invention, described capturing webpage contents unit 302, can also be specifically for:
In the situation that web page contents comprises picture, judge whether the picture number in webpage is greater than default threshold value, if so, adopt asynchronous system to download the image content of described webpage.
The web page storage device more than providing, can be the functional module that is positioned at client, and this module can be web page storage software of browser itself, browser plug-in or special use etc.
Corresponding and the above-mentioned scheme of all collecting operation that realizes in client, the embodiment of the present invention also provides a kind of web page storage system, shown in Figure 4, and this system comprises client device 401 and server end equipment 402;
Described client device 401, comprising:
Descriptor placement unit 4011, for after receiving user's collection operational order, utilizes the scripted code writing in advance, captures the content description information of described webpage;
Descriptor transmitting element 4012, for being sent to server end equipment by described web page contents descriptor;
Described server end equipment 402, comprising:
Descriptor receiving element 4021, the web page contents descriptor sending for receiving client device;
Capturing webpage contents unit 4022, for described web page contents descriptor is resolved, captures the content of webpage according to analysis result;
Web page contents storage unit 4023, for preserving captured web page contents.
Due to the analysis ability of service end, download controllability, the aspect such as typesetting exceeds much than front end JS script again.Therefore the web page storage system that the embodiment of the present invention provides can effectively promote the crawl quality of web page contents.And the storage space of service end is more abundant, the Information Sharing of being also more convenient between user.
In addition, according to description before, because service end cannot directly capture some webpage, the step that therefore captures webpage descriptor is still completed by client, thereby guarantees the success ratio of crawl.
During specific implementation, client device 401 can also comprise:
Code injection unit, for when webpage that described user browses being detected and loaded, writes in the webpage of browsing for capturing the scripted code of web page contents to user; Or, when receiving the collection operational order that user carries out browsed webpage, in the webpage of browsing to user, write for capturing the scripted code of web page contents.
In one embodiment of the invention, described code injection unit can comprise:
Framework adds subelement, for adding embedded framework in the webpage of browsing user;
Code writes subelement, for writing described scripted code at described embedded framework.
In one embodiment of the invention, described descriptor placement unit 4011, can be specifically for:
After receiving user's collection operational order, utilize the scripted code writing in advance, capture the DOM Document Object Model information of described webpage.
In one embodiment of the invention, described web page contents storage unit 4023, can be specifically for:
According to the DOM Document Object Model information of described webpage, captured web page contents is preserved with structuring pattern.
In one embodiment of the invention, described capturing webpage contents unit 4022, can be specifically for:
According to default rule, the content without collection meaning comprising in web page contents is filtered, according to filter result, capture the content of described webpage.
In one embodiment of the invention, described capturing webpage contents unit 4022, can also be specifically for:
In the situation that web page contents comprises picture, judge whether the picture number in webpage is greater than default threshold value, if so, adopt asynchronous system to download the image content of described webpage.
As seen through the above description of the embodiments, those skilled in the art can be well understood to the mode that the present invention can add essential general hardware platform by software and realizes.Understanding based on such, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product can be stored in storage medium, as ROM/RAM, magnetic disc, CD etc., comprise that some instructions are with so that a computer equipment (can be personal computer, server, or the network equipment etc.) carry out the method described in some part of each embodiment of the present invention or embodiment.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually referring to, each embodiment stresses is the difference with other embodiment.Especially, for device or system embodiment, because it is substantially similar in appearance to embodiment of the method, so describe fairly simplely, relevant part is referring to the part explanation of embodiment of the method.Apparatus and system embodiment described above is only schematic, the wherein said unit as separating component explanation can or can not be also physically to separate, the parts that show as unit can be or can not be also physical locations, can be positioned at a place, or also can be distributed in a plurality of network element.Can select according to the actual needs some or all of module wherein to realize the object of the present embodiment scheme.Those of ordinary skills, in the situation that not paying creative work, are appreciated that and implement.
Above to a kind of web page storage method provided by the present invention, Apparatus and system, be described in detail, applied specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment is just for helping to understand method of the present invention and core concept thereof; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications.In sum, this description should not be construed as limitation of the present invention.
The embodiment of the invention discloses A1 web page storage method, comprising:
Receive after the collection operational order that user carries out browsed webpage, utilize write described webpage for capturing the scripted code of web page contents, capture the content description information of described webpage;
Described content description information is resolved, according to analysis result, capture the content of described webpage;
Captured web page contents is preserved.
A2, according to the method described in A1, also comprise:
When webpage that described user browses being detected and loaded, in the webpage of browsing to user, write for capturing the scripted code of web page contents;
Or,
When receiving the collection operational order that user carries out browsed webpage, in the webpage of browsing to user, write for capturing the scripted code of web page contents.
A3, according to the method described in A2, in the described webpage of browsing to user, write for capturing the scripted code of web page contents, comprising:
In the webpage of browsing user, add embedded framework;
In described embedded framework, write described scripted code.
A4, according to the method described in A1, the content description information of the described webpage of described crawl, comprising:
Capture the DOM Document Object Model information of described webpage.
A5, according to the method described in A1, described captured web page contents is preserved, comprising:
According to the DOM Document Object Model information of described webpage, captured web page contents is preserved with structuring pattern.
A6, according to the method described in A1, the described content that captures described webpage according to analysis result comprises:
According to default rule, the content without collection meaning comprising in web page contents is filtered, according to filter result, capture the content of described webpage.
A7, according to the method described in A1, the described content that captures described webpage according to analysis result comprises:
In the situation that web page contents comprises picture, judge whether the picture number in webpage is greater than default threshold value, if so, adopt asynchronous system to download the image content of described webpage.
A8, according to the method described in A1-A7 any one,
After capturing the content description information of webpage, also comprise: described content description information is sent to server end equipment;
Described server end equipment is resolved described content description information, captures the content of described webpage according to analysis result, and captured web page contents is preserved.
B9, a kind of web page storage device, comprising:
Descriptor placement unit, for receive after the collection operational order that user carries out browsed webpage, utilize write described webpage for capturing the scripted code of web page contents, capture the content description information of described webpage;
Capturing webpage contents unit, for described content description information is resolved, captures the content of described webpage according to analysis result;
Web page contents storage unit, for preserving captured web page contents.
B10, according to the device described in B9, also comprise:
Code injection unit, for when webpage that described user browses being detected and loaded, writes in the webpage of browsing for capturing the scripted code of web page contents to user; Or, when receiving the collection operational order that user carries out browsed webpage, in the webpage of browsing to user, write for capturing the scripted code of web page contents.
B11, according to the device described in B10, described code injection unit, comprising:
Framework adds subelement, for the webpage of browsing user, adds embedded framework;
Code writes subelement, for writing described scripted code at described embedded framework.
B12, according to the device described in B9, described descriptor placement unit, specifically for:
After receiving user's collection operational order, utilize the scripted code writing in advance, capture the DOM Document Object Model information of described webpage.
B13, according to the device described in B9, described web page contents storage unit, specifically for:
According to the DOM Document Object Model information of described webpage, captured web page contents is preserved with structuring pattern.
B14, according to the device described in B9, described capturing webpage contents unit, specifically for:
According to default rule, the content without collection meaning comprising in web page contents is filtered, according to filter result, capture the content of described webpage.
B15, according to the device described in B9, described capturing webpage contents unit, specifically for:
In the situation that web page contents comprises picture, judge whether the picture number in webpage is greater than default threshold value, if so, adopt asynchronous system to download the image content of described webpage.
C16, a kind of web page storage system, comprise client device and server end equipment;
Described client device, comprising:
Descriptor placement unit, for receive after the collection operational order that user carries out browsed webpage, utilize write described webpage for capturing the scripted code of web page contents, capture the content description information of described webpage;
Descriptor transmitting element, for being sent to server end equipment by described web page contents descriptor;
Described server end equipment, comprising:
Descriptor receiving element, the web page contents descriptor sending for receiving client device;
Capturing webpage contents unit, for described web page contents descriptor is resolved, captures the content of webpage according to analysis result;
Web page contents storage unit, for preserving captured web page contents.
C17, according to the system described in C16, described client device also comprises:
Code injection unit, for when webpage that described user browses being detected and loaded, writes in the webpage of browsing for capturing the scripted code of web page contents to user; Or, when receiving the collection operational order that user carries out browsed webpage, in the webpage of browsing to user, write for capturing the scripted code of web page contents.
C18, according to the system described in C17, described code injection unit, comprising:
Framework adds subelement, for the webpage of browsing user, adds embedded framework;
Code writes subelement, for writing described scripted code at described embedded framework.
C19, according to the system described in C16, described descriptor placement unit, specifically for:
After receiving user's collection operational order, utilize the scripted code writing in advance, capture the DOM Document Object Model information of described webpage.
C20, according to the system described in C16, described web page contents storage unit, specifically for:
According to the DOM Document Object Model information of described webpage, captured web page contents is preserved with structuring pattern.
C21, according to the system described in C16, described capturing webpage contents unit, specifically for:
According to default rule, the content without collection meaning comprising in web page contents is filtered, according to filter result, capture the content of described webpage.
C22, according to the system described in C16, described capturing webpage contents unit, specifically for:
In the situation that web page contents comprises picture, judge whether the picture number in webpage is greater than default threshold value, if so, adopt asynchronous system to download the image content of described webpage.

Claims (10)

1. a web page storage method, is characterized in that, comprising:
Receive after the collection operational order that user carries out browsed webpage, utilize write described webpage for capturing the scripted code of web page contents, capture the content description information of described webpage;
Described content description information is resolved, according to analysis result, capture the content of described webpage;
Captured web page contents is preserved.
2. method according to claim 1, is characterized in that, also comprises:
When webpage that described user browses being detected and loaded, in the webpage of browsing to user, write for capturing the scripted code of web page contents;
Or,
When receiving the collection operational order that user carries out browsed webpage, in the webpage of browsing to user, write for capturing the scripted code of web page contents.
3. method according to claim 2, is characterized in that, in the described webpage of browsing to user, writes for capturing the scripted code of web page contents, comprising:
In the webpage of browsing user, add embedded framework;
In described embedded framework, write described scripted code.
4. method according to claim 1, is characterized in that, the content description information of the described webpage of described crawl, comprising:
Capture the DOM Document Object Model information of described webpage.
5. method according to claim 1, is characterized in that, described captured web page contents is preserved, and comprising:
According to the DOM Document Object Model information of described webpage, captured web page contents is preserved with structuring pattern.
6. method according to claim 1, is characterized in that, the described content that captures described webpage according to analysis result comprises:
According to default rule, the content without collection meaning comprising in web page contents is filtered, according to filter result, capture the content of described webpage.
7. method according to claim 1, is characterized in that, the described content that captures described webpage according to analysis result comprises:
In the situation that web page contents comprises picture, judge whether the picture number in webpage is greater than default threshold value, if so, adopt asynchronous system to download the image content of described webpage.
8. according to the method described in claim 1-7 any one, it is characterized in that,
After capturing the content description information of webpage, also comprise: described content description information is sent to server end equipment;
Described server end equipment is resolved described content description information, captures the content of described webpage according to analysis result, and captured web page contents is preserved.
9. a web page storage device, is characterized in that, comprising:
Descriptor placement unit, for receive after the collection operational order that user carries out browsed webpage, utilize write described webpage for capturing the scripted code of web page contents, capture the content description information of described webpage;
Capturing webpage contents unit, for described content description information is resolved, captures the content of described webpage according to analysis result;
Web page contents storage unit, for preserving captured web page contents.
10. a web page storage system, is characterized in that, comprises client device and server end equipment;
Described client device, comprising:
Descriptor placement unit, for receive after the collection operational order that user carries out browsed webpage, utilize write described webpage for capturing the scripted code of web page contents, capture the content description information of described webpage;
Descriptor transmitting element, for being sent to server end equipment by described web page contents descriptor;
Described server end equipment, comprising:
Descriptor receiving element, the web page contents descriptor sending for receiving client device;
Capturing webpage contents unit, for described web page contents descriptor is resolved, captures the content of webpage according to analysis result;
Web page contents storage unit, for preserving captured web page contents.
CN201310603186.5A 2012-03-31 2012-03-31 A kind of webpage collection method, apparatus and system Expired - Fee Related CN103617224B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310603186.5A CN103617224B (en) 2012-03-31 2012-03-31 A kind of webpage collection method, apparatus and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210092944.7A CN102646135B (en) 2012-03-31 2012-03-31 Webpage collecting method, device and system
CN201310603186.5A CN103617224B (en) 2012-03-31 2012-03-31 A kind of webpage collection method, apparatus and system

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201210092944.7A Division CN102646135B (en) 2012-03-31 2012-03-31 Webpage collecting method, device and system

Publications (2)

Publication Number Publication Date
CN103617224A true CN103617224A (en) 2014-03-05
CN103617224B CN103617224B (en) 2018-01-19

Family

ID=50167927

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310603186.5A Expired - Fee Related CN103617224B (en) 2012-03-31 2012-03-31 A kind of webpage collection method, apparatus and system

Country Status (1)

Country Link
CN (1) CN103617224B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354204A (en) * 2014-08-22 2016-02-24 北京金山安全软件有限公司 Method and device for collecting webpage data
CN105550179A (en) * 2014-10-29 2016-05-04 腾讯科技(深圳)有限公司 Webpage collection method and browser plug-in
CN105893428A (en) * 2015-12-07 2016-08-24 乐视移动智能信息技术(北京)有限公司 Advertisement filtering method, device and mobile terminal

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1756160A (en) * 2004-09-27 2006-04-05 戴志军 Individualized website convenient for user accessing Internet
CN101051325A (en) * 2007-05-16 2007-10-10 杭州华三通信技术有限公司 Method and device for collecting web page active
CN101727486A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Web forum information extraction system
CN101782911A (en) * 2009-06-23 2010-07-21 北京搜狗科技发展有限公司 Method and system for prompting network resource content
WO2010102165A1 (en) * 2009-03-05 2010-09-10 Alibaba Group Holding Limited Method, apparatus and system for visualizing user's web page browsing behavior

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1756160A (en) * 2004-09-27 2006-04-05 戴志军 Individualized website convenient for user accessing Internet
CN101051325A (en) * 2007-05-16 2007-10-10 杭州华三通信技术有限公司 Method and device for collecting web page active
WO2010102165A1 (en) * 2009-03-05 2010-09-10 Alibaba Group Holding Limited Method, apparatus and system for visualizing user's web page browsing behavior
CN101782911A (en) * 2009-06-23 2010-07-21 北京搜狗科技发展有限公司 Method and system for prompting network resource content
CN101727486A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Web forum information extraction system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354204A (en) * 2014-08-22 2016-02-24 北京金山安全软件有限公司 Method and device for collecting webpage data
CN105550179A (en) * 2014-10-29 2016-05-04 腾讯科技(深圳)有限公司 Webpage collection method and browser plug-in
CN105893428A (en) * 2015-12-07 2016-08-24 乐视移动智能信息技术(北京)有限公司 Advertisement filtering method, device and mobile terminal

Also Published As

Publication number Publication date
CN103617224B (en) 2018-01-19

Similar Documents

Publication Publication Date Title
CN102646135B (en) Webpage collecting method, device and system
US11907642B2 (en) Enhanced links in curation and collaboration applications
CN106294648B (en) Processing method and device for page access path
US10515142B2 (en) Method and apparatus for extracting webpage information
WO2016173200A1 (en) Malicious website detection method and system
US20160188551A1 (en) System for clipping webpages
CN102799372B (en) A kind of method for uploading of pictorial information and upload device
US20170337168A1 (en) System and method for generating and monitoring feedback of a published webpage as implemented on a remote client
WO2015120327A2 (en) Developer based document collaboration
CN103473302A (en) Lock screen information display method, device and system
CN104765746B (en) Data processing method and device for mobile communication terminal browser
CN104243273A (en) Method and device for displaying information on instant messaging client and information display system
CN103678487A (en) Method and device for generating web page snapshot
CN105550179B (en) Webpage collection method and browser plug-in
CN106874271A (en) A kind of method and system that PC webpages are converted to mobile terminal webpage
CN111177623A (en) Information processing method and device
CN102624910B (en) Method, the Apparatus and system of the web page contents that process user chooses
CN104899212A (en) Webpage display method, server and system
CN104361007B (en) The processing method of browser and its collection
CN102955852A (en) Method, device and equipment for webpage resource processing
CN103617224A (en) Webpage collecting method, webpage collecting device and webpage collecting system
CN105450460B (en) Network operation recording method and system
CN108108381B (en) Page monitoring method and device
CN112307386A (en) Information monitoring method, system, electronic device and computer readable storage medium
CN103617223A (en) Webpage collecting method and webpage collecting device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220722

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180119