CN103678511B - The method and device of webpage content extraction is carried out according to visual template - Google Patents

The method and device of webpage content extraction is carried out according to visual template Download PDF

Info

Publication number
CN103678511B
CN103678511B CN201310606505.8A CN201310606505A CN103678511B CN 103678511 B CN103678511 B CN 103678511B CN 201310606505 A CN201310606505 A CN 201310606505A CN 103678511 B CN103678511 B CN 103678511B
Authority
CN
China
Prior art keywords
web page
page template
webpage
content
mark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310606505.8A
Other languages
Chinese (zh)
Other versions
CN103678511A (en
Inventor
马晓辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201310606505.8A priority Critical patent/CN103678511B/en
Publication of CN103678511A publication Critical patent/CN103678511A/en
Application granted granted Critical
Publication of CN103678511B publication Critical patent/CN103678511B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The open one of the present invention carries out webpage content extraction method and device according to visual template, belongs to Internet technical field.Described method includes: when orientation captures targeted website, searches the web page template generated according to visualization mark whether recording corresponding described targeted website in web page template storehouse;If record has the web page template generated according to visualization mark of corresponding described targeted website in web page template storehouse, then according to described web page template, described targeted website is carried out content extraction.In accordance with the invention it is possible to improve the accuracy to webpage content extraction.

Description

The method and device of webpage content extraction is carried out according to visual template
Technical field
The present invention relates to Internet technical field, be specifically related to one and carry out in webpage according to visual template Hold the method and device of extraction.
Background technology
Web page template may be used for extract webpage content, than if any search engine capture website time make By oriented acquisition technology, the Aranea of oriented acquisition uses web page template that the related content of website is extracted Come, it is thus achieved that the content of formatting, including webpage title, author, deliver the information such as time and text.
Existing a kind of generate web page template method be: first, according to the URL (Uniform of the page Resource Locator, URL), download the source code of the page;Secondly, according to the page Source code page structure is automatically analyzed, calculate the cryptographic Hash of each structure in the page;Then, Which structure correspondence title in the source code artificial judgment page according to the page, which structure correspondence text, Which structure correspondence is delivered the time etc., and is marked;Finally, the cryptographic Hash of generating structure and structure The corresponding relation of content type, obtains web page template.
Existing generation web page template method at least has a disadvantage in that
The content type of handmarking's page structure is carried out by text editing, has inside web page template Substantial amounts of incoherent content, some web page templates even have ten of thousands row, cause the efficiency of handmarking The lowest;
Various contents in web page template are mixed in web page code, due to web page contents the most intuitively in Reveal to come, if not yet done to webpage design language, be then not easy to determine the content type of page structure, people Being easy for makeing mistakes during work labelling, the accuracy causing the web page template of generation is the highest, and then causes according to being somebody's turn to do The accuracy that web page template carries out content extraction is the highest.
Summary of the invention
In view of the above problems, it is proposed that the present invention is to provide one to overcome the problems referred to above or at least partly Ground solves the method and device carrying out webpage content extraction according to visual template of the problems referred to above.
According to one aspect of the present invention, it is provided that one carries out webpage content extraction according to visual template Method, described method includes:
When orientation captures targeted website, search in web page template storehouse whether recorded corresponding described targeted website According to visualization mark generate web page template;
If record has the net generated according to visualization mark of corresponding described targeted website in web page template storehouse Page template, then carry out content extraction according to described web page template to described targeted website.
Alternatively, described web page template is identified with the homepage URL of website.
Alternatively, described according to described web page template, described targeted website is carried out content extraction, including:
According to the URL of all external linkages in homepage URL extraction homepage, remove wherein to other net The part that station is jumped out, puts into scheduling queue by remaining URL;
According to described web page template, webpage corresponding for URL in scheduling queue carried out respectively content extraction.
Alternatively, described method also includes:
Building the effect of visualization framework being labeled webpage, described effect of visualization framework includes content Region, be positioned at the masking-out above the content area chosen and mark menu, described mark menu includes multiple Content type menu item;
Obtain instruction that webpage each several part content area is labeled, described in be designated as by mark menu The content type of the content area corresponding to choosing selected;
Recorded content region and the corresponding relation of content type, obtain web page template.
Alternatively, described method also includes:
Add up according to multiple web page templates of the multiple auto-building html files under same resource website, extract Same section in the plurality of web page template generates final web page template.
According to a further aspect in the invention, it is provided that a kind of carry out webpage content extraction according to visual template Device, described device includes:
Web page template storehouse, is suitable to preserve the web page template generated according to visualization mark;
Finger, when being suitable to orientation crawl targeted website, searches in web page template storehouse whether recorded correspondence The web page template generated according to visualization mark of described targeted website;
Content extraction device, is suitable to when in web page template storehouse, record has the basis of corresponding described targeted website visually When changing the web page template that mark generates, according to described web page template, described targeted website is carried out content and take out Take.
Alternatively, described web page template is identified with the homepage URL of website.
Alternatively, described content extraction device is further adapted for:
According to the URL of all external linkages in homepage URL extraction homepage, remove wherein to other net The part that station is jumped out, puts into scheduling queue by remaining URL;
According to described web page template, webpage corresponding for URL in scheduling queue carried out respectively content extraction.
Alternatively, described device also includes:
Effect of visualization framework establishment device, is suitable to build the effect of visualization framework being labeled webpage, Described effect of visualization framework includes content area, is positioned at the masking-out above the content area chosen and mark Menu, described mark menu includes plurality of kinds of contents type menu item;
Mark instruction getter, is suitable to obtain the instruction being labeled webpage each several part content area, institute State the content type being designated as the content area corresponding to choosing by mark menu setecting;
Web page template maker, is suitable to the corresponding relation in recorded content region and content type, obtains webpage Template.
Alternatively, described device also includes: counter, and be suitable to according under same resource website is multiple Multiple web page templates of auto-building html files are added up, and extract the identical portions in the plurality of web page template mitogenetic Become final web page template.
According to one or more technical schemes that the present invention is above-mentioned, by the net using visualization mark to generate Page template carries out webpage content extraction, owing to the accuracy of this web page template is higher, therefore, according to this Web page template carries out the accuracy of content extraction and have also been obtained raising.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the skill of the present invention Art means, and can being practiced according to the content of description, and in order to allow, the present invention's is above and other Objects, features and advantages can become apparent, below especially exemplified by the detailed description of the invention of the present invention.
Accompanying drawing explanation
By reading the detailed description of hereafter preferred implementation, various other advantage and benefit for this Field those of ordinary skill will be clear from understanding.Accompanying drawing is only used for illustrating the purpose of preferred implementation, And it is not considered as limitation of the present invention.And in whole accompanying drawing, be denoted by the same reference numerals Identical parts.In the accompanying drawings:
Fig. 1 shows the method flow diagram generating web page template according to an embodiment of the invention;
Fig. 2 shows the schematic diagram in the embodiment of the present invention being labeled the title of webpage;
Fig. 3 shows the schematic diagram in the embodiment of the present invention being labeled the text of webpage;
Fig. 4 shows the method detail flowchart generating web page template according to an embodiment of the invention;
Fig. 5 shows the structure drawing of device generating web page template according to an embodiment of the invention;
Fig. 6 shows the method flow that webpage provides visualization mark according to an embodiment of the invention Figure;
Fig. 7 shows the apparatus structure that webpage provides visualization mark according to an embodiment of the invention Figure;
Fig. 8 shows and carries out webpage content extraction according to visual template according to an embodiment of the invention Method flow diagram;
Fig. 9 shows and carries out webpage content extraction according to visual template according to an embodiment of the invention Structure drawing of device;
Figure 10 shows that carrying out web page contents according to visual template according to an embodiment of the invention takes out The system construction drawing taken.
Detailed description of the invention
It is more fully described the exemplary embodiment of the disclosure below with reference to accompanying drawings.Although accompanying drawing shows The exemplary embodiment of the disclosure, it being understood, however, that may be realized in various forms the disclosure and not Should be limited by embodiments set forth here.On the contrary, it is provided that these embodiments are able to more thoroughly Understand the disclosure, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
Embodiment 1
The present embodiment provides a kind of method and device generating web page template
Fig. 1 shows the method flow diagram generating web page template according to an embodiment of the invention, reference Fig. 1, described method includes:
Step 102, builds the effect of visualization framework being labeled webpage;
In one implementation, described effect of visualization framework may include that content area, is positioned at choosing In content area above masking-out and mark menu, described mark menu includes plurality of kinds of contents type menu ?.
By obtaining source code such as html (hypertext mark-up language, the hypertext mark of webpage Note language) document, by stylesheet files such as css (cascading style sheets, CSS) File is attached to html document, and increases js (javascript) script in html document, can be with structure The effect of visualization framework of networking page.Specifically, can be realized when certain content being detected by js script When region is selected, masking-out and mark menu, described masking-out and mark occur above the content area chosen The display mode of note menu can be limited by the rule defined in stylesheet files.
According to above-mentioned effect of visualization framework, when webpage shows in a browser, each portion of webpage Divide content area can have effect of visualization, when certain content area is selected, (such as detect that mouse moves Move above this content area, the most such as, detect in touch screen the click to this content area or The slip gesture at this content area detected), the top of this content area there will be masking-out, and, The top of this content area can occur labelling menu simultaneously or labelling menu occurs according to triggering, such as, A mouse click right button on selected content area, it may appear that various content type menu items.Such as Fig. 2 Shown in Fig. 3, described content type menu item can include " being labeled as title ", " just be labeled as Literary composition " and " being labeled as the date " etc., it addition, described content type menu item can also include " preserving mark Note " and " end mark " etc..
Step 104, obtains the instruction being labeled webpage each several part content area;
In embodiments of the present invention, the main body performing mark is client, and client can be by user, fortune Battalion personnel or management personnel operate.By mouse, webpage can be labeled, mouse is moved to Above certain content area and a mouse click right button, then, click on certain content type menu item, just may be used Complete the mark to the contents of the section region.In touch screen, it is also possible to according to the touch behaviour to menu item Make to select content type, it is achieved the mark to webpage.As in figure 2 it is shown, " be labeled as by clicking Topic ", corresponding content area can be labeled as title, as it is shown on figure 3, " be labeled as by clicking Text ", corresponding content area can be labeled as text.
Step 106, recorded content region and the corresponding relation of mark instruction, obtain web page template.
Every one content area of labelling, and select then " preservation labelling " menu item, it is possible to by this content regions Territory stores in web page template with the corresponding relation of the content type of selection, by selecting " end mark " Menu item, completes the labelling of content area to labelling there is a need in webpage, obtains this webpage corresponding Web page template (or referred to as web page contents template).
Visible, technical scheme according to embodiments of the present invention, it is only necessary in described effect of visualization framework Select web page contents region to carry out visualized operation, can easily define web page template, improve generation net The efficiency of page template;It is additionally, since web page contents to be presented intuitively, it is easy to determine that the page is tied The content type of structure, improves the accuracy generating web page template.
Such scheme is to generate the web page template corresponding to this webpage according to a webpage.For a money For Source Site, it potentially includes a lot of webpage, and these webpages are usually according to identical webpage design mould Plate generates, thus the structure of these webpages can be essentially identical, it is possible to only exist little difference, Such as, some webpages potentially include comment content, and some webpages do not include commenting on content, but these Webpage all includes title, author, delivers the content such as time and text.If on each webpage is carried out The step stated is to generate web page template, then workload is the biggest.
Then, for improving the formation efficiency of web page template further, described method can also include: to root Add up according to multiple web page templates of the multiple auto-building html files under same resource website, extract the plurality of Same section in web page template generates final web page template.Specifically, resource website can be included All webpages be sampled, obtain multiple webpage;Then, multiple webpage moulds are generated according to said method Plate;Finally, the plurality of web page template (every content area and content type in web page template are extracted Corresponding relation is a part of web page template) in same section generate final web page template (or It is referred to as the web page template of this resource website).
Such as, for 360 websites, can be first according to the homepage URL (http://www.360.cn/) of this website Obtain the html document of homepage;Then this html document being analyzed finds that this website includes many Individual (such as 1000) sub-pages, then, from these 1000 sub-pages according to predetermined algorithm (example Such as random algorithm) extract 50 sub-pages;50 are generated after these 50 sub-pages are carried out visualization mark Individual web page template;Finally, extract the same section in these 50 web page templates to generate corresponding to 360 websites Web page template.
It addition, in embodiments of the present invention, for ease of the content area in location and presentation web page, also may be used Think that the label belonging to each content area adds cryptographic Hash attribute, correspondingly, storage in web page template It it is exactly the cryptographic Hash corresponding relation with the content type of selection of label belonging to content area.This kind of situation Under, the method generating web page template of the embodiment of the present invention is building the visualization effect being labeled webpage Really before the step of framework, it is also possible to comprise the steps:
First, obtain the source code of webpage, generate the DOM of described webpage according to described source code (Document Object Model, DOM Document Object Model) is set;
Then, the cryptographic Hash of the label that each node is corresponding in described dom tree is obtained;
Finally, each label for described webpage adds cryptographic Hash attribute.
Wherein, described cryptographic Hash can include label level cryptographic Hash in described dom tree and label The cryptographic Hash of self.Label level cryptographic Hash in dom tree can be according to current label place The hierarchical relationship of dom tree is calculated, and the cryptographic Hash of label self can be had according to current label Attribute node calculate.
When implementing, the cryptographic Hash calculating of label can be carried out by service end.As shown in Figure 10, Service end 210 is positioned in search engine 200, search engine 200 and multiple (showing 3 in figure) Third party website server 300 communicates to connect, and service end 210 can generate net with fit end 100 Page template.In such cases, the cryptographic Hash of the label that each node is corresponding in the described dom tree of described acquisition May include that
First, index attributes is added at each label that client 100 is described webpage;
Then, the source code of the webpage after client 100 will add index attributes is sent to service end 210;
Secondly, service end 210 carries out the cryptographic Hash calculating of label;
Finally, the corresponding relation of tab indexes value Yu cryptographic Hash is sent to client 100 by service end 210.
When implementing the present invention, the operation of client may include steps of:
First, in client, effect of visualization framework is installed and generates plug-in unit, and access third party website service Webpage in device 300;
Then, in one implementation, mouse moves to web page contents overlying regions, content area There is nattier blue masking-out in top, represents that this content area is selected, and right button is clicked, and occurs selecting dish Single, this content area can be selected to belong to the content type such as title, text;
Finally, after labelling completes, client generates web page template.
Client can be sent to service end the web page template generated, and service end is being oriented collection net This web page template can be used to carry out information gathering during page content.
One detailed process of method generating web page template of an embodiment of the present invention given below.Reference Fig. 4, described method includes:
Step 402, client obtains the source code of webpage, generates described webpage according to described source code Dom tree;
Step 404, client is that each label of dom tree adds index attributes, wherein, dom tree Traversal the algorithm of depth-first can be used to carry out;
Step 406, client is sent to clothes the source code of the webpage added after indexing (index) attribute Business end, the content of transmission for example:
Step 408, service end receives the source code that with the addition of index attributes that client sends, to source generation Code is analyzed, calculate the cryptographic Hash of full page structure respective labels, and calculates all Cryptographic Hash returns to client;
The cryptographic Hash that service end is calculated is corresponding with the index of label, can be packaged into json form and return Return client, json content format for example: tab indexes value: Hash 1:hash1, Hash 2: hash2}...}。
Step 410, client receives the json data that service end returns, by tab indexes value and Hash The corresponding relation of value, for corresponding label plus two property values: label layer in described dom tree Level cryptographic Hash frame_hash and cryptographic Hash self_hash of label self;
Such as, a div tag content that with the addition of cryptographic Hash attribute is as follows:
<div frame_hash=”46131321231613”self_hash=”174461815164”index=”45”>
content
</div>
Wherein, frame_hash cryptographic Hash is the hierarchical relationship meter of the dom tree according to current label place Calculate, such as:
If calculating the frame_hash of div tag, " html body div " this string can be carried out Md5 calculates a cryptographic Hash, and algorithm can have multiple, and concrete algorithm is not done by the embodiment of the present invention Limit.
And self_hash cryptographic Hash is the attribute node being had according to current label calculates, example As div tag has a class attribute and id attribute, then can according to " class:name id:author " this String carries out md5 and calculates a cryptographic Hash, and algorithm can also have multiple, and the embodiment of the present invention is to concrete Algorithm does not limits.
In this manner it is possible to navigate to a node of dom tree according to frame_hash and self_hash.
Step 412, client adds visual effect according to cryptographic Hash attribute, the page elements for webpage Really, in one implementation, mouse moves to above this element, and the top of this element has azury Masking-out, represents that the content area of this element is selected, and on selected content area, right button is clicked, and goes out The existing menu item such as " being labeled as title ", " being labeled as text ".
Step 414, when each content area of webpage is marked, content area under client records Cryptographic Hash and the corresponding relation of the content type of labelling, generate web page template, and the content of web page template is such as For:
Frame_hash:243092489self_hash:49348393 title
Frame_hash:434389298self_hash:23439438 author
Frame_hash:023473843self_hash:34934932 text
The frame_hash:483928384self_hash:23487388 date
Step 416, the web page template of generation is sent to service end by client, and service end preserves client The web page template sent, during this website of oriented acquisition, uses this web page template the title of webpage, just Literary composition, content etc. extract.
The embodiment of the present invention also provides for a kind of device generating web page template, with reference to Fig. 5, described device bag Include effect of visualization framework establishment device 10, mark instruction getter 20 and web page template maker 30, its In:
Effect of visualization framework establishment device 10 is suitable to build the effect of visualization framework being labeled webpage. In one implementation, described effect of visualization framework includes: content area, be positioned at the content chosen The masking-out of overlying regions and mark menu, described mark menu includes plurality of kinds of contents type menu item.Visually Change the effect framework establishment device 10 source code such as html document by acquisition webpage, by stylesheet files Such as css file is attached to html document, and increases js script in html document, can build net The effect of visualization framework of page.
Mark instruction getter 20 is suitable to obtain the instruction being labeled webpage each several part content area.Can Webpage to be labeled by mouse or touch screen, such as, mouse is moved to certain content area Top a mouse click right button, then, click on certain content type menu item and complete the contents of the section The mark in region, mark instruction getter 20 can detect labeling operation, and obtain by right button menu The content type selected.
Web page template maker 30 is suitable to the corresponding relation in recorded content region and mark instruction, obtains webpage Template.After mark instruction getter 20 gets the content type of selection, web page template maker 30 can With the corresponding relation in recorded content region Yu the content type of selection, thus generate web page template.
Alternatively, described device also includes counter (not shown), is suitable to according to same resource website Under multiple web page templates of multiple auto-building html files add up, extract the phase in the plurality of web page template Final web page template is generated with part.
In embodiments of the present invention, for ease of the content area in location and presentation web page, it is also possible to for respectively Label belonging to content area adds cryptographic Hash attribute.Therefore, the generation web page template of the embodiment of the present invention Device can also include dom tree maker, cryptographic Hash getter and cryptographic Hash attribute adder.Logical Cross dom tree maker to obtain the source code of webpage, and generate described webpage according to described source code Dom tree;The Hash of the label that each node is corresponding in described dom tree is obtained by cryptographic Hash getter Value;Cryptographic Hash attribute is added by each label that cryptographic Hash attribute adder is described webpage.Wherein, institute State cryptographic Hash and may include that label level cryptographic Hash in described dom tree and the Hash of label self Value.Label level cryptographic Hash in dom tree can be according to the layer of the dom tree at current label place Level relation is calculated, the attribute node meter that the cryptographic Hash of label self can be had according to current label Calculate.Correspondingly, the described web page template maker 30 Hash by label belonging to recorded content region Value obtains web page template with the corresponding relation of the content type selected.
When implementing, the cryptographic Hash calculating of label can be carried out by service end.In such cases, Described cryptographic Hash getter obtains the cryptographic Hash of label further according to following manner: for described webpage Each label adds index attributes;The source code of the webpage after adding index attributes is sent to service end, with The cryptographic Hash carrying out label for service end calculates;Receive tab indexes value and cryptographic Hash that service end sends Corresponding relation.
It should be noted that each step of the method in embodiment 1 can be split as required and take House, each module of the device in embodiment 1 can also carry out splitting and accepting or rejecting as required.Such as, by Step 102 and step 104 constitute a kind of method providing visualization mark to webpage, by effect of visualization Framework establishment device 10 and mark instruction getter 20 constitute a kind of device that webpage provides visualization mark.
Embodiment 2
The present embodiment provides a kind of method and device that webpage provides visualization mark.
Fig. 6 shows the method flow that webpage provides visualization mark according to an embodiment of the invention Figure, with reference to Fig. 6, described method includes:
Step 602, be constructed by that webpage is labeled by the masking-out being positioned at web page contents overlying regions can Depending on changing effect framework;
Described effect of visualization framework may include that content area, is positioned at above the content area chosen Masking-out and mark menu, described mark menu includes plurality of kinds of contents type menu item.
By obtaining the source code such as html document of webpage, stylesheet files such as css file is added To html document, and in html document, increase js (javascript) script, webpage can be built Effect of visualization framework.Specifically, can be realized when detecting that certain content area is selected by js script Time middle, masking-out and mark menu, described masking-out and mark menu occur above the content area chosen Display mode can be limited by the rule in stylesheet files.
According to above-mentioned effect of visualization framework, when webpage shows in a browser, each portion of webpage Divide content area can have effect of visualization, when certain content area is selected, (such as detect that mouse moves Move above this content area, the most such as, detect in touch screen the click to this content area or The slip gesture at this content area detected), the top of this content area there will be masking-out, and, The top of this content area can occur labelling menu simultaneously or labelling menu occurs according to triggering, such as, A mouse click right button on selected content area, it may appear that various content type menu items.Such as Fig. 2 Shown in Fig. 3, described content type menu item can include " being labeled as title ", " just be labeled as Literary composition " and " being labeled as the date " etc., it addition, described content type menu item can also include " preserving mark Note " and " end mark " etc..
Step 604, obtains the instruction being labeled webpage each several part content area in described masking-out.
Described instruction can be the content class of the content area corresponding to choosing by mark menu setecting Type.In embodiments of the present invention, the main body performing mark is client, and client can be by user, fortune Battalion personnel or management personnel operate.By mouse, webpage can be labeled, mouse is moved to Above certain content area and a mouse click right button, then, click on certain content type menu item, just may be used Complete the mark to the contents of the section region.In touch screen, it is also possible to according to the touch behaviour to menu item Make to select content type, it is achieved the mark to webpage.As in figure 2 it is shown, " be labeled as by clicking Topic ", corresponding content area can be labeled as title, as it is shown on figure 3, " be labeled as by clicking Text ", corresponding content area can be labeled as text.
Visible, technical scheme according to embodiments of the present invention is by building effect of visualization framework, permissible Webpage is carried out visualization mark, improves the efficiency of mark;It is additionally, since web page contents by intuitively Present, it is easy to determine the content type of page structure, improve the accuracy of mark.
It addition, in embodiments of the present invention, for ease of the content area in location and presentation web page, also may be used Think that the label belonging to each content area adds cryptographic Hash attribute.In this case, the embodiment of the present invention Webpage provides the method method of visualization mark at the structure effect of visualization frame that is labeled webpage Before the step of frame, it is also possible to comprise the steps:
First, obtain the source code of webpage, generate the dom tree of described webpage according to described source code;
Then, the cryptographic Hash of the label that each node is corresponding in described dom tree is obtained;
Finally, each label for described webpage adds cryptographic Hash attribute.
Wherein, described cryptographic Hash can include label level cryptographic Hash in described dom tree and label The cryptographic Hash of self.Label level cryptographic Hash in dom tree can be according to current label place The hierarchical relationship of dom tree is calculated, and the cryptographic Hash of label self can be had according to current label Attribute node calculate.
When implementing, the cryptographic Hash calculating of label can be carried out by service end.As shown in Figure 10, Service end 210 is positioned in search engine 200, search engine 200 and multiple (showing 3 in figure) Third party website server 300 communicates to connect, and service end 210 can generate net with fit end 100 Page template.In such cases, the cryptographic Hash of the label that each node is corresponding in the described dom tree of described acquisition May include that
First, index attributes is added at each label that client 100 is described webpage;
Then, the source code of the webpage after client 100 will add index attributes is sent to service end 210;
Secondly, service end 210 carries out the cryptographic Hash calculating of label;
Finally, the corresponding relation of tab indexes value Yu cryptographic Hash is sent to client 100 by service end 210.
When implementing the present invention, the labeling operation of client may include steps of: first, client End is installed effect of visualization framework and is generated plug-in unit, and accesses the webpage in third party website server 300; Then, in one implementation, mouse moves to web page contents overlying regions, the top of content area Nattier blue masking-out occur, represent that this content area is selected, right button is clicked, and occurs selecting menu, can To select this content area to belong to the content type such as title, text;Repeatedly performing above-mentioned steps, it is right to complete The mark of webpage.
The embodiment of the present invention also provides for a kind of device that webpage provides visualization mark, with reference to Fig. 7, institute State device and include effect of visualization framework establishment device 10 and mark instruction getter 20, wherein:
Effect of visualization framework establishment device 10 is suitable to be constructed by the masking-out being positioned at web page contents overlying regions The effect of visualization framework that webpage is labeled.In one implementation, described effect of visualization frame Frame includes: content area, be positioned at the masking-out above the content area chosen and mark menu, described mark Menu includes plurality of kinds of contents type menu item.Effect of visualization framework establishment device 10 is by obtaining the source of webpage Code such as html document, is attached to html document by stylesheet files such as css file, and at html Document increases js script, the effect of visualization framework of webpage can be built.
Mark instruction getter 20 is suitable to obtain and carries out webpage each several part content area in described masking-out The instruction of mark, described in be designated as by mark menu setecting corresponding to the content of content area chosen Type.By mouse or touch screen, webpage can be labeled, such as, mouse be moved to certain Above content area and a mouse click right button, then, click on certain content type menu item and complete this The mark in partial content region, mark instruction getter 20 can detect labeling operation, and acquisition is passed through The content type of right button menu setecting.
In embodiments of the present invention, for ease of the content area in location and presentation web page, it is also possible to for respectively Label belonging to content area adds cryptographic Hash attribute.Therefore, the providing webpage of the embodiment of the present invention can Can also include that dom tree maker, cryptographic Hash getter and cryptographic Hash attribute add depending on changing the device of mark Add device.Obtained the source code of webpage by dom tree maker, and generate institute according to described source code State the dom tree of webpage;Each node in described dom tree is obtained corresponding by cryptographic Hash getter The cryptographic Hash of label;Cryptographic Hash attribute is added by each label that cryptographic Hash attribute adder is described webpage. Wherein, described cryptographic Hash may include that label level cryptographic Hash in described dom tree and label certainly The cryptographic Hash of body.Label level cryptographic Hash in dom tree can be according to the DOM at current label place The hierarchical relationship of tree is calculated, the attribute that the cryptographic Hash of label self can be had according to current label Node calculates.
When implementing, the cryptographic Hash calculating of label can be carried out by service end.In such cases, Described cryptographic Hash getter obtains the cryptographic Hash of label further according to following manner: for described webpage Each label adds index attributes;The source code of the webpage after adding index attributes is sent to service end, with The cryptographic Hash carrying out label for service end calculates;Receive tab indexes value and cryptographic Hash that service end sends Corresponding relation.
Embodiment 3
The present embodiment provides a kind of method and device carrying out webpage content extraction according to visual template.
Figure 10 shows that carrying out web page contents according to visual template according to an embodiment of the invention takes out The system construction drawing taken.With reference to Figure 10, described system includes client 100, search engine 200 and many Individual (showing 3 in figure) third party website server 300, search engine 200 includes service end 210, search engine 200 communicates to connect with third party website server 300, and service end 210 can coordinate Client 100 generates web page template, and search engine 200 can carry out web page contents according to web page template Extraction, i.e. according to the structured content of webpage in web page template extraction third party website server 300.
Fig. 8 shows and carries out webpage content extraction according to visual template according to an embodiment of the invention Method flow diagram.With reference to Fig. 8, described method includes:
Step 802, when orientation captures targeted website, searches and whether has recorded corresponding institute in web page template storehouse State the web page template generated according to visualization mark of targeted website;
Web page template storehouse is preserved the web page template generated according to visualization mark arrive.Described web page template Can be according to embodiment 1 or the web page template of the schemes generation of embodiment 2.Web page template storehouse stores Having multiple web page template, described web page template can be identified with the homepage URL of website.Can basis The homepage URL of targeted website searches the web page template whether having correspondence in web page template storehouse.
Step 804, if record has marking according to visualization of corresponding described targeted website in web page template storehouse The web page template that note generates, then carry out content extraction according to described web page template to described targeted website.
Can remove wherein to not according to the URL of all external linkages in homepage URL extraction homepage The part jumped out of website, remaining URL is put into scheduling queue;Then, according to described web page template Webpage corresponding for URL in scheduling queue is carried out respectively content extraction.Webpage capture device can be used to hold The described content extraction of row, described webpage capture device can be Web Spider, spiders, searching machine people Or network captures shell script etc..
In the technical scheme of the embodiment of the present invention, the web page template using visualization mark to generate is carried out Webpage content extraction, owing to the accuracy of this web page template is higher, therefore, is carried out according to this web page template The accuracy of content extraction have also been obtained raising.
Fig. 9 shows and carries out webpage content extraction according to visual template according to an embodiment of the invention Structure drawing of device.With reference to Fig. 9, described device includes web page template storehouse 902, finger 904 and content Withdrawal device 906, wherein:
Web page template storehouse 902 is suitable to preserve the web page template generated according to visualization mark, described webpage mould Plate can be identified with the URL of webpage, it is also possible to is identified with the homepage URL of website.
Finger 904 is suitable to search in web page template storehouse whether recorded correspondence when orientation captures targeted website The web page template generated according to visualization mark of described targeted website.
Content extraction device 906 be suitable to when in web page template storehouse record have corresponding described targeted website according to can When marking, depending on changing, the web page template generated, according to described web page template, described targeted website is carried out content and take out Take.Content extraction device 906 can extract the URL of all external linkages in homepage according to homepage URL, Remove the part wherein jumped out to other website, remaining URL is put into scheduling queue;Then, according to Described web page template carries out content extraction respectively to webpage corresponding for URL in scheduling queue.Content extraction device 906 can be that Web Spider, spiders, searching machine people or network capture shell script etc..
Alternatively, described device also includes the device for generating web page template, i.e. can include embodiment Effect of visualization framework establishment device 10, mark instruction getter 20 and web page template maker 30 in 1, The annexation of these modules and operation principle can be found in the description in embodiment 1.
In sum, technical scheme according to embodiments of the present invention, by structure, webpage is labeled Effect of visualization framework, it is not necessary to manual edit web page template text, it is only necessary at described effect of visualization Framework selects web page contents region carry out visualized operation and can complete the mark to web page contents, improve The efficiency of mark, and then improve the efficiency generating web page template;It is additionally, since web page contents by directly That sees presents, it is not necessary to possess the Professional knowledge in terms of webpage design, is just easily determined page knot The content type of structure, improves the accuracy of mark, and then improves the accuracy generating web page template; Further, owing to the accuracy of web page template is improved, so, net is carried out according to this web page template Page captures the accuracy of the content obtained and have also been obtained raising.
Algorithm and display be not solid with any certain computer, virtual system or miscellaneous equipment provided herein Have relevant.Various general-purpose systems can also be used together with based on teaching in this.According to retouching above State, construct the structure required by this kind of system and be apparent from.Additionally, the present invention is also not for any Certain programmed language.It is understood that, it is possible to use various programming languages realize invention described herein Content, and the description above done language-specific is the preferred forms in order to disclose the present invention.
In description mentioned herein, illustrate a large amount of detail.It is to be appreciated, however, that this Inventive embodiment can be put into practice in the case of not having these details.In some instances, not It is shown specifically known method, structure and technology, in order to do not obscure the understanding of this description.
Similarly, it will be appreciated that in order to simplify the disclosure and help understand in each inventive aspect one Or multiple, above in the description of the exemplary embodiment of the present invention, each feature of the present invention is sometimes It is grouped together in single embodiment, figure or descriptions thereof.But, should be by the disclosure Method be construed to reflect an intention that i.e. the present invention for required protection require ratio in each claim The middle more feature of feature be expressly recited.More precisely, as the following claims reflect As, inventive aspect is all features less than single embodiment disclosed above.Therefore, it then follows Claims of detailed description of the invention are thus expressly incorporated in this detailed description of the invention, the most each right Requirement itself is all as the independent embodiment of the present invention.
Those skilled in the art are appreciated that and can carry out the module in the equipment in embodiment certainly Change adaptively and they are arranged in one or more equipment different from this embodiment.Permissible Module in embodiment or unit or assembly are combined into a module or unit or assembly, and in addition may be used To put them into multiple submodule or subelement or sub-component.Except such feature and/or process or Outside at least some in person's unit excludes each other, can use any combination that this specification (is included Adjoint claim, summary and accompanying drawing) disclosed in all features and so disclosed any method Or all processes of equipment or unit are combined.Unless expressly stated otherwise, this specification (includes Adjoint claim, summary and accompanying drawing) disclosed in each feature can by provide identical, equivalent or The alternative features of similar purpose replaces.
Although additionally, it will be appreciated by those of skill in the art that embodiments more described herein include it Some feature included in its embodiment rather than further feature, but the group of the feature of different embodiment Close and mean to be within the scope of the present invention and formed different embodiments.Such as, in following power In profit claim, one of arbitrarily can mode making in any combination of embodiment required for protection With.
The all parts embodiment of the present invention can realize with hardware, or to process at one or more The software module run on device realizes, or realizes with combinations thereof.Those skilled in the art should Understand, microprocessor or digital signal processor (DSP) can be used in practice to realize basis Some or all in the device carrying out webpage content extraction according to visual template of the embodiment of the present invention The some or all functions of parts.The present invention is also implemented as performing method as described herein Part or all equipment or device program (such as, computer program and computer program product Product).The program of such present invention of realization can store on a computer-readable medium, or can have There is the form of one or more signal.Such signal can be downloaded from internet website and obtain, or Person provides on carrier signal, or provides with any other form.
The present invention will be described rather than limits the invention to it should be noted above-described embodiment, And those skilled in the art can design replacement in fact without departing from the scope of the appended claims Execute example.In the claims, should not will be located in any reference marks between bracket to be configured to right is wanted The restriction asked.Word " comprises " and does not excludes the presence of the element or step not arranged in the claims.It is positioned at Word "a" or "an" before element does not excludes the presence of multiple such element.The present invention is permissible By means of including the hardware of some different elements and realizing by means of properly programmed computer.? If listing in the unit claim of equipment for drying, several in these devices can be by same Hardware branch specifically embodies.Word first, second and third use do not indicate that any order. Can be title by these word explanations.

Claims (8)

1. the method carrying out webpage content extraction according to visual template, including:
When orientation captures targeted website, search in web page template storehouse whether recorded corresponding described targeted website According to visualization mark generate web page template;
If record has the net generated according to visualization mark of corresponding described targeted website in web page template storehouse Page template, then carry out content extraction, wherein, described net according to described web page template to described targeted website Page template generation process includes:
Obtain the html document of webpage, stylesheet files and js script inserted in described html document, Described js script realizes when detecting that certain content area is selected, goes out above the content area chosen Existing masking-out and mark menu, the display mode of described masking-out and described mark menu is by fixed in stylesheet files The rule of justice limits, thus builds the effect of visualization framework being labeled webpage;
Obtain the instruction that webpage each several part content area is labeled;Described be designated as by mark menu The content type of the content area corresponding to choosing selected;
Recorded content region and the corresponding relation of content type, obtain described web page template.
The most described web page template is with the homepage URL of website It is identified.
3. method as claimed in claim 1 or 2, wherein, described according to described web page template to described Targeted website carries out content extraction, including:
According to the URL of all external linkages in homepage URL extraction homepage, remove wherein to other net The part that station is jumped out, puts into scheduling queue by remaining URL;
According to described web page template, webpage corresponding for URL in scheduling queue carried out respectively content extraction.
The most also include:
Add up according to multiple web page templates of the multiple auto-building html files under same resource website, extract Same section in the plurality of web page template generates final web page template.
5. a device for webpage content extraction is carried out according to visual template, including:
Web page template storehouse, is suitable to preserve the web page template generated according to visualization mark;
Finger, when being suitable to orientation crawl targeted website, searches in web page template storehouse whether recorded correspondence The web page template generated according to visualization mark of described targeted website;
Content extraction device, is suitable to when in web page template storehouse, record has the basis of corresponding described targeted website visually When changing the web page template that mark generates, according to described web page template, described targeted website is carried out content and take out Take;
Effect of visualization framework establishment device, is suitable to obtain the html document of webpage, by stylesheet files and js Script inserts in described html document, and described js script realizes when detecting that certain content area is selected Time, masking-out and mark menu, described masking-out and described mark menu occur above the content area chosen Display mode limited by the rule defined in stylesheet files, thus build and webpage be labeled Effect of visualization framework;
Mark instruction getter, is suitable to obtain the instruction being labeled webpage each several part content area, institute State the content type being designated as the content area corresponding to choosing by mark menu setecting;
Web page template maker, is suitable to the corresponding relation in recorded content region and content type, obtains webpage Template.
6. device as claimed in claim 5, wherein, described web page template is with the homepage URL of website It is identified.
7. the device as described in claim 5 or 6, wherein, described content extraction device is further adapted for:
According to the URL of all external linkages in homepage URL extraction homepage, remove wherein to other net The part that station is jumped out, puts into scheduling queue by remaining URL;
According to described web page template, webpage corresponding for URL in scheduling queue carried out respectively content extraction.
8. device as claimed in claim 5, wherein, also includes:
Counter, is suitable to enter according to multiple web page templates of the multiple auto-building html files under same resource website Row statistics, extracts the same section in the plurality of web page template and generates final web page template.
CN201310606505.8A 2013-11-25 2013-11-25 The method and device of webpage content extraction is carried out according to visual template Active CN103678511B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310606505.8A CN103678511B (en) 2013-11-25 2013-11-25 The method and device of webpage content extraction is carried out according to visual template

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310606505.8A CN103678511B (en) 2013-11-25 2013-11-25 The method and device of webpage content extraction is carried out according to visual template

Publications (2)

Publication Number Publication Date
CN103678511A CN103678511A (en) 2014-03-26
CN103678511B true CN103678511B (en) 2016-11-16

Family

ID=50316056

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310606505.8A Active CN103678511B (en) 2013-11-25 2013-11-25 The method and device of webpage content extraction is carried out according to visual template

Country Status (1)

Country Link
CN (1) CN103678511B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105100904A (en) * 2014-05-09 2015-11-25 深圳市快播科技有限公司 Video advertisement blocking method, device and browser
CN104657340B (en) * 2015-02-10 2018-09-11 上海创景信息科技有限公司 Expansible Word report preparing systems and method based on script
CN105989167B (en) * 2015-03-04 2019-11-08 北大方正集团有限公司 Collecting method and device based on news client
CN105095416B (en) * 2015-07-13 2018-12-07 北京奇虎科技有限公司 A kind of method and apparatus realizing content in the search and promoting
CN105574084A (en) * 2015-12-10 2016-05-11 天津海量信息技术有限公司 Extraction method of case information in webpage
CN107085578B (en) * 2016-02-16 2020-05-12 腾讯科技(深圳)有限公司 Webpage editing method and device
US9871911B1 (en) * 2016-09-30 2018-01-16 Microsoft Technology Licensing, Llc Visualizations for interactions with external computing logic
CN110020296A (en) * 2017-10-31 2019-07-16 北京国双科技有限公司 A kind of method and device for extracting news web page text
CN108549678B (en) * 2018-04-02 2020-06-19 北京今朝在线科技有限公司 Information acquisition system
CN110321177B (en) * 2019-06-18 2022-06-03 北京奇艺世纪科技有限公司 Mobile application localized loading method and device and electronic equipment
CN110837614A (en) * 2019-11-05 2020-02-25 上海嘉道信息技术有限公司 Method and system for efficiently generating webpage information extraction rule

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1777632A2 (en) * 2005-10-20 2007-04-25 Intro Mobile Co., Ltd. Method and server for extracting content based on RSS
CN101192234A (en) * 2007-06-07 2008-06-04 腾讯科技(深圳)有限公司 Searching system and method based on web page extraction
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure
CN102890681A (en) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 Method and system for generating webpage structure template

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1777632A2 (en) * 2005-10-20 2007-04-25 Intro Mobile Co., Ltd. Method and server for extracting content based on RSS
CN101192234A (en) * 2007-06-07 2008-06-04 腾讯科技(深圳)有限公司 Searching system and method based on web page extraction
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure
CN102890681A (en) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 Method and system for generating webpage structure template

Also Published As

Publication number Publication date
CN103678511A (en) 2014-03-26

Similar Documents

Publication Publication Date Title
CN103678511B (en) The method and device of webpage content extraction is carried out according to visual template
CN103678509B (en) Generate the method and device of web page template
US11294968B2 (en) Combining website characteristics in an automatically generated website
EP2987088B1 (en) Client side page processing
CN110069683B (en) Method and device for crawling data based on browser
US10657323B2 (en) Method of preparing documents in markup languages
CN104077387B (en) A kind of web page contents display methods and browser device
CN103678510B (en) The method and device of visualization mark is provided webpage
CA2817554A1 (en) Mobile content management system
TW201250492A (en) Method and system of extracting web page information
CN103034518B (en) The method and browser of loading browser control instrument
CN110309386B (en) Method and device for crawling web page
CN108595697A (en) Webpage integrated approach, apparatus and system
CN106874502A (en) A kind of method of video search, device and terminal
CN102902784B (en) Web page classification storage system and method
WO2019000894A1 (en) Method and device for generating article outline
TWI570579B (en) An information retrieving method utilizing webpage visual features and webpage language features and a system using thereof
CN110413765A (en) A kind of interactive system and its method of mass data set analysis and displaying
CN113051333B (en) Data processing method and device, electronic equipment and storage medium
JP5380874B2 (en) Information retrieval method, program and apparatus
CN110147477B (en) Data resource modeling extraction method, device and equipment of Web system
Fung et al. Discover information and knowledge from websites using an integrated summarization and visualization framework
CN103246662A (en) Processing method and device of area data contents in web pages
CN105138701B (en) Index page method for extracting content and device, search engine
Su et al. KaitoroCap: A document navigation capture and visualisation tool

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220725

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

TR01 Transfer of patent right