CN103678511B - The method and device of webpage content extraction is carried out according to visual template - Google Patents
The method and device of webpage content extraction is carried out according to visual template Download PDFInfo
- Publication number
- CN103678511B CN103678511B CN201310606505.8A CN201310606505A CN103678511B CN 103678511 B CN103678511 B CN 103678511B CN 201310606505 A CN201310606505 A CN 201310606505A CN 103678511 B CN103678511 B CN 103678511B
- Authority
- CN
- China
- Prior art keywords
- web page
- page template
- webpage
- content
- mark
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 48
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000000007 visual effect Effects 0.000 title claims abstract description 18
- 238000012800 visualization Methods 0.000 claims abstract description 65
- 230000000694 effects Effects 0.000 claims description 40
- 239000000284 extract Substances 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 5
- 238000002372 labelling Methods 0.000 description 14
- 238000004422 calculation algorithm Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 241000239290 Araneae Species 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000001035 drying Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000002073 mitogenetic effect Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The open one of the present invention carries out webpage content extraction method and device according to visual template, belongs to Internet technical field.Described method includes: when orientation captures targeted website, searches the web page template generated according to visualization mark whether recording corresponding described targeted website in web page template storehouse;If record has the web page template generated according to visualization mark of corresponding described targeted website in web page template storehouse, then according to described web page template, described targeted website is carried out content extraction.In accordance with the invention it is possible to improve the accuracy to webpage content extraction.
Description
Technical field
The present invention relates to Internet technical field, be specifically related to one and carry out in webpage according to visual template
Hold the method and device of extraction.
Background technology
Web page template may be used for extract webpage content, than if any search engine capture website time make
By oriented acquisition technology, the Aranea of oriented acquisition uses web page template that the related content of website is extracted
Come, it is thus achieved that the content of formatting, including webpage title, author, deliver the information such as time and text.
Existing a kind of generate web page template method be: first, according to the URL (Uniform of the page
Resource Locator, URL), download the source code of the page;Secondly, according to the page
Source code page structure is automatically analyzed, calculate the cryptographic Hash of each structure in the page;Then,
Which structure correspondence title in the source code artificial judgment page according to the page, which structure correspondence text,
Which structure correspondence is delivered the time etc., and is marked;Finally, the cryptographic Hash of generating structure and structure
The corresponding relation of content type, obtains web page template.
Existing generation web page template method at least has a disadvantage in that
The content type of handmarking's page structure is carried out by text editing, has inside web page template
Substantial amounts of incoherent content, some web page templates even have ten of thousands row, cause the efficiency of handmarking
The lowest;
Various contents in web page template are mixed in web page code, due to web page contents the most intuitively in
Reveal to come, if not yet done to webpage design language, be then not easy to determine the content type of page structure, people
Being easy for makeing mistakes during work labelling, the accuracy causing the web page template of generation is the highest, and then causes according to being somebody's turn to do
The accuracy that web page template carries out content extraction is the highest.
Summary of the invention
In view of the above problems, it is proposed that the present invention is to provide one to overcome the problems referred to above or at least partly
Ground solves the method and device carrying out webpage content extraction according to visual template of the problems referred to above.
According to one aspect of the present invention, it is provided that one carries out webpage content extraction according to visual template
Method, described method includes:
When orientation captures targeted website, search in web page template storehouse whether recorded corresponding described targeted website
According to visualization mark generate web page template;
If record has the net generated according to visualization mark of corresponding described targeted website in web page template storehouse
Page template, then carry out content extraction according to described web page template to described targeted website.
Alternatively, described web page template is identified with the homepage URL of website.
Alternatively, described according to described web page template, described targeted website is carried out content extraction, including:
According to the URL of all external linkages in homepage URL extraction homepage, remove wherein to other net
The part that station is jumped out, puts into scheduling queue by remaining URL;
According to described web page template, webpage corresponding for URL in scheduling queue carried out respectively content extraction.
Alternatively, described method also includes:
Building the effect of visualization framework being labeled webpage, described effect of visualization framework includes content
Region, be positioned at the masking-out above the content area chosen and mark menu, described mark menu includes multiple
Content type menu item;
Obtain instruction that webpage each several part content area is labeled, described in be designated as by mark menu
The content type of the content area corresponding to choosing selected;
Recorded content region and the corresponding relation of content type, obtain web page template.
Alternatively, described method also includes:
Add up according to multiple web page templates of the multiple auto-building html files under same resource website, extract
Same section in the plurality of web page template generates final web page template.
According to a further aspect in the invention, it is provided that a kind of carry out webpage content extraction according to visual template
Device, described device includes:
Web page template storehouse, is suitable to preserve the web page template generated according to visualization mark;
Finger, when being suitable to orientation crawl targeted website, searches in web page template storehouse whether recorded correspondence
The web page template generated according to visualization mark of described targeted website;
Content extraction device, is suitable to when in web page template storehouse, record has the basis of corresponding described targeted website visually
When changing the web page template that mark generates, according to described web page template, described targeted website is carried out content and take out
Take.
Alternatively, described web page template is identified with the homepage URL of website.
Alternatively, described content extraction device is further adapted for:
According to the URL of all external linkages in homepage URL extraction homepage, remove wherein to other net
The part that station is jumped out, puts into scheduling queue by remaining URL;
According to described web page template, webpage corresponding for URL in scheduling queue carried out respectively content extraction.
Alternatively, described device also includes:
Effect of visualization framework establishment device, is suitable to build the effect of visualization framework being labeled webpage,
Described effect of visualization framework includes content area, is positioned at the masking-out above the content area chosen and mark
Menu, described mark menu includes plurality of kinds of contents type menu item;
Mark instruction getter, is suitable to obtain the instruction being labeled webpage each several part content area, institute
State the content type being designated as the content area corresponding to choosing by mark menu setecting;
Web page template maker, is suitable to the corresponding relation in recorded content region and content type, obtains webpage
Template.
Alternatively, described device also includes: counter, and be suitable to according under same resource website is multiple
Multiple web page templates of auto-building html files are added up, and extract the identical portions in the plurality of web page template mitogenetic
Become final web page template.
According to one or more technical schemes that the present invention is above-mentioned, by the net using visualization mark to generate
Page template carries out webpage content extraction, owing to the accuracy of this web page template is higher, therefore, according to this
Web page template carries out the accuracy of content extraction and have also been obtained raising.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the skill of the present invention
Art means, and can being practiced according to the content of description, and in order to allow, the present invention's is above and other
Objects, features and advantages can become apparent, below especially exemplified by the detailed description of the invention of the present invention.
Accompanying drawing explanation
By reading the detailed description of hereafter preferred implementation, various other advantage and benefit for this
Field those of ordinary skill will be clear from understanding.Accompanying drawing is only used for illustrating the purpose of preferred implementation,
And it is not considered as limitation of the present invention.And in whole accompanying drawing, be denoted by the same reference numerals
Identical parts.In the accompanying drawings:
Fig. 1 shows the method flow diagram generating web page template according to an embodiment of the invention;
Fig. 2 shows the schematic diagram in the embodiment of the present invention being labeled the title of webpage;
Fig. 3 shows the schematic diagram in the embodiment of the present invention being labeled the text of webpage;
Fig. 4 shows the method detail flowchart generating web page template according to an embodiment of the invention;
Fig. 5 shows the structure drawing of device generating web page template according to an embodiment of the invention;
Fig. 6 shows the method flow that webpage provides visualization mark according to an embodiment of the invention
Figure;
Fig. 7 shows the apparatus structure that webpage provides visualization mark according to an embodiment of the invention
Figure;
Fig. 8 shows and carries out webpage content extraction according to visual template according to an embodiment of the invention
Method flow diagram;
Fig. 9 shows and carries out webpage content extraction according to visual template according to an embodiment of the invention
Structure drawing of device;
Figure 10 shows that carrying out web page contents according to visual template according to an embodiment of the invention takes out
The system construction drawing taken.
Detailed description of the invention
It is more fully described the exemplary embodiment of the disclosure below with reference to accompanying drawings.Although accompanying drawing shows
The exemplary embodiment of the disclosure, it being understood, however, that may be realized in various forms the disclosure and not
Should be limited by embodiments set forth here.On the contrary, it is provided that these embodiments are able to more thoroughly
Understand the disclosure, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
Embodiment 1
The present embodiment provides a kind of method and device generating web page template
Fig. 1 shows the method flow diagram generating web page template according to an embodiment of the invention, reference
Fig. 1, described method includes:
Step 102, builds the effect of visualization framework being labeled webpage;
In one implementation, described effect of visualization framework may include that content area, is positioned at choosing
In content area above masking-out and mark menu, described mark menu includes plurality of kinds of contents type menu
?.
By obtaining source code such as html (hypertext mark-up language, the hypertext mark of webpage
Note language) document, by stylesheet files such as css (cascading style sheets, CSS)
File is attached to html document, and increases js (javascript) script in html document, can be with structure
The effect of visualization framework of networking page.Specifically, can be realized when certain content being detected by js script
When region is selected, masking-out and mark menu, described masking-out and mark occur above the content area chosen
The display mode of note menu can be limited by the rule defined in stylesheet files.
According to above-mentioned effect of visualization framework, when webpage shows in a browser, each portion of webpage
Divide content area can have effect of visualization, when certain content area is selected, (such as detect that mouse moves
Move above this content area, the most such as, detect in touch screen the click to this content area or
The slip gesture at this content area detected), the top of this content area there will be masking-out, and,
The top of this content area can occur labelling menu simultaneously or labelling menu occurs according to triggering, such as,
A mouse click right button on selected content area, it may appear that various content type menu items.Such as Fig. 2
Shown in Fig. 3, described content type menu item can include " being labeled as title ", " just be labeled as
Literary composition " and " being labeled as the date " etc., it addition, described content type menu item can also include " preserving mark
Note " and " end mark " etc..
Step 104, obtains the instruction being labeled webpage each several part content area;
In embodiments of the present invention, the main body performing mark is client, and client can be by user, fortune
Battalion personnel or management personnel operate.By mouse, webpage can be labeled, mouse is moved to
Above certain content area and a mouse click right button, then, click on certain content type menu item, just may be used
Complete the mark to the contents of the section region.In touch screen, it is also possible to according to the touch behaviour to menu item
Make to select content type, it is achieved the mark to webpage.As in figure 2 it is shown, " be labeled as by clicking
Topic ", corresponding content area can be labeled as title, as it is shown on figure 3, " be labeled as by clicking
Text ", corresponding content area can be labeled as text.
Step 106, recorded content region and the corresponding relation of mark instruction, obtain web page template.
Every one content area of labelling, and select then " preservation labelling " menu item, it is possible to by this content regions
Territory stores in web page template with the corresponding relation of the content type of selection, by selecting " end mark "
Menu item, completes the labelling of content area to labelling there is a need in webpage, obtains this webpage corresponding
Web page template (or referred to as web page contents template).
Visible, technical scheme according to embodiments of the present invention, it is only necessary in described effect of visualization framework
Select web page contents region to carry out visualized operation, can easily define web page template, improve generation net
The efficiency of page template;It is additionally, since web page contents to be presented intuitively, it is easy to determine that the page is tied
The content type of structure, improves the accuracy generating web page template.
Such scheme is to generate the web page template corresponding to this webpage according to a webpage.For a money
For Source Site, it potentially includes a lot of webpage, and these webpages are usually according to identical webpage design mould
Plate generates, thus the structure of these webpages can be essentially identical, it is possible to only exist little difference,
Such as, some webpages potentially include comment content, and some webpages do not include commenting on content, but these
Webpage all includes title, author, delivers the content such as time and text.If on each webpage is carried out
The step stated is to generate web page template, then workload is the biggest.
Then, for improving the formation efficiency of web page template further, described method can also include: to root
Add up according to multiple web page templates of the multiple auto-building html files under same resource website, extract the plurality of
Same section in web page template generates final web page template.Specifically, resource website can be included
All webpages be sampled, obtain multiple webpage;Then, multiple webpage moulds are generated according to said method
Plate;Finally, the plurality of web page template (every content area and content type in web page template are extracted
Corresponding relation is a part of web page template) in same section generate final web page template (or
It is referred to as the web page template of this resource website).
Such as, for 360 websites, can be first according to the homepage URL (http://www.360.cn/) of this website
Obtain the html document of homepage;Then this html document being analyzed finds that this website includes many
Individual (such as 1000) sub-pages, then, from these 1000 sub-pages according to predetermined algorithm (example
Such as random algorithm) extract 50 sub-pages;50 are generated after these 50 sub-pages are carried out visualization mark
Individual web page template;Finally, extract the same section in these 50 web page templates to generate corresponding to 360 websites
Web page template.
It addition, in embodiments of the present invention, for ease of the content area in location and presentation web page, also may be used
Think that the label belonging to each content area adds cryptographic Hash attribute, correspondingly, storage in web page template
It it is exactly the cryptographic Hash corresponding relation with the content type of selection of label belonging to content area.This kind of situation
Under, the method generating web page template of the embodiment of the present invention is building the visualization effect being labeled webpage
Really before the step of framework, it is also possible to comprise the steps:
First, obtain the source code of webpage, generate the DOM of described webpage according to described source code
(Document Object Model, DOM Document Object Model) is set;
Then, the cryptographic Hash of the label that each node is corresponding in described dom tree is obtained;
Finally, each label for described webpage adds cryptographic Hash attribute.
Wherein, described cryptographic Hash can include label level cryptographic Hash in described dom tree and label
The cryptographic Hash of self.Label level cryptographic Hash in dom tree can be according to current label place
The hierarchical relationship of dom tree is calculated, and the cryptographic Hash of label self can be had according to current label
Attribute node calculate.
When implementing, the cryptographic Hash calculating of label can be carried out by service end.As shown in Figure 10,
Service end 210 is positioned in search engine 200, search engine 200 and multiple (showing 3 in figure)
Third party website server 300 communicates to connect, and service end 210 can generate net with fit end 100
Page template.In such cases, the cryptographic Hash of the label that each node is corresponding in the described dom tree of described acquisition
May include that
First, index attributes is added at each label that client 100 is described webpage;
Then, the source code of the webpage after client 100 will add index attributes is sent to service end 210;
Secondly, service end 210 carries out the cryptographic Hash calculating of label;
Finally, the corresponding relation of tab indexes value Yu cryptographic Hash is sent to client 100 by service end 210.
When implementing the present invention, the operation of client may include steps of:
First, in client, effect of visualization framework is installed and generates plug-in unit, and access third party website service
Webpage in device 300;
Then, in one implementation, mouse moves to web page contents overlying regions, content area
There is nattier blue masking-out in top, represents that this content area is selected, and right button is clicked, and occurs selecting dish
Single, this content area can be selected to belong to the content type such as title, text;
Finally, after labelling completes, client generates web page template.
Client can be sent to service end the web page template generated, and service end is being oriented collection net
This web page template can be used to carry out information gathering during page content.
One detailed process of method generating web page template of an embodiment of the present invention given below.Reference
Fig. 4, described method includes:
Step 402, client obtains the source code of webpage, generates described webpage according to described source code
Dom tree;
Step 404, client is that each label of dom tree adds index attributes, wherein, dom tree
Traversal the algorithm of depth-first can be used to carry out;
Step 406, client is sent to clothes the source code of the webpage added after indexing (index) attribute
Business end, the content of transmission for example:
Step 408, service end receives the source code that with the addition of index attributes that client sends, to source generation
Code is analyzed, calculate the cryptographic Hash of full page structure respective labels, and calculates all
Cryptographic Hash returns to client;
The cryptographic Hash that service end is calculated is corresponding with the index of label, can be packaged into json form and return
Return client, json content format for example: tab indexes value: Hash 1:hash1, Hash 2:
hash2}...}。
Step 410, client receives the json data that service end returns, by tab indexes value and Hash
The corresponding relation of value, for corresponding label plus two property values: label layer in described dom tree
Level cryptographic Hash frame_hash and cryptographic Hash self_hash of label self;
Such as, a div tag content that with the addition of cryptographic Hash attribute is as follows:
<div frame_hash=”46131321231613”self_hash=”174461815164”index=”45”>
content
</div>
Wherein, frame_hash cryptographic Hash is the hierarchical relationship meter of the dom tree according to current label place
Calculate, such as:
If calculating the frame_hash of div tag, " html body div " this string can be carried out
Md5 calculates a cryptographic Hash, and algorithm can have multiple, and concrete algorithm is not done by the embodiment of the present invention
Limit.
And self_hash cryptographic Hash is the attribute node being had according to current label calculates, example
As div tag has a class attribute and id attribute, then can according to " class:name id:author " this
String carries out md5 and calculates a cryptographic Hash, and algorithm can also have multiple, and the embodiment of the present invention is to concrete
Algorithm does not limits.
In this manner it is possible to navigate to a node of dom tree according to frame_hash and self_hash.
Step 412, client adds visual effect according to cryptographic Hash attribute, the page elements for webpage
Really, in one implementation, mouse moves to above this element, and the top of this element has azury
Masking-out, represents that the content area of this element is selected, and on selected content area, right button is clicked, and goes out
The existing menu item such as " being labeled as title ", " being labeled as text ".
Step 414, when each content area of webpage is marked, content area under client records
Cryptographic Hash and the corresponding relation of the content type of labelling, generate web page template, and the content of web page template is such as
For:
Frame_hash:243092489self_hash:49348393 title
Frame_hash:434389298self_hash:23439438 author
Frame_hash:023473843self_hash:34934932 text
The frame_hash:483928384self_hash:23487388 date
Step 416, the web page template of generation is sent to service end by client, and service end preserves client
The web page template sent, during this website of oriented acquisition, uses this web page template the title of webpage, just
Literary composition, content etc. extract.
The embodiment of the present invention also provides for a kind of device generating web page template, with reference to Fig. 5, described device bag
Include effect of visualization framework establishment device 10, mark instruction getter 20 and web page template maker 30, its
In:
Effect of visualization framework establishment device 10 is suitable to build the effect of visualization framework being labeled webpage.
In one implementation, described effect of visualization framework includes: content area, be positioned at the content chosen
The masking-out of overlying regions and mark menu, described mark menu includes plurality of kinds of contents type menu item.Visually
Change the effect framework establishment device 10 source code such as html document by acquisition webpage, by stylesheet files
Such as css file is attached to html document, and increases js script in html document, can build net
The effect of visualization framework of page.
Mark instruction getter 20 is suitable to obtain the instruction being labeled webpage each several part content area.Can
Webpage to be labeled by mouse or touch screen, such as, mouse is moved to certain content area
Top a mouse click right button, then, click on certain content type menu item and complete the contents of the section
The mark in region, mark instruction getter 20 can detect labeling operation, and obtain by right button menu
The content type selected.
Web page template maker 30 is suitable to the corresponding relation in recorded content region and mark instruction, obtains webpage
Template.After mark instruction getter 20 gets the content type of selection, web page template maker 30 can
With the corresponding relation in recorded content region Yu the content type of selection, thus generate web page template.
Alternatively, described device also includes counter (not shown), is suitable to according to same resource website
Under multiple web page templates of multiple auto-building html files add up, extract the phase in the plurality of web page template
Final web page template is generated with part.
In embodiments of the present invention, for ease of the content area in location and presentation web page, it is also possible to for respectively
Label belonging to content area adds cryptographic Hash attribute.Therefore, the generation web page template of the embodiment of the present invention
Device can also include dom tree maker, cryptographic Hash getter and cryptographic Hash attribute adder.Logical
Cross dom tree maker to obtain the source code of webpage, and generate described webpage according to described source code
Dom tree;The Hash of the label that each node is corresponding in described dom tree is obtained by cryptographic Hash getter
Value;Cryptographic Hash attribute is added by each label that cryptographic Hash attribute adder is described webpage.Wherein, institute
State cryptographic Hash and may include that label level cryptographic Hash in described dom tree and the Hash of label self
Value.Label level cryptographic Hash in dom tree can be according to the layer of the dom tree at current label place
Level relation is calculated, the attribute node meter that the cryptographic Hash of label self can be had according to current label
Calculate.Correspondingly, the described web page template maker 30 Hash by label belonging to recorded content region
Value obtains web page template with the corresponding relation of the content type selected.
When implementing, the cryptographic Hash calculating of label can be carried out by service end.In such cases,
Described cryptographic Hash getter obtains the cryptographic Hash of label further according to following manner: for described webpage
Each label adds index attributes;The source code of the webpage after adding index attributes is sent to service end, with
The cryptographic Hash carrying out label for service end calculates;Receive tab indexes value and cryptographic Hash that service end sends
Corresponding relation.
It should be noted that each step of the method in embodiment 1 can be split as required and take
House, each module of the device in embodiment 1 can also carry out splitting and accepting or rejecting as required.Such as, by
Step 102 and step 104 constitute a kind of method providing visualization mark to webpage, by effect of visualization
Framework establishment device 10 and mark instruction getter 20 constitute a kind of device that webpage provides visualization mark.
Embodiment 2
The present embodiment provides a kind of method and device that webpage provides visualization mark.
Fig. 6 shows the method flow that webpage provides visualization mark according to an embodiment of the invention
Figure, with reference to Fig. 6, described method includes:
Step 602, be constructed by that webpage is labeled by the masking-out being positioned at web page contents overlying regions can
Depending on changing effect framework;
Described effect of visualization framework may include that content area, is positioned at above the content area chosen
Masking-out and mark menu, described mark menu includes plurality of kinds of contents type menu item.
By obtaining the source code such as html document of webpage, stylesheet files such as css file is added
To html document, and in html document, increase js (javascript) script, webpage can be built
Effect of visualization framework.Specifically, can be realized when detecting that certain content area is selected by js script
Time middle, masking-out and mark menu, described masking-out and mark menu occur above the content area chosen
Display mode can be limited by the rule in stylesheet files.
According to above-mentioned effect of visualization framework, when webpage shows in a browser, each portion of webpage
Divide content area can have effect of visualization, when certain content area is selected, (such as detect that mouse moves
Move above this content area, the most such as, detect in touch screen the click to this content area or
The slip gesture at this content area detected), the top of this content area there will be masking-out, and,
The top of this content area can occur labelling menu simultaneously or labelling menu occurs according to triggering, such as,
A mouse click right button on selected content area, it may appear that various content type menu items.Such as Fig. 2
Shown in Fig. 3, described content type menu item can include " being labeled as title ", " just be labeled as
Literary composition " and " being labeled as the date " etc., it addition, described content type menu item can also include " preserving mark
Note " and " end mark " etc..
Step 604, obtains the instruction being labeled webpage each several part content area in described masking-out.
Described instruction can be the content class of the content area corresponding to choosing by mark menu setecting
Type.In embodiments of the present invention, the main body performing mark is client, and client can be by user, fortune
Battalion personnel or management personnel operate.By mouse, webpage can be labeled, mouse is moved to
Above certain content area and a mouse click right button, then, click on certain content type menu item, just may be used
Complete the mark to the contents of the section region.In touch screen, it is also possible to according to the touch behaviour to menu item
Make to select content type, it is achieved the mark to webpage.As in figure 2 it is shown, " be labeled as by clicking
Topic ", corresponding content area can be labeled as title, as it is shown on figure 3, " be labeled as by clicking
Text ", corresponding content area can be labeled as text.
Visible, technical scheme according to embodiments of the present invention is by building effect of visualization framework, permissible
Webpage is carried out visualization mark, improves the efficiency of mark;It is additionally, since web page contents by intuitively
Present, it is easy to determine the content type of page structure, improve the accuracy of mark.
It addition, in embodiments of the present invention, for ease of the content area in location and presentation web page, also may be used
Think that the label belonging to each content area adds cryptographic Hash attribute.In this case, the embodiment of the present invention
Webpage provides the method method of visualization mark at the structure effect of visualization frame that is labeled webpage
Before the step of frame, it is also possible to comprise the steps:
First, obtain the source code of webpage, generate the dom tree of described webpage according to described source code;
Then, the cryptographic Hash of the label that each node is corresponding in described dom tree is obtained;
Finally, each label for described webpage adds cryptographic Hash attribute.
Wherein, described cryptographic Hash can include label level cryptographic Hash in described dom tree and label
The cryptographic Hash of self.Label level cryptographic Hash in dom tree can be according to current label place
The hierarchical relationship of dom tree is calculated, and the cryptographic Hash of label self can be had according to current label
Attribute node calculate.
When implementing, the cryptographic Hash calculating of label can be carried out by service end.As shown in Figure 10,
Service end 210 is positioned in search engine 200, search engine 200 and multiple (showing 3 in figure)
Third party website server 300 communicates to connect, and service end 210 can generate net with fit end 100
Page template.In such cases, the cryptographic Hash of the label that each node is corresponding in the described dom tree of described acquisition
May include that
First, index attributes is added at each label that client 100 is described webpage;
Then, the source code of the webpage after client 100 will add index attributes is sent to service end 210;
Secondly, service end 210 carries out the cryptographic Hash calculating of label;
Finally, the corresponding relation of tab indexes value Yu cryptographic Hash is sent to client 100 by service end 210.
When implementing the present invention, the labeling operation of client may include steps of: first, client
End is installed effect of visualization framework and is generated plug-in unit, and accesses the webpage in third party website server 300;
Then, in one implementation, mouse moves to web page contents overlying regions, the top of content area
Nattier blue masking-out occur, represent that this content area is selected, right button is clicked, and occurs selecting menu, can
To select this content area to belong to the content type such as title, text;Repeatedly performing above-mentioned steps, it is right to complete
The mark of webpage.
The embodiment of the present invention also provides for a kind of device that webpage provides visualization mark, with reference to Fig. 7, institute
State device and include effect of visualization framework establishment device 10 and mark instruction getter 20, wherein:
Effect of visualization framework establishment device 10 is suitable to be constructed by the masking-out being positioned at web page contents overlying regions
The effect of visualization framework that webpage is labeled.In one implementation, described effect of visualization frame
Frame includes: content area, be positioned at the masking-out above the content area chosen and mark menu, described mark
Menu includes plurality of kinds of contents type menu item.Effect of visualization framework establishment device 10 is by obtaining the source of webpage
Code such as html document, is attached to html document by stylesheet files such as css file, and at html
Document increases js script, the effect of visualization framework of webpage can be built.
Mark instruction getter 20 is suitable to obtain and carries out webpage each several part content area in described masking-out
The instruction of mark, described in be designated as by mark menu setecting corresponding to the content of content area chosen
Type.By mouse or touch screen, webpage can be labeled, such as, mouse be moved to certain
Above content area and a mouse click right button, then, click on certain content type menu item and complete this
The mark in partial content region, mark instruction getter 20 can detect labeling operation, and acquisition is passed through
The content type of right button menu setecting.
In embodiments of the present invention, for ease of the content area in location and presentation web page, it is also possible to for respectively
Label belonging to content area adds cryptographic Hash attribute.Therefore, the providing webpage of the embodiment of the present invention can
Can also include that dom tree maker, cryptographic Hash getter and cryptographic Hash attribute add depending on changing the device of mark
Add device.Obtained the source code of webpage by dom tree maker, and generate institute according to described source code
State the dom tree of webpage;Each node in described dom tree is obtained corresponding by cryptographic Hash getter
The cryptographic Hash of label;Cryptographic Hash attribute is added by each label that cryptographic Hash attribute adder is described webpage.
Wherein, described cryptographic Hash may include that label level cryptographic Hash in described dom tree and label certainly
The cryptographic Hash of body.Label level cryptographic Hash in dom tree can be according to the DOM at current label place
The hierarchical relationship of tree is calculated, the attribute that the cryptographic Hash of label self can be had according to current label
Node calculates.
When implementing, the cryptographic Hash calculating of label can be carried out by service end.In such cases,
Described cryptographic Hash getter obtains the cryptographic Hash of label further according to following manner: for described webpage
Each label adds index attributes;The source code of the webpage after adding index attributes is sent to service end, with
The cryptographic Hash carrying out label for service end calculates;Receive tab indexes value and cryptographic Hash that service end sends
Corresponding relation.
Embodiment 3
The present embodiment provides a kind of method and device carrying out webpage content extraction according to visual template.
Figure 10 shows that carrying out web page contents according to visual template according to an embodiment of the invention takes out
The system construction drawing taken.With reference to Figure 10, described system includes client 100, search engine 200 and many
Individual (showing 3 in figure) third party website server 300, search engine 200 includes service end
210, search engine 200 communicates to connect with third party website server 300, and service end 210 can coordinate
Client 100 generates web page template, and search engine 200 can carry out web page contents according to web page template
Extraction, i.e. according to the structured content of webpage in web page template extraction third party website server 300.
Fig. 8 shows and carries out webpage content extraction according to visual template according to an embodiment of the invention
Method flow diagram.With reference to Fig. 8, described method includes:
Step 802, when orientation captures targeted website, searches and whether has recorded corresponding institute in web page template storehouse
State the web page template generated according to visualization mark of targeted website;
Web page template storehouse is preserved the web page template generated according to visualization mark arrive.Described web page template
Can be according to embodiment 1 or the web page template of the schemes generation of embodiment 2.Web page template storehouse stores
Having multiple web page template, described web page template can be identified with the homepage URL of website.Can basis
The homepage URL of targeted website searches the web page template whether having correspondence in web page template storehouse.
Step 804, if record has marking according to visualization of corresponding described targeted website in web page template storehouse
The web page template that note generates, then carry out content extraction according to described web page template to described targeted website.
Can remove wherein to not according to the URL of all external linkages in homepage URL extraction homepage
The part jumped out of website, remaining URL is put into scheduling queue;Then, according to described web page template
Webpage corresponding for URL in scheduling queue is carried out respectively content extraction.Webpage capture device can be used to hold
The described content extraction of row, described webpage capture device can be Web Spider, spiders, searching machine people
Or network captures shell script etc..
In the technical scheme of the embodiment of the present invention, the web page template using visualization mark to generate is carried out
Webpage content extraction, owing to the accuracy of this web page template is higher, therefore, is carried out according to this web page template
The accuracy of content extraction have also been obtained raising.
Fig. 9 shows and carries out webpage content extraction according to visual template according to an embodiment of the invention
Structure drawing of device.With reference to Fig. 9, described device includes web page template storehouse 902, finger 904 and content
Withdrawal device 906, wherein:
Web page template storehouse 902 is suitable to preserve the web page template generated according to visualization mark, described webpage mould
Plate can be identified with the URL of webpage, it is also possible to is identified with the homepage URL of website.
Finger 904 is suitable to search in web page template storehouse whether recorded correspondence when orientation captures targeted website
The web page template generated according to visualization mark of described targeted website.
Content extraction device 906 be suitable to when in web page template storehouse record have corresponding described targeted website according to can
When marking, depending on changing, the web page template generated, according to described web page template, described targeted website is carried out content and take out
Take.Content extraction device 906 can extract the URL of all external linkages in homepage according to homepage URL,
Remove the part wherein jumped out to other website, remaining URL is put into scheduling queue;Then, according to
Described web page template carries out content extraction respectively to webpage corresponding for URL in scheduling queue.Content extraction device
906 can be that Web Spider, spiders, searching machine people or network capture shell script etc..
Alternatively, described device also includes the device for generating web page template, i.e. can include embodiment
Effect of visualization framework establishment device 10, mark instruction getter 20 and web page template maker 30 in 1,
The annexation of these modules and operation principle can be found in the description in embodiment 1.
In sum, technical scheme according to embodiments of the present invention, by structure, webpage is labeled
Effect of visualization framework, it is not necessary to manual edit web page template text, it is only necessary at described effect of visualization
Framework selects web page contents region carry out visualized operation and can complete the mark to web page contents, improve
The efficiency of mark, and then improve the efficiency generating web page template;It is additionally, since web page contents by directly
That sees presents, it is not necessary to possess the Professional knowledge in terms of webpage design, is just easily determined page knot
The content type of structure, improves the accuracy of mark, and then improves the accuracy generating web page template;
Further, owing to the accuracy of web page template is improved, so, net is carried out according to this web page template
Page captures the accuracy of the content obtained and have also been obtained raising.
Algorithm and display be not solid with any certain computer, virtual system or miscellaneous equipment provided herein
Have relevant.Various general-purpose systems can also be used together with based on teaching in this.According to retouching above
State, construct the structure required by this kind of system and be apparent from.Additionally, the present invention is also not for any
Certain programmed language.It is understood that, it is possible to use various programming languages realize invention described herein
Content, and the description above done language-specific is the preferred forms in order to disclose the present invention.
In description mentioned herein, illustrate a large amount of detail.It is to be appreciated, however, that this
Inventive embodiment can be put into practice in the case of not having these details.In some instances, not
It is shown specifically known method, structure and technology, in order to do not obscure the understanding of this description.
Similarly, it will be appreciated that in order to simplify the disclosure and help understand in each inventive aspect one
Or multiple, above in the description of the exemplary embodiment of the present invention, each feature of the present invention is sometimes
It is grouped together in single embodiment, figure or descriptions thereof.But, should be by the disclosure
Method be construed to reflect an intention that i.e. the present invention for required protection require ratio in each claim
The middle more feature of feature be expressly recited.More precisely, as the following claims reflect
As, inventive aspect is all features less than single embodiment disclosed above.Therefore, it then follows
Claims of detailed description of the invention are thus expressly incorporated in this detailed description of the invention, the most each right
Requirement itself is all as the independent embodiment of the present invention.
Those skilled in the art are appreciated that and can carry out the module in the equipment in embodiment certainly
Change adaptively and they are arranged in one or more equipment different from this embodiment.Permissible
Module in embodiment or unit or assembly are combined into a module or unit or assembly, and in addition may be used
To put them into multiple submodule or subelement or sub-component.Except such feature and/or process or
Outside at least some in person's unit excludes each other, can use any combination that this specification (is included
Adjoint claim, summary and accompanying drawing) disclosed in all features and so disclosed any method
Or all processes of equipment or unit are combined.Unless expressly stated otherwise, this specification (includes
Adjoint claim, summary and accompanying drawing) disclosed in each feature can by provide identical, equivalent or
The alternative features of similar purpose replaces.
Although additionally, it will be appreciated by those of skill in the art that embodiments more described herein include it
Some feature included in its embodiment rather than further feature, but the group of the feature of different embodiment
Close and mean to be within the scope of the present invention and formed different embodiments.Such as, in following power
In profit claim, one of arbitrarily can mode making in any combination of embodiment required for protection
With.
The all parts embodiment of the present invention can realize with hardware, or to process at one or more
The software module run on device realizes, or realizes with combinations thereof.Those skilled in the art should
Understand, microprocessor or digital signal processor (DSP) can be used in practice to realize basis
Some or all in the device carrying out webpage content extraction according to visual template of the embodiment of the present invention
The some or all functions of parts.The present invention is also implemented as performing method as described herein
Part or all equipment or device program (such as, computer program and computer program product
Product).The program of such present invention of realization can store on a computer-readable medium, or can have
There is the form of one or more signal.Such signal can be downloaded from internet website and obtain, or
Person provides on carrier signal, or provides with any other form.
The present invention will be described rather than limits the invention to it should be noted above-described embodiment,
And those skilled in the art can design replacement in fact without departing from the scope of the appended claims
Execute example.In the claims, should not will be located in any reference marks between bracket to be configured to right is wanted
The restriction asked.Word " comprises " and does not excludes the presence of the element or step not arranged in the claims.It is positioned at
Word "a" or "an" before element does not excludes the presence of multiple such element.The present invention is permissible
By means of including the hardware of some different elements and realizing by means of properly programmed computer.?
If listing in the unit claim of equipment for drying, several in these devices can be by same
Hardware branch specifically embodies.Word first, second and third use do not indicate that any order.
Can be title by these word explanations.
Claims (8)
1. the method carrying out webpage content extraction according to visual template, including:
When orientation captures targeted website, search in web page template storehouse whether recorded corresponding described targeted website
According to visualization mark generate web page template;
If record has the net generated according to visualization mark of corresponding described targeted website in web page template storehouse
Page template, then carry out content extraction, wherein, described net according to described web page template to described targeted website
Page template generation process includes:
Obtain the html document of webpage, stylesheet files and js script inserted in described html document,
Described js script realizes when detecting that certain content area is selected, goes out above the content area chosen
Existing masking-out and mark menu, the display mode of described masking-out and described mark menu is by fixed in stylesheet files
The rule of justice limits, thus builds the effect of visualization framework being labeled webpage;
Obtain the instruction that webpage each several part content area is labeled;Described be designated as by mark menu
The content type of the content area corresponding to choosing selected;
Recorded content region and the corresponding relation of content type, obtain described web page template.
The most described web page template is with the homepage URL of website
It is identified.
3. method as claimed in claim 1 or 2, wherein, described according to described web page template to described
Targeted website carries out content extraction, including:
According to the URL of all external linkages in homepage URL extraction homepage, remove wherein to other net
The part that station is jumped out, puts into scheduling queue by remaining URL;
According to described web page template, webpage corresponding for URL in scheduling queue carried out respectively content extraction.
The most also include:
Add up according to multiple web page templates of the multiple auto-building html files under same resource website, extract
Same section in the plurality of web page template generates final web page template.
5. a device for webpage content extraction is carried out according to visual template, including:
Web page template storehouse, is suitable to preserve the web page template generated according to visualization mark;
Finger, when being suitable to orientation crawl targeted website, searches in web page template storehouse whether recorded correspondence
The web page template generated according to visualization mark of described targeted website;
Content extraction device, is suitable to when in web page template storehouse, record has the basis of corresponding described targeted website visually
When changing the web page template that mark generates, according to described web page template, described targeted website is carried out content and take out
Take;
Effect of visualization framework establishment device, is suitable to obtain the html document of webpage, by stylesheet files and js
Script inserts in described html document, and described js script realizes when detecting that certain content area is selected
Time, masking-out and mark menu, described masking-out and described mark menu occur above the content area chosen
Display mode limited by the rule defined in stylesheet files, thus build and webpage be labeled
Effect of visualization framework;
Mark instruction getter, is suitable to obtain the instruction being labeled webpage each several part content area, institute
State the content type being designated as the content area corresponding to choosing by mark menu setecting;
Web page template maker, is suitable to the corresponding relation in recorded content region and content type, obtains webpage
Template.
6. device as claimed in claim 5, wherein, described web page template is with the homepage URL of website
It is identified.
7. the device as described in claim 5 or 6, wherein, described content extraction device is further adapted for:
According to the URL of all external linkages in homepage URL extraction homepage, remove wherein to other net
The part that station is jumped out, puts into scheduling queue by remaining URL;
According to described web page template, webpage corresponding for URL in scheduling queue carried out respectively content extraction.
8. device as claimed in claim 5, wherein, also includes:
Counter, is suitable to enter according to multiple web page templates of the multiple auto-building html files under same resource website
Row statistics, extracts the same section in the plurality of web page template and generates final web page template.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310606505.8A CN103678511B (en) | 2013-11-25 | 2013-11-25 | The method and device of webpage content extraction is carried out according to visual template |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310606505.8A CN103678511B (en) | 2013-11-25 | 2013-11-25 | The method and device of webpage content extraction is carried out according to visual template |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103678511A CN103678511A (en) | 2014-03-26 |
CN103678511B true CN103678511B (en) | 2016-11-16 |
Family
ID=50316056
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310606505.8A Active CN103678511B (en) | 2013-11-25 | 2013-11-25 | The method and device of webpage content extraction is carried out according to visual template |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103678511B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105100904A (en) * | 2014-05-09 | 2015-11-25 | 深圳市快播科技有限公司 | Video advertisement blocking method, device and browser |
CN104657340B (en) * | 2015-02-10 | 2018-09-11 | 上海创景信息科技有限公司 | Expansible Word report preparing systems and method based on script |
CN105989167B (en) * | 2015-03-04 | 2019-11-08 | 北大方正集团有限公司 | Collecting method and device based on news client |
CN105095416B (en) * | 2015-07-13 | 2018-12-07 | 北京奇虎科技有限公司 | A kind of method and apparatus realizing content in the search and promoting |
CN105574084A (en) * | 2015-12-10 | 2016-05-11 | 天津海量信息技术有限公司 | Extraction method of case information in webpage |
CN107085578B (en) * | 2016-02-16 | 2020-05-12 | 腾讯科技(深圳)有限公司 | Webpage editing method and device |
US9871911B1 (en) * | 2016-09-30 | 2018-01-16 | Microsoft Technology Licensing, Llc | Visualizations for interactions with external computing logic |
CN110020296A (en) * | 2017-10-31 | 2019-07-16 | 北京国双科技有限公司 | A kind of method and device for extracting news web page text |
CN108549678B (en) * | 2018-04-02 | 2020-06-19 | 北京今朝在线科技有限公司 | Information acquisition system |
CN110321177B (en) * | 2019-06-18 | 2022-06-03 | 北京奇艺世纪科技有限公司 | Mobile application localized loading method and device and electronic equipment |
CN110837614A (en) * | 2019-11-05 | 2020-02-25 | 上海嘉道信息技术有限公司 | Method and system for efficiently generating webpage information extraction rule |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1777632A2 (en) * | 2005-10-20 | 2007-04-25 | Intro Mobile Co., Ltd. | Method and server for extracting content based on RSS |
CN101192234A (en) * | 2007-06-07 | 2008-06-04 | 腾讯科技(深圳)有限公司 | Searching system and method based on web page extraction |
CN101727498A (en) * | 2010-01-15 | 2010-06-09 | 西安交通大学 | Automatic extraction method of web page information based on WEB structure |
CN102890681A (en) * | 2011-07-20 | 2013-01-23 | 阿里巴巴集团控股有限公司 | Method and system for generating webpage structure template |
-
2013
- 2013-11-25 CN CN201310606505.8A patent/CN103678511B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1777632A2 (en) * | 2005-10-20 | 2007-04-25 | Intro Mobile Co., Ltd. | Method and server for extracting content based on RSS |
CN101192234A (en) * | 2007-06-07 | 2008-06-04 | 腾讯科技(深圳)有限公司 | Searching system and method based on web page extraction |
CN101727498A (en) * | 2010-01-15 | 2010-06-09 | 西安交通大学 | Automatic extraction method of web page information based on WEB structure |
CN102890681A (en) * | 2011-07-20 | 2013-01-23 | 阿里巴巴集团控股有限公司 | Method and system for generating webpage structure template |
Also Published As
Publication number | Publication date |
---|---|
CN103678511A (en) | 2014-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103678511B (en) | The method and device of webpage content extraction is carried out according to visual template | |
CN103678509B (en) | Generate the method and device of web page template | |
US11294968B2 (en) | Combining website characteristics in an automatically generated website | |
EP2987088B1 (en) | Client side page processing | |
CN110069683B (en) | Method and device for crawling data based on browser | |
US10657323B2 (en) | Method of preparing documents in markup languages | |
CN104077387B (en) | A kind of web page contents display methods and browser device | |
CN103678510B (en) | The method and device of visualization mark is provided webpage | |
CA2817554A1 (en) | Mobile content management system | |
TW201250492A (en) | Method and system of extracting web page information | |
CN103034518B (en) | The method and browser of loading browser control instrument | |
CN110309386B (en) | Method and device for crawling web page | |
CN108595697A (en) | Webpage integrated approach, apparatus and system | |
CN106874502A (en) | A kind of method of video search, device and terminal | |
CN102902784B (en) | Web page classification storage system and method | |
WO2019000894A1 (en) | Method and device for generating article outline | |
TWI570579B (en) | An information retrieving method utilizing webpage visual features and webpage language features and a system using thereof | |
CN110413765A (en) | A kind of interactive system and its method of mass data set analysis and displaying | |
CN113051333B (en) | Data processing method and device, electronic equipment and storage medium | |
JP5380874B2 (en) | Information retrieval method, program and apparatus | |
CN110147477B (en) | Data resource modeling extraction method, device and equipment of Web system | |
Fung et al. | Discover information and knowledge from websites using an integrated summarization and visualization framework | |
CN103246662A (en) | Processing method and device of area data contents in web pages | |
CN105138701B (en) | Index page method for extracting content and device, search engine | |
Su et al. | KaitoroCap: A document navigation capture and visualisation tool |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220725 Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015 Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park) Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Patentee before: Qizhi software (Beijing) Co.,Ltd. |
|
TR01 | Transfer of patent right |