CN103678510B - The method and device of visualization mark is provided webpage - Google Patents
The method and device of visualization mark is provided webpage Download PDFInfo
- Publication number
- CN103678510B CN103678510B CN201310606202.6A CN201310606202A CN103678510B CN 103678510 B CN103678510 B CN 103678510B CN 201310606202 A CN201310606202 A CN 201310606202A CN 103678510 B CN103678510 B CN 103678510B
- Authority
- CN
- China
- Prior art keywords
- webpage
- cryptographic hash
- label
- mark
- web page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012800 visualization Methods 0.000 title claims abstract description 72
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000000694 effects Effects 0.000 claims abstract description 44
- 230000005540 biological transmission Effects 0.000 claims description 4
- 238000012544 monitoring process Methods 0.000 claims 2
- 238000000605 extraction Methods 0.000 description 18
- 230000000007 visual effect Effects 0.000 description 10
- 238000004422 calculation algorithm Methods 0.000 description 8
- 238000013515 script Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 241000239290 Araneae Species 0.000 description 5
- 230000008901 benefit Effects 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000009434 installation Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000001035 drying Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000002073 mitogenetic effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/169—Annotation, e.g. comment data or footnotes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention discloses a kind of method and device for providing webpage visualization mark, belongs to Internet technical field.Methods described includes:It is constructed by the effect of visualization framework being labeled positioned at the masking-out of web page contents overlying regions to webpage;Obtain the instruction being labeled in the masking-out to webpage each several part content area.The present invention can improve the efficiency and accuracy being labeled to webpage.
Description
Technical field
The present invention relates to Internet technical field, and in particular to a kind of method and dress for providing webpage visualization mark
Put.
Background technology
Web page template can be used for extract webpage content, than if any search engine used orientation when capturing website
Acquisition technique, the spider of oriented acquisition extract the related content of website using web page template, obtain the content of formatting,
Title, author including webpage, deliver the information such as time and text.
It is existing it is a kind of generate web page template method be:First, according to URL (the Uniform Resource of the page
Locator, URL), download the source code of the page;Secondly, page structure is carried out according to the source code of the page
Automatically analyze, calculate the cryptographic Hash of each structure in page-out;Then, according to which knot in the source code artificial judgment page of the page
Structure corresponds to title, and which structure corresponds to text, and which structure correspondingly delivers time etc., and is marked;Finally, generating structure
The corresponding relation of the content type of cryptographic Hash and structure, obtains web page template.
Existing generation web page template method at least has the disadvantage that:
The content type of handmarking's page structure is carried out by text editing, is had inside web page template largely not
Related content, some web page templates even have ten of thousands row, cause the efficiency of handmarking very low;
Various contents in web page template are mixed in web page code, because web page contents do not show intuitively,
If not yet done to webpage design language, it is not easy to determine the content type of page structure, is easy for malfunctioning during handmarking, leads
Cause the accuracy of the web page template of generation not high, and then cause not carry out the accuracy of content extraction also not according to the web page template
It is high.
The content of the invention
In view of the above problems, it is proposed that the present invention so as to provide one kind overcome above mentioned problem or at least in part solve on
State the method and device that visualization mark is provided webpage of problem.
According to one aspect of the present invention, there is provided a kind of method for providing webpage visualization mark, methods described bag
Include:
It is constructed by the effect of visualization framework being labeled positioned at the masking-out of web page contents overlying regions to webpage;
Obtain the instruction being labeled in the masking-out to webpage each several part content area.
Alternatively, the effect of visualization framework include content area, the masking-out above the content area chosen and
Menu is marked, the mark menu includes plurality of kinds of contents type menu item, the correspondence being designated as by marking menu setecting
In the content type for the content area chosen.
Alternatively, before the effect of visualization framework that structure is labeled to webpage, methods described also includes:
The source code of webpage is obtained, the dom tree of the webpage is generated according to the source code;
Obtain the cryptographic Hash of label corresponding to each node in the dom tree;
Hash value attribute is added for each label of the webpage, wherein, the cryptographic Hash is used in positioning and presentation web page
Content area.
Alternatively, the cryptographic Hash includes:The Hash of level cryptographic Hash and label itself of the label in the dom tree
Value.
Alternatively, the cryptographic Hash for obtaining label corresponding to each node in the dom tree, including:
Index attributes are added for each label of the webpage;
The source code of webpage after addition index attributes is sent to service end, so that service end enters the cryptographic Hash of row label
Calculate;
Receive the tab indexes value of service end transmission and the corresponding relation of cryptographic Hash.
According to another aspect of the present invention, there is provided a kind of device that visualization mark is provided webpage, described device bag
Include:
Effect of visualization framework establishment device, the masking-out that web page contents overlying regions are located at suitable for being constructed by are carried out to webpage
The effect of visualization framework of mark;
Mark instruction getter, suitable for obtaining the finger being labeled in the masking-out to webpage each several part content area
Show.
Alternatively, the effect of visualization framework include content area, the masking-out above the content area chosen and
Menu is marked, the mark menu includes plurality of kinds of contents type menu item, the correspondence being designated as by marking menu setecting
In the content type for the content area chosen.
Alternatively, described device also includes:
Dom tree maker, suitable for obtaining the source code of webpage, the dom tree of the webpage is generated according to the source code;
Cryptographic Hash getter, suitable for obtaining the cryptographic Hash of label corresponding to each node in the dom tree;
Hash value attribute adder, each label addition Hash value attribute of the webpage is suitable for, wherein, the cryptographic Hash
For positioning and the content area in presentation web page.
Alternatively, the cryptographic Hash includes:The Hash of level cryptographic Hash and label itself of the label in the dom tree
Value.
Alternatively, the cryptographic Hash getter is further adapted for:
Index attributes are added for each label of the webpage;
The source code of webpage after addition index attributes is sent to service end, so that service end enters the cryptographic Hash of row label
Calculate;
Receive the tab indexes value of service end transmission and the corresponding relation of cryptographic Hash.
, can be to webpage by building effect of visualization framework according to the above-mentioned one or more technical schemes of the present invention
Visualization mark is carried out, improves the efficiency of mark;Web page contents are additionally, since intuitively to be showed, it is easy to it is determined that
The content type of page structure, improve the accuracy of mark.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention,
And can be practiced according to the content of specification, and in order to allow above and other objects of the present invention, feature and advantage can
Become apparent, below especially exemplified by the embodiment of the present invention.
Brief description of the drawings
By reading the detailed description of hereafter preferred embodiment, it is various other the advantages of and benefit it is common for this area
Technical staff will be clear understanding.Accompanying drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention
Limitation.And in whole accompanying drawing, identical part is denoted by the same reference numerals.In the accompanying drawings:
Fig. 1 shows the method flow diagram of generation web page template according to an embodiment of the invention;
Fig. 2 shows the schematic diagram being labeled in the embodiment of the present invention to the title of webpage;
Fig. 3 shows the schematic diagram being labeled in the embodiment of the present invention to the text of webpage;
Fig. 4 shows the method detail flowchart of generation web page template according to an embodiment of the invention;
Fig. 5 shows the structure drawing of device of generation web page template according to an embodiment of the invention;
Fig. 6 shows the method flow diagram according to an embodiment of the invention for providing webpage visualization mark;
Fig. 7 shows the structure drawing of device according to an embodiment of the invention for providing webpage visualization mark;
Fig. 8 shows the method stream according to an embodiment of the invention that webpage content extraction is carried out according to visual template
Cheng Tu;
Fig. 9 shows the device knot according to an embodiment of the invention that webpage content extraction is carried out according to visual template
Composition;
Figure 10 shows the system according to an embodiment of the invention that webpage content extraction is carried out according to visual template
Structure chart.
Embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although the disclosure is shown in accompanying drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
Limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
Completely it is communicated to those skilled in the art.
Embodiment 1
The present embodiment provides a kind of method and device for generating web page template
Fig. 1 shows the method flow diagram of generation web page template according to an embodiment of the invention, and reference picture 1 is described
Method includes:
Step 102, the effect of visualization framework being labeled to webpage is built;
In one implementation, the effect of visualization framework can include:Content area, positioned at the content regions chosen
Masking-out and mark menu above domain, the mark menu include plurality of kinds of contents type menu item.
By the source code such as html (hypertext mark-up language, the hypertext markup language that obtain webpage
Speech) document, stylesheet files such as css (cascading style sheets, CSS) file is attached to html
Document, and increase js (javascript) script in html documents, the effect of visualization framework of webpage can be built.Specifically
Ground, it can be realized when detecting that some content area is selected by js scripts, be covered above the content area chosen
The display mode of version and mark menu, the masking-out and mark menu can be limited by the rule defined in stylesheet files.
According to above-mentioned effect of visualization framework, when webpage is shown in a browser, each several part content regions of webpage
Domain can have effect of visualization, and when some content area is selected (such as detect that mouse is moved to above the content area, again
For example, detecting click to the content area in touch-screen or detecting slip gesture in the content area), this is interior
Hold and occur masking-out above region, also, the top of the content area can occur marking menu simultaneously or according to triggering out
Menu is now marked, for example, a mouse click right button on selected content area, it may appear that various content type menu items.Such as
Shown in Fig. 2 and Fig. 3, the content type menu item can include " being labeled as title ", " being labeled as text " and " be labeled as day
Phase " etc., in addition, the content type menu item can also include " preserving mark " and " end mark " etc..
Step 104, the instruction being labeled to webpage each several part content area is obtained;
In embodiments of the present invention, the main body for performing mark is client, and client can be by user, operation personnel or pipe
Reason personnel are operated.Webpage can be labeled by mouse, mouse is moved to above some content area and clicked
Right mouse button, then, some content type menu item is clicked on, the mark to the contents of the section region can be completed.In touch-screen
In, content type can also be selected according to the touch operation to menu item, realize the mark to webpage.As shown in Fig. 2 pass through
" being labeled as title " is clicked, corresponding content area can be labeled as title, as shown in figure 3, by clicking " labeled as just
Text ", can be labeled as text by corresponding content area.
Step 106, content area and the corresponding relation of mark instruction are recorded, obtains web page template.
Often one content area of mark, and select and then " preserve and mark " menu item, it is possible to by the content area and selection
The corresponding relation of content type is stored into web page template, by selecting " end mark " menu item, is completed to owning in webpage
The mark of content area marked is needed, obtains web page template corresponding to the webpage (or referred to as web page contents template).
It can be seen that technical scheme according to embodiments of the present invention, it is only necessary to select webpage in the effect of visualization framework
Content area carries out visualized operation, you can easily defines web page template, improves the efficiency of generation web page template;Moreover, by
Intuitively showed in web page contents, it is easy to determine the content type of page structure, improve generation web page template
Accuracy.
Such scheme is to generate the web page template corresponding to the webpage according to a webpage.For a resource website
Speech, it may include many webpages, and these webpages are usually according to identical webpage design template generation, thus these webpages
Structure can be essentially identical, it is possible to seldom difference is only existed, for example, comment content may be included in some webpages, and
Some webpages do not include comment content, but these webpages all include title, author, deliver the contents such as time and text.It is if right
Each webpage carries out above-mentioned step next life into web page template, then workload is still larger.
Then, further to improve the formation efficiency of web page template, methods described can also include:To according to same resource
Multiple web page templates of multiple auto-building html files under website are counted, and the identical portions extracted in the multiple web page template are mitogenetic
Into final web page template.Specifically, all webpages that can include to resource website are sampled, and obtain multiple webpages;So
Afterwards, multiple web page templates are generated according to the above method;Finally, the multiple web page template (every content in web page template is extracted
The corresponding relation of region and content type is a part of web page template) in same section generate final web page template
(or being referred to as the web page template of the resource website).
For example, for 360 websites, can be first according to the homepage URL (http of the website://www.360.cn/) obtain head
The html document of page;Then analysis is carried out to the html document and finds that the website includes multiple (such as 1000) sub-pages, in
It is to extract 50 sub-pages from this 1000 sub-pages according to predetermined algorithm (such as random algorithm);To this 50 sub-pages
50 web page templates are generated after carrying out visualization mark;Finally, the same section generation extracted in this 50 web page templates is corresponding
Web page template in 360 websites.
In addition, can also be each content for ease of the content area in positioning and presentation web page in embodiments of the present invention
Label addition Hash value attribute belonging to region, correspondingly, what is stored in web page template is exactly the affiliated label of content area
Cryptographic Hash and the corresponding relation of the content type of selection.In this case, the side of the generation web page template of the embodiment of the present invention
Method can also comprise the following steps before the step of effect of visualization framework that structure is labeled to webpage:
First, the source code of webpage is obtained, DOM (the Document Object of the webpage are generated according to the source code
Model, DOM Document Object Model) tree;
Then, the cryptographic Hash of label corresponding to each node in the dom tree is obtained;
Finally, Hash value attribute is added for each label of the webpage.
Wherein, the cryptographic Hash can include the Hash of level cryptographic Hash and label itself of the label in the dom tree
Value.Level cryptographic Hash of the label in dom tree can be calculated according to the hierarchical relationship of the dom tree where current label, mark
The attribute node that signing the cryptographic Hash of itself can be possessed according to current label calculates.
In specific implementation, can be calculated by service end to enter the cryptographic Hash of row label.As shown in Figure 10, service end 210
In search engine 200, search engine 200 and multiple communication links of (3 are shown in figure) third party website server 300
Connect, service end 210 can generate web page template with fit end 100.In such cases, it is described to obtain in the dom tree respectively
The cryptographic Hash of label corresponding to node can include:
First, index attributes are added for each label of the webpage in client 100;
Then, the source code of the webpage after adding index attributes is sent to service end 210 by client 100;
Secondly, service end 210 enters the cryptographic Hash calculating of row label;
Finally, the corresponding relation of tab indexes value and cryptographic Hash is sent to client 100 by service end 210.
When implementing of the invention, the operation of client may include steps of:
First, in client installation effect of visualization framework generation plug-in unit, and access in third party website server 300
Webpage;
Then, in one implementation, mouse is moved to web page contents overlying regions, and the top appearance of content area is light
The masking-out of blueness, represents that the content area is selected, and right button is clicked, and selection menu occurs, the content area can be selected to belong to
The content types such as title, text;
Finally, after the completion of mark, client generation web page template.
Client can be sent to the web page template of generation service end, and service end is when being oriented collection web page contents
The web page template can be used to carry out information gathering.
One detailed process of method of the generation web page template of an embodiment of the present invention given below.Reference picture 4, it is described
Method includes:
Step 402, client obtains the source code of webpage, and the dom tree of the webpage is generated according to the source code;
Step 404, client adds index attributes for each label of dom tree, wherein, the traversal of dom tree can use
The algorithm of depth-first is carried out;
Step 406, the source code of the webpage after addition index (index) attribute is sent to service end by client, is sent
Content be, for example,:
Step 408, service end receives the source code that with the addition of index attributes that client is sent, and source code is divided
Analysis, calculates the cryptographic Hash of full page structure respective labels, and all cryptographic Hash calculated are returned into client;
The cryptographic Hash that service end is calculated is corresponding with the index of label, can be packaged into j son forms and return to client
End, json content formats are, for example,:{ tab indexes value:{ Hash 1:Hash1, Hash 2:hash2}...}.
Step 410, client receives the json data that service end returns, and is closed by the way that tab indexes value is corresponding with cryptographic Hash
System, two property values are added for corresponding label:Level cryptographic Hash frame_hash and label of the label in the dom tree are certainly
The cryptographic Hash self_hash of body;
For example, a div tag content that with the addition of Hash value attribute is as follows:
<The index=" 45 " of 46131321231613 " self_hash=" of div frame_hash=" 174461815164 "
>
content
</div>
Wherein, frame_hash cryptographic Hash is that the hierarchical relationship of the dom tree according to where current label is calculated, example
Such as:
, can be to " this string of html body div " carries out md5 and calculated if calculating the frame_hash of div tag
One cryptographic Hash, algorithm can have a variety of, and the embodiment of the present invention is not limited to specific algorithm.
And self_hash cryptographic Hash is the attribute node possessed according to current label calculates, such as div tag
There are class attributes and id attributes, then can be according to " class:name id:This string of author " carries out md5 and calculates a Kazakhstan
Uncommon value, algorithm can also have a variety of, and the embodiment of the present invention is not limited to specific algorithm.
In this manner it is possible to a node of dom tree is navigated to according to frame_hash and self_hash.
Step 412, client adds visual effect, in one kind according to Hash value attribute for the page elements of webpage
In implementation, mouse is moved to above the element, is had masking-out azury above the element, is represented the content regions of the element
Domain is selected, and right button is clicked on selected content area, the menu items such as " being labeled as title ", " being labeled as text " occurs.
Step 414, when each content area of webpage being marked, the cryptographic Hash and mark of content area under client records
The corresponding relation of the content type of note, generates web page template, and the content of web page template is, for example,:
frame_hash:243092489 self_hash:49348393 titles
frame_hash:434389298 self_hash:23439438 authors
frame_hash:023473843 self_hash:34934932 texts
frame_hash:483928384 self_hash:23487388 dates
Step 416, the web page template of generation is sent to service end by client, and service end preserves the webpage that client is sent
Template, during the oriented acquisition website, title, text, content of webpage etc. are extracted using the web page template.
The embodiment of the present invention also provides a kind of device for generating web page template, reference picture 5, and described device includes visualization and imitated
Fruit framework establishment device 10, mark instruction getter 20 and web page template maker 30, wherein:
Effect of visualization framework establishment device 10 is suitable to the effect of visualization framework that structure is labeled to webpage.A kind of real
In existing mode, the effect of visualization framework includes:Content area, the masking-out above the content area chosen and mark dish
Single, the mark menu includes plurality of kinds of contents type menu item.Effect of visualization framework establishment device 10 is by obtaining the source of webpage
Code such as html documents, stylesheet files such as css files are attached to html documents, and increase js pin in html documents
This, can build the effect of visualization framework of webpage.
Mark instruction getter 20 is suitable to obtain the instruction for being labeled webpage each several part content area.Mouse can be passed through
Mark or touch-screen are labeled to webpage, for example, mouse is moved to above some content area into simultaneously a mouse click right button, so
Afterwards, some content type menu item is clicked on to complete the mark to the contents of the section region, and mark instruction getter 20 can be examined
Labeling operation is measured, and obtains the content type by right button menu setecting.
Web page template maker 30 is suitable to record content area and the corresponding relation of mark instruction, obtains web page template.Mark
After note instruction getter 20 gets the content type of selection, web page template maker 30 can record content area and selection
The corresponding relation of content type, so as to generate web page template.
Alternatively, described device also includes counter (not shown), suitable for according to multiple nets under same resource website
Multiple web page templates of page generation are counted, and the same section extracted in the multiple web page template generates final webpage mould
Plate.
In embodiments of the present invention, can also be each content area for ease of the content area in positioning and presentation web page
Affiliated label addition Hash value attribute.Therefore, the device of the generation web page template of the embodiment of the present invention can also include dom tree
Maker, cryptographic Hash getter and Hash value attribute adder.The source code of webpage, and root are obtained by dom tree maker
The dom tree of the webpage is generated according to the source code;Obtained by cryptographic Hash getter in the dom tree corresponding to each node
The cryptographic Hash of label;Hash value attribute is added for each label of the webpage by Hash value attribute adder.Wherein, the Kazakhstan
Uncommon value can include:The cryptographic Hash of level cryptographic Hash and label itself of the label in the dom tree.Label is in dom tree
Level cryptographic Hash can be calculated according to the hierarchical relationship of the dom tree where current label, and the cryptographic Hash of label itself can be with
The attribute node possessed according to current label calculates.Correspondingly, the web page template maker 30 is by recording content
The corresponding relation of the cryptographic Hash of the affiliated label in region and the content type of selection obtains web page template.
In specific implementation, can be calculated by service end to enter the cryptographic Hash of row label.In such cases, the cryptographic Hash
Getter obtains the cryptographic Hash of label further according to following manner:Index attributes are added for each label of the webpage;Will
The source code of webpage after addition index attributes is sent to service end, and the cryptographic Hash that row label is entered for service end calculates;Receive
The tab indexes value and the corresponding relation of cryptographic Hash that service end is sent.
It should be noted that each step of the method in embodiment 1 can be split and be accepted or rejected as needed, embodiment
Each module of device in 1 can also be split and be accepted or rejected as needed.For example, one kind is formed by step 102 and step 104
The method that visualization mark is provided webpage, one kind is formed by effect of visualization framework establishment device 10 and mark instruction getter 20
The device of visualization mark is provided webpage.
Embodiment 2
The present embodiment provides a kind of method and device for providing webpage visualization mark.
Fig. 6 shows the method flow diagram according to an embodiment of the invention for providing webpage visualization mark, reference
Fig. 6, methods described include:
Step 602, it is constructed by the effect of visualization being labeled positioned at the masking-out of web page contents overlying regions to webpage
Framework;
The effect of visualization framework can include:Content area, the masking-out above the content area chosen and mark
Menu is noted, the mark menu includes plurality of kinds of contents type menu item.
By obtaining the source code such as html documents of webpage, stylesheet files such as css files are attached to html texts
Shelves, and increase js (javascript) script in html documents, the effect of visualization framework of webpage can be built.Specifically,
Can be realized by js scripts when detecting that some content area is selected, occur above the content area chosen masking-out and
Menu is marked, the masking-out and the display mode of mark menu can be limited by the rule in stylesheet files.
According to above-mentioned effect of visualization framework, when webpage is shown in a browser, each several part content regions of webpage
Domain can have effect of visualization, and when some content area is selected (such as detect that mouse is moved to above the content area, again
For example, detecting click to the content area in touch-screen or detecting slip gesture in the content area), this is interior
Hold and occur masking-out above region, also, the top of the content area can occur marking menu simultaneously or according to triggering out
Menu is now marked, for example, a mouse click right button on selected content area, it may appear that various content type menu items.Such as
Shown in Fig. 2 and Fig. 3, the content type menu item can include " being labeled as title ", " being labeled as text " and " be labeled as day
Phase " etc., in addition, the content type menu item can also include " preserving mark " and " end mark " etc..
Step 604, the instruction being labeled in the masking-out to webpage each several part content area is obtained.
The instruction can be the content type corresponding to the content area chosen by marking menu setecting.In this hair
In bright embodiment, the main body for performing mark is client, and client can be grasped by user, operation personnel or administrative staff
Make.Webpage can be labeled by mouse, mouse is moved to above some content area simultaneously a mouse click right button, so
Afterwards, some content type menu item is clicked on, the mark to the contents of the section region can be completed., can be with root in touch-screen
Content type is selected according to the touch operation to menu item, realizes the mark to webpage.As shown in Fig. 2 " it is labeled as by clicking
Title ", corresponding content area can be labeled as title, as shown in figure 3, by clicking " being labeled as text ", can be by phase
The content area answered is labeled as text.
It can be seen that technical scheme according to embodiments of the present invention, by building effect of visualization framework, can be carried out to webpage
Visualization mark, improve the efficiency of mark;It is additionally, since web page contents intuitively to be showed, it is easy to determine the page
The content type of structure, improve the accuracy of mark.
In addition, can also be each content for ease of the content area in positioning and presentation web page in embodiments of the present invention
Label addition Hash value attribute belonging to region.In this case, providing webpage for the embodiment of the present invention visualizes mark
Method method structure webpage is labeled effect of visualization framework the step of before, can also comprise the following steps:
First, the source code of webpage is obtained, the dom tree of the webpage is generated according to the source code;
Then, the cryptographic Hash of label corresponding to each node in the dom tree is obtained;
Finally, Hash value attribute is added for each label of the webpage.
Wherein, the cryptographic Hash can include the Hash of level cryptographic Hash and label itself of the label in the dom tree
Value.Level cryptographic Hash of the label in dom tree can be calculated according to the hierarchical relationship of the dom tree where current label, mark
The attribute node that signing the cryptographic Hash of itself can be possessed according to current label calculates.
In specific implementation, can be calculated by service end to enter the cryptographic Hash of row label.As shown in Figure 10, service end 210
In search engine 200, search engine 200 and multiple communication links of (3 are shown in figure) third party website server 300
Connect, service end 210 can generate web page template with fit end 100.In such cases, it is described to obtain in the dom tree respectively
The cryptographic Hash of label corresponding to node can include:
First, index attributes are added for each label of the webpage in client 100;
Then, the source code of the webpage after adding index attributes is sent to service end 210 by client 100;
Secondly, service end 210 enters the cryptographic Hash calculating of row label;
Finally, the corresponding relation of tab indexes value and cryptographic Hash is sent to client 100 by service end 210.
When implementing of the invention, the labeling operation of client may include steps of:First, it is visual in client installation
Change effect framework generation plug-in unit, and access the webpage in third party website server 300;Then, in one implementation, mouse
Mark is moved to web page contents overlying regions, and the top of content area nattier blue masking-out occurs, represents that the content area is selected,
Right button is clicked, and selection menu occurs, the content area can be selected to belong to the content types such as title, text;Perform repeatedly above-mentioned
Step, complete the mark to webpage.
The embodiment of the present invention also provides a kind of device for providing webpage visualization mark, and reference picture 7, described device includes
Effect of visualization framework establishment device 10 and mark instruction getter 20, wherein:
Effect of visualization framework establishment device 10 enters suitable for the masking-out being constructed by positioned at web page contents overlying regions to webpage
The effect of visualization framework of rower note.In one implementation, the effect of visualization framework includes:Content area, it is located at
The masking-out and mark menu above content area chosen, the mark menu include plurality of kinds of contents type menu item.Visualization
Effect framework establishment device 10 is added stylesheet files such as css files by obtaining the source code such as html documents of webpage
To html documents, and increase js scripts in html documents, the effect of visualization framework of webpage can be built.
Mark instruction getter 20 is suitable to obtain the finger for being labeled webpage each several part content area in the masking-out
Show, the content type corresponding to the content area chosen being designated as by marking menu setecting.Can by mouse or
Person's touch-screen is labeled to webpage, for example, mouse is moved to above some content area simultaneously a mouse click right button, then,
Some content type menu item is clicked on to complete the mark to the contents of the section region, mark instruction getter 20 can detect
Labeling operation, and obtain the content type by right button menu setecting.
In embodiments of the present invention, can also be each content area for ease of the content area in positioning and presentation web page
Affiliated label addition Hash value attribute.Therefore, the device for providing webpage visualization mark of the embodiment of the present invention can be with
Including dom tree maker, cryptographic Hash getter and Hash value attribute adder.The source of webpage is obtained by dom tree maker
Code, and according to the dom tree of the source code generation webpage;Obtained by cryptographic Hash getter each in the dom tree
The cryptographic Hash of label corresponding to node;Hash value attribute is added for each label of the webpage by Hash value attribute adder.
Wherein, the cryptographic Hash can include:The cryptographic Hash of level cryptographic Hash and label itself of the label in the dom tree.Label
Level cryptographic Hash in dom tree can be calculated according to the hierarchical relationship of the dom tree where current label, label itself
The attribute node that cryptographic Hash can be possessed according to current label calculates.
In specific implementation, can be calculated by service end to enter the cryptographic Hash of row label.In such cases, the cryptographic Hash
Getter obtains the cryptographic Hash of label further according to following manner:Index attributes are added for each label of the webpage;Will
The source code of webpage after addition index attributes is sent to service end, and the cryptographic Hash that row label is entered for service end calculates;Receive
The tab indexes value and the corresponding relation of cryptographic Hash that service end is sent.
Embodiment 3
The present embodiment provides a kind of method and device that webpage content extraction is carried out according to visual template.
Figure 10 shows the system according to an embodiment of the invention that webpage content extraction is carried out according to visual template
Structure chart.Reference picture 10, the system include client 100, search engine 200 and multiple (3 are shown in figure) third parties
Website server 300, search engine 200 include service end 210, and search engine 200 communicates with third party website server 300
Connection, service end 210 can generate web page template with fit end 100, and search engine 200 can carry out according to web page template
Webpage content extraction, i.e., the structured content of webpage in third party website server 300 is extracted according to web page template.
Fig. 8 shows the method stream according to an embodiment of the invention that webpage content extraction is carried out according to visual template
Cheng Tu.Reference picture 8, methods described includes:
Step 802, during orientation crawl targeted website, search and the corresponding targeted website whether has been recorded in web page template storehouse
According to visualization mark generation web page template;
Preserve in web page template storehouse and arrived according to the web page template of visualization mark generation.The web page template can be root
According to embodiment 1 or the web page template of the schemes generation of embodiment 2.Multiple web page templates, the net are stored with web page template storehouse
Page template can be identified with the homepage URL of website.It can be searched according to the homepage URL of targeted website in web page template storehouse
Whether corresponding web page template is had.
Step 804, if in web page template storehouse record have a corresponding targeted website generation is marked according to visualization
Web page template, then content extraction is carried out to the targeted website according to the web page template.
The URL of all external linkages in homepage can be extracted according to homepage URL, removes what is wherein jumped out to other website
Part, remaining URL is put into scheduling queue;Then, according to the web page template to webpage corresponding to URL in scheduling queue point
Content extraction is not carried out.The content extraction can be performed using webpage capture device, the webpage capture device can be network
Spider, spiders, searching machine people or network crawl shell script etc..
In the technical scheme of the embodiment of the present invention, web page contents are carried out using the web page template of visualization mark generation
Extract, because the accuracy of the web page template is higher, therefore, the accuracy that content extraction is carried out according to the web page template also obtains
Improve.
Fig. 9 shows the device knot according to an embodiment of the invention that webpage content extraction is carried out according to visual template
Composition.Reference picture 9, described device include web page template storehouse 902, finger 904 and content extraction device 906, wherein:
Web page template storehouse 902 is suitable to preserve the web page template for marking generation according to visualization, and the web page template can be used
The URL of webpage is identified, and can also be identified with the homepage URL of website.
Finger 904 is suitable to search the corresponding target whether has been recorded in web page template storehouse during orientation crawl targeted website
The web page template that generation is marked according to visualization of website.
Content extraction device 906, which is suitable to work as record in web page template storehouse, being marked according to visualization for the corresponding targeted website
During the web page template of generation, content extraction is carried out to the targeted website according to the web page template.Content extraction device 906 can be with
The URL of all external linkages in homepage is extracted according to homepage URL, removes the part wherein jumped out to other website, will be remaining
URL be put into scheduling queue;Then, content is carried out according to the web page template respectively to webpage corresponding to URL in scheduling queue
Extract.Content extraction device 906 can be Web Spider, spiders, searching machine people or network crawl shell script etc..
Alternatively, described device also includes being used for the device for generating web page template, you can with including visual in embodiment 1
Change effect framework establishment device 10, mark instruction getter 20 and web page template maker 30, the annexation and work of these modules
The description in embodiment 1 is can be found in as principle.
In summary, technical scheme according to embodiments of the present invention, imitated by building the visualization being labeled to webpage
Fruit framework, it is not necessary to manual edit web page template text, it is only necessary to web page contents area is selected in the effect of visualization framework
Domain progress visualized operation can complete the mark to web page contents, improve the efficiency of mark, and then improve generation webpage
The efficiency of template;It is additionally, since web page contents intuitively to be showed, it is not necessary to possess the specialty in terms of webpage design and know
Know, be just easily determined the content type of page structure, improve the accuracy of mark, and then improve generation web page template
Accuracy;Further, because the accuracy of web page template is improved, so, webpage capture is carried out according to the web page template
The accuracy of the content of acquisition is also improved.
Algorithm and display be not inherently related to any certain computer, virtual system or miscellaneous equipment provided herein.
Various general-purpose systems can also be used together with teaching based on this.As described above, required by constructing this kind of system
Structure be obvious.In addition, the present invention is not also directed to any certain programmed language.It should be understood that it can utilize various
Programming language realizes the content of invention described herein, and the description done above to language-specific is to disclose this hair
Bright preferred forms.
In the specification that this place provides, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention
Example can be put into practice in the case of these no details.In some instances, known method, structure is not been shown in detail
And technology, so as not to obscure the understanding of this description.
Similarly, it will be appreciated that in order to simplify the disclosure and help to understand one or more of each inventive aspect,
Above in the description to the exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes
In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:I.e. required guarantor
The application claims of shield features more more than the feature being expressly recited in each claim.It is more precisely, such as following
Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore,
Thus the claims for following embodiment are expressly incorporated in the embodiment, wherein each claim is in itself
Separate embodiments all as the present invention.
Those skilled in the art, which are appreciated that, to be carried out adaptively to the module in the equipment in embodiment
Change and they are arranged in one or more equipment different from the embodiment.Can be the module or list in embodiment
Member or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or
Sub-component.In addition at least some in such feature and/or process or unit exclude each other, it can use any
Combination is disclosed to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so to appoint
Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (including adjoint power
Profit requires, summary and accompanying drawing) disclosed in each feature can be by providing the alternative features of identical, equivalent or similar purpose come generation
Replace.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments
In included some features rather than further feature, but the combination of the feature of different embodiments means in of the invention
Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed
One of meaning mode can use in any combination.
The all parts embodiment of the present invention can be realized with hardware, or to be run on one or more processor
Software module realize, or realized with combinations thereof.It will be understood by those of skill in the art that it can use in practice
Microprocessor or digital signal processor (DSP) according to embodiments of the present invention provide webpage visualization mark to realize
The some or all functions of some or all parts in device.The present invention is also implemented as being used to perform being retouched here
The some or all equipment or program of device (for example, computer program and computer program product) for the method stated.
Such program for realizing the present invention can store on a computer-readable medium, or can have one or more signal
Form.Such signal can be downloaded from internet website and obtained, either provide on carrier signal or with it is any its
He provides form.
It should be noted that the present invention will be described rather than limits the invention for above-described embodiment, and ability
Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims,
Any reference symbol between bracket should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not
Element or step listed in the claims.Word "a" or "an" before element does not exclude the presence of multiple such
Element.The present invention can be by means of including the hardware of some different elements and being come by means of properly programmed computer real
It is existing.In if the unit claim of equipment for drying is listed, several in these devices can be by same hardware branch
To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame
Claim.
Claims (10)
1. a kind of method that visualization mark is provided webpage, including:
The effect of visualization framework being labeled positioned at the masking-out of web page contents overlying regions to webpage is constructed by, wherein, institute
Stating effect of visualization framework includes content area, the masking-out above the content area chosen and mark menu, the mark
Menu includes plurality of kinds of contents type menu item;
When monitoring that the content area is selected, the masking-out and mark are shown above the selected content area
Menu;
Obtain the instruction being labeled in the masking-out to webpage each several part content area.
2. the method for claim 1, wherein it is described be designated as by mark menu setecting corresponding to the content chosen
The content type in region.
3. the method for claim 1, wherein before the effect of visualization framework that structure is labeled to webpage, institute
Stating method also includes:
The source code of webpage is obtained, the dom tree of the webpage is generated according to the source code;
Obtain the cryptographic Hash of label corresponding to each node in the dom tree;
Hash value attribute is added for each label of the webpage, wherein, the cryptographic Hash is used for interior in positioning and presentation web page
Hold region.
4. method as claimed in claim 3, wherein, the cryptographic Hash includes:
The cryptographic Hash of level cryptographic Hash and label itself of the label in the dom tree.
5. method as claimed in claim 3, wherein, the cryptographic Hash for obtaining label corresponding to each node in the dom tree,
Including:
Index attributes are added for each label of the webpage;
The source code of webpage after addition index attributes is sent to service end, so that service end enters the cryptographic Hash meter of row label
Calculate;
Receive the tab indexes value of service end transmission and the corresponding relation of cryptographic Hash.
6. a kind of device that visualization mark is provided webpage, including:
Effect of visualization framework establishment device, the masking-out that web page contents overlying regions are located at suitable for being constructed by are labeled to webpage
Effect of visualization framework, wherein, the effect of visualization framework include content area, above the content area chosen
Masking-out and mark menu, the mark menu include plurality of kinds of contents type menu item;When monitoring that the content area is selected
When, the masking-out and mark menu are shown above the selected content area;
Mark instruction getter, suitable for obtaining the instruction being labeled in the masking-out to webpage each several part content area.
7. device as claimed in claim 6, wherein, it is described be designated as by mark menu setecting corresponding to the content chosen
The content type in region.
8. device as claimed in claim 6, wherein, in addition to:
Dom tree maker, suitable for obtaining the source code of webpage, the dom tree of the webpage is generated according to the source code;
Cryptographic Hash getter, suitable for obtaining the cryptographic Hash of label corresponding to each node in the dom tree;
Hash value attribute adder, each label addition Hash value attribute of the webpage is suitable for, wherein, the cryptographic Hash is used for
Content area in positioning and presentation web page.
9. device as claimed in claim 8, wherein, the cryptographic Hash includes:
The cryptographic Hash of level cryptographic Hash and label itself of the label in the dom tree.
10. device as claimed in claim 8, wherein, the cryptographic Hash getter is further adapted for:
Index attributes are added for each label of the webpage;
The source code of webpage after addition index attributes is sent to service end, so that service end enters the cryptographic Hash meter of row label
Calculate;
Receive the tab indexes value of service end transmission and the corresponding relation of cryptographic Hash.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310606202.6A CN103678510B (en) | 2013-11-25 | 2013-11-25 | The method and device of visualization mark is provided webpage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310606202.6A CN103678510B (en) | 2013-11-25 | 2013-11-25 | The method and device of visualization mark is provided webpage |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103678510A CN103678510A (en) | 2014-03-26 |
CN103678510B true CN103678510B (en) | 2018-02-02 |
Family
ID=50316055
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310606202.6A Active CN103678510B (en) | 2013-11-25 | 2013-11-25 | The method and device of visualization mark is provided webpage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103678510B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105912613A (en) * | 2016-04-06 | 2016-08-31 | 江苏中威科技软件系统有限公司 | Website template quick migration method |
CN107423322B (en) * | 2017-03-31 | 2020-03-03 | 广州视源电子科技股份有限公司 | Method and device for displaying label nesting hierarchy of webpage |
CN109189688B (en) * | 2018-09-11 | 2022-06-03 | 北京奇艺世纪科技有限公司 | Test case script generation method and device and electronic equipment |
CN109522490B (en) * | 2018-09-18 | 2021-11-09 | 武汉大学 | Map visualization method for internet information |
CN111400581B (en) * | 2020-03-13 | 2024-02-06 | 京东科技控股股份有限公司 | System, method and apparatus for labeling samples |
CN113419781A (en) * | 2021-07-19 | 2021-09-21 | 湖南四方天箭信息科技有限公司 | Crawler method and device based on Chrome plug-in, computer equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101192234A (en) * | 2007-06-07 | 2008-06-04 | 腾讯科技(深圳)有限公司 | Searching system and method based on web page extraction |
CN102682055A (en) * | 2011-01-03 | 2012-09-19 | 三星电子株式会社 | Method and apparatus for managing e-book contents |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050237336A1 (en) * | 2004-04-23 | 2005-10-27 | Jens Guhring | Method and system for multi-object volumetric data visualization |
CN101777045A (en) * | 2008-09-01 | 2010-07-14 | 西北工业大学 | Method for analyzing XML file by indexing |
CN101464905B (en) * | 2009-01-08 | 2011-03-23 | 中国科学院计算技术研究所 | Web page information extraction system and method |
CN102779172B (en) * | 2012-06-25 | 2016-06-01 | 北京奇虎科技有限公司 | The recognition system of non-body text and method in a kind of webpage |
-
2013
- 2013-11-25 CN CN201310606202.6A patent/CN103678510B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101192234A (en) * | 2007-06-07 | 2008-06-04 | 腾讯科技(深圳)有限公司 | Searching system and method based on web page extraction |
CN102682055A (en) * | 2011-01-03 | 2012-09-19 | 三星电子株式会社 | Method and apparatus for managing e-book contents |
Non-Patent Citations (1)
Title |
---|
"一种基于模板的快速网页文本自动抽取算法";陈冶昂 等;《计算机应用研究》;20090731;第26卷(第7期);第2646-2649页 * |
Also Published As
Publication number | Publication date |
---|---|
CN103678510A (en) | 2014-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103678509B (en) | Generate the method and device of web page template | |
CN103678511B (en) | The method and device of webpage content extraction is carried out according to visual template | |
CN103678510B (en) | The method and device of visualization mark is provided webpage | |
US20210248204A1 (en) | Systems and methods for automatically identifying and linking names in digital resources | |
US10509555B2 (en) | Machine data analysis in an information technology environment | |
US10423697B2 (en) | User interface with navigation controls for the display or concealment of adjacent content | |
CN102360368A (en) | Web data extraction method based on visual customization of extraction template | |
TW201250492A (en) | Method and system of extracting web page information | |
CN105095067A (en) | User interface element object identification and automatic test method and apparatus | |
Huynh et al. | Enabling web browsers to augment web sites' filtering and sorting functionalities | |
US10789302B2 (en) | Method and system for extracting user-specific content | |
CN104268289B (en) | The abatement detecting method and device of link URL | |
CN107092670A (en) | A kind of visual network crawler system and analysis method based on embedded browser | |
CN107368546A (en) | A kind of method and apparatus for generating outline | |
CN106951405A (en) | Data processing method and device based on typesetting engine | |
JP5380874B2 (en) | Information retrieval method, program and apparatus | |
Sellers et al. | Taking the OXPath down the deep web | |
CN110147477A (en) | Data resource modelling extracting method, device and the equipment of Web system | |
JP5652519B2 (en) | Information retrieval method, program and apparatus | |
CN108228542A (en) | A kind of processing method and processing device of non-structured text | |
Neeli et al. | Automated data mining from web servers using perl script | |
JP7023612B2 (en) | Log structure visualization device, log structure visualization method, and program | |
Wara | A Framework for Fashion Data Gathering, Hierarchical-Annotation and Analysis for Social Media and Online Shop: TOOLKIT FOR DETAILED STYLE ANNOTATIONS FOR ENHANCED FASHION RECOMMENDATION | |
Rozinajová et al. | One approach to HTML wrappers creation: using document object model tree | |
Huhtamäki et al. | Context-driven social network visualisation: Case wiki co-creation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220725 Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015 Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park) Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Patentee before: Qizhi software (Beijing) Co.,Ltd. |
|
TR01 | Transfer of patent right |