CN103678510B - The method and device of visualization mark is provided webpage - Google Patents

The method and device of visualization mark is provided webpage Download PDF

Info

Publication number
CN103678510B
CN103678510B CN201310606202.6A CN201310606202A CN103678510B CN 103678510 B CN103678510 B CN 103678510B CN 201310606202 A CN201310606202 A CN 201310606202A CN 103678510 B CN103678510 B CN 103678510B
Authority
CN
China
Prior art keywords
webpage
cryptographic hash
label
mark
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310606202.6A
Other languages
Chinese (zh)
Other versions
CN103678510A (en
Inventor
马晓辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201310606202.6A priority Critical patent/CN103678510B/en
Publication of CN103678510A publication Critical patent/CN103678510A/en
Application granted granted Critical
Publication of CN103678510B publication Critical patent/CN103678510B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention discloses a kind of method and device for providing webpage visualization mark, belongs to Internet technical field.Methods described includes:It is constructed by the effect of visualization framework being labeled positioned at the masking-out of web page contents overlying regions to webpage;Obtain the instruction being labeled in the masking-out to webpage each several part content area.The present invention can improve the efficiency and accuracy being labeled to webpage.

Description

The method and device of visualization mark is provided webpage
Technical field
The present invention relates to Internet technical field, and in particular to a kind of method and dress for providing webpage visualization mark Put.
Background technology
Web page template can be used for extract webpage content, than if any search engine used orientation when capturing website Acquisition technique, the spider of oriented acquisition extract the related content of website using web page template, obtain the content of formatting, Title, author including webpage, deliver the information such as time and text.
It is existing it is a kind of generate web page template method be:First, according to URL (the Uniform Resource of the page Locator, URL), download the source code of the page;Secondly, page structure is carried out according to the source code of the page Automatically analyze, calculate the cryptographic Hash of each structure in page-out;Then, according to which knot in the source code artificial judgment page of the page Structure corresponds to title, and which structure corresponds to text, and which structure correspondingly delivers time etc., and is marked;Finally, generating structure The corresponding relation of the content type of cryptographic Hash and structure, obtains web page template.
Existing generation web page template method at least has the disadvantage that:
The content type of handmarking's page structure is carried out by text editing, is had inside web page template largely not Related content, some web page templates even have ten of thousands row, cause the efficiency of handmarking very low;
Various contents in web page template are mixed in web page code, because web page contents do not show intuitively, If not yet done to webpage design language, it is not easy to determine the content type of page structure, is easy for malfunctioning during handmarking, leads Cause the accuracy of the web page template of generation not high, and then cause not carry out the accuracy of content extraction also not according to the web page template It is high.
The content of the invention
In view of the above problems, it is proposed that the present invention so as to provide one kind overcome above mentioned problem or at least in part solve on State the method and device that visualization mark is provided webpage of problem.
According to one aspect of the present invention, there is provided a kind of method for providing webpage visualization mark, methods described bag Include:
It is constructed by the effect of visualization framework being labeled positioned at the masking-out of web page contents overlying regions to webpage;
Obtain the instruction being labeled in the masking-out to webpage each several part content area.
Alternatively, the effect of visualization framework include content area, the masking-out above the content area chosen and Menu is marked, the mark menu includes plurality of kinds of contents type menu item, the correspondence being designated as by marking menu setecting In the content type for the content area chosen.
Alternatively, before the effect of visualization framework that structure is labeled to webpage, methods described also includes:
The source code of webpage is obtained, the dom tree of the webpage is generated according to the source code;
Obtain the cryptographic Hash of label corresponding to each node in the dom tree;
Hash value attribute is added for each label of the webpage, wherein, the cryptographic Hash is used in positioning and presentation web page Content area.
Alternatively, the cryptographic Hash includes:The Hash of level cryptographic Hash and label itself of the label in the dom tree Value.
Alternatively, the cryptographic Hash for obtaining label corresponding to each node in the dom tree, including:
Index attributes are added for each label of the webpage;
The source code of webpage after addition index attributes is sent to service end, so that service end enters the cryptographic Hash of row label Calculate;
Receive the tab indexes value of service end transmission and the corresponding relation of cryptographic Hash.
According to another aspect of the present invention, there is provided a kind of device that visualization mark is provided webpage, described device bag Include:
Effect of visualization framework establishment device, the masking-out that web page contents overlying regions are located at suitable for being constructed by are carried out to webpage The effect of visualization framework of mark;
Mark instruction getter, suitable for obtaining the finger being labeled in the masking-out to webpage each several part content area Show.
Alternatively, the effect of visualization framework include content area, the masking-out above the content area chosen and Menu is marked, the mark menu includes plurality of kinds of contents type menu item, the correspondence being designated as by marking menu setecting In the content type for the content area chosen.
Alternatively, described device also includes:
Dom tree maker, suitable for obtaining the source code of webpage, the dom tree of the webpage is generated according to the source code;
Cryptographic Hash getter, suitable for obtaining the cryptographic Hash of label corresponding to each node in the dom tree;
Hash value attribute adder, each label addition Hash value attribute of the webpage is suitable for, wherein, the cryptographic Hash For positioning and the content area in presentation web page.
Alternatively, the cryptographic Hash includes:The Hash of level cryptographic Hash and label itself of the label in the dom tree Value.
Alternatively, the cryptographic Hash getter is further adapted for:
Index attributes are added for each label of the webpage;
The source code of webpage after addition index attributes is sent to service end, so that service end enters the cryptographic Hash of row label Calculate;
Receive the tab indexes value of service end transmission and the corresponding relation of cryptographic Hash.
, can be to webpage by building effect of visualization framework according to the above-mentioned one or more technical schemes of the present invention Visualization mark is carried out, improves the efficiency of mark;Web page contents are additionally, since intuitively to be showed, it is easy to it is determined that The content type of page structure, improve the accuracy of mark.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And can be practiced according to the content of specification, and in order to allow above and other objects of the present invention, feature and advantage can Become apparent, below especially exemplified by the embodiment of the present invention.
Brief description of the drawings
By reading the detailed description of hereafter preferred embodiment, it is various other the advantages of and benefit it is common for this area Technical staff will be clear understanding.Accompanying drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention Limitation.And in whole accompanying drawing, identical part is denoted by the same reference numerals.In the accompanying drawings:
Fig. 1 shows the method flow diagram of generation web page template according to an embodiment of the invention;
Fig. 2 shows the schematic diagram being labeled in the embodiment of the present invention to the title of webpage;
Fig. 3 shows the schematic diagram being labeled in the embodiment of the present invention to the text of webpage;
Fig. 4 shows the method detail flowchart of generation web page template according to an embodiment of the invention;
Fig. 5 shows the structure drawing of device of generation web page template according to an embodiment of the invention;
Fig. 6 shows the method flow diagram according to an embodiment of the invention for providing webpage visualization mark;
Fig. 7 shows the structure drawing of device according to an embodiment of the invention for providing webpage visualization mark;
Fig. 8 shows the method stream according to an embodiment of the invention that webpage content extraction is carried out according to visual template Cheng Tu;
Fig. 9 shows the device knot according to an embodiment of the invention that webpage content extraction is carried out according to visual template Composition;
Figure 10 shows the system according to an embodiment of the invention that webpage content extraction is carried out according to visual template Structure chart.
Embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although the disclosure is shown in accompanying drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here Limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure Completely it is communicated to those skilled in the art.
Embodiment 1
The present embodiment provides a kind of method and device for generating web page template
Fig. 1 shows the method flow diagram of generation web page template according to an embodiment of the invention, and reference picture 1 is described Method includes:
Step 102, the effect of visualization framework being labeled to webpage is built;
In one implementation, the effect of visualization framework can include:Content area, positioned at the content regions chosen Masking-out and mark menu above domain, the mark menu include plurality of kinds of contents type menu item.
By the source code such as html (hypertext mark-up language, the hypertext markup language that obtain webpage Speech) document, stylesheet files such as css (cascading style sheets, CSS) file is attached to html Document, and increase js (javascript) script in html documents, the effect of visualization framework of webpage can be built.Specifically Ground, it can be realized when detecting that some content area is selected by js scripts, be covered above the content area chosen The display mode of version and mark menu, the masking-out and mark menu can be limited by the rule defined in stylesheet files.
According to above-mentioned effect of visualization framework, when webpage is shown in a browser, each several part content regions of webpage Domain can have effect of visualization, and when some content area is selected (such as detect that mouse is moved to above the content area, again For example, detecting click to the content area in touch-screen or detecting slip gesture in the content area), this is interior Hold and occur masking-out above region, also, the top of the content area can occur marking menu simultaneously or according to triggering out Menu is now marked, for example, a mouse click right button on selected content area, it may appear that various content type menu items.Such as Shown in Fig. 2 and Fig. 3, the content type menu item can include " being labeled as title ", " being labeled as text " and " be labeled as day Phase " etc., in addition, the content type menu item can also include " preserving mark " and " end mark " etc..
Step 104, the instruction being labeled to webpage each several part content area is obtained;
In embodiments of the present invention, the main body for performing mark is client, and client can be by user, operation personnel or pipe Reason personnel are operated.Webpage can be labeled by mouse, mouse is moved to above some content area and clicked Right mouse button, then, some content type menu item is clicked on, the mark to the contents of the section region can be completed.In touch-screen In, content type can also be selected according to the touch operation to menu item, realize the mark to webpage.As shown in Fig. 2 pass through " being labeled as title " is clicked, corresponding content area can be labeled as title, as shown in figure 3, by clicking " labeled as just Text ", can be labeled as text by corresponding content area.
Step 106, content area and the corresponding relation of mark instruction are recorded, obtains web page template.
Often one content area of mark, and select and then " preserve and mark " menu item, it is possible to by the content area and selection The corresponding relation of content type is stored into web page template, by selecting " end mark " menu item, is completed to owning in webpage The mark of content area marked is needed, obtains web page template corresponding to the webpage (or referred to as web page contents template).
It can be seen that technical scheme according to embodiments of the present invention, it is only necessary to select webpage in the effect of visualization framework Content area carries out visualized operation, you can easily defines web page template, improves the efficiency of generation web page template;Moreover, by Intuitively showed in web page contents, it is easy to determine the content type of page structure, improve generation web page template Accuracy.
Such scheme is to generate the web page template corresponding to the webpage according to a webpage.For a resource website Speech, it may include many webpages, and these webpages are usually according to identical webpage design template generation, thus these webpages Structure can be essentially identical, it is possible to seldom difference is only existed, for example, comment content may be included in some webpages, and Some webpages do not include comment content, but these webpages all include title, author, deliver the contents such as time and text.It is if right Each webpage carries out above-mentioned step next life into web page template, then workload is still larger.
Then, further to improve the formation efficiency of web page template, methods described can also include:To according to same resource Multiple web page templates of multiple auto-building html files under website are counted, and the identical portions extracted in the multiple web page template are mitogenetic Into final web page template.Specifically, all webpages that can include to resource website are sampled, and obtain multiple webpages;So Afterwards, multiple web page templates are generated according to the above method;Finally, the multiple web page template (every content in web page template is extracted The corresponding relation of region and content type is a part of web page template) in same section generate final web page template (or being referred to as the web page template of the resource website).
For example, for 360 websites, can be first according to the homepage URL (http of the website://www.360.cn/) obtain head The html document of page;Then analysis is carried out to the html document and finds that the website includes multiple (such as 1000) sub-pages, in It is to extract 50 sub-pages from this 1000 sub-pages according to predetermined algorithm (such as random algorithm);To this 50 sub-pages 50 web page templates are generated after carrying out visualization mark;Finally, the same section generation extracted in this 50 web page templates is corresponding Web page template in 360 websites.
In addition, can also be each content for ease of the content area in positioning and presentation web page in embodiments of the present invention Label addition Hash value attribute belonging to region, correspondingly, what is stored in web page template is exactly the affiliated label of content area Cryptographic Hash and the corresponding relation of the content type of selection.In this case, the side of the generation web page template of the embodiment of the present invention Method can also comprise the following steps before the step of effect of visualization framework that structure is labeled to webpage:
First, the source code of webpage is obtained, DOM (the Document Object of the webpage are generated according to the source code Model, DOM Document Object Model) tree;
Then, the cryptographic Hash of label corresponding to each node in the dom tree is obtained;
Finally, Hash value attribute is added for each label of the webpage.
Wherein, the cryptographic Hash can include the Hash of level cryptographic Hash and label itself of the label in the dom tree Value.Level cryptographic Hash of the label in dom tree can be calculated according to the hierarchical relationship of the dom tree where current label, mark The attribute node that signing the cryptographic Hash of itself can be possessed according to current label calculates.
In specific implementation, can be calculated by service end to enter the cryptographic Hash of row label.As shown in Figure 10, service end 210 In search engine 200, search engine 200 and multiple communication links of (3 are shown in figure) third party website server 300 Connect, service end 210 can generate web page template with fit end 100.In such cases, it is described to obtain in the dom tree respectively The cryptographic Hash of label corresponding to node can include:
First, index attributes are added for each label of the webpage in client 100;
Then, the source code of the webpage after adding index attributes is sent to service end 210 by client 100;
Secondly, service end 210 enters the cryptographic Hash calculating of row label;
Finally, the corresponding relation of tab indexes value and cryptographic Hash is sent to client 100 by service end 210.
When implementing of the invention, the operation of client may include steps of:
First, in client installation effect of visualization framework generation plug-in unit, and access in third party website server 300 Webpage;
Then, in one implementation, mouse is moved to web page contents overlying regions, and the top appearance of content area is light The masking-out of blueness, represents that the content area is selected, and right button is clicked, and selection menu occurs, the content area can be selected to belong to The content types such as title, text;
Finally, after the completion of mark, client generation web page template.
Client can be sent to the web page template of generation service end, and service end is when being oriented collection web page contents The web page template can be used to carry out information gathering.
One detailed process of method of the generation web page template of an embodiment of the present invention given below.Reference picture 4, it is described Method includes:
Step 402, client obtains the source code of webpage, and the dom tree of the webpage is generated according to the source code;
Step 404, client adds index attributes for each label of dom tree, wherein, the traversal of dom tree can use The algorithm of depth-first is carried out;
Step 406, the source code of the webpage after addition index (index) attribute is sent to service end by client, is sent Content be, for example,:
Step 408, service end receives the source code that with the addition of index attributes that client is sent, and source code is divided Analysis, calculates the cryptographic Hash of full page structure respective labels, and all cryptographic Hash calculated are returned into client;
The cryptographic Hash that service end is calculated is corresponding with the index of label, can be packaged into j son forms and return to client End, json content formats are, for example,:{ tab indexes value:{ Hash 1:Hash1, Hash 2:hash2}...}.
Step 410, client receives the json data that service end returns, and is closed by the way that tab indexes value is corresponding with cryptographic Hash System, two property values are added for corresponding label:Level cryptographic Hash frame_hash and label of the label in the dom tree are certainly The cryptographic Hash self_hash of body;
For example, a div tag content that with the addition of Hash value attribute is as follows:
<The index=" 45 " of 46131321231613 " self_hash=" of div frame_hash=" 174461815164 " >
content
</div>
Wherein, frame_hash cryptographic Hash is that the hierarchical relationship of the dom tree according to where current label is calculated, example Such as:
, can be to " this string of html body div " carries out md5 and calculated if calculating the frame_hash of div tag One cryptographic Hash, algorithm can have a variety of, and the embodiment of the present invention is not limited to specific algorithm.
And self_hash cryptographic Hash is the attribute node possessed according to current label calculates, such as div tag There are class attributes and id attributes, then can be according to " class:name id:This string of author " carries out md5 and calculates a Kazakhstan Uncommon value, algorithm can also have a variety of, and the embodiment of the present invention is not limited to specific algorithm.
In this manner it is possible to a node of dom tree is navigated to according to frame_hash and self_hash.
Step 412, client adds visual effect, in one kind according to Hash value attribute for the page elements of webpage In implementation, mouse is moved to above the element, is had masking-out azury above the element, is represented the content regions of the element Domain is selected, and right button is clicked on selected content area, the menu items such as " being labeled as title ", " being labeled as text " occurs.
Step 414, when each content area of webpage being marked, the cryptographic Hash and mark of content area under client records The corresponding relation of the content type of note, generates web page template, and the content of web page template is, for example,:
frame_hash:243092489 self_hash:49348393 titles
frame_hash:434389298 self_hash:23439438 authors
frame_hash:023473843 self_hash:34934932 texts
frame_hash:483928384 self_hash:23487388 dates
Step 416, the web page template of generation is sent to service end by client, and service end preserves the webpage that client is sent Template, during the oriented acquisition website, title, text, content of webpage etc. are extracted using the web page template.
The embodiment of the present invention also provides a kind of device for generating web page template, reference picture 5, and described device includes visualization and imitated Fruit framework establishment device 10, mark instruction getter 20 and web page template maker 30, wherein:
Effect of visualization framework establishment device 10 is suitable to the effect of visualization framework that structure is labeled to webpage.A kind of real In existing mode, the effect of visualization framework includes:Content area, the masking-out above the content area chosen and mark dish Single, the mark menu includes plurality of kinds of contents type menu item.Effect of visualization framework establishment device 10 is by obtaining the source of webpage Code such as html documents, stylesheet files such as css files are attached to html documents, and increase js pin in html documents This, can build the effect of visualization framework of webpage.
Mark instruction getter 20 is suitable to obtain the instruction for being labeled webpage each several part content area.Mouse can be passed through Mark or touch-screen are labeled to webpage, for example, mouse is moved to above some content area into simultaneously a mouse click right button, so Afterwards, some content type menu item is clicked on to complete the mark to the contents of the section region, and mark instruction getter 20 can be examined Labeling operation is measured, and obtains the content type by right button menu setecting.
Web page template maker 30 is suitable to record content area and the corresponding relation of mark instruction, obtains web page template.Mark After note instruction getter 20 gets the content type of selection, web page template maker 30 can record content area and selection The corresponding relation of content type, so as to generate web page template.
Alternatively, described device also includes counter (not shown), suitable for according to multiple nets under same resource website Multiple web page templates of page generation are counted, and the same section extracted in the multiple web page template generates final webpage mould Plate.
In embodiments of the present invention, can also be each content area for ease of the content area in positioning and presentation web page Affiliated label addition Hash value attribute.Therefore, the device of the generation web page template of the embodiment of the present invention can also include dom tree Maker, cryptographic Hash getter and Hash value attribute adder.The source code of webpage, and root are obtained by dom tree maker The dom tree of the webpage is generated according to the source code;Obtained by cryptographic Hash getter in the dom tree corresponding to each node The cryptographic Hash of label;Hash value attribute is added for each label of the webpage by Hash value attribute adder.Wherein, the Kazakhstan Uncommon value can include:The cryptographic Hash of level cryptographic Hash and label itself of the label in the dom tree.Label is in dom tree Level cryptographic Hash can be calculated according to the hierarchical relationship of the dom tree where current label, and the cryptographic Hash of label itself can be with The attribute node possessed according to current label calculates.Correspondingly, the web page template maker 30 is by recording content The corresponding relation of the cryptographic Hash of the affiliated label in region and the content type of selection obtains web page template.
In specific implementation, can be calculated by service end to enter the cryptographic Hash of row label.In such cases, the cryptographic Hash Getter obtains the cryptographic Hash of label further according to following manner:Index attributes are added for each label of the webpage;Will The source code of webpage after addition index attributes is sent to service end, and the cryptographic Hash that row label is entered for service end calculates;Receive The tab indexes value and the corresponding relation of cryptographic Hash that service end is sent.
It should be noted that each step of the method in embodiment 1 can be split and be accepted or rejected as needed, embodiment Each module of device in 1 can also be split and be accepted or rejected as needed.For example, one kind is formed by step 102 and step 104 The method that visualization mark is provided webpage, one kind is formed by effect of visualization framework establishment device 10 and mark instruction getter 20 The device of visualization mark is provided webpage.
Embodiment 2
The present embodiment provides a kind of method and device for providing webpage visualization mark.
Fig. 6 shows the method flow diagram according to an embodiment of the invention for providing webpage visualization mark, reference Fig. 6, methods described include:
Step 602, it is constructed by the effect of visualization being labeled positioned at the masking-out of web page contents overlying regions to webpage Framework;
The effect of visualization framework can include:Content area, the masking-out above the content area chosen and mark Menu is noted, the mark menu includes plurality of kinds of contents type menu item.
By obtaining the source code such as html documents of webpage, stylesheet files such as css files are attached to html texts Shelves, and increase js (javascript) script in html documents, the effect of visualization framework of webpage can be built.Specifically, Can be realized by js scripts when detecting that some content area is selected, occur above the content area chosen masking-out and Menu is marked, the masking-out and the display mode of mark menu can be limited by the rule in stylesheet files.
According to above-mentioned effect of visualization framework, when webpage is shown in a browser, each several part content regions of webpage Domain can have effect of visualization, and when some content area is selected (such as detect that mouse is moved to above the content area, again For example, detecting click to the content area in touch-screen or detecting slip gesture in the content area), this is interior Hold and occur masking-out above region, also, the top of the content area can occur marking menu simultaneously or according to triggering out Menu is now marked, for example, a mouse click right button on selected content area, it may appear that various content type menu items.Such as Shown in Fig. 2 and Fig. 3, the content type menu item can include " being labeled as title ", " being labeled as text " and " be labeled as day Phase " etc., in addition, the content type menu item can also include " preserving mark " and " end mark " etc..
Step 604, the instruction being labeled in the masking-out to webpage each several part content area is obtained.
The instruction can be the content type corresponding to the content area chosen by marking menu setecting.In this hair In bright embodiment, the main body for performing mark is client, and client can be grasped by user, operation personnel or administrative staff Make.Webpage can be labeled by mouse, mouse is moved to above some content area simultaneously a mouse click right button, so Afterwards, some content type menu item is clicked on, the mark to the contents of the section region can be completed., can be with root in touch-screen Content type is selected according to the touch operation to menu item, realizes the mark to webpage.As shown in Fig. 2 " it is labeled as by clicking Title ", corresponding content area can be labeled as title, as shown in figure 3, by clicking " being labeled as text ", can be by phase The content area answered is labeled as text.
It can be seen that technical scheme according to embodiments of the present invention, by building effect of visualization framework, can be carried out to webpage Visualization mark, improve the efficiency of mark;It is additionally, since web page contents intuitively to be showed, it is easy to determine the page The content type of structure, improve the accuracy of mark.
In addition, can also be each content for ease of the content area in positioning and presentation web page in embodiments of the present invention Label addition Hash value attribute belonging to region.In this case, providing webpage for the embodiment of the present invention visualizes mark Method method structure webpage is labeled effect of visualization framework the step of before, can also comprise the following steps:
First, the source code of webpage is obtained, the dom tree of the webpage is generated according to the source code;
Then, the cryptographic Hash of label corresponding to each node in the dom tree is obtained;
Finally, Hash value attribute is added for each label of the webpage.
Wherein, the cryptographic Hash can include the Hash of level cryptographic Hash and label itself of the label in the dom tree Value.Level cryptographic Hash of the label in dom tree can be calculated according to the hierarchical relationship of the dom tree where current label, mark The attribute node that signing the cryptographic Hash of itself can be possessed according to current label calculates.
In specific implementation, can be calculated by service end to enter the cryptographic Hash of row label.As shown in Figure 10, service end 210 In search engine 200, search engine 200 and multiple communication links of (3 are shown in figure) third party website server 300 Connect, service end 210 can generate web page template with fit end 100.In such cases, it is described to obtain in the dom tree respectively The cryptographic Hash of label corresponding to node can include:
First, index attributes are added for each label of the webpage in client 100;
Then, the source code of the webpage after adding index attributes is sent to service end 210 by client 100;
Secondly, service end 210 enters the cryptographic Hash calculating of row label;
Finally, the corresponding relation of tab indexes value and cryptographic Hash is sent to client 100 by service end 210.
When implementing of the invention, the labeling operation of client may include steps of:First, it is visual in client installation Change effect framework generation plug-in unit, and access the webpage in third party website server 300;Then, in one implementation, mouse Mark is moved to web page contents overlying regions, and the top of content area nattier blue masking-out occurs, represents that the content area is selected, Right button is clicked, and selection menu occurs, the content area can be selected to belong to the content types such as title, text;Perform repeatedly above-mentioned Step, complete the mark to webpage.
The embodiment of the present invention also provides a kind of device for providing webpage visualization mark, and reference picture 7, described device includes Effect of visualization framework establishment device 10 and mark instruction getter 20, wherein:
Effect of visualization framework establishment device 10 enters suitable for the masking-out being constructed by positioned at web page contents overlying regions to webpage The effect of visualization framework of rower note.In one implementation, the effect of visualization framework includes:Content area, it is located at The masking-out and mark menu above content area chosen, the mark menu include plurality of kinds of contents type menu item.Visualization Effect framework establishment device 10 is added stylesheet files such as css files by obtaining the source code such as html documents of webpage To html documents, and increase js scripts in html documents, the effect of visualization framework of webpage can be built.
Mark instruction getter 20 is suitable to obtain the finger for being labeled webpage each several part content area in the masking-out Show, the content type corresponding to the content area chosen being designated as by marking menu setecting.Can by mouse or Person's touch-screen is labeled to webpage, for example, mouse is moved to above some content area simultaneously a mouse click right button, then, Some content type menu item is clicked on to complete the mark to the contents of the section region, mark instruction getter 20 can detect Labeling operation, and obtain the content type by right button menu setecting.
In embodiments of the present invention, can also be each content area for ease of the content area in positioning and presentation web page Affiliated label addition Hash value attribute.Therefore, the device for providing webpage visualization mark of the embodiment of the present invention can be with Including dom tree maker, cryptographic Hash getter and Hash value attribute adder.The source of webpage is obtained by dom tree maker Code, and according to the dom tree of the source code generation webpage;Obtained by cryptographic Hash getter each in the dom tree The cryptographic Hash of label corresponding to node;Hash value attribute is added for each label of the webpage by Hash value attribute adder. Wherein, the cryptographic Hash can include:The cryptographic Hash of level cryptographic Hash and label itself of the label in the dom tree.Label Level cryptographic Hash in dom tree can be calculated according to the hierarchical relationship of the dom tree where current label, label itself The attribute node that cryptographic Hash can be possessed according to current label calculates.
In specific implementation, can be calculated by service end to enter the cryptographic Hash of row label.In such cases, the cryptographic Hash Getter obtains the cryptographic Hash of label further according to following manner:Index attributes are added for each label of the webpage;Will The source code of webpage after addition index attributes is sent to service end, and the cryptographic Hash that row label is entered for service end calculates;Receive The tab indexes value and the corresponding relation of cryptographic Hash that service end is sent.
Embodiment 3
The present embodiment provides a kind of method and device that webpage content extraction is carried out according to visual template.
Figure 10 shows the system according to an embodiment of the invention that webpage content extraction is carried out according to visual template Structure chart.Reference picture 10, the system include client 100, search engine 200 and multiple (3 are shown in figure) third parties Website server 300, search engine 200 include service end 210, and search engine 200 communicates with third party website server 300 Connection, service end 210 can generate web page template with fit end 100, and search engine 200 can carry out according to web page template Webpage content extraction, i.e., the structured content of webpage in third party website server 300 is extracted according to web page template.
Fig. 8 shows the method stream according to an embodiment of the invention that webpage content extraction is carried out according to visual template Cheng Tu.Reference picture 8, methods described includes:
Step 802, during orientation crawl targeted website, search and the corresponding targeted website whether has been recorded in web page template storehouse According to visualization mark generation web page template;
Preserve in web page template storehouse and arrived according to the web page template of visualization mark generation.The web page template can be root According to embodiment 1 or the web page template of the schemes generation of embodiment 2.Multiple web page templates, the net are stored with web page template storehouse Page template can be identified with the homepage URL of website.It can be searched according to the homepage URL of targeted website in web page template storehouse Whether corresponding web page template is had.
Step 804, if in web page template storehouse record have a corresponding targeted website generation is marked according to visualization Web page template, then content extraction is carried out to the targeted website according to the web page template.
The URL of all external linkages in homepage can be extracted according to homepage URL, removes what is wherein jumped out to other website Part, remaining URL is put into scheduling queue;Then, according to the web page template to webpage corresponding to URL in scheduling queue point Content extraction is not carried out.The content extraction can be performed using webpage capture device, the webpage capture device can be network Spider, spiders, searching machine people or network crawl shell script etc..
In the technical scheme of the embodiment of the present invention, web page contents are carried out using the web page template of visualization mark generation Extract, because the accuracy of the web page template is higher, therefore, the accuracy that content extraction is carried out according to the web page template also obtains Improve.
Fig. 9 shows the device knot according to an embodiment of the invention that webpage content extraction is carried out according to visual template Composition.Reference picture 9, described device include web page template storehouse 902, finger 904 and content extraction device 906, wherein:
Web page template storehouse 902 is suitable to preserve the web page template for marking generation according to visualization, and the web page template can be used The URL of webpage is identified, and can also be identified with the homepage URL of website.
Finger 904 is suitable to search the corresponding target whether has been recorded in web page template storehouse during orientation crawl targeted website The web page template that generation is marked according to visualization of website.
Content extraction device 906, which is suitable to work as record in web page template storehouse, being marked according to visualization for the corresponding targeted website During the web page template of generation, content extraction is carried out to the targeted website according to the web page template.Content extraction device 906 can be with The URL of all external linkages in homepage is extracted according to homepage URL, removes the part wherein jumped out to other website, will be remaining URL be put into scheduling queue;Then, content is carried out according to the web page template respectively to webpage corresponding to URL in scheduling queue Extract.Content extraction device 906 can be Web Spider, spiders, searching machine people or network crawl shell script etc..
Alternatively, described device also includes being used for the device for generating web page template, you can with including visual in embodiment 1 Change effect framework establishment device 10, mark instruction getter 20 and web page template maker 30, the annexation and work of these modules The description in embodiment 1 is can be found in as principle.
In summary, technical scheme according to embodiments of the present invention, imitated by building the visualization being labeled to webpage Fruit framework, it is not necessary to manual edit web page template text, it is only necessary to web page contents area is selected in the effect of visualization framework Domain progress visualized operation can complete the mark to web page contents, improve the efficiency of mark, and then improve generation webpage The efficiency of template;It is additionally, since web page contents intuitively to be showed, it is not necessary to possess the specialty in terms of webpage design and know Know, be just easily determined the content type of page structure, improve the accuracy of mark, and then improve generation web page template Accuracy;Further, because the accuracy of web page template is improved, so, webpage capture is carried out according to the web page template The accuracy of the content of acquisition is also improved.
Algorithm and display be not inherently related to any certain computer, virtual system or miscellaneous equipment provided herein. Various general-purpose systems can also be used together with teaching based on this.As described above, required by constructing this kind of system Structure be obvious.In addition, the present invention is not also directed to any certain programmed language.It should be understood that it can utilize various Programming language realizes the content of invention described herein, and the description done above to language-specific is to disclose this hair Bright preferred forms.
In the specification that this place provides, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention Example can be put into practice in the case of these no details.In some instances, known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this description.
Similarly, it will be appreciated that in order to simplify the disclosure and help to understand one or more of each inventive aspect, Above in the description to the exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:I.e. required guarantor The application claims of shield features more more than the feature being expressly recited in each claim.It is more precisely, such as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following embodiment are expressly incorporated in the embodiment, wherein each claim is in itself Separate embodiments all as the present invention.
Those skilled in the art, which are appreciated that, to be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more equipment different from the embodiment.Can be the module or list in embodiment Member or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or Sub-component.In addition at least some in such feature and/or process or unit exclude each other, it can use any Combination is disclosed to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so to appoint Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (including adjoint power Profit requires, summary and accompanying drawing) disclosed in each feature can be by providing the alternative features of identical, equivalent or similar purpose come generation Replace.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included some features rather than further feature, but the combination of the feature of different embodiments means in of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed One of meaning mode can use in any combination.
The all parts embodiment of the present invention can be realized with hardware, or to be run on one or more processor Software module realize, or realized with combinations thereof.It will be understood by those of skill in the art that it can use in practice Microprocessor or digital signal processor (DSP) according to embodiments of the present invention provide webpage visualization mark to realize The some or all functions of some or all parts in device.The present invention is also implemented as being used to perform being retouched here The some or all equipment or program of device (for example, computer program and computer program product) for the method stated. Such program for realizing the present invention can store on a computer-readable medium, or can have one or more signal Form.Such signal can be downloaded from internet website and obtained, either provide on carrier signal or with it is any its He provides form.
It should be noted that the present invention will be described rather than limits the invention for above-described embodiment, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between bracket should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" before element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of some different elements and being come by means of properly programmed computer real It is existing.In if the unit claim of equipment for drying is listed, several in these devices can be by same hardware branch To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame Claim.

Claims (10)

1. a kind of method that visualization mark is provided webpage, including:
The effect of visualization framework being labeled positioned at the masking-out of web page contents overlying regions to webpage is constructed by, wherein, institute Stating effect of visualization framework includes content area, the masking-out above the content area chosen and mark menu, the mark Menu includes plurality of kinds of contents type menu item;
When monitoring that the content area is selected, the masking-out and mark are shown above the selected content area Menu;
Obtain the instruction being labeled in the masking-out to webpage each several part content area.
2. the method for claim 1, wherein it is described be designated as by mark menu setecting corresponding to the content chosen The content type in region.
3. the method for claim 1, wherein before the effect of visualization framework that structure is labeled to webpage, institute Stating method also includes:
The source code of webpage is obtained, the dom tree of the webpage is generated according to the source code;
Obtain the cryptographic Hash of label corresponding to each node in the dom tree;
Hash value attribute is added for each label of the webpage, wherein, the cryptographic Hash is used for interior in positioning and presentation web page Hold region.
4. method as claimed in claim 3, wherein, the cryptographic Hash includes:
The cryptographic Hash of level cryptographic Hash and label itself of the label in the dom tree.
5. method as claimed in claim 3, wherein, the cryptographic Hash for obtaining label corresponding to each node in the dom tree, Including:
Index attributes are added for each label of the webpage;
The source code of webpage after addition index attributes is sent to service end, so that service end enters the cryptographic Hash meter of row label Calculate;
Receive the tab indexes value of service end transmission and the corresponding relation of cryptographic Hash.
6. a kind of device that visualization mark is provided webpage, including:
Effect of visualization framework establishment device, the masking-out that web page contents overlying regions are located at suitable for being constructed by are labeled to webpage Effect of visualization framework, wherein, the effect of visualization framework include content area, above the content area chosen Masking-out and mark menu, the mark menu include plurality of kinds of contents type menu item;When monitoring that the content area is selected When, the masking-out and mark menu are shown above the selected content area;
Mark instruction getter, suitable for obtaining the instruction being labeled in the masking-out to webpage each several part content area.
7. device as claimed in claim 6, wherein, it is described be designated as by mark menu setecting corresponding to the content chosen The content type in region.
8. device as claimed in claim 6, wherein, in addition to:
Dom tree maker, suitable for obtaining the source code of webpage, the dom tree of the webpage is generated according to the source code;
Cryptographic Hash getter, suitable for obtaining the cryptographic Hash of label corresponding to each node in the dom tree;
Hash value attribute adder, each label addition Hash value attribute of the webpage is suitable for, wherein, the cryptographic Hash is used for Content area in positioning and presentation web page.
9. device as claimed in claim 8, wherein, the cryptographic Hash includes:
The cryptographic Hash of level cryptographic Hash and label itself of the label in the dom tree.
10. device as claimed in claim 8, wherein, the cryptographic Hash getter is further adapted for:
Index attributes are added for each label of the webpage;
The source code of webpage after addition index attributes is sent to service end, so that service end enters the cryptographic Hash meter of row label Calculate;
Receive the tab indexes value of service end transmission and the corresponding relation of cryptographic Hash.
CN201310606202.6A 2013-11-25 2013-11-25 The method and device of visualization mark is provided webpage Active CN103678510B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310606202.6A CN103678510B (en) 2013-11-25 2013-11-25 The method and device of visualization mark is provided webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310606202.6A CN103678510B (en) 2013-11-25 2013-11-25 The method and device of visualization mark is provided webpage

Publications (2)

Publication Number Publication Date
CN103678510A CN103678510A (en) 2014-03-26
CN103678510B true CN103678510B (en) 2018-02-02

Family

ID=50316055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310606202.6A Active CN103678510B (en) 2013-11-25 2013-11-25 The method and device of visualization mark is provided webpage

Country Status (1)

Country Link
CN (1) CN103678510B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105912613A (en) * 2016-04-06 2016-08-31 江苏中威科技软件系统有限公司 Website template quick migration method
CN107423322B (en) * 2017-03-31 2020-03-03 广州视源电子科技股份有限公司 Method and device for displaying label nesting hierarchy of webpage
CN109189688B (en) * 2018-09-11 2022-06-03 北京奇艺世纪科技有限公司 Test case script generation method and device and electronic equipment
CN109522490B (en) * 2018-09-18 2021-11-09 武汉大学 Map visualization method for internet information
CN111400581B (en) * 2020-03-13 2024-02-06 京东科技控股股份有限公司 System, method and apparatus for labeling samples
CN113419781A (en) * 2021-07-19 2021-09-21 湖南四方天箭信息科技有限公司 Crawler method and device based on Chrome plug-in, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101192234A (en) * 2007-06-07 2008-06-04 腾讯科技(深圳)有限公司 Searching system and method based on web page extraction
CN102682055A (en) * 2011-01-03 2012-09-19 三星电子株式会社 Method and apparatus for managing e-book contents

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050237336A1 (en) * 2004-04-23 2005-10-27 Jens Guhring Method and system for multi-object volumetric data visualization
CN101777045A (en) * 2008-09-01 2010-07-14 西北工业大学 Method for analyzing XML file by indexing
CN101464905B (en) * 2009-01-08 2011-03-23 中国科学院计算技术研究所 Web page information extraction system and method
CN102779172B (en) * 2012-06-25 2016-06-01 北京奇虎科技有限公司 The recognition system of non-body text and method in a kind of webpage

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101192234A (en) * 2007-06-07 2008-06-04 腾讯科技(深圳)有限公司 Searching system and method based on web page extraction
CN102682055A (en) * 2011-01-03 2012-09-19 三星电子株式会社 Method and apparatus for managing e-book contents

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"一种基于模板的快速网页文本自动抽取算法";陈冶昂 等;《计算机应用研究》;20090731;第26卷(第7期);第2646-2649页 *

Also Published As

Publication number Publication date
CN103678510A (en) 2014-03-26

Similar Documents

Publication Publication Date Title
CN103678509B (en) Generate the method and device of web page template
CN103678511B (en) The method and device of webpage content extraction is carried out according to visual template
CN103678510B (en) The method and device of visualization mark is provided webpage
US20210248204A1 (en) Systems and methods for automatically identifying and linking names in digital resources
US10509555B2 (en) Machine data analysis in an information technology environment
US10423697B2 (en) User interface with navigation controls for the display or concealment of adjacent content
CN102360368A (en) Web data extraction method based on visual customization of extraction template
TW201250492A (en) Method and system of extracting web page information
CN105095067A (en) User interface element object identification and automatic test method and apparatus
Huynh et al. Enabling web browsers to augment web sites' filtering and sorting functionalities
US10789302B2 (en) Method and system for extracting user-specific content
CN104268289B (en) The abatement detecting method and device of link URL
CN107092670A (en) A kind of visual network crawler system and analysis method based on embedded browser
CN107368546A (en) A kind of method and apparatus for generating outline
CN106951405A (en) Data processing method and device based on typesetting engine
JP5380874B2 (en) Information retrieval method, program and apparatus
Sellers et al. Taking the OXPath down the deep web
CN110147477A (en) Data resource modelling extracting method, device and the equipment of Web system
JP5652519B2 (en) Information retrieval method, program and apparatus
CN108228542A (en) A kind of processing method and processing device of non-structured text
Neeli et al. Automated data mining from web servers using perl script
JP7023612B2 (en) Log structure visualization device, log structure visualization method, and program
Wara A Framework for Fashion Data Gathering, Hierarchical-Annotation and Analysis for Social Media and Online Shop: TOOLKIT FOR DETAILED STYLE ANNOTATIONS FOR ENHANCED FASHION RECOMMENDATION
Rozinajová et al. One approach to HTML wrappers creation: using document object model tree
Huhtamäki et al. Context-driven social network visualisation: Case wiki co-creation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220725

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

TR01 Transfer of patent right