CN106897287A - Homepage Publishing decimation in time method and the device for Homepage Publishing decimation in time - Google Patents

Homepage Publishing decimation in time method and the device for Homepage Publishing decimation in time Download PDF

Info

Publication number
CN106897287A
CN106897287A CN201510955640.2A CN201510955640A CN106897287A CN 106897287 A CN106897287 A CN 106897287A CN 201510955640 A CN201510955640 A CN 201510955640A CN 106897287 A CN106897287 A CN 106897287A
Authority
CN
China
Prior art keywords
node
time
web page
timing
page title
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510955640.2A
Other languages
Chinese (zh)
Other versions
CN106897287B (en
Inventor
丁圣勇
黄志兰
樊勇兵
陈楠
金华敏
赖培源
区洪辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN201510955640.2A priority Critical patent/CN106897287B/en
Publication of CN106897287A publication Critical patent/CN106897287A/en
Application granted granted Critical
Publication of CN106897287B publication Critical patent/CN106897287B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

The invention discloses a kind of Homepage Publishing decimation in time method and the device for Homepage Publishing decimation in time, it is related to field of cloud calculation.Homepage Publishing decimation in time method therein includes:Set up the DOM Document Object Model dom tree of webpage source code;Web page title node is determined in dom tree;The Homepage Publishing time is determined according to Homepage Publishing timing node and relative position relation of the web page title node in dom tree.Determine the Homepage Publishing time by according to the position relationship of Homepage Publishing timing node in DOM and web page title node, the Homepage Publishing time can be accurately positioned, it is adaptable to the Homepage Publishing decimation in time of automation.

Description

Homepage Publishing decimation in time method and the device for Homepage Publishing decimation in time
Technical field
The present invention relates to field of cloud calculation, especially a kind of Homepage Publishing decimation in time method and use In the device of Homepage Publishing decimation in time.
Background technology
In Internet era, webpage is the important carrier of bearer messages content issue.Except direct Obtained from webpage and read information, the analysis that profound level is carried out to information is also of concern one Individual emphasis.
It is the content for parsing webpage to the premise that the information in webpage is analyzed.Taken out in webpage Take in problem, when especially being extracted to message information class webpage, the Homepage Publishing time is one Important attribute.Current Main Basiss regular expression rule carries out the extraction of Homepage Publishing time. But a webpage usually contains multiple times, during only by simple matching regular expressions out Between not can determine that the specific Homepage Publishing time.Additionally, search engine is when webpage is captured, Often by HTTP (the HyperText Transfer Protocol, hypertext of webpage source code Host-host protocol) time in header file as webpage issuing time, but HTTP text Time in part is the last modification time of webpage, and webpage may be modified after distribution, The issuing time of webpage cannot be represented.
The content of the invention
An embodiment of the present invention technical problem to be solved is:How webpage is extracted exactly Issuing time.
A kind of one side according to embodiments of the present invention, there is provided Homepage Publishing decimation in time side Method, it is characterised in that including:Set up the DOM Document Object Model dom tree of webpage source code; Web page title node is determined in dom tree;According to Homepage Publishing timing node and web page title section Relative position relation of the point in dom tree determines the Homepage Publishing time.
In one embodiment, according to Homepage Publishing timing node and web page title node in DOM Relative position relation in tree determines that the Homepage Publishing time includes:If in web page title node institute There is timing node under the father node of category, the temporal information in timing node is extracted as Homepage Publishing Time.
In one embodiment, if belonging to the corresponding node of label where web page title There is time leaf node under father node, the temporal information in time leaf node is extracted as webpage Issuing time;Or, if in the father's section belonging to the corresponding node of label where web page title Label where having the time under point, the extracting time information and as net from the label where the time Page issuing time.
In one embodiment, according to Homepage Publishing timing node and web page title node in DOM Relative position relation in tree determines that the Homepage Publishing time includes:Determine the father of web page title node The secondary left child node of node, if the left subtree node under some subtrees of secondary left child node is Intermediate node, the Homepage Publishing time is extracted as by the temporal information in timing node.
In one embodiment, according to Homepage Publishing timing node and web page title node in DOM Relative position relation in tree determines that the Homepage Publishing time includes:If in web page title node institute There is timing node under the father node of category, the temporal information in timing node is extracted as Homepage Publishing Time;If there is no timing node under the father node belonging to web page title node, webpage is determined The secondary left child node of the father node of title node, if the left side under some subtrees of secondary left child node Children tree nodes are timing nodes, and the temporal information in timing node is extracted as into the Homepage Publishing time.
In one embodiment, according to Homepage Publishing timing node and web page title node in DOM Relative position relation in tree determines that the Homepage Publishing time includes:If where web page title There is time leaf node under father node belonging to the corresponding node of label, by time leaf node Temporal information be extracted as the Homepage Publishing time;If corresponding in the label where web page title There is no time leaf node under father node belonging to node, in the label correspondence where web page title Node belonging to father node under search whether the time where label, if it has, from the time Extracting time information and as the Homepage Publishing time in the label at place;If where web page title The corresponding node of label belonging to father node there is no the time where label, determine web page title The secondary left child node of the father node of node, if the left subtree under some subtrees of secondary left child node Node is timing node, and the temporal information in timing node is extracted as into the Homepage Publishing time.
In one embodiment, according to Homepage Publishing timing node and web page title node in DOM Relative position relation in tree determines that the Homepage Publishing time includes:The time is searched in dom tree Whether node, the timing node that judgement finds meets the Homepage Publishing time with web page title node Node and relative position relation of the web page title node in dom tree, when will be qualified Intermediate node is defined as Homepage Publishing timing node, and extracts webpage from Homepage Publishing timing node Issuing time.
In one embodiment, method also includes:If qualified timing node has multiple, Qualified timing node close to the root node of dom tree is defined as the Homepage Publishing time Node.
In one embodiment, determine that web page title node includes in dom tree:According to net Tag types, unique encodings attribute or generic attribute where page head determine in dom tree Web page title node.
Second aspect according to embodiments of the present invention, there is provided one kind is used for Homepage Publishing decimation in time Device, including:DOM Document Object Model dom tree sets up module, for setting up webpage source code Dom tree;Title node determining module, for determining web page title section in dom tree Point;Issuing time determining module, for according to Homepage Publishing timing node and web page title node Relative position relation in dom tree determines the Homepage Publishing time.
In one embodiment, issuing time determining module includes very first time node checks unit With very first time information extraction unit;Very first time node checks unit is used to search in webpage mark Whether there is timing node under father node belonging to topic node, if it has, very first time information extraction Unit is used to for the temporal information in timing node to be extracted as the Homepage Publishing time.
In one embodiment, very first time node checks unit is used to search in web page title institute The corresponding node of label belonging to father node under whether have time leaf node, if it has, Very first time information extraction unit is used to for the temporal information in time leaf node to be extracted as webpage Issuing time;Or, very first time node checks unit is used to search where web page title Label where whether having the time under father node belonging to the corresponding node of label, if it has, the One time information extraction unit is for the extracting time information from the label where the time and as net Page issuing time.
In one embodiment, issuing time determining module includes the second timing node searching unit With the second temporal information extraction unit;Second timing node searching unit is used to determine web page title The secondary left child node of the father node of node, and search the left side under some subtrees of secondary left child node Whether children tree nodes have timing node, if it has, the second temporal information extraction unit be used for by when Temporal information in intermediate node is extracted as the Homepage Publishing time.
In one embodiment, issuing time determining module includes the 3rd timing node searching unit With the 3rd temporal information extraction unit;Issuing time determining module is used to search in web page title section Whether there is timing node under father node belonging to point, if it has, the 3rd temporal information extraction unit For the temporal information in timing node to be extracted as into the Homepage Publishing time;Issuing time determines mould When block is additionally operable to not have timing node under the father node belonging to web page title node, webpage is determined Under the secondary left child node of the father node of title node, and some subtrees of the left child node of judgement time Whether left subtree node is timing node, if it is, the 3rd temporal information extraction unit is used to incite somebody to action Temporal information in timing node is extracted as the Homepage Publishing time.
In one embodiment, issuing time determining module includes the 4th timing node searching unit With the 4th temporal information extraction unit;4th timing node searching unit is used to search in webpage mark Whether there is time leaf node under the father node belonging to the corresponding node of label where topic, if Have, the 4th temporal information extraction unit is used to be extracted as the temporal information in time leaf node The Homepage Publishing time;4th timing node searching unit is additionally operable in the label where web page title When there is no time leaf node under the father node belonging to corresponding node, search in web page title institute The corresponding node of label belonging to father node under whether have the time where label, if it has, 4th temporal information extraction unit is used for extracting time information and conduct from the label where the time The Homepage Publishing time;4th timing node searching unit is additionally operable in the label where web page title Father node belonging to corresponding node there is no the time where label when, determine web page title node Father node secondary left child node, and search the left subtree section under some subtrees of time left child node Whether point has timing node, if it has, the 4th temporal information extraction unit is used for timing node In temporal information be extracted as the Homepage Publishing time.
In one embodiment, issuing time determining module include the 5th timing node searching unit, Position relationship judging unit and the 5th temporal information extraction unit;5th timing node searching unit For searching timing node in dom tree;Position relationship judging unit finds for judgement Timing node and web page title node whether meet Homepage Publishing timing node and web page title section Relative position relation of the point in dom tree;5th temporal information extraction unit is used to meet The timing node of condition is defined as Homepage Publishing timing node, and from Homepage Publishing timing node Extract the Homepage Publishing time.
In one embodiment, the 5th temporal information extraction unit is used to work as the qualified time When node has multiple, the qualified timing node close to the root node of dom tree is determined It is Homepage Publishing timing node.
In one embodiment, the mark that title node determining module is used for according to where web page title Sign type, unique encodings attribute or generic attribute and web page title node is determined in dom tree.
By the position relationship according to Homepage Publishing timing node in DOM and web page title node Determine the Homepage Publishing time, the Homepage Publishing time can be accurately positioned, it is adaptable to the net of automation Page issuing time is extracted.
By referring to the drawings to the detailed description of exemplary embodiment of the invention, the present invention Further feature and its advantage will be made apparent from.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will The accompanying drawing to be used needed for embodiment or description of the prior art is briefly described, it is clear that Ground, drawings in the following description are only some embodiments of the present invention, for the common skill in this area For art personnel, on the premise of not paying creative work, can also be obtained according to these accompanying drawings Other accompanying drawings.
Fig. 1 shows the flow chart of one embodiment of Homepage Publishing decimation in time method of the present invention.
Fig. 2 shows the schematic diagram of the part sectional drawing of information class webpage.
Fig. 3 shows the text and label structure in the corresponding a kind of code of part webpage shown in Fig. 2 Into dom tree schematic diagram.
Fig. 4 shows the text and label in the corresponding another code of part webpage shown in Fig. 2 The schematic diagram of the dom tree of composition.
Fig. 5 shows the text and label in corresponding another code of part webpage shown in Fig. 2 The schematic diagram of the dom tree of composition.
Fig. 6 shows the present invention for the knot of one embodiment of the device of Homepage Publishing decimation in time Composition.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, to the technical scheme in the embodiment of the present invention It is clearly and completely described, it is clear that described embodiment is only a real part of the invention Example is applied, rather than whole embodiments.Below to the description reality of at least one exemplary embodiment It is merely illustrative on border, never as to the present invention and its application or any limitation for using. Based on the embodiment in the present invention, those of ordinary skill in the art are not before creative work is made The every other embodiment for being obtained is put, the scope of protection of the invention is belonged to.
The Homepage Publishing decimation in time method of one embodiment of the invention is described below with reference to Fig. 1.
Fig. 1 is the flow chart of one embodiment of Homepage Publishing decimation in time method of the present invention.As schemed Shown in 1, the method for the embodiment includes:
Step S102, sets up DOM (the Document Object Model, text of webpage source code Shelves object model) tree.
DOM can access by the way of independently of platform and language and change HTML, The content and structure of XML document.In the dom tree set up according to html document, often One label or text node are all a nodes in dom tree.
Step S104, determines web page title node in dom tree.
Wherein, web page title node can be the section where web page title text in dom tree The node of label where point, or web page title.
Step S106, according to Homepage Publishing timing node and web page title node in dom tree Relative position relation determine the Homepage Publishing time.
Wherein, Homepage Publishing timing node can be the Homepage Publishing time text in dom tree The node of label where the node at place, or Homepage Publishing time.
By the position relationship according to Homepage Publishing timing node in DOM and web page title node Determine the Homepage Publishing time, the Homepage Publishing time can be accurately positioned, it is adaptable to the net of automation Page issuing time is extracted.
In step S104, tag types, unique encodings category that can be according to where web page title Property or generic attribute determine web page title node in dom tree.This paper of title is generally placed In special label, for example, it is placed on the h1 labels of expression large print or represents link In a labels.In addition to being recognized to the label where web page title using special tag, Where web page title can also be determined by the specific object content of the label where web page title DOM node.Label can be by id (unique encodings) attributes or class (class) attribute Particular content mark label content concrete meaning, having headed label for example can be: <Div id=" title ">Title</div>, div tag is by " title " mark in its id attribute The content that it is included is title.It is obvious also possible to determine web page title place using other method Label, repeat no more here.
Generally, after parsing, be presented webpage in a browser, the Homepage Publishing time is past Toward after following web page title closely.And because the specific front end of each webpage is laid out difference, for being in Identic two pages in present webpage, the structure of its source code may be different, i.e. Relative position relation of the Homepage Publishing timing node with web page title node in dom tree may It is different.Statistics and analysis are carried out by the webpage to substantial amounts of information class to understand, the issue of webpage The relative position relation of the DOM node where DOM node and web page title where the time Mainly there are two kinds.Fig. 2 is the schematic diagram of the part sectional drawing of information class webpage.It is with Fig. 2 below Example, is specifically described according to Homepage Publishing timing node and web page title node in dom tree Relative position relation determines the two methods of Homepage Publishing time.
The first determination method is:If having the time under the father node belonging to web page title node Node, the Homepage Publishing time is extracted as by the temporal information in timing node.According to Homepage Publishing Timing node is the difference of text node or label node, and the first determination method can also have Body is divided into two kinds of following forms:If belonging to the corresponding node of label where web page title Father node under have time leaf node, the temporal information in time leaf node is extracted as net Page issuing time;Or, if in the father belonging to the corresponding node of label where web page title Label where having the time under node, extracting time information and conduct from the label where the time The Homepage Publishing time.With reference to the code of the part webpage in Fig. 2 to the first determination method Two kinds of concrete forms carry out exemplary elaboration.
When having time leaf section under the father node belonging to the corresponding node of label where web page title During point, the code of the web page portions sectional drawing shown for Fig. 2 in webpage for example can be:
<div><h1>A kind of Homepage Publishing decimation in time method</h1>2016-1-1<span>120 Secondary reading</span></div>
Fig. 3 is the structural representation of the dom tree being made up of each text and label in above-mentioned code Figure.As shown in figure 3, the label node where web page title is h1 nodes, the father of h1 nodes Node is div nodes, searches other child nodes of div nodes, can be obtained " 2016-1-1 " This time leaf node, " 2016-1-1 " is the Homepage Publishing time.
Where having the time under the father node belonging to the corresponding node of label where web page title During label, the code of the web page portions sectional drawing shown for Fig. 2 in webpage for example can be:
<div><h1>A kind of Homepage Publishing decimation in time method</h1><span> 2016-1-1</span><span>120 readings</span></div>
Fig. 4 is the structural representation of the dom tree being made up of each text and label in above-mentioned code Figure.As shown in Figure 4, the label node where web page title is h1 nodes, the father of h1 nodes Node is div nodes, searches other child nodes of div nodes, can be obtained comprising " 2016-1-1 " The span labels of this time, then the extracting time information from span labels, therein " 2016-1-1 " is the Homepage Publishing time.
The first determination method is generally used for the relatively simple webpage of hierarchical layout of front-end code. Web page title node and timing node are not only closer to the distance in the vision of webpage is presented, while Level difference in dom tree is also smaller.By using this determination method, can be promptly Locating web-pages issuing time node, and extract the Homepage Publishing time exactly.
For two kinds of specific methods in the first determination method, the former can be first used, then adopt With the latter, i.e. if having timing node under the father node belonging to web page title node, by when Temporal information in intermediate node is extracted as the Homepage Publishing time;If belonging to web page title node Father node under there is no timing node, determine the secondary left child node of the father node of web page title node, If the left subtree node under some subtrees of secondary left child node is timing node, by timing node In temporal information be extracted as the Homepage Publishing time.
Second determination method be:Determine the secondary left child node of the father node of web page title node, If the left subtree node under some subtrees of secondary left child node is timing node, by timing node In temporal information be extracted as the Homepage Publishing time.When using second determination method acquisition webpage During issuing time, the code of the part shown for Fig. 2 in webpage for example can be:
<div><h1>A kind of Homepage Publishing decimation in time method</h1><p> <span>2016-1-1</span><h4>120 readings</h4></p></div>
Fig. 5 is the structural representation of the dom tree being made up of each text and label in above-mentioned code Figure.As shown in figure 5, the father node of web page title node be the div nodes div nodes time Left child node is p labels, has two subtrees under p labels, and respectively root node is span's Subtree and root node are the subtree of h4.Wherein, root node is that the subtree of span is p labels Left subtree, it is the timing node in the subtree of span to search root node, can be obtained " 2016-1-1 " This temporal information, " 2016-1-1 " is the Homepage Publishing time.Such case is generally occurred within In the complex webpage of the hierarchical layout of front-end code.
The complex webpage of hierarchical layout that second determination method is generally used for front-end code. Although the level difference of web page title node and timing node in dom tree is larger, according to It is old to be determined when distance webs page head during the vision of webpage is presented is nearer by the above method Intermediate node.By using this determination method, the complicated situation of page layout is coped with, and The Homepage Publishing time is extracted exactly.
Determine method for above two, can first use the first determination method, then using Two kinds of determination methods.First look in time nearer node of web page title node layer whether There is timing node, then search whether have the time in the secondary node farther out of web page title node layer Node.
In foregoing two kinds of determination methods, specific determination order is:Webpage is determined for compliance with first The node of the position relationship of title node and Homepage Publishing timing node, then judge that these meet position Whether the node for putting relation is timing node.From another angle, can also use and first determine net Timing node in page, then judge whether these timing nodes meet web page title node and webpage The determination of the position relationship of issuing time node sequentially, i.e.,:Segmentum intercalaris when being searched in dom tree Point, segmentum intercalaris when whether the timing node that judgement finds meets Homepage Publishing with web page title node Relative position relation of the point with web page title node in dom tree, by the qualified time Node is defined as Homepage Publishing timing node, and webpage hair is extracted from Homepage Publishing timing node The cloth time.Two kinds of determinations order sets about being searched and being judged from different angles respectively, works as net It is more quick using former approach when the hierarchical structure of page front-end code is relatively easy;Work as net It is in hgher efficiency using later approach when the temporal information included in page is less.
Sequentially, i.e., using first lookup timing node, judge that position is closed again when using second determination During the mode of system, multiple results may be obtained, now can therefrom be chosen using following methods As a result:If qualified timing node has multiple, by close to the root node of dom tree Qualified timing node is defined as Homepage Publishing timing node.Symbol close to dom tree The timing node of conjunction condition is relatively simple with the hierarchical relationship of title node, therefore compared to other Node, is more likely the node comprising the Homepage Publishing time.
During above-mentioned lookup timing node, can be by regular expression decision node Text whether be the expression time text.Java language, Python, JavaScript The mainstream speeches such as language are supported to extract text according to regular expression, can be according to actually used The demand selection canonical matching tool of environment and subsequent treatment.
Below with reference to Fig. 6 describe one embodiment of the invention for Homepage Publishing decimation in time Device.
Fig. 6 is structure of the present invention for one embodiment of the device of Homepage Publishing decimation in time Figure.As shown in fig. 6, the device of the embodiment includes:DOM Document Object Model dom tree sets up mould Block 62, the dom tree for setting up webpage source code;Title node determining module 64, for Web page title node is determined in dom tree;Issuing time determining module 66, for according to webpage Issuing time node determines webpage with relative position relation of the web page title node in dom tree Issuing time.
Wherein, issuing time determining module 66 can specifically use following several concrete structures.
The first structure is:Issuing time determining module 66 can include very first time node checks Unit and very first time information extraction unit;Very first time node checks unit is used to search in net Whether there is timing node under father node belonging to page head node, if it has, very first time information Extraction unit is used to for the temporal information in timing node to be extracted as the Homepage Publishing time.
Additionally, very first time node checks unit can be also used for searching where web page title Whether there is time leaf node under father node belonging to the corresponding node of label, if it has, first Temporal information extraction unit is used to for the temporal information in time leaf node to be extracted as Homepage Publishing Time;Or, very first time node checks unit is used to search in the label where web page title Label where whether having the time under father node belonging to corresponding node, if it has, when first Between information extraction unit be used for from the label where the time extracting time information and as webpage send out The cloth time.
Second structure be:Issuing time determining module 66 can be searched including the second timing node Unit and the second temporal information extraction unit;Second timing node searching unit is used to determine webpage The secondary left child node of the father node of title node, and search under some subtrees of secondary left child node Left subtree node whether have timing node, if it has, the second temporal information extraction unit is used for Temporal information in timing node is extracted as the Homepage Publishing time.
The third structure is:Issuing time determining module 66 can be searched including the 3rd timing node Unit and the 3rd temporal information extraction unit;Issuing time determining module 66 is used to search in webpage Whether there is timing node under father node belonging to title node, if it has, the 3rd temporal information is carried Unit is taken for the temporal information in timing node to be extracted as into the Homepage Publishing time;Issuing time When determining module 66 is additionally operable to not have timing node under the father node belonging to web page title node, Determine the secondary left child node of the father node of web page title node, and judge some of time left child node Whether the left subtree node under subtree is timing node, if it is, the 3rd temporal information extracts single Unit by the temporal information in timing node for being extracted as the Homepage Publishing time.
4th kind of structure be:Issuing time determining module 66 can be searched including the 4th timing node Unit and the 4th temporal information extraction unit;4th timing node searching unit is used to search in net Whether there is time leaf node under the father node belonging to the corresponding node of label where page head, If it has, the 4th temporal information extraction unit is used to carry the temporal information in time leaf node It is taken as the Homepage Publishing time;4th timing node searching unit is additionally operable to where web page title When there is no time leaf node under the father node belonging to the corresponding node of label, search in webpage mark Label where whether having the time under the father node belonging to the corresponding node of label where topic, such as Fruit has, and the 4th temporal information extraction unit is used for the extracting time information from the label where the time And as the Homepage Publishing time;4th timing node searching unit is additionally operable to where web page title The corresponding node of label belonging to father node there is no the time where label when, determine webpage mark The secondary left child node of the father node of node is inscribed, and searches the left side under some subtrees of time left child node Whether children tree nodes have timing node, if it has, the 4th temporal information extraction unit be used for by when Temporal information in intermediate node is extracted as the Homepage Publishing time.
5th kind of structure be:Issuing time determining module 66 can be searched including the 5th timing node Unit, position relationship judging unit and the 5th temporal information extraction unit;5th timing node is looked into Unit is looked for for searching timing node in dom tree;Position relationship judging unit is used to judge Whether the timing node for finding meets Homepage Publishing timing node and webpage with web page title node Relative position relation of the title node in dom tree;5th temporal information extraction unit is used for Qualified timing node is defined as Homepage Publishing timing node, and from the Homepage Publishing time The Homepage Publishing time is extracted in node.
Additionally, the 5th temporal information extraction unit can be used for having when qualified timing node When multiple, the qualified timing node close to the root node of dom tree is defined as webpage Issuing time node.
Tag types that title node determining module 64 can be used for according to where web page title, only One encoded attributes or generic attribute determine web page title node in dom tree.
Additionally, the method according to the invention is also implemented as a kind of computer program product, should Computer program product includes computer-readable medium, is stored with the computer-readable medium Computer program for performing the above-mentioned functions limited in the method for the present invention.Art technology Personnel will also understand is that, the various illustrative logical blocks with reference to described by disclosure herein, mould Block, circuit and algorithm steps may be implemented as the group of electronic hardware, computer software or both Close.
Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all at this Within the spirit and principle of invention, any modification, equivalent substitution and improvements made etc. all should be wrapped It is contained within protection scope of the present invention.

Claims (18)

1. a kind of Homepage Publishing decimation in time method, it is characterised in that including:
Set up the DOM Document Object Model dom tree of webpage source code;
Web page title node is determined in the dom tree;
According to Homepage Publishing timing node and relative position of the web page title node in dom tree Relation determines the Homepage Publishing time.
2. method according to claim 1, it is characterised in that described according to Homepage Publishing Timing node determines Homepage Publishing with relative position relation of the web page title node in dom tree Time includes:
If having timing node under the father node belonging to web page title node, by timing node Temporal information be extracted as the Homepage Publishing time.
3. method according to claim 2, it is characterised in that
If having time leaf under the father node belonging to the corresponding node of label where web page title Child node, the Homepage Publishing time is extracted as by the temporal information in time leaf node;
Or,
If having time institute under the father node belonging to the corresponding node of label where web page title Label, the extracting time information and as the Homepage Publishing time from the label where the time.
4. method according to claim 1, it is characterised in that described according to Homepage Publishing Timing node determines Homepage Publishing with relative position relation of the web page title node in dom tree Time includes:
Determine the secondary left child node of the father node of web page title node, if if secondary left child node Left subtree node under dry subtree is timing node, and the temporal information in timing node is extracted as The Homepage Publishing time.
5. method according to claim 1, it is characterised in that described according to Homepage Publishing Timing node determines Homepage Publishing with relative position relation of the web page title node in dom tree Time includes:
If having timing node under the father node belonging to web page title node, by timing node Temporal information be extracted as the Homepage Publishing time;
If there is no timing node under the father node belonging to web page title node, webpage mark is determined The secondary left child node of the father node of node is inscribed, if the left son under some subtrees of secondary left child node Tree node is timing node, and the temporal information in timing node is extracted as into the Homepage Publishing time.
6. method according to claim 1, it is characterised in that described according to Homepage Publishing Timing node determines Homepage Publishing with relative position relation of the web page title node in dom tree Time includes:
If having time leaf under the father node belonging to the corresponding node of label where web page title Child node, the Homepage Publishing time is extracted as by the temporal information in time leaf node;
If there is no the time under the father node belonging to the corresponding node of label where web page title Leaf node, being searched under the father node belonging to the corresponding node of label where web page title is It is no have the time where label, if it has, extracting time information is simultaneously from the label where the time As the Homepage Publishing time;
If the father node belonging to the corresponding node of label where web page title did not had where the time Label, the secondary left child node of the father node of web page title node is determined, if secondary left child node Some subtrees under left subtree node be timing node, the temporal information in timing node is carried It is taken as the Homepage Publishing time.
7. method according to claim 1, it is characterised in that described according to Homepage Publishing Timing node determines Homepage Publishing with relative position relation of the web page title node in dom tree Time includes:
Timing node, timing node and web page title that judgement finds are searched in dom tree It is relative in dom tree with web page title node whether node meets Homepage Publishing timing node Position relationship, Homepage Publishing timing node is defined as by qualified timing node, and from net The Homepage Publishing time is extracted in page issuing time node.
8. method according to claim 7, it is characterised in that also include:
If qualified timing node has multiple, by the symbol close to the root node of dom tree The timing node of conjunction condition is defined as Homepage Publishing timing node.
9. method according to claim 1, it is characterised in that described in the DOM Determine that web page title node includes in tree:
Tag types, unique encodings attribute or generic attribute according to where web page title are described Web page title node is determined in dom tree.
10. a kind of device for Homepage Publishing decimation in time, it is characterised in that including:
DOM Document Object Model dom tree sets up module, the dom tree for setting up webpage source code;
Title node determining module, for determining web page title node in the dom tree;
Issuing time determining module, for according to Homepage Publishing timing node and web page title node Relative position relation in dom tree determines the Homepage Publishing time.
11. devices according to claim 10, it is characterised in that the issuing time is true Cover half block includes very first time node checks unit and very first time information extraction unit;
The very first time node checks unit is used to search the father's section belonging to web page title node Point under whether have timing node, if it has, the very first time information extraction unit be used for by when Temporal information in intermediate node is extracted as the Homepage Publishing time.
12. devices according to claim 11, it is characterised in that the very first time Point searching unit is used to search in the father node belonging to the corresponding node of label where web page title Under whether have time leaf node, if it has, the very first time information extraction unit be used for will Temporal information in time leaf node is extracted as the Homepage Publishing time;
Or,
The very first time node checks unit is used to search the label correspondence where web page title Node belonging to father node under whether have the time where label, if it has, when described first Between information extraction unit be used for from the label where the time extracting time information and as webpage send out The cloth time.
13. devices according to claim 10, it is characterised in that the issuing time is true Cover half block includes the second timing node searching unit and the second temporal information extraction unit;
The second timing node searching unit is used to determine the secondary of the father node of web page title node Left child node, and whether sometimes to search the left subtree node under some subtrees of secondary left child node Intermediate node, if it has, the second temporal information extraction unit be used for by timing node when Between information extraction be the Homepage Publishing time.
14. devices according to claim 10, it is characterised in that the issuing time is true Cover half block includes the 3rd timing node searching unit and the 3rd temporal information extraction unit;
The issuing time determining module is used to search under the father node belonging to web page title node Whether have timing node, if it has, the 3rd temporal information extraction unit be used for by when segmentum intercalaris Temporal information in point is extracted as the Homepage Publishing time;
The issuing time determining module is additionally operable to not to be had under the father node belonging to web page title node When having timing node, the secondary left child node of the father node of web page title node is determined, and judge secondary Whether the left subtree node under some subtrees of left child node is timing node, if it is, described 3rd temporal information extraction unit is used to for the temporal information in timing node to be extracted as Homepage Publishing Time.
15. devices according to claim 10, it is characterised in that the issuing time is true Cover half block includes the 4th timing node searching unit and the 4th temporal information extraction unit;
The 4th timing node searching unit is used to search the label correspondence where web page title Node belonging to father node under whether have time leaf node, if it has, the 4th time Information extraction unit is used to for the temporal information in time leaf node to be extracted as the Homepage Publishing time;
The 4th timing node searching unit is additionally operable to corresponding in the label where web page title When there is no time leaf node under the father node belonging to node, search in the mark where web page title Whether the label where having the time under the father node belonging to corresponding node is signed, if it has, described 4th temporal information extraction unit is used for extracting time information and conduct from the label where the time The Homepage Publishing time;
The 4th timing node searching unit is additionally operable to corresponding in the label where web page title Father node belonging to node there is no the time where label when, determine web page title node father section The secondary left child node of point, and whether search the left subtree node under some subtrees of time left child node There is timing node, if it has, the 4th temporal information extraction unit is used in timing node Temporal information be extracted as the Homepage Publishing time.
16. devices according to claim 10, it is characterised in that the issuing time is true Cover half block includes the 5th timing node searching unit, position relationship judging unit and the 5th time letter Breath extraction unit;
The 5th timing node searching unit is used to search timing node in dom tree;
The position relationship judging unit is used for the timing node and web page title section for judging to find Whether point meets relative position of the Homepage Publishing timing node with web page title node in dom tree Put relation;
The 5th temporal information extraction unit is used to for qualified timing node to be defined as net Page issuing time node, and extracted the Homepage Publishing time from Homepage Publishing timing node.
17. devices according to claim 16, it is characterised in that the 5th time letter Breath extraction unit is used for when qualified timing node has multiple, by from the root of dom tree The near qualified timing node of node is defined as Homepage Publishing timing node.
18. devices according to claim 10, it is characterised in that the title node is true Tag types, unique encodings attribute or generic attribute that cover half block is used for according to where web page title Web page title node is determined in the dom tree.
CN201510955640.2A 2015-12-18 2015-12-18 Webpage release time extraction method and device for webpage release time extraction Active CN106897287B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510955640.2A CN106897287B (en) 2015-12-18 2015-12-18 Webpage release time extraction method and device for webpage release time extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510955640.2A CN106897287B (en) 2015-12-18 2015-12-18 Webpage release time extraction method and device for webpage release time extraction

Publications (2)

Publication Number Publication Date
CN106897287A true CN106897287A (en) 2017-06-27
CN106897287B CN106897287B (en) 2020-06-16

Family

ID=59189612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510955640.2A Active CN106897287B (en) 2015-12-18 2015-12-18 Webpage release time extraction method and device for webpage release time extraction

Country Status (1)

Country Link
CN (1) CN106897287B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268433A (en) * 2018-02-26 2018-07-10 杭州数梦工场科技有限公司 Title abstracting method and device based on webpage article
CN109829092A (en) * 2018-12-26 2019-05-31 厦门邑通软件科技有限公司 The method that a kind of pair of webpage is oriented monitoring
CN116484831A (en) * 2023-02-22 2023-07-25 北京麦克斯泰科技有限公司 Multi-dimension-based release time identification method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101470728A (en) * 2007-12-25 2009-07-01 北京大学 Method and device for automatically abstracting text of Chinese news web page
CN102129428A (en) * 2010-01-20 2011-07-20 腾讯科技(深圳)有限公司 Method and device for subscribing information from webpage
CN103064827A (en) * 2013-01-16 2013-04-24 盘古文化传播有限公司 Method and device for extracting webpage content
CN104462151A (en) * 2013-09-25 2015-03-25 腾讯科技(深圳)有限公司 Method for evaluating web page publishing time and related device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101470728A (en) * 2007-12-25 2009-07-01 北京大学 Method and device for automatically abstracting text of Chinese news web page
CN102129428A (en) * 2010-01-20 2011-07-20 腾讯科技(深圳)有限公司 Method and device for subscribing information from webpage
CN103064827A (en) * 2013-01-16 2013-04-24 盘古文化传播有限公司 Method and device for extracting webpage content
CN104462151A (en) * 2013-09-25 2015-03-25 腾讯科技(深圳)有限公司 Method for evaluating web page publishing time and related device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268433A (en) * 2018-02-26 2018-07-10 杭州数梦工场科技有限公司 Title abstracting method and device based on webpage article
CN108268433B (en) * 2018-02-26 2019-06-11 杭州数梦工场科技有限公司 Title abstracting method and device based on webpage article
CN109829092A (en) * 2018-12-26 2019-05-31 厦门邑通软件科技有限公司 The method that a kind of pair of webpage is oriented monitoring
CN109829092B (en) * 2018-12-26 2021-05-28 厦门邑通软件科技有限公司 Method for directionally monitoring webpage
CN116484831A (en) * 2023-02-22 2023-07-25 北京麦克斯泰科技有限公司 Multi-dimension-based release time identification method and device
CN116484831B (en) * 2023-02-22 2024-03-12 北京麦克斯泰科技有限公司 Multi-dimension-based release time identification method and device

Also Published As

Publication number Publication date
CN106897287B (en) 2020-06-16

Similar Documents

Publication Publication Date Title
US7941420B2 (en) Method for organizing structurally similar web pages from a web site
US7627571B2 (en) Extraction of anchor explanatory text by mining repeated patterns
CN105630941B (en) Web body matter abstracting methods based on statistics and structure of web page
CN107590219A (en) Webpage personage subject correlation message extracting method
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
CN103246732B (en) A kind of abstracting method of online Web news content and system
Zheng et al. Template-independent news extraction based on visual consistency
CN102662969B (en) Internet information object positioning method based on webpage structure semantic meaning
CN100552673C (en) Open type document isomorphism engines system
JP2006004417A (en) Method and device for recognizing specific type of information file
CN103294781A (en) Method and equipment used for processing page data
CN106339455B (en) Webpage context extraction method based on text label feature mining
CN109344355B (en) Automatic regression detection and block matching self-adaption method and device for webpage change
CN106960058A (en) A kind of structure of web page alteration detection method and system
CN108733813A (en) Information extracting method, system towards BBS forum Web pages contents and medium
CN104572934A (en) Webpage key content extracting method based on DOM
CN105740370B (en) A kind of online Web news contents extraction system
CN105740355B (en) Webpage context extraction method and device based on aggregation text density
CN106897287A (en) Homepage Publishing decimation in time method and the device for Homepage Publishing decimation in time
US20140156799A1 (en) Method and System for Extracting Post Contents From Forum Web Page
CN107145591A (en) A kind of effective content metadata extracting method of webpage based on title
CN104572874B (en) A kind of abstracting method and device of webpage information
CN105589918B (en) A kind of method and device for extracting page info
CN104217025B (en) For the entry extraction system and method for more record webpages
CN111581478A (en) Cross-website general news acquisition method for specific subject

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant