CN106897287A - Homepage Publishing decimation in time method and the device for Homepage Publishing decimation in time - Google Patents
Homepage Publishing decimation in time method and the device for Homepage Publishing decimation in time Download PDFInfo
- Publication number
- CN106897287A CN106897287A CN201510955640.2A CN201510955640A CN106897287A CN 106897287 A CN106897287 A CN 106897287A CN 201510955640 A CN201510955640 A CN 201510955640A CN 106897287 A CN106897287 A CN 106897287A
- Authority
- CN
- China
- Prior art keywords
- node
- time
- web page
- timing
- page title
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Abstract
The invention discloses a kind of Homepage Publishing decimation in time method and the device for Homepage Publishing decimation in time, it is related to field of cloud calculation.Homepage Publishing decimation in time method therein includes:Set up the DOM Document Object Model dom tree of webpage source code;Web page title node is determined in dom tree;The Homepage Publishing time is determined according to Homepage Publishing timing node and relative position relation of the web page title node in dom tree.Determine the Homepage Publishing time by according to the position relationship of Homepage Publishing timing node in DOM and web page title node, the Homepage Publishing time can be accurately positioned, it is adaptable to the Homepage Publishing decimation in time of automation.
Description
Technical field
The present invention relates to field of cloud calculation, especially a kind of Homepage Publishing decimation in time method and use
In the device of Homepage Publishing decimation in time.
Background technology
In Internet era, webpage is the important carrier of bearer messages content issue.Except direct
Obtained from webpage and read information, the analysis that profound level is carried out to information is also of concern one
Individual emphasis.
It is the content for parsing webpage to the premise that the information in webpage is analyzed.Taken out in webpage
Take in problem, when especially being extracted to message information class webpage, the Homepage Publishing time is one
Important attribute.Current Main Basiss regular expression rule carries out the extraction of Homepage Publishing time.
But a webpage usually contains multiple times, during only by simple matching regular expressions out
Between not can determine that the specific Homepage Publishing time.Additionally, search engine is when webpage is captured,
Often by HTTP (the HyperText Transfer Protocol, hypertext of webpage source code
Host-host protocol) time in header file as webpage issuing time, but HTTP text
Time in part is the last modification time of webpage, and webpage may be modified after distribution,
The issuing time of webpage cannot be represented.
The content of the invention
An embodiment of the present invention technical problem to be solved is:How webpage is extracted exactly
Issuing time.
A kind of one side according to embodiments of the present invention, there is provided Homepage Publishing decimation in time side
Method, it is characterised in that including:Set up the DOM Document Object Model dom tree of webpage source code;
Web page title node is determined in dom tree;According to Homepage Publishing timing node and web page title section
Relative position relation of the point in dom tree determines the Homepage Publishing time.
In one embodiment, according to Homepage Publishing timing node and web page title node in DOM
Relative position relation in tree determines that the Homepage Publishing time includes:If in web page title node institute
There is timing node under the father node of category, the temporal information in timing node is extracted as Homepage Publishing
Time.
In one embodiment, if belonging to the corresponding node of label where web page title
There is time leaf node under father node, the temporal information in time leaf node is extracted as webpage
Issuing time;Or, if in the father's section belonging to the corresponding node of label where web page title
Label where having the time under point, the extracting time information and as net from the label where the time
Page issuing time.
In one embodiment, according to Homepage Publishing timing node and web page title node in DOM
Relative position relation in tree determines that the Homepage Publishing time includes:Determine the father of web page title node
The secondary left child node of node, if the left subtree node under some subtrees of secondary left child node is
Intermediate node, the Homepage Publishing time is extracted as by the temporal information in timing node.
In one embodiment, according to Homepage Publishing timing node and web page title node in DOM
Relative position relation in tree determines that the Homepage Publishing time includes:If in web page title node institute
There is timing node under the father node of category, the temporal information in timing node is extracted as Homepage Publishing
Time;If there is no timing node under the father node belonging to web page title node, webpage is determined
The secondary left child node of the father node of title node, if the left side under some subtrees of secondary left child node
Children tree nodes are timing nodes, and the temporal information in timing node is extracted as into the Homepage Publishing time.
In one embodiment, according to Homepage Publishing timing node and web page title node in DOM
Relative position relation in tree determines that the Homepage Publishing time includes:If where web page title
There is time leaf node under father node belonging to the corresponding node of label, by time leaf node
Temporal information be extracted as the Homepage Publishing time;If corresponding in the label where web page title
There is no time leaf node under father node belonging to node, in the label correspondence where web page title
Node belonging to father node under search whether the time where label, if it has, from the time
Extracting time information and as the Homepage Publishing time in the label at place;If where web page title
The corresponding node of label belonging to father node there is no the time where label, determine web page title
The secondary left child node of the father node of node, if the left subtree under some subtrees of secondary left child node
Node is timing node, and the temporal information in timing node is extracted as into the Homepage Publishing time.
In one embodiment, according to Homepage Publishing timing node and web page title node in DOM
Relative position relation in tree determines that the Homepage Publishing time includes:The time is searched in dom tree
Whether node, the timing node that judgement finds meets the Homepage Publishing time with web page title node
Node and relative position relation of the web page title node in dom tree, when will be qualified
Intermediate node is defined as Homepage Publishing timing node, and extracts webpage from Homepage Publishing timing node
Issuing time.
In one embodiment, method also includes:If qualified timing node has multiple,
Qualified timing node close to the root node of dom tree is defined as the Homepage Publishing time
Node.
In one embodiment, determine that web page title node includes in dom tree:According to net
Tag types, unique encodings attribute or generic attribute where page head determine in dom tree
Web page title node.
Second aspect according to embodiments of the present invention, there is provided one kind is used for Homepage Publishing decimation in time
Device, including:DOM Document Object Model dom tree sets up module, for setting up webpage source code
Dom tree;Title node determining module, for determining web page title section in dom tree
Point;Issuing time determining module, for according to Homepage Publishing timing node and web page title node
Relative position relation in dom tree determines the Homepage Publishing time.
In one embodiment, issuing time determining module includes very first time node checks unit
With very first time information extraction unit;Very first time node checks unit is used to search in webpage mark
Whether there is timing node under father node belonging to topic node, if it has, very first time information extraction
Unit is used to for the temporal information in timing node to be extracted as the Homepage Publishing time.
In one embodiment, very first time node checks unit is used to search in web page title institute
The corresponding node of label belonging to father node under whether have time leaf node, if it has,
Very first time information extraction unit is used to for the temporal information in time leaf node to be extracted as webpage
Issuing time;Or, very first time node checks unit is used to search where web page title
Label where whether having the time under father node belonging to the corresponding node of label, if it has, the
One time information extraction unit is for the extracting time information from the label where the time and as net
Page issuing time.
In one embodiment, issuing time determining module includes the second timing node searching unit
With the second temporal information extraction unit;Second timing node searching unit is used to determine web page title
The secondary left child node of the father node of node, and search the left side under some subtrees of secondary left child node
Whether children tree nodes have timing node, if it has, the second temporal information extraction unit be used for by when
Temporal information in intermediate node is extracted as the Homepage Publishing time.
In one embodiment, issuing time determining module includes the 3rd timing node searching unit
With the 3rd temporal information extraction unit;Issuing time determining module is used to search in web page title section
Whether there is timing node under father node belonging to point, if it has, the 3rd temporal information extraction unit
For the temporal information in timing node to be extracted as into the Homepage Publishing time;Issuing time determines mould
When block is additionally operable to not have timing node under the father node belonging to web page title node, webpage is determined
Under the secondary left child node of the father node of title node, and some subtrees of the left child node of judgement time
Whether left subtree node is timing node, if it is, the 3rd temporal information extraction unit is used to incite somebody to action
Temporal information in timing node is extracted as the Homepage Publishing time.
In one embodiment, issuing time determining module includes the 4th timing node searching unit
With the 4th temporal information extraction unit;4th timing node searching unit is used to search in webpage mark
Whether there is time leaf node under the father node belonging to the corresponding node of label where topic, if
Have, the 4th temporal information extraction unit is used to be extracted as the temporal information in time leaf node
The Homepage Publishing time;4th timing node searching unit is additionally operable in the label where web page title
When there is no time leaf node under the father node belonging to corresponding node, search in web page title institute
The corresponding node of label belonging to father node under whether have the time where label, if it has,
4th temporal information extraction unit is used for extracting time information and conduct from the label where the time
The Homepage Publishing time;4th timing node searching unit is additionally operable in the label where web page title
Father node belonging to corresponding node there is no the time where label when, determine web page title node
Father node secondary left child node, and search the left subtree section under some subtrees of time left child node
Whether point has timing node, if it has, the 4th temporal information extraction unit is used for timing node
In temporal information be extracted as the Homepage Publishing time.
In one embodiment, issuing time determining module include the 5th timing node searching unit,
Position relationship judging unit and the 5th temporal information extraction unit;5th timing node searching unit
For searching timing node in dom tree;Position relationship judging unit finds for judgement
Timing node and web page title node whether meet Homepage Publishing timing node and web page title section
Relative position relation of the point in dom tree;5th temporal information extraction unit is used to meet
The timing node of condition is defined as Homepage Publishing timing node, and from Homepage Publishing timing node
Extract the Homepage Publishing time.
In one embodiment, the 5th temporal information extraction unit is used to work as the qualified time
When node has multiple, the qualified timing node close to the root node of dom tree is determined
It is Homepage Publishing timing node.
In one embodiment, the mark that title node determining module is used for according to where web page title
Sign type, unique encodings attribute or generic attribute and web page title node is determined in dom tree.
By the position relationship according to Homepage Publishing timing node in DOM and web page title node
Determine the Homepage Publishing time, the Homepage Publishing time can be accurately positioned, it is adaptable to the net of automation
Page issuing time is extracted.
By referring to the drawings to the detailed description of exemplary embodiment of the invention, the present invention
Further feature and its advantage will be made apparent from.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will
The accompanying drawing to be used needed for embodiment or description of the prior art is briefly described, it is clear that
Ground, drawings in the following description are only some embodiments of the present invention, for the common skill in this area
For art personnel, on the premise of not paying creative work, can also be obtained according to these accompanying drawings
Other accompanying drawings.
Fig. 1 shows the flow chart of one embodiment of Homepage Publishing decimation in time method of the present invention.
Fig. 2 shows the schematic diagram of the part sectional drawing of information class webpage.
Fig. 3 shows the text and label structure in the corresponding a kind of code of part webpage shown in Fig. 2
Into dom tree schematic diagram.
Fig. 4 shows the text and label in the corresponding another code of part webpage shown in Fig. 2
The schematic diagram of the dom tree of composition.
Fig. 5 shows the text and label in corresponding another code of part webpage shown in Fig. 2
The schematic diagram of the dom tree of composition.
Fig. 6 shows the present invention for the knot of one embodiment of the device of Homepage Publishing decimation in time
Composition.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, to the technical scheme in the embodiment of the present invention
It is clearly and completely described, it is clear that described embodiment is only a real part of the invention
Example is applied, rather than whole embodiments.Below to the description reality of at least one exemplary embodiment
It is merely illustrative on border, never as to the present invention and its application or any limitation for using.
Based on the embodiment in the present invention, those of ordinary skill in the art are not before creative work is made
The every other embodiment for being obtained is put, the scope of protection of the invention is belonged to.
The Homepage Publishing decimation in time method of one embodiment of the invention is described below with reference to Fig. 1.
Fig. 1 is the flow chart of one embodiment of Homepage Publishing decimation in time method of the present invention.As schemed
Shown in 1, the method for the embodiment includes:
Step S102, sets up DOM (the Document Object Model, text of webpage source code
Shelves object model) tree.
DOM can access by the way of independently of platform and language and change HTML,
The content and structure of XML document.In the dom tree set up according to html document, often
One label or text node are all a nodes in dom tree.
Step S104, determines web page title node in dom tree.
Wherein, web page title node can be the section where web page title text in dom tree
The node of label where point, or web page title.
Step S106, according to Homepage Publishing timing node and web page title node in dom tree
Relative position relation determine the Homepage Publishing time.
Wherein, Homepage Publishing timing node can be the Homepage Publishing time text in dom tree
The node of label where the node at place, or Homepage Publishing time.
By the position relationship according to Homepage Publishing timing node in DOM and web page title node
Determine the Homepage Publishing time, the Homepage Publishing time can be accurately positioned, it is adaptable to the net of automation
Page issuing time is extracted.
In step S104, tag types, unique encodings category that can be according to where web page title
Property or generic attribute determine web page title node in dom tree.This paper of title is generally placed
In special label, for example, it is placed on the h1 labels of expression large print or represents link
In a labels.In addition to being recognized to the label where web page title using special tag,
Where web page title can also be determined by the specific object content of the label where web page title
DOM node.Label can be by id (unique encodings) attributes or class (class) attribute
Particular content mark label content concrete meaning, having headed label for example can be:
<Div id=" title ">Title</div>, div tag is by " title " mark in its id attribute
The content that it is included is title.It is obvious also possible to determine web page title place using other method
Label, repeat no more here.
Generally, after parsing, be presented webpage in a browser, the Homepage Publishing time is past
Toward after following web page title closely.And because the specific front end of each webpage is laid out difference, for being in
Identic two pages in present webpage, the structure of its source code may be different, i.e.
Relative position relation of the Homepage Publishing timing node with web page title node in dom tree may
It is different.Statistics and analysis are carried out by the webpage to substantial amounts of information class to understand, the issue of webpage
The relative position relation of the DOM node where DOM node and web page title where the time
Mainly there are two kinds.Fig. 2 is the schematic diagram of the part sectional drawing of information class webpage.It is with Fig. 2 below
Example, is specifically described according to Homepage Publishing timing node and web page title node in dom tree
Relative position relation determines the two methods of Homepage Publishing time.
The first determination method is:If having the time under the father node belonging to web page title node
Node, the Homepage Publishing time is extracted as by the temporal information in timing node.According to Homepage Publishing
Timing node is the difference of text node or label node, and the first determination method can also have
Body is divided into two kinds of following forms:If belonging to the corresponding node of label where web page title
Father node under have time leaf node, the temporal information in time leaf node is extracted as net
Page issuing time;Or, if in the father belonging to the corresponding node of label where web page title
Label where having the time under node, extracting time information and conduct from the label where the time
The Homepage Publishing time.With reference to the code of the part webpage in Fig. 2 to the first determination method
Two kinds of concrete forms carry out exemplary elaboration.
When having time leaf section under the father node belonging to the corresponding node of label where web page title
During point, the code of the web page portions sectional drawing shown for Fig. 2 in webpage for example can be:
<div><h1>A kind of Homepage Publishing decimation in time method</h1>2016-1-1<span>120
Secondary reading</span></div>
Fig. 3 is the structural representation of the dom tree being made up of each text and label in above-mentioned code
Figure.As shown in figure 3, the label node where web page title is h1 nodes, the father of h1 nodes
Node is div nodes, searches other child nodes of div nodes, can be obtained " 2016-1-1 "
This time leaf node, " 2016-1-1 " is the Homepage Publishing time.
Where having the time under the father node belonging to the corresponding node of label where web page title
During label, the code of the web page portions sectional drawing shown for Fig. 2 in webpage for example can be:
<div><h1>A kind of Homepage Publishing decimation in time method</h1><span>
2016-1-1</span><span>120 readings</span></div>
Fig. 4 is the structural representation of the dom tree being made up of each text and label in above-mentioned code
Figure.As shown in Figure 4, the label node where web page title is h1 nodes, the father of h1 nodes
Node is div nodes, searches other child nodes of div nodes, can be obtained comprising " 2016-1-1 "
The span labels of this time, then the extracting time information from span labels, therein
" 2016-1-1 " is the Homepage Publishing time.
The first determination method is generally used for the relatively simple webpage of hierarchical layout of front-end code.
Web page title node and timing node are not only closer to the distance in the vision of webpage is presented, while
Level difference in dom tree is also smaller.By using this determination method, can be promptly
Locating web-pages issuing time node, and extract the Homepage Publishing time exactly.
For two kinds of specific methods in the first determination method, the former can be first used, then adopt
With the latter, i.e. if having timing node under the father node belonging to web page title node, by when
Temporal information in intermediate node is extracted as the Homepage Publishing time;If belonging to web page title node
Father node under there is no timing node, determine the secondary left child node of the father node of web page title node,
If the left subtree node under some subtrees of secondary left child node is timing node, by timing node
In temporal information be extracted as the Homepage Publishing time.
Second determination method be:Determine the secondary left child node of the father node of web page title node,
If the left subtree node under some subtrees of secondary left child node is timing node, by timing node
In temporal information be extracted as the Homepage Publishing time.When using second determination method acquisition webpage
During issuing time, the code of the part shown for Fig. 2 in webpage for example can be:
<div><h1>A kind of Homepage Publishing decimation in time method</h1><p>
<span>2016-1-1</span><h4>120 readings</h4></p></div>
Fig. 5 is the structural representation of the dom tree being made up of each text and label in above-mentioned code
Figure.As shown in figure 5, the father node of web page title node be the div nodes div nodes time
Left child node is p labels, has two subtrees under p labels, and respectively root node is span's
Subtree and root node are the subtree of h4.Wherein, root node is that the subtree of span is p labels
Left subtree, it is the timing node in the subtree of span to search root node, can be obtained " 2016-1-1 "
This temporal information, " 2016-1-1 " is the Homepage Publishing time.Such case is generally occurred within
In the complex webpage of the hierarchical layout of front-end code.
The complex webpage of hierarchical layout that second determination method is generally used for front-end code.
Although the level difference of web page title node and timing node in dom tree is larger, according to
It is old to be determined when distance webs page head during the vision of webpage is presented is nearer by the above method
Intermediate node.By using this determination method, the complicated situation of page layout is coped with, and
The Homepage Publishing time is extracted exactly.
Determine method for above two, can first use the first determination method, then using
Two kinds of determination methods.First look in time nearer node of web page title node layer whether
There is timing node, then search whether have the time in the secondary node farther out of web page title node layer
Node.
In foregoing two kinds of determination methods, specific determination order is:Webpage is determined for compliance with first
The node of the position relationship of title node and Homepage Publishing timing node, then judge that these meet position
Whether the node for putting relation is timing node.From another angle, can also use and first determine net
Timing node in page, then judge whether these timing nodes meet web page title node and webpage
The determination of the position relationship of issuing time node sequentially, i.e.,:Segmentum intercalaris when being searched in dom tree
Point, segmentum intercalaris when whether the timing node that judgement finds meets Homepage Publishing with web page title node
Relative position relation of the point with web page title node in dom tree, by the qualified time
Node is defined as Homepage Publishing timing node, and webpage hair is extracted from Homepage Publishing timing node
The cloth time.Two kinds of determinations order sets about being searched and being judged from different angles respectively, works as net
It is more quick using former approach when the hierarchical structure of page front-end code is relatively easy;Work as net
It is in hgher efficiency using later approach when the temporal information included in page is less.
Sequentially, i.e., using first lookup timing node, judge that position is closed again when using second determination
During the mode of system, multiple results may be obtained, now can therefrom be chosen using following methods
As a result:If qualified timing node has multiple, by close to the root node of dom tree
Qualified timing node is defined as Homepage Publishing timing node.Symbol close to dom tree
The timing node of conjunction condition is relatively simple with the hierarchical relationship of title node, therefore compared to other
Node, is more likely the node comprising the Homepage Publishing time.
During above-mentioned lookup timing node, can be by regular expression decision node
Text whether be the expression time text.Java language, Python, JavaScript
The mainstream speeches such as language are supported to extract text according to regular expression, can be according to actually used
The demand selection canonical matching tool of environment and subsequent treatment.
Below with reference to Fig. 6 describe one embodiment of the invention for Homepage Publishing decimation in time
Device.
Fig. 6 is structure of the present invention for one embodiment of the device of Homepage Publishing decimation in time
Figure.As shown in fig. 6, the device of the embodiment includes:DOM Document Object Model dom tree sets up mould
Block 62, the dom tree for setting up webpage source code;Title node determining module 64, for
Web page title node is determined in dom tree;Issuing time determining module 66, for according to webpage
Issuing time node determines webpage with relative position relation of the web page title node in dom tree
Issuing time.
Wherein, issuing time determining module 66 can specifically use following several concrete structures.
The first structure is:Issuing time determining module 66 can include very first time node checks
Unit and very first time information extraction unit;Very first time node checks unit is used to search in net
Whether there is timing node under father node belonging to page head node, if it has, very first time information
Extraction unit is used to for the temporal information in timing node to be extracted as the Homepage Publishing time.
Additionally, very first time node checks unit can be also used for searching where web page title
Whether there is time leaf node under father node belonging to the corresponding node of label, if it has, first
Temporal information extraction unit is used to for the temporal information in time leaf node to be extracted as Homepage Publishing
Time;Or, very first time node checks unit is used to search in the label where web page title
Label where whether having the time under father node belonging to corresponding node, if it has, when first
Between information extraction unit be used for from the label where the time extracting time information and as webpage send out
The cloth time.
Second structure be:Issuing time determining module 66 can be searched including the second timing node
Unit and the second temporal information extraction unit;Second timing node searching unit is used to determine webpage
The secondary left child node of the father node of title node, and search under some subtrees of secondary left child node
Left subtree node whether have timing node, if it has, the second temporal information extraction unit is used for
Temporal information in timing node is extracted as the Homepage Publishing time.
The third structure is:Issuing time determining module 66 can be searched including the 3rd timing node
Unit and the 3rd temporal information extraction unit;Issuing time determining module 66 is used to search in webpage
Whether there is timing node under father node belonging to title node, if it has, the 3rd temporal information is carried
Unit is taken for the temporal information in timing node to be extracted as into the Homepage Publishing time;Issuing time
When determining module 66 is additionally operable to not have timing node under the father node belonging to web page title node,
Determine the secondary left child node of the father node of web page title node, and judge some of time left child node
Whether the left subtree node under subtree is timing node, if it is, the 3rd temporal information extracts single
Unit by the temporal information in timing node for being extracted as the Homepage Publishing time.
4th kind of structure be:Issuing time determining module 66 can be searched including the 4th timing node
Unit and the 4th temporal information extraction unit;4th timing node searching unit is used to search in net
Whether there is time leaf node under the father node belonging to the corresponding node of label where page head,
If it has, the 4th temporal information extraction unit is used to carry the temporal information in time leaf node
It is taken as the Homepage Publishing time;4th timing node searching unit is additionally operable to where web page title
When there is no time leaf node under the father node belonging to the corresponding node of label, search in webpage mark
Label where whether having the time under the father node belonging to the corresponding node of label where topic, such as
Fruit has, and the 4th temporal information extraction unit is used for the extracting time information from the label where the time
And as the Homepage Publishing time;4th timing node searching unit is additionally operable to where web page title
The corresponding node of label belonging to father node there is no the time where label when, determine webpage mark
The secondary left child node of the father node of node is inscribed, and searches the left side under some subtrees of time left child node
Whether children tree nodes have timing node, if it has, the 4th temporal information extraction unit be used for by when
Temporal information in intermediate node is extracted as the Homepage Publishing time.
5th kind of structure be:Issuing time determining module 66 can be searched including the 5th timing node
Unit, position relationship judging unit and the 5th temporal information extraction unit;5th timing node is looked into
Unit is looked for for searching timing node in dom tree;Position relationship judging unit is used to judge
Whether the timing node for finding meets Homepage Publishing timing node and webpage with web page title node
Relative position relation of the title node in dom tree;5th temporal information extraction unit is used for
Qualified timing node is defined as Homepage Publishing timing node, and from the Homepage Publishing time
The Homepage Publishing time is extracted in node.
Additionally, the 5th temporal information extraction unit can be used for having when qualified timing node
When multiple, the qualified timing node close to the root node of dom tree is defined as webpage
Issuing time node.
Tag types that title node determining module 64 can be used for according to where web page title, only
One encoded attributes or generic attribute determine web page title node in dom tree.
Additionally, the method according to the invention is also implemented as a kind of computer program product, should
Computer program product includes computer-readable medium, is stored with the computer-readable medium
Computer program for performing the above-mentioned functions limited in the method for the present invention.Art technology
Personnel will also understand is that, the various illustrative logical blocks with reference to described by disclosure herein, mould
Block, circuit and algorithm steps may be implemented as the group of electronic hardware, computer software or both
Close.
Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all at this
Within the spirit and principle of invention, any modification, equivalent substitution and improvements made etc. all should be wrapped
It is contained within protection scope of the present invention.
Claims (18)
1. a kind of Homepage Publishing decimation in time method, it is characterised in that including:
Set up the DOM Document Object Model dom tree of webpage source code;
Web page title node is determined in the dom tree;
According to Homepage Publishing timing node and relative position of the web page title node in dom tree
Relation determines the Homepage Publishing time.
2. method according to claim 1, it is characterised in that described according to Homepage Publishing
Timing node determines Homepage Publishing with relative position relation of the web page title node in dom tree
Time includes:
If having timing node under the father node belonging to web page title node, by timing node
Temporal information be extracted as the Homepage Publishing time.
3. method according to claim 2, it is characterised in that
If having time leaf under the father node belonging to the corresponding node of label where web page title
Child node, the Homepage Publishing time is extracted as by the temporal information in time leaf node;
Or,
If having time institute under the father node belonging to the corresponding node of label where web page title
Label, the extracting time information and as the Homepage Publishing time from the label where the time.
4. method according to claim 1, it is characterised in that described according to Homepage Publishing
Timing node determines Homepage Publishing with relative position relation of the web page title node in dom tree
Time includes:
Determine the secondary left child node of the father node of web page title node, if if secondary left child node
Left subtree node under dry subtree is timing node, and the temporal information in timing node is extracted as
The Homepage Publishing time.
5. method according to claim 1, it is characterised in that described according to Homepage Publishing
Timing node determines Homepage Publishing with relative position relation of the web page title node in dom tree
Time includes:
If having timing node under the father node belonging to web page title node, by timing node
Temporal information be extracted as the Homepage Publishing time;
If there is no timing node under the father node belonging to web page title node, webpage mark is determined
The secondary left child node of the father node of node is inscribed, if the left son under some subtrees of secondary left child node
Tree node is timing node, and the temporal information in timing node is extracted as into the Homepage Publishing time.
6. method according to claim 1, it is characterised in that described according to Homepage Publishing
Timing node determines Homepage Publishing with relative position relation of the web page title node in dom tree
Time includes:
If having time leaf under the father node belonging to the corresponding node of label where web page title
Child node, the Homepage Publishing time is extracted as by the temporal information in time leaf node;
If there is no the time under the father node belonging to the corresponding node of label where web page title
Leaf node, being searched under the father node belonging to the corresponding node of label where web page title is
It is no have the time where label, if it has, extracting time information is simultaneously from the label where the time
As the Homepage Publishing time;
If the father node belonging to the corresponding node of label where web page title did not had where the time
Label, the secondary left child node of the father node of web page title node is determined, if secondary left child node
Some subtrees under left subtree node be timing node, the temporal information in timing node is carried
It is taken as the Homepage Publishing time.
7. method according to claim 1, it is characterised in that described according to Homepage Publishing
Timing node determines Homepage Publishing with relative position relation of the web page title node in dom tree
Time includes:
Timing node, timing node and web page title that judgement finds are searched in dom tree
It is relative in dom tree with web page title node whether node meets Homepage Publishing timing node
Position relationship, Homepage Publishing timing node is defined as by qualified timing node, and from net
The Homepage Publishing time is extracted in page issuing time node.
8. method according to claim 7, it is characterised in that also include:
If qualified timing node has multiple, by the symbol close to the root node of dom tree
The timing node of conjunction condition is defined as Homepage Publishing timing node.
9. method according to claim 1, it is characterised in that described in the DOM
Determine that web page title node includes in tree:
Tag types, unique encodings attribute or generic attribute according to where web page title are described
Web page title node is determined in dom tree.
10. a kind of device for Homepage Publishing decimation in time, it is characterised in that including:
DOM Document Object Model dom tree sets up module, the dom tree for setting up webpage source code;
Title node determining module, for determining web page title node in the dom tree;
Issuing time determining module, for according to Homepage Publishing timing node and web page title node
Relative position relation in dom tree determines the Homepage Publishing time.
11. devices according to claim 10, it is characterised in that the issuing time is true
Cover half block includes very first time node checks unit and very first time information extraction unit;
The very first time node checks unit is used to search the father's section belonging to web page title node
Point under whether have timing node, if it has, the very first time information extraction unit be used for by when
Temporal information in intermediate node is extracted as the Homepage Publishing time.
12. devices according to claim 11, it is characterised in that the very first time
Point searching unit is used to search in the father node belonging to the corresponding node of label where web page title
Under whether have time leaf node, if it has, the very first time information extraction unit be used for will
Temporal information in time leaf node is extracted as the Homepage Publishing time;
Or,
The very first time node checks unit is used to search the label correspondence where web page title
Node belonging to father node under whether have the time where label, if it has, when described first
Between information extraction unit be used for from the label where the time extracting time information and as webpage send out
The cloth time.
13. devices according to claim 10, it is characterised in that the issuing time is true
Cover half block includes the second timing node searching unit and the second temporal information extraction unit;
The second timing node searching unit is used to determine the secondary of the father node of web page title node
Left child node, and whether sometimes to search the left subtree node under some subtrees of secondary left child node
Intermediate node, if it has, the second temporal information extraction unit be used for by timing node when
Between information extraction be the Homepage Publishing time.
14. devices according to claim 10, it is characterised in that the issuing time is true
Cover half block includes the 3rd timing node searching unit and the 3rd temporal information extraction unit;
The issuing time determining module is used to search under the father node belonging to web page title node
Whether have timing node, if it has, the 3rd temporal information extraction unit be used for by when segmentum intercalaris
Temporal information in point is extracted as the Homepage Publishing time;
The issuing time determining module is additionally operable to not to be had under the father node belonging to web page title node
When having timing node, the secondary left child node of the father node of web page title node is determined, and judge secondary
Whether the left subtree node under some subtrees of left child node is timing node, if it is, described
3rd temporal information extraction unit is used to for the temporal information in timing node to be extracted as Homepage Publishing
Time.
15. devices according to claim 10, it is characterised in that the issuing time is true
Cover half block includes the 4th timing node searching unit and the 4th temporal information extraction unit;
The 4th timing node searching unit is used to search the label correspondence where web page title
Node belonging to father node under whether have time leaf node, if it has, the 4th time
Information extraction unit is used to for the temporal information in time leaf node to be extracted as the Homepage Publishing time;
The 4th timing node searching unit is additionally operable to corresponding in the label where web page title
When there is no time leaf node under the father node belonging to node, search in the mark where web page title
Whether the label where having the time under the father node belonging to corresponding node is signed, if it has, described
4th temporal information extraction unit is used for extracting time information and conduct from the label where the time
The Homepage Publishing time;
The 4th timing node searching unit is additionally operable to corresponding in the label where web page title
Father node belonging to node there is no the time where label when, determine web page title node father section
The secondary left child node of point, and whether search the left subtree node under some subtrees of time left child node
There is timing node, if it has, the 4th temporal information extraction unit is used in timing node
Temporal information be extracted as the Homepage Publishing time.
16. devices according to claim 10, it is characterised in that the issuing time is true
Cover half block includes the 5th timing node searching unit, position relationship judging unit and the 5th time letter
Breath extraction unit;
The 5th timing node searching unit is used to search timing node in dom tree;
The position relationship judging unit is used for the timing node and web page title section for judging to find
Whether point meets relative position of the Homepage Publishing timing node with web page title node in dom tree
Put relation;
The 5th temporal information extraction unit is used to for qualified timing node to be defined as net
Page issuing time node, and extracted the Homepage Publishing time from Homepage Publishing timing node.
17. devices according to claim 16, it is characterised in that the 5th time letter
Breath extraction unit is used for when qualified timing node has multiple, by from the root of dom tree
The near qualified timing node of node is defined as Homepage Publishing timing node.
18. devices according to claim 10, it is characterised in that the title node is true
Tag types, unique encodings attribute or generic attribute that cover half block is used for according to where web page title
Web page title node is determined in the dom tree.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510955640.2A CN106897287B (en) | 2015-12-18 | 2015-12-18 | Webpage release time extraction method and device for webpage release time extraction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510955640.2A CN106897287B (en) | 2015-12-18 | 2015-12-18 | Webpage release time extraction method and device for webpage release time extraction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106897287A true CN106897287A (en) | 2017-06-27 |
CN106897287B CN106897287B (en) | 2020-06-16 |
Family
ID=59189612
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510955640.2A Active CN106897287B (en) | 2015-12-18 | 2015-12-18 | Webpage release time extraction method and device for webpage release time extraction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106897287B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108268433A (en) * | 2018-02-26 | 2018-07-10 | 杭州数梦工场科技有限公司 | Title abstracting method and device based on webpage article |
CN109829092A (en) * | 2018-12-26 | 2019-05-31 | 厦门邑通软件科技有限公司 | The method that a kind of pair of webpage is oriented monitoring |
CN116484831A (en) * | 2023-02-22 | 2023-07-25 | 北京麦克斯泰科技有限公司 | Multi-dimension-based release time identification method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101470728A (en) * | 2007-12-25 | 2009-07-01 | 北京大学 | Method and device for automatically abstracting text of Chinese news web page |
CN102129428A (en) * | 2010-01-20 | 2011-07-20 | 腾讯科技(深圳)有限公司 | Method and device for subscribing information from webpage |
CN103064827A (en) * | 2013-01-16 | 2013-04-24 | 盘古文化传播有限公司 | Method and device for extracting webpage content |
CN104462151A (en) * | 2013-09-25 | 2015-03-25 | 腾讯科技(深圳)有限公司 | Method for evaluating web page publishing time and related device |
-
2015
- 2015-12-18 CN CN201510955640.2A patent/CN106897287B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101470728A (en) * | 2007-12-25 | 2009-07-01 | 北京大学 | Method and device for automatically abstracting text of Chinese news web page |
CN102129428A (en) * | 2010-01-20 | 2011-07-20 | 腾讯科技(深圳)有限公司 | Method and device for subscribing information from webpage |
CN103064827A (en) * | 2013-01-16 | 2013-04-24 | 盘古文化传播有限公司 | Method and device for extracting webpage content |
CN104462151A (en) * | 2013-09-25 | 2015-03-25 | 腾讯科技(深圳)有限公司 | Method for evaluating web page publishing time and related device |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108268433A (en) * | 2018-02-26 | 2018-07-10 | 杭州数梦工场科技有限公司 | Title abstracting method and device based on webpage article |
CN108268433B (en) * | 2018-02-26 | 2019-06-11 | 杭州数梦工场科技有限公司 | Title abstracting method and device based on webpage article |
CN109829092A (en) * | 2018-12-26 | 2019-05-31 | 厦门邑通软件科技有限公司 | The method that a kind of pair of webpage is oriented monitoring |
CN109829092B (en) * | 2018-12-26 | 2021-05-28 | 厦门邑通软件科技有限公司 | Method for directionally monitoring webpage |
CN116484831A (en) * | 2023-02-22 | 2023-07-25 | 北京麦克斯泰科技有限公司 | Multi-dimension-based release time identification method and device |
CN116484831B (en) * | 2023-02-22 | 2024-03-12 | 北京麦克斯泰科技有限公司 | Multi-dimension-based release time identification method and device |
Also Published As
Publication number | Publication date |
---|---|
CN106897287B (en) | 2020-06-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7941420B2 (en) | Method for organizing structurally similar web pages from a web site | |
US7627571B2 (en) | Extraction of anchor explanatory text by mining repeated patterns | |
CN105630941B (en) | Web body matter abstracting methods based on statistics and structure of web page | |
CN107590219A (en) | Webpage personage subject correlation message extracting method | |
CN101727498A (en) | Automatic extraction method of web page information based on WEB structure | |
CN103246732B (en) | A kind of abstracting method of online Web news content and system | |
Zheng et al. | Template-independent news extraction based on visual consistency | |
CN102662969B (en) | Internet information object positioning method based on webpage structure semantic meaning | |
CN100552673C (en) | Open type document isomorphism engines system | |
JP2006004417A (en) | Method and device for recognizing specific type of information file | |
CN103294781A (en) | Method and equipment used for processing page data | |
CN106339455B (en) | Webpage context extraction method based on text label feature mining | |
CN109344355B (en) | Automatic regression detection and block matching self-adaption method and device for webpage change | |
CN106960058A (en) | A kind of structure of web page alteration detection method and system | |
CN108733813A (en) | Information extracting method, system towards BBS forum Web pages contents and medium | |
CN104572934A (en) | Webpage key content extracting method based on DOM | |
CN105740370B (en) | A kind of online Web news contents extraction system | |
CN105740355B (en) | Webpage context extraction method and device based on aggregation text density | |
CN106897287A (en) | Homepage Publishing decimation in time method and the device for Homepage Publishing decimation in time | |
US20140156799A1 (en) | Method and System for Extracting Post Contents From Forum Web Page | |
CN107145591A (en) | A kind of effective content metadata extracting method of webpage based on title | |
CN104572874B (en) | A kind of abstracting method and device of webpage information | |
CN105589918B (en) | A kind of method and device for extracting page info | |
CN104217025B (en) | For the entry extraction system and method for more record webpages | |
CN111581478A (en) | Cross-website general news acquisition method for specific subject |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |