CN108920434A

CN108920434A - A kind of general Web page subject method for extracting content and system

Info

Publication number: CN108920434A
Application number: CN201810572726.0A
Authority: CN
Inventors: 钟刚
Original assignee: Wuhan Cool Dog Data Technology Co Ltd
Current assignee: Wuhan Cool Dog Data Technology Co Ltd
Priority date: 2018-06-06
Filing date: 2018-06-06
Publication date: 2018-11-30
Anticipated expiration: 2038-06-06
Also published as: CN108920434B

Abstract

The present invention is more particularly directed to a kind of general Web page subject method for extracting content and system, method to include the following steps：The dom tree for constructing target webpage, clears up the node of dom tree, and carries out attribute label to remaining node according to the correlation with body matter；Dom tree is traversed, the remaining node-classification of dom tree is cached；Whether it is the theme content according to the content of node described in the Distance Judgment of each classification interior joint and the visual title node, and the extraction to target webpage subject content is completed according to judging result.The present invention provides the semantic-based method for abstracting web page information that one kind more optimizes, it is based on strong incidence relation present on page structure, the visual title node of the text of dom tree identify and classification caching is carried out to other nodes, then whether other category nodes belong to the important evidence of subject content at a distance from the visual title node of text as predicate node using in dom tree, to improve the precision and efficiency of Web page information extraction.

Description

A kind of general Web page subject method for extracting content and system

Technical field

The present invention relates to computer software technical fields, and in particular to a kind of general Web page subject method for extracting content and System.

Background technique

Current Internet era, visible most of information is disclosed in network, which is in a manner of subject content, is in Existing, the Domestic News etc. of blog article, portal website in blog.These subject contents are that most of Internet users obtain The important channel of information and the magnanimity basis corpus of academic research personnel, and have important valence in natural language processing field Value.But due to many, the subject content webpage on network is not to be made of simple subject content, is further comprised all The information that such as advertisement, comment, associated recommendation and guidance to website are not directly relevant to subject content.How from many and diverse net The subject content that webpage is extracted in page information becomes a problem to be solved.

Current existing theme's extraction mode is generally divided into two kinds：One is semantic-based webpage information extraction sides Method, another kind are the web page release methods of view-based access control model.Both the above mode is all attempted to extract from structure of web page really Block of information where subject content.

Semantic-based webpage information extracts generally there are two types of mode, first way be the information based on entire website into Row analysis is attempted to find the replicated blocks, such as navigation bar etc. between different web pages, is then making a concrete analysis of some webpage When remove these replicated blocks, and then find subject content；The second way is the webpage itself for only relying only on present analysis, is tasted Some pieces of grade node elements in HTML are found in examination, the then text information of analysis node content, such as text size, by comparing To obtain the block grade element with longest text size.

The web page release method of view-based access control model, attempt by browser engine rendering full page, then to rendering after The page carries out background color based on page elements, font, frame etc. factor and carries out piecemeal, to the more close element of the degree of association It merges, and the untight element of the degree of association is then considered as different piecemeals, to complete the piecemeal of full page view-based access control model Reconstruct.The web page release method of view-based access control model has its defect, because this mode needs to construct in analysis based on web page source code Dom tree while load its dependence CSS（Cascading style sheets）File etc., and rendered dependent on browser engine, There is a problem of that speed is relatively very slow for the analysis of mass data.

Summary of the invention

The present invention provides a kind of general Web page subject method for extracting content and system, solve webpage in the prior art The lower technical problem of the precision and efficiency that subject content is extracted.

The technical solution that the present invention solves above-mentioned technical problem is as follows：A kind of general Web page subject method for extracting content, Include the following steps：

Step 1, the dom tree for constructing target webpage clears up the node of the dom tree, and according to the phase with body matter Closing property carries out attribute label to the remaining node of the dom tree；

Step 2, traversal attribute label after dom tree, by the remaining node-classification of dom tree cache for picture node, date node, Body text node or visual title node；

Step 3, according to the picture node, the date node and the body text node respectively with the visual title The content of the content of picture node described in the Distance Judgment of node, the content of the date node and the body text node Whether content, and complete the extraction to target webpage subject content according to judging result if being the theme, the subject content includes just Texts and pictures piece, issuing time and text.

The beneficial effects of the invention are as follows：The present invention provides the semantic-based Web page information extraction sides that one kind more optimizes Method is based on strong incidence relation present on page structure, carries out identification to the visual title node of the text of dom tree and to other Node carries out classification caching, and then other category nodes save at a distance from the visual title node of text as judgement using in dom tree Whether point belongs to the important evidence of subject content, to improve the precision and efficiency of Web page information extraction.

Based on the above technical solution, the present invention can also be improved as follows.

Further, the step 1 specifically includes following steps：

S101 downloads the source code of target webpage, and the source code is resolved to a dom tree；

S102, obtains and caches the content of title label node in the dom tree, at the same to the content of title label node into Row Chinese word segmentation and removal stop words, generate the title set of words including several title words；

S103 traverses the dom tree using the mode of depth-first, clears up in the dom tree after the node of preset kind, judgement Whether id attribute, class attribute and/or the style attribute of remaining node meet the first preset condition, and according to judging result pair The residue node carry out attribute labeled as determine the element unrelated with text, element that may be unrelated with text and other members Element.

Further, the step 2 specifically includes following steps：

S201 selects the body element of dom tree as the start node for carrying out depth-first recursive traversal, generates every in dom tree The corresponding node visit path of a surplus element；

S202, according to the attribute mark information of surplus element in dom tree, it would be possible to which the element unrelated with text and other elements are equal As due-in collection element, the due-in element that integrates is carried out information collection and classified to cache as picture node, author node, date Node, body text node or visual title node.

Further, in step S202, information collection is carried out to the due-in collection element and caching of classifying specifically include it is following Step：

Step a judges whether the element tags of the due-in collection element are img labels, if so, collecting and caching described due-in Integrate element as picture node, if it is not, thening follow the steps b；

Step b, judges the id attribute of the due-in collection element or whether class attribute includes image, photo or gallery mark Label, if it is not, c is thened follow the steps, if so, determining that the due-in collection element is determining pictorial information block node, and global mark The traversal of note dom tree enters pictorial information and collects block, when traversing the child node of the due-in collection element, judges the child node Whether be picture node, if so, collecting and caching the child node is picture node, if it is not, then continue to judge it is next to Collect element；

Step c, judge it is described it is due-in collection element id attribute or class attribute whether include author, writtenby or Byline label, if it is not, d is thened follow the steps, if so, determine that the due-in collection element is determining author information block node, And the traversal of global mark dom tree enters author information and collects block, when traversing the child node of the due-in collection element, judges institute State whether child node is author node, if so, collecting and caching the child node is author node, if it is not, then continuing to judge Next due-in collection element；

Step d, judge it is described it is due-in collection element id attribute or class attribute whether include article, post, main or Content label, if it is not, e is thened follow the steps, if so, determine that the due-in collection element is determining text message block node, And the traversal of global mark dom tree enters text message and collects block, while if the current global text without collecting determination is believed It ceases block and only has collected non-deterministic text message block, then empty the non-deterministic text message block currently collected；

Step e, determines whether the due-in collection element has daughter element, if there is daughter element, then judges that the daughter element whether may be used To be integrated replacement, if it is then the content for the due-in collection element being replaced with after the content integration of all daughter elements, And step f is executed, if cannot, directly execution step f；

Step f traverses all child nodes of the due-in collection element and handles one by one, and processing method is：Judge the child node Type, if child node is node element, global node is counted plus one, and return step a carries out depth of recursion time again It goes through, if child node is text child node, the content of the text child node is identified, according to recognition result by the text Book nodal cache is visual title node, date node or possible body text node；

During carrying out the above depth-first recursive traversal, the node counts serial number of due-in collection element, text in dom tree are recorded Node counts serial number and node visit path.

Further, following steps are specifically included according to the body text Node extraction text cached in the step 3：

All possible body text node is subjected to ascending sort according to node counts serial number；

It finds in all possible body text node, first node counts serial number is greater than the node counts of visual title node The first object node of serial number, and the sentence number of the first object node is greater than the content word of 0 or first object node There is correlation with the content word of visual title node, the first object node is denoted as p1 node；

The node counts serial number difference with the P1 node is reversely found forward using the p1 node as starting point less than 3, and is accessed Similar second destination node in path, and p1 is replaced with, this step then being repeated, being until can not find the second new destination node Only；

All possible body text node before clearing up the p1 node, and to remaining all possible body text section Point is grouped according to node visit path, and each packets inner carries out ascending sort according to node counts serial number, between grouping Ascending sort is carried out according to the node counts serial number of first node of each grouping；

The preset parameter value of each grouping is calculated, and the preset parameter value is imported into prediction model trained in advance and is beaten Point, generate the targeted packets that score is greater than default score value；

Node in all targeted packets is subjected to ascending sort according to node counts serial number, and forms text node set；

Cache the text node set.

Further, following steps are specifically included according to the date Node extraction issuing time cached in the step 3：

The invalid node in all date nodes is cleared up, the invalid node is node counts serial number in first first mesh Mark the node after node；

The target date node nearest from visual title node in remaining date node after obtaining cleaning, and the target date The node counts serial number difference of node is lower than the first preset value, and text node counts serial number difference and is lower than the second preset value.

Further, following steps are specifically included according to the picture Node extraction text picture cached in the step 3：

Step 001, by the picture node cached according to by node counts serial number ascending sort；

Step 002, Target Photo node is obtained, by other picture nodes after Target Photo node and Target Photo node Complete liquidation, the Target Photo node is near the last one first object node and node counts serial number difference is greater than the The picture node of three preset values；

Step 003, picture node of the node counts serial number between body text node and visual title node is obtained, is denoted as Then interpolation graphs piece node will be preset before being located at visual title node and with the nodal distance of visual title node lower than the 4th The picture node of value is also denoted as interpolation graphs piece node, and is incorporated into interpolation graphs piece node set, while to non-interpolative picture node It is cached；

Step 004, each interpolation graphs piece node is obtained at a distance from the node counts serial number of visual title node, and according to distance Ascending sort is carried out to all interpolation graphs segment points；

Step 005, prescreening is carried out to all interpolation graphs segment points according to default screening rule, filters out the nothing unrelated with text Imitate picture；

Step 006, the node visit path of remaining interpolation graphs piece node after prescreening, and the interpolation graphs in step 003 are obtained The identical node in node visit path is found in piece node set, then repeatedly step 004 and step 005, to filtering out again Interpolation graphs piece node and non-interpolative picture node integrated.

In order to solve technical problem of the invention, a kind of general Web page subject content extraction system is additionally provided, including Dom tree processing module, cache module and extraction module,

The dom tree processing module is used to construct the dom tree of target webpage, clears up the node of the dom tree, and according to Attribute label is carried out to the remaining node of the dom tree with the correlation of body matter；

The cache module is used to traverse the dom tree after attribute label, and the remaining node-classification of dom tree is cached as picture section Point, date node, body text node or visual title node；

The extraction module be used for according to the picture node, the date node and the body text node respectively with institute State the interior perhaps described body text section of the content of picture node described in the Distance Judgment of visual title node, the date node Whether the content of point is the theme content, and completes the extraction to target webpage subject content according to judging result, in the theme Hold includes text picture, issuing time and text.

Further, the dom tree processing module includes：

Resolution unit resolves to a dom tree for downloading the source code of target webpage, and by the source code；

Title word generation unit, for obtaining and caching the content of title label node in the dom tree, while to title The content of label node carries out Chinese word segmentation and removal stop words, generates the title set of words including several title words；

Marking unit traverses the dom tree for the mode using depth-first, clears up the section of preset kind in the dom tree After point, judge whether the id attribute, class attribute and/or style attribute of remaining node meet the first preset condition, and according to Judging result carries out attribute to the remaining node and is labeled as determining the element unrelated with text, element that may be unrelated with text With other elements.

Further, the cache module includes：

Coordinates measurement unit, it is raw for selecting the body element of dom tree as the start node for carrying out depth-first recursive traversal At the corresponding node visit path of surplus element each in dom tree；

Cache unit, for the attribute mark information according to surplus element in dom tree, it would be possible to the element unrelated with text and its Its element is used as due-in collection element, carries out information collection and classify to cache as picture node, author to the due-in element that integrates Node, date node, body text node or visual title node.

The advantages of additional aspect of the invention, will be set forth in part in the description, and will partially become from the following description It obtains obviously, or practice is recognized through the invention.

Detailed description of the invention

Fig. 1 is a kind of flow diagram for general Web page subject method for extracting content that embodiment 1 provides；

Fig. 2 is a kind of structural schematic diagram for general Web page subject content extraction system that embodiment 2 provides.

Specific embodiment

The principle and features of the present invention will be described below with reference to the accompanying drawings, and the given examples are served only to explain the present invention, and It is non-to be used to limit the scope of the invention.

Fig. 1 is a kind of flow diagram for general Web page subject method for extracting content that embodiment 1 provides, such as Fig. 1 institute Show, includes the following steps：

Step 1, the dom tree of target webpage is constructed, and cleaning and attribute label are carried out to the node of the dom tree；

Step 2, traversal cleaning and attribute label after dom tree, and by the remaining node-classification of dom tree cache for picture node, Visual title node, date node or body text node；

Step 3, the subject content of target webpage is extracted from cache information, the subject content includes text, issuing time With text picture.

Above-described embodiment be based on strong incidence relation present on page structure, to the visual title node of the text of dom tree into Row identifies and carries out classification caching to other nodes, then with other category nodes in dom tree and the visual title node of text Whether distance belongs to the important evidence of subject content as predicate node, to improve the precision and effect of Web page information extraction Rate.Each step of above-described embodiment is specifically described below.

In above-described embodiment 1, the step 1 specifically includes following steps：

S101 downloads the source code of target webpage, and the source code is resolved to a dom tree.Target webpage has usually been given Determine web page interlinkage, the source code of target webpage can be downloaded by web page interlinkage, it then can be by the source using Open-Source Tools Code analysis is dom tree, and specific analytic method is on the books in prior art document, herein without being described in detail.

S102 obtains and caches the content of title label node in the dom tree, while in title label node Hold and carry out Chinese word segmentation and removal stop words, generation includes the title set of words of several title words.CSS can specifically be used Selector finds the title label node in dom tree, then obtains its content, that is, obtains the heading message of target webpage, then After carrying out Chinese word segmentation to heading message and remove stop words, title set of words is obtained, the title in title set of words is passed through Word identifies the visual title node in text.Here visual title node refers to the node where title word, without It is above-mentioned title label node.

S103 traverses the dom tree using the mode of depth-first, clears up in the dom tree after the node of preset kind, Judge whether id attribute, class attribute and/or the style attribute of remaining node meet the first preset condition, and is tied according to judgement Fruit to the remaining node carry out attribute labeled as determine the element unrelated with text, element that may be unrelated with text and other Element.In the present embodiment, the node of the preset kind is node obviously unrelated with body matter, such as neither text section Point is also not the node and various script nodes, such as meta, title, link node etc. of node element.

First preset condition is：If the id attribute or class attribute of node include banner, comment, The style attribute of the texts such as sidebar, logo or node includes display:None, then judge the node for determine with The unrelated element of text.After being judged to the attribute of remaining node and generate judging result, special marking attribute score is used The node is marked, but is not cleared up directly, prevents from upsetting subsequent node counts serial number mark.Meanwhile in this step In, because having carried out element property label to remaining node, the residue node is in the description of step in detail below Referred to as surplus element, the two are expressed equivalent in meaning.

Then traversal is carried out to dom tree and information is collected, specifically include following steps：

S201 selects the body element of dom tree as the start node for carrying out depth-first recursive traversal, generates every in dom tree The corresponding node visit path of a surplus element.The node visit path of body element is null character string, surplus element in dom tree Node visit path be the fullpath that the element is reached from body element, by each node on path nodename and Serial number of the node under its father node is spliced, and since body element only has one, does not need to record its serial number.Such as body Under second div element under third p element, access path be body.div [2] .p [3].Meanwhile when the node When access path is longer, the loose access path of node can specify that, that is, ignore last 3 layers on the node visit path of element Index on secondary, and node visit path used in subsequent step is replaced using loose access path.A such as element Node visit path be body.div [2] .div [1] .table [1] .div [2] .p [1], corresponding loose access path For body.div [2] .div [1] .table.div.p.

S202, according to the attribute mark information of surplus element in dom tree, it would be possible to the element unrelated with text and other members Element is used as due-in collection element, and carries out information collection and classify to cache as picture node, Zuo Zhejie to the due-in element that integrates Point, visual title node, date node or body text node, specific caching method are：

Step a judges whether the element tags of the due-in collection element are img labels, if so, collecting and caching described due-in Integrate element as picture node, because a picture element can not be the elements such as date or title simultaneously, if it is not, then executing step Rapid b.

Step b, judge it is described it is due-in collection element id attribute or class attribute whether include image, photo or Gallery label, if it is not, c is thened follow the steps, if so, determine that the due-in collection element is determining pictorial information block node, And the traversal of global mark dom tree enters pictorial information and collects block, when traversing the child node of the due-in collection element, need to only sentence Whether the child node of breaking is that picture node does not have to attempt to judge whether it is the elements such as author or date or title, is improved Extraction efficiency.If so, collecting and caching the child node is picture node, if it is not, then continuing to judge next wait collect Element.

Step c, judge it is described it is due-in collection element id attribute or class attribute whether include author, writtenby or Byline label, if it is not, d is thened follow the steps, if so, determine that the due-in collection element is determining author information block node, And the traversal of global mark dom tree enters author information and collects block, when traversing the child node of the due-in collection element, need to only sentence Whether the child node of breaking is author node, will not attempt to identify whether its child node is the members such as picture or date or title Element further improves extraction efficiency.If so, collecting and caching the child node is author node, if it is not, then continuing to judge Next due-in collection element.

Step d, judges the id attribute of the due-in collection element or whether class attribute includes article, post, main Or content label, if it is not, e is thened follow the steps, if so, determining that the due-in collection element is determining text message block section Point, and the traversal of global mark dom tree enters text message and collects block, while if current global without collecting determining text Block of information and only have collected non-deterministic text message block, then empty the currently non-deterministic text message block collected.

Step e, determines whether the due-in collection element has daughter element, if there is daughter element, then judges that the daughter element is It is no to be integrated replacement, if it is then the due-in collection element will be replaced with after the content integration of all daughter elements Content, and step f is executed, if cannot, directly execution step f.

Step f traverses all child nodes of the due-in collection element and handles one by one, and processing method is：Judge the son The type of node, if child node is node element, global node, which counts, adds one, and return step a carries out recurrence depth again Degree traversal identifies the content of the text child node if child node is text child node, according to recognition result by institute Stating text child node caching is visual title node, date node or possible body text node.

During carrying out the above depth-first recursive traversal, record dom tree in it is due-in collection element node counts serial number, Text node counts serial number and node visit path.

In the step e of above-described embodiment, judge whether the daughter element of the due-in collection element can be integrated the tool of replacement Body method is：

1）If the due-in element that integrates is pre element, title element h1 ~ h6 or other display labels such as strong, b, i, em Deng, then it is described it is due-in collection element daughter element can be integrated replacement, it can directly merging in an element；

If 2) the due-in element that integrates is p element, judge whether the due-in collection element meets the first pre- integration condition, Meet and judges whether the due-in collection element meets the second pre- integration condition, two conditions again on the basis of the first pre- integration condition When being all satisfied, the daughter element of the due-in collection element can be integrated replacement；

The first pre- integration condition is：The due-in collection element includes more than one text child node or described wait collect Link text and the text word number ratio of plain text are less than one third in the daughter element of element；

The second pre- integration condition is：The due-in element of set is known as the node of more than one sentence, the due-in collection element The access path due-in collection element consistent or described with the node visit path of a upper text node being collected is simple member Element.The simple elements refer to that an element only includes most simple elements and text node, are a recursive procedures；

3）If the due-in collection element had both included child element node or included text child node, the due-in collection element is checked All texts whether constitute short text, if it is, it is described it is due-in collection element daughter element can be integrated replacement.It is described short Text refers to that after text carries out Chinese word segmentation include less than 3 stop words.

In the step f of above-described embodiment, the text child node is cached as visual title node, day according to recognition result The specific method of phase node or possible body text node is：

1）Title word in the content of text of the text child node and the title set of words is subjected to similarity-rough set, Judge whether the text child node is visual title node according to comparison result；

2）The date-time information in the content of text of the text child node is extracted based on regular expression, if can extract When the ratio that success and date-time text account for entire content of text is greater than preset threshold 0.5, the text child node is determined For pure date node, not as other types node, such as " 2018-04-13 07:03:37 sources:Xinhua News Agency "；

3）If the text child node is neither visual title node, nor pure date node, then by this article book Nodal cache is possible body text node, to subsequent analysis.

Then text extraction is carried out according to the possible body text node cached.And the text extracts, main base In the following two fact：First, body text node behind visual title node, i.e., its node counts serial number be greater than can sighting target Inscribe the node counts serial number of node.Second, body text node has similar access path.Based on the above fact, extract just Stationery body includes the following steps：

1）All possible body text node is subjected to ascending sort according to node counts serial number；

2）It finds in all possible body text node, first node counts serial number is greater than the node meter of visual title node The first object node of number sequence number, and the sentence number of the first object node is greater than the lexical word of 0 or first object node Language and the content word of visual title node have correlation, and the first object node is denoted as p1 node；

3）The node counts serial number difference with the P1 node is reversely found forward using the p1 node as starting point less than 3, and is visited Similar second destination node of diameter of asking the way, and p1 is replaced with, this step is then repeated, until can not find the second new destination node Until；

4）All possible body text node before clearing up the p1 node, and to remaining all possible body text Node is grouped according to node visit path, and each packets inner carries out ascending sort according to node counts serial number, is grouped it Between according to each grouping first node node counts serial number carry out ascending sort；

5）The preset parameter value of each grouping is calculated, and the preset parameter value is imported into prediction model trained in advance and is carried out Marking generates the targeted packets that score is greater than default score value；

6）Node in all targeted packets is subjected to ascending sort according to node counts serial number, and forms text node set；

7）Cache the text node set.

The step 5 of the present embodiment）In, the preset parameter value includes number of nodes, total sentence number, total correlation word number, puts down Related word number, node text node count serial number difference, the node counts serial number difference of node and current group The similarity in node visit path and the node visit path of a upper targeted packets.Wherein text node counts serial number difference, First is grouped, this difference refers to first text node of current group at a distance from visual title node.It is right In other groupings, this difference refers to first text node of current group and the last one section of a upper targeted packets The distance of point.

Then according to the date Node extraction issuing time cached, following steps are specifically included：

1）The invalid node in all date nodes is cleared up, the invalid node is node counts serial number in text extraction and analysis Node after first found the first object node because issue date node otherwise visual title node it It is preceding or between visual title node and first body text node.

2）The target date node nearest from visual title node in remaining date node after obtaining cleaning, and the mesh The node counts serial number difference for marking date node is lower than the first preset value, and it is default lower than second that text node counts serial number difference Value.

The picture Node extraction text picture that last basis has cached, specifically includes following steps：

Step 002, Target Photo node is obtained, by other picture nodes after Target Photo node and Target Photo node Complete liquidation, the Target Photo node is near the last one first object node and node counts serial number difference is greater than in advance If the picture node of value；

Step 003, picture node of the node counts serial number between body text node and visual title node is obtained, is denoted as Interpolation graphs piece node, then will be located at visual title node before and nodal distance lower than preset value picture node be also denoted as it is slotting It is worth picture node, and is incorporated into interpolation graphs piece node set, while caching to non-interpolative picture node；

Step 004, each interpolation graphs piece node is obtained at a distance from the node counts serial number of visual title node, and according to distance Ascending order is ranked up all interpolation graphs segment points；

In above-described embodiment, the used default screening rule includes following：

Rule 1, the image link based on interpolation graphs piece node filters common advertisement link, such as URL（Unified resource positioning）'s Include common advertise printed words or common social networks exterior chain or logo in path.

Rule 2, obtain interpolation graphs piece node dimension of picture information, according to the length-width ratio of picture filter banner picture with And size is lower than the small size picture of preset value, specifically can obtain dimension of picture information using following methods, for example work as prosthomere Point has specified that width, height attribute, and attribute can then directly acquire dimension of picture information in effective range, no Network inputs stream is then opened by picture URL and obtains dimension of picture information.It is not needed when obtaining dimension of picture information by network Full picture is downloaded, only reads dimension information on network inputs stream head.The loose visit of the picture node is recorded simultaneously It asks the way diameter, when traversing other picture nodes, such as other nodes are identical as the loose access path of the node, then other nodes can Directly to use the dimension information of the picture node, do not need to open additional network request.

Rule 3, based on node path from picture node be starting point, at most recall 3 node layers, in conjunction with node id attribute and Class attribute gives a mark to picture node, and according to score size, the picture unrelated to determination is filtered.Section is recorded simultaneously The loose access path of point, when traversing other picture nodes, such as other nodes are identical as the loose access path of the node, then Directly filter.

The step 3 of above-described embodiment is based on node visit path and is grouped to node, and carries out marking in packetized units and sentence Whether the node content in fixed grouping belongs to subject content, to further improve the efficiency of Web page subject contents extraction.

It is illustrated in conjunction with process of the attached drawing 1 to general Web page subject method for extracting content, is tied below above Fig. 2 is closed to be illustrated the structure of general Web page subject content extraction system.

Fig. 2 is a kind of structural schematic diagram for general Web page subject content extraction system that the embodiment of the present invention 2 provides, such as Shown in Fig. 2, including dom tree processing module, cache module and extraction module,

Above-described embodiment be based on strong incidence relation present on page structure, to the visual title node of the text of dom tree into Row identifies and carries out classification caching to other nodes, then with other category nodes in dom tree and the visual title node of text Whether distance belongs to the important evidence of subject content as predicate node, to improve the precision and effect of Web page information extraction Rate.

In in preferred embodiment, the dom tree processing module includes：

In another preferred embodiment, the cache module includes：

The cache unit includes picture nodal cache unit, pictorial information block nodal cache unit, author information block section Point cache unit, text message block nodal cache unit, daughter element integrate replacement unit, daughter element cache unit and information record Unit,

The picture nodal cache unit is used to judge whether the element tags of the due-in collection element to be img labels, if so, It collects and caches the due-in element that integrates as picture node, if it is not, then driving the first judging unit；

The pictorial information block nodal cache unit is used to judge whether the id attribute of the due-in collection element or class attribute to wrap Containing image, photo or gallery label, if it is not, author information block nodal cache unit is then driven, if so, described in determining Due-in collection element is determining pictorial information block node, and the traversal of global mark dom tree enters pictorial information and collects block, when time When going through the child node of the due-in collection element, judge whether the child node is picture node, if so, collecting and caching described Child node is picture node, if it is not, then continuing to judge next due-in collection element；

The author information block nodal cache be used for judge it is described it is due-in collection element id attribute or class attribute whether include Author, writtenby or byline label, if it is not, text message block nodal cache unit is then driven, if so, determining institute Stating due-in collection element is determining author information block node, and the traversal of global mark dom tree enters author information and collects block, when When traversing the child node of the due-in collection element, judge whether the child node is author node, if so, collecting and caching institute Stating child node is author node, if it is not, then continuing to judge next due-in collection element；

The text message block nodal cache unit is used to judge whether the id attribute of the due-in collection element or class attribute to wrap Containing article, post, main or content label, if it is not, then daughter element is driven to integrate replacement unit, if so, determining institute Stating due-in collection element is determining text message block node, and the traversal of global mark dom tree enters text message and collects block, together The current overall situation of Shi Ruguo only has collected non-deterministic text message block without collecting determining text message block, then empties current The non-deterministic text message block collected；

The daughter element integrates replacement unit for determining whether the due-in collection element has daughter element, if there is daughter element, then Judge whether the daughter element can be integrated replacement, if it is then by replacing with after the content integration of all daughter elements The content of the due-in collection element, and the daughter element cache unit is driven, if cannot, it is single to directly drive daughter element caching Member；

The daughter element cache unit, for traversing all child nodes of the due-in collection element and judging the child node one by one Type, if child node is node element, global node is counted plus one, and picture nodal cache unit is driven to carry out again Depth of recursion traversal identifies the content of the text child node if child node is text child node, is tied according to identification Fruit caches the text child node for visual title node, date node or possible body text node；

The information recording unit counts serial number for recording the node counts serial number of due-in collection element, text node in dom tree And node visit path.

In another preferred embodiment, the extraction module includes text extraction module, issuing time extraction module and text Picture extraction module.The text extraction module specifically includes：

First sequencing unit, for all possible body text node to be carried out ascending sort according to node counts serial number；

First object node generation unit, for finding in all possible body text node, first node counts serial number Greater than the first object node of the node counts serial number of visual title node, and the sentence number of the first object node is greater than 0 Or the content word of first object node and the content word of visual title node have correlation, by the first object section Point is denoted as p1 node；

Cycling element, for reversely finding the node counts serial number difference with the P1 node forward using the p1 node as starting point Less than 3, and similar second destination node of access path, and p1 is replaced with, until can not find the second new destination node；

Grouped element, for all possible body text node before clearing up the p1 node, and to it is remaining it is all can The body text node of energy is grouped according to node visit path, and each packets inner carries out ascending order according to node counts serial number It sorts, carries out ascending sort according to the node counts serial number of first node of each grouping between grouping；

Marking unit imported into training in advance for calculating the preset parameter value of each grouping, and by the preset parameter value Prediction model is given a mark, and the targeted packets that score is greater than default score value are generated；

First extraction unit, for the node in all targeted packets to be carried out ascending sort, and shape according to node counts serial number At text section point set, and cache the text node set.

In preferred embodiment, the issuing time extraction module is specifically included：

Unit, the invalid node for clearing up in all date nodes are cleared up, the invalid node is node counts serial number the Node after one first object node；

Second extraction unit, for obtaining target date nearest from visual title node in the remaining date node after clearing up Point, and the node counts serial number difference of the target date node is lower than the first preset value, it is low that text node counts serial number difference In the second preset value.

Preferably, the text picture extraction module specifically includes：

Second sequencing unit, the picture node for will cache is according to by node counts serial number ascending sort；

Second destination node generation unit, for obtaining Target Photo node, by Target Photo node and Target Photo node Other picture node complete liquidations later, the Target Photo node are near the last one first object node and node Count the picture node that serial number difference is greater than third preset value；

Interpolation graphs piece node generation unit, for obtain node counts serial number be located at body text node and visual title node it Between picture node, be denoted as interpolation graphs piece node, then will be located at before visual title node and section with visual title node Point distance is also denoted as interpolation graphs piece node lower than the picture node of the 4th preset value, and is incorporated into interpolation graphs piece node set, together When non-interpolative picture node is cached；

Third sequencing unit, for obtaining each interpolation graphs piece node at a distance from the node counts serial number of visual title node, And ascending sort is carried out to all interpolation graphs segment points according to distance；

Prescreening unit filters out and text for carrying out prescreening to all interpolation graphs segment points according to presetting screening rule Unrelated invalid picture；

Third extraction unit, for obtaining the node visit path of remaining interpolation graphs piece node after prescreening, and in interpolation picture The identical node in node visit path is found in the interpolation graphs piece node set of node generation unit, then repeats driving third row Sequence unit and prescreening unit integrate the interpolation graphs piece node and non-interpolative picture node that filter out again.

Reader should be understood that in the description of this specification reference term " one embodiment ", " is shown " some embodiments " The description of example ", " specific example " or " some examples " etc. mean specific features described in conjunction with this embodiment or example, structure, Material or feature are included at least one embodiment or example of the invention.In the present specification, above-mentioned term is shown The statement of meaning property need not be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described It may be combined in any suitable manner in any one or more of the embodiments or examples.In addition, without conflicting with each other, this The technical staff in field can be by the spy of different embodiments or examples described in this specification and different embodiments or examples Sign is combined.

It is apparent to those skilled in the art that for convenience of description and succinctly, the dress of foregoing description The specific work process with unit is set, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.

In several embodiments provided herein, it should be understood that disclosed device and method can pass through it Its mode is realized.For example, the apparatus embodiments described above are merely exemplary, for example, the division of unit, only A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.

Unit may or may not be physically separated as illustrated by the separation member, shown as a unit Component may or may not be physical unit, it can and it is in one place, or may be distributed over multiple networks On unit.It can select some or all of unit therein according to the actual needs to realize the mesh of the embodiment of the present invention 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, is also possible to two or more units and is integrated in one unit.It is above-mentioned integrated Unit both can take the form of hardware realization, can also realize in the form of software functional units.

It, can if integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product To be stored in a computer readable storage medium.Based on this understanding, technical solution of the present invention substantially or Say that all or part of the part that contributes to existing technology or the technical solution can embody in the form of software products Out, which is stored in a storage medium, including some instructions are used so that a computer equipment （It can be personal computer, server or the network equipment etc.）Execute all or part of each embodiment method of the present invention Step.And storage medium above-mentioned includes：USB flash disk, mobile hard disk, read-only memory（ROM, Read-Only Memory）, it is random Access memory（RAM, Random Access Memory）, various Jie that can store program code such as magnetic or disk Matter.

More than, only a specific embodiment of the invention, but scope of protection of the present invention is not limited thereto, and it is any to be familiar with Those skilled in the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or substitutions, These modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be wanted with right Subject to the protection scope asked.

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of general Web page subject method for extracting content, which is characterized in that include the following steps：

2. general Web page subject method for extracting content according to claim 1, which is characterized in that the step 1 is specific Include the following steps：

3. general Web page subject method for extracting content according to claim 2, which is characterized in that the step 2 is specific Include the following steps：

4. general Web page subject method for extracting content according to claim 3, which is characterized in that right in step S202 The due-in collection element carries out information collection and caching of classifying specifically includes following steps：

5. general Web page subject method for extracting content according to claim 4, which is characterized in that root in the step 3 Following steps are specifically included according to the body text Node extraction text cached：

Cache the text node set.

6. general Web page subject method for extracting content according to claim 5, which is characterized in that root in the step 3 Following steps are specifically included according to the date Node extraction issuing time cached：

7. general Web page subject method for extracting content according to claim 6, which is characterized in that root in the step 3 Following steps are specifically included according to the picture Node extraction text picture cached：

8. a kind of general Web page subject content extraction system, which is characterized in that including dom tree processing module, cache module and Extraction module,

9. general Web page subject content extraction system according to claim 8, which is characterized in that the dom tree processing Module includes：

10. general Web page subject content extraction system according to claim 9, which is characterized in that the cache module Including：