CN108920434A - A kind of general Web page subject method for extracting content and system - Google Patents
A kind of general Web page subject method for extracting content and system Download PDFInfo
- Publication number
- CN108920434A CN108920434A CN201810572726.0A CN201810572726A CN108920434A CN 108920434 A CN108920434 A CN 108920434A CN 201810572726 A CN201810572726 A CN 201810572726A CN 108920434 A CN108920434 A CN 108920434A
- Authority
- CN
- China
- Prior art keywords
- node
- text
- dom tree
- content
- picture
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
- G06F40/146—Coding or compression of tree-structured data
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention is more particularly directed to a kind of general Web page subject method for extracting content and system, method to include the following steps:The dom tree for constructing target webpage, clears up the node of dom tree, and carries out attribute label to remaining node according to the correlation with body matter;Dom tree is traversed, the remaining node-classification of dom tree is cached;Whether it is the theme content according to the content of node described in the Distance Judgment of each classification interior joint and the visual title node, and the extraction to target webpage subject content is completed according to judging result.The present invention provides the semantic-based method for abstracting web page information that one kind more optimizes, it is based on strong incidence relation present on page structure, the visual title node of the text of dom tree identify and classification caching is carried out to other nodes, then whether other category nodes belong to the important evidence of subject content at a distance from the visual title node of text as predicate node using in dom tree, to improve the precision and efficiency of Web page information extraction.
Description
Technical field
The present invention relates to computer software technical fields, and in particular to a kind of general Web page subject method for extracting content and
System.
Background technique
Current Internet era, visible most of information is disclosed in network, which is in a manner of subject content, is in
Existing, the Domestic News etc. of blog article, portal website in blog.These subject contents are that most of Internet users obtain
The important channel of information and the magnanimity basis corpus of academic research personnel, and have important valence in natural language processing field
Value.But due to many, the subject content webpage on network is not to be made of simple subject content, is further comprised all
The information that such as advertisement, comment, associated recommendation and guidance to website are not directly relevant to subject content.How from many and diverse net
The subject content that webpage is extracted in page information becomes a problem to be solved.
Current existing theme's extraction mode is generally divided into two kinds:One is semantic-based webpage information extraction sides
Method, another kind are the web page release methods of view-based access control model.Both the above mode is all attempted to extract from structure of web page really
Block of information where subject content.
Semantic-based webpage information extracts generally there are two types of mode, first way be the information based on entire website into
Row analysis is attempted to find the replicated blocks, such as navigation bar etc. between different web pages, is then making a concrete analysis of some webpage
When remove these replicated blocks, and then find subject content;The second way is the webpage itself for only relying only on present analysis, is tasted
Some pieces of grade node elements in HTML are found in examination, the then text information of analysis node content, such as text size, by comparing
To obtain the block grade element with longest text size.
The web page release method of view-based access control model, attempt by browser engine rendering full page, then to rendering after
The page carries out background color based on page elements, font, frame etc. factor and carries out piecemeal, to the more close element of the degree of association
It merges, and the untight element of the degree of association is then considered as different piecemeals, to complete the piecemeal of full page view-based access control model
Reconstruct.The web page release method of view-based access control model has its defect, because this mode needs to construct in analysis based on web page source code
Dom tree while load its dependence CSS(Cascading style sheets)File etc., and rendered dependent on browser engine,
There is a problem of that speed is relatively very slow for the analysis of mass data.
Summary of the invention
The present invention provides a kind of general Web page subject method for extracting content and system, solve webpage in the prior art
The lower technical problem of the precision and efficiency that subject content is extracted.
The technical solution that the present invention solves above-mentioned technical problem is as follows:A kind of general Web page subject method for extracting content,
Include the following steps:
Step 1, the dom tree for constructing target webpage clears up the node of the dom tree, and according to the phase with body matter
Closing property carries out attribute label to the remaining node of the dom tree;
Step 2, traversal attribute label after dom tree, by the remaining node-classification of dom tree cache for picture node, date node,
Body text node or visual title node;
Step 3, according to the picture node, the date node and the body text node respectively with the visual title
The content of the content of picture node described in the Distance Judgment of node, the content of the date node and the body text node
Whether content, and complete the extraction to target webpage subject content according to judging result if being the theme, the subject content includes just
Texts and pictures piece, issuing time and text.
The beneficial effects of the invention are as follows:The present invention provides the semantic-based Web page information extraction sides that one kind more optimizes
Method is based on strong incidence relation present on page structure, carries out identification to the visual title node of the text of dom tree and to other
Node carries out classification caching, and then other category nodes save at a distance from the visual title node of text as judgement using in dom tree
Whether point belongs to the important evidence of subject content, to improve the precision and efficiency of Web page information extraction.
Based on the above technical solution, the present invention can also be improved as follows.
Further, the step 1 specifically includes following steps:
S101 downloads the source code of target webpage, and the source code is resolved to a dom tree;
S102, obtains and caches the content of title label node in the dom tree, at the same to the content of title label node into
Row Chinese word segmentation and removal stop words, generate the title set of words including several title words;
S103 traverses the dom tree using the mode of depth-first, clears up in the dom tree after the node of preset kind, judgement
Whether id attribute, class attribute and/or the style attribute of remaining node meet the first preset condition, and according to judging result pair
The residue node carry out attribute labeled as determine the element unrelated with text, element that may be unrelated with text and other members
Element.
Further, the step 2 specifically includes following steps:
S201 selects the body element of dom tree as the start node for carrying out depth-first recursive traversal, generates every in dom tree
The corresponding node visit path of a surplus element;
S202, according to the attribute mark information of surplus element in dom tree, it would be possible to which the element unrelated with text and other elements are equal
As due-in collection element, the due-in element that integrates is carried out information collection and classified to cache as picture node, author node, date
Node, body text node or visual title node.
Further, in step S202, information collection is carried out to the due-in collection element and caching of classifying specifically include it is following
Step:
Step a judges whether the element tags of the due-in collection element are img labels, if so, collecting and caching described due-in
Integrate element as picture node, if it is not, thening follow the steps b;
Step b, judges the id attribute of the due-in collection element or whether class attribute includes image, photo or gallery mark
Label, if it is not, c is thened follow the steps, if so, determining that the due-in collection element is determining pictorial information block node, and global mark
The traversal of note dom tree enters pictorial information and collects block, when traversing the child node of the due-in collection element, judges the child node
Whether be picture node, if so, collecting and caching the child node is picture node, if it is not, then continue to judge it is next to
Collect element;
Step c, judge it is described it is due-in collection element id attribute or class attribute whether include author, writtenby or
Byline label, if it is not, d is thened follow the steps, if so, determine that the due-in collection element is determining author information block node,
And the traversal of global mark dom tree enters author information and collects block, when traversing the child node of the due-in collection element, judges institute
State whether child node is author node, if so, collecting and caching the child node is author node, if it is not, then continuing to judge
Next due-in collection element;
Step d, judge it is described it is due-in collection element id attribute or class attribute whether include article, post, main or
Content label, if it is not, e is thened follow the steps, if so, determine that the due-in collection element is determining text message block node,
And the traversal of global mark dom tree enters text message and collects block, while if the current global text without collecting determination is believed
It ceases block and only has collected non-deterministic text message block, then empty the non-deterministic text message block currently collected;
Step e, determines whether the due-in collection element has daughter element, if there is daughter element, then judges that the daughter element whether may be used
To be integrated replacement, if it is then the content for the due-in collection element being replaced with after the content integration of all daughter elements,
And step f is executed, if cannot, directly execution step f;
Step f traverses all child nodes of the due-in collection element and handles one by one, and processing method is:Judge the child node
Type, if child node is node element, global node is counted plus one, and return step a carries out depth of recursion time again
It goes through, if child node is text child node, the content of the text child node is identified, according to recognition result by the text
Book nodal cache is visual title node, date node or possible body text node;
During carrying out the above depth-first recursive traversal, the node counts serial number of due-in collection element, text in dom tree are recorded
Node counts serial number and node visit path.
Further, following steps are specifically included according to the body text Node extraction text cached in the step 3:
All possible body text node is subjected to ascending sort according to node counts serial number;
It finds in all possible body text node, first node counts serial number is greater than the node counts of visual title node
The first object node of serial number, and the sentence number of the first object node is greater than the content word of 0 or first object node
There is correlation with the content word of visual title node, the first object node is denoted as p1 node;
The node counts serial number difference with the P1 node is reversely found forward using the p1 node as starting point less than 3, and is accessed
Similar second destination node in path, and p1 is replaced with, this step then being repeated, being until can not find the second new destination node
Only;
All possible body text node before clearing up the p1 node, and to remaining all possible body text section
Point is grouped according to node visit path, and each packets inner carries out ascending sort according to node counts serial number, between grouping
Ascending sort is carried out according to the node counts serial number of first node of each grouping;
The preset parameter value of each grouping is calculated, and the preset parameter value is imported into prediction model trained in advance and is beaten
Point, generate the targeted packets that score is greater than default score value;
Node in all targeted packets is subjected to ascending sort according to node counts serial number, and forms text node set;
Cache the text node set.
Further, following steps are specifically included according to the date Node extraction issuing time cached in the step 3:
The invalid node in all date nodes is cleared up, the invalid node is node counts serial number in first first mesh
Mark the node after node;
The target date node nearest from visual title node in remaining date node after obtaining cleaning, and the target date
The node counts serial number difference of node is lower than the first preset value, and text node counts serial number difference and is lower than the second preset value.
Further, following steps are specifically included according to the picture Node extraction text picture cached in the step 3:
Step 001, by the picture node cached according to by node counts serial number ascending sort;
Step 002, Target Photo node is obtained, by other picture nodes after Target Photo node and Target Photo node
Complete liquidation, the Target Photo node is near the last one first object node and node counts serial number difference is greater than the
The picture node of three preset values;
Step 003, picture node of the node counts serial number between body text node and visual title node is obtained, is denoted as
Then interpolation graphs piece node will be preset before being located at visual title node and with the nodal distance of visual title node lower than the 4th
The picture node of value is also denoted as interpolation graphs piece node, and is incorporated into interpolation graphs piece node set, while to non-interpolative picture node
It is cached;
Step 004, each interpolation graphs piece node is obtained at a distance from the node counts serial number of visual title node, and according to distance
Ascending sort is carried out to all interpolation graphs segment points;
Step 005, prescreening is carried out to all interpolation graphs segment points according to default screening rule, filters out the nothing unrelated with text
Imitate picture;
Step 006, the node visit path of remaining interpolation graphs piece node after prescreening, and the interpolation graphs in step 003 are obtained
The identical node in node visit path is found in piece node set, then repeatedly step 004 and step 005, to filtering out again
Interpolation graphs piece node and non-interpolative picture node integrated.
In order to solve technical problem of the invention, a kind of general Web page subject content extraction system is additionally provided, including
Dom tree processing module, cache module and extraction module,
The dom tree processing module is used to construct the dom tree of target webpage, clears up the node of the dom tree, and according to
Attribute label is carried out to the remaining node of the dom tree with the correlation of body matter;
The cache module is used to traverse the dom tree after attribute label, and the remaining node-classification of dom tree is cached as picture section
Point, date node, body text node or visual title node;
The extraction module be used for according to the picture node, the date node and the body text node respectively with institute
State the interior perhaps described body text section of the content of picture node described in the Distance Judgment of visual title node, the date node
Whether the content of point is the theme content, and completes the extraction to target webpage subject content according to judging result, in the theme
Hold includes text picture, issuing time and text.
Further, the dom tree processing module includes:
Resolution unit resolves to a dom tree for downloading the source code of target webpage, and by the source code;
Title word generation unit, for obtaining and caching the content of title label node in the dom tree, while to title
The content of label node carries out Chinese word segmentation and removal stop words, generates the title set of words including several title words;
Marking unit traverses the dom tree for the mode using depth-first, clears up the section of preset kind in the dom tree
After point, judge whether the id attribute, class attribute and/or style attribute of remaining node meet the first preset condition, and according to
Judging result carries out attribute to the remaining node and is labeled as determining the element unrelated with text, element that may be unrelated with text
With other elements.
Further, the cache module includes:
Coordinates measurement unit, it is raw for selecting the body element of dom tree as the start node for carrying out depth-first recursive traversal
At the corresponding node visit path of surplus element each in dom tree;
Cache unit, for the attribute mark information according to surplus element in dom tree, it would be possible to the element unrelated with text and its
Its element is used as due-in collection element, carries out information collection and classify to cache as picture node, author to the due-in element that integrates
Node, date node, body text node or visual title node.
The advantages of additional aspect of the invention, will be set forth in part in the description, and will partially become from the following description
It obtains obviously, or practice is recognized through the invention.
Detailed description of the invention
Fig. 1 is a kind of flow diagram for general Web page subject method for extracting content that embodiment 1 provides;
Fig. 2 is a kind of structural schematic diagram for general Web page subject content extraction system that embodiment 2 provides.
Specific embodiment
The principle and features of the present invention will be described below with reference to the accompanying drawings, and the given examples are served only to explain the present invention, and
It is non-to be used to limit the scope of the invention.
Fig. 1 is a kind of flow diagram for general Web page subject method for extracting content that embodiment 1 provides, such as Fig. 1 institute
Show, includes the following steps:
Step 1, the dom tree of target webpage is constructed, and cleaning and attribute label are carried out to the node of the dom tree;
Step 2, traversal cleaning and attribute label after dom tree, and by the remaining node-classification of dom tree cache for picture node,
Visual title node, date node or body text node;
Step 3, the subject content of target webpage is extracted from cache information, the subject content includes text, issuing time
With text picture.
Above-described embodiment be based on strong incidence relation present on page structure, to the visual title node of the text of dom tree into
Row identifies and carries out classification caching to other nodes, then with other category nodes in dom tree and the visual title node of text
Whether distance belongs to the important evidence of subject content as predicate node, to improve the precision and effect of Web page information extraction
Rate.Each step of above-described embodiment is specifically described below.
In above-described embodiment 1, the step 1 specifically includes following steps:
S101 downloads the source code of target webpage, and the source code is resolved to a dom tree.Target webpage has usually been given
Determine web page interlinkage, the source code of target webpage can be downloaded by web page interlinkage, it then can be by the source using Open-Source Tools
Code analysis is dom tree, and specific analytic method is on the books in prior art document, herein without being described in detail.
S102 obtains and caches the content of title label node in the dom tree, while in title label node
Hold and carry out Chinese word segmentation and removal stop words, generation includes the title set of words of several title words.CSS can specifically be used
Selector finds the title label node in dom tree, then obtains its content, that is, obtains the heading message of target webpage, then
After carrying out Chinese word segmentation to heading message and remove stop words, title set of words is obtained, the title in title set of words is passed through
Word identifies the visual title node in text.Here visual title node refers to the node where title word, without
It is above-mentioned title label node.
S103 traverses the dom tree using the mode of depth-first, clears up in the dom tree after the node of preset kind,
Judge whether id attribute, class attribute and/or the style attribute of remaining node meet the first preset condition, and is tied according to judgement
Fruit to the remaining node carry out attribute labeled as determine the element unrelated with text, element that may be unrelated with text and other
Element.In the present embodiment, the node of the preset kind is node obviously unrelated with body matter, such as neither text section
Point is also not the node and various script nodes, such as meta, title, link node etc. of node element.
First preset condition is:If the id attribute or class attribute of node include banner, comment,
The style attribute of the texts such as sidebar, logo or node includes display:None, then judge the node for determine with
The unrelated element of text.After being judged to the attribute of remaining node and generate judging result, special marking attribute score is used
The node is marked, but is not cleared up directly, prevents from upsetting subsequent node counts serial number mark.Meanwhile in this step
In, because having carried out element property label to remaining node, the residue node is in the description of step in detail below
Referred to as surplus element, the two are expressed equivalent in meaning.
Then traversal is carried out to dom tree and information is collected, specifically include following steps:
S201 selects the body element of dom tree as the start node for carrying out depth-first recursive traversal, generates every in dom tree
The corresponding node visit path of a surplus element.The node visit path of body element is null character string, surplus element in dom tree
Node visit path be the fullpath that the element is reached from body element, by each node on path nodename and
Serial number of the node under its father node is spliced, and since body element only has one, does not need to record its serial number.Such as body
Under second div element under third p element, access path be body.div [2] .p [3].Meanwhile when the node
When access path is longer, the loose access path of node can specify that, that is, ignore last 3 layers on the node visit path of element
Index on secondary, and node visit path used in subsequent step is replaced using loose access path.A such as element
Node visit path be body.div [2] .div [1] .table [1] .div [2] .p [1], corresponding loose access path
For body.div [2] .div [1] .table.div.p.
S202, according to the attribute mark information of surplus element in dom tree, it would be possible to the element unrelated with text and other members
Element is used as due-in collection element, and carries out information collection and classify to cache as picture node, Zuo Zhejie to the due-in element that integrates
Point, visual title node, date node or body text node, specific caching method are:
Step a judges whether the element tags of the due-in collection element are img labels, if so, collecting and caching described due-in
Integrate element as picture node, because a picture element can not be the elements such as date or title simultaneously, if it is not, then executing step
Rapid b.
Step b, judge it is described it is due-in collection element id attribute or class attribute whether include image, photo or
Gallery label, if it is not, c is thened follow the steps, if so, determine that the due-in collection element is determining pictorial information block node,
And the traversal of global mark dom tree enters pictorial information and collects block, when traversing the child node of the due-in collection element, need to only sentence
Whether the child node of breaking is that picture node does not have to attempt to judge whether it is the elements such as author or date or title, is improved
Extraction efficiency.If so, collecting and caching the child node is picture node, if it is not, then continuing to judge next wait collect
Element.
Step c, judge it is described it is due-in collection element id attribute or class attribute whether include author, writtenby or
Byline label, if it is not, d is thened follow the steps, if so, determine that the due-in collection element is determining author information block node,
And the traversal of global mark dom tree enters author information and collects block, when traversing the child node of the due-in collection element, need to only sentence
Whether the child node of breaking is author node, will not attempt to identify whether its child node is the members such as picture or date or title
Element further improves extraction efficiency.If so, collecting and caching the child node is author node, if it is not, then continuing to judge
Next due-in collection element.
Step d, judges the id attribute of the due-in collection element or whether class attribute includes article, post, main
Or content label, if it is not, e is thened follow the steps, if so, determining that the due-in collection element is determining text message block section
Point, and the traversal of global mark dom tree enters text message and collects block, while if current global without collecting determining text
Block of information and only have collected non-deterministic text message block, then empty the currently non-deterministic text message block collected.
Step e, determines whether the due-in collection element has daughter element, if there is daughter element, then judges that the daughter element is
It is no to be integrated replacement, if it is then the due-in collection element will be replaced with after the content integration of all daughter elements
Content, and step f is executed, if cannot, directly execution step f.
Step f traverses all child nodes of the due-in collection element and handles one by one, and processing method is:Judge the son
The type of node, if child node is node element, global node, which counts, adds one, and return step a carries out recurrence depth again
Degree traversal identifies the content of the text child node if child node is text child node, according to recognition result by institute
Stating text child node caching is visual title node, date node or possible body text node.
During carrying out the above depth-first recursive traversal, record dom tree in it is due-in collection element node counts serial number,
Text node counts serial number and node visit path.
In the step e of above-described embodiment, judge whether the daughter element of the due-in collection element can be integrated the tool of replacement
Body method is:
1)If the due-in element that integrates is pre element, title element h1 ~ h6 or other display labels such as strong, b, i, em
Deng, then it is described it is due-in collection element daughter element can be integrated replacement, it can directly merging in an element;
If 2) the due-in element that integrates is p element, judge whether the due-in collection element meets the first pre- integration condition,
Meet and judges whether the due-in collection element meets the second pre- integration condition, two conditions again on the basis of the first pre- integration condition
When being all satisfied, the daughter element of the due-in collection element can be integrated replacement;
The first pre- integration condition is:The due-in collection element includes more than one text child node or described wait collect
Link text and the text word number ratio of plain text are less than one third in the daughter element of element;
The second pre- integration condition is:The due-in element of set is known as the node of more than one sentence, the due-in collection element
The access path due-in collection element consistent or described with the node visit path of a upper text node being collected is simple member
Element.The simple elements refer to that an element only includes most simple elements and text node, are a recursive procedures;
3)If the due-in collection element had both included child element node or included text child node, the due-in collection element is checked
All texts whether constitute short text, if it is, it is described it is due-in collection element daughter element can be integrated replacement.It is described short
Text refers to that after text carries out Chinese word segmentation include less than 3 stop words.
In the step f of above-described embodiment, the text child node is cached as visual title node, day according to recognition result
The specific method of phase node or possible body text node is:
1)Title word in the content of text of the text child node and the title set of words is subjected to similarity-rough set,
Judge whether the text child node is visual title node according to comparison result;
2)The date-time information in the content of text of the text child node is extracted based on regular expression, if can extract
When the ratio that success and date-time text account for entire content of text is greater than preset threshold 0.5, the text child node is determined
For pure date node, not as other types node, such as " 2018-04-13 07:03:37 sources:Xinhua News Agency ";
3)If the text child node is neither visual title node, nor pure date node, then by this article book
Nodal cache is possible body text node, to subsequent analysis.
Then text extraction is carried out according to the possible body text node cached.And the text extracts, main base
In the following two fact:First, body text node behind visual title node, i.e., its node counts serial number be greater than can sighting target
Inscribe the node counts serial number of node.Second, body text node has similar access path.Based on the above fact, extract just
Stationery body includes the following steps:
1)All possible body text node is subjected to ascending sort according to node counts serial number;
2)It finds in all possible body text node, first node counts serial number is greater than the node meter of visual title node
The first object node of number sequence number, and the sentence number of the first object node is greater than the lexical word of 0 or first object node
Language and the content word of visual title node have correlation, and the first object node is denoted as p1 node;
3)The node counts serial number difference with the P1 node is reversely found forward using the p1 node as starting point less than 3, and is visited
Similar second destination node of diameter of asking the way, and p1 is replaced with, this step is then repeated, until can not find the second new destination node
Until;
4)All possible body text node before clearing up the p1 node, and to remaining all possible body text
Node is grouped according to node visit path, and each packets inner carries out ascending sort according to node counts serial number, is grouped it
Between according to each grouping first node node counts serial number carry out ascending sort;
5)The preset parameter value of each grouping is calculated, and the preset parameter value is imported into prediction model trained in advance and is carried out
Marking generates the targeted packets that score is greater than default score value;
6)Node in all targeted packets is subjected to ascending sort according to node counts serial number, and forms text node set;
7)Cache the text node set.
The step 5 of the present embodiment)In, the preset parameter value includes number of nodes, total sentence number, total correlation word number, puts down
Related word number, node text node count serial number difference, the node counts serial number difference of node and current group
The similarity in node visit path and the node visit path of a upper targeted packets.Wherein text node counts serial number difference,
First is grouped, this difference refers to first text node of current group at a distance from visual title node.It is right
In other groupings, this difference refers to first text node of current group and the last one section of a upper targeted packets
The distance of point.
Then according to the date Node extraction issuing time cached, following steps are specifically included:
1)The invalid node in all date nodes is cleared up, the invalid node is node counts serial number in text extraction and analysis
Node after first found the first object node because issue date node otherwise visual title node it
It is preceding or between visual title node and first body text node.
2)The target date node nearest from visual title node in remaining date node after obtaining cleaning, and the mesh
The node counts serial number difference for marking date node is lower than the first preset value, and it is default lower than second that text node counts serial number difference
Value.
The picture Node extraction text picture that last basis has cached, specifically includes following steps:
Step 001, by the picture node cached according to by node counts serial number ascending sort;
Step 002, Target Photo node is obtained, by other picture nodes after Target Photo node and Target Photo node
Complete liquidation, the Target Photo node is near the last one first object node and node counts serial number difference is greater than in advance
If the picture node of value;
Step 003, picture node of the node counts serial number between body text node and visual title node is obtained, is denoted as
Interpolation graphs piece node, then will be located at visual title node before and nodal distance lower than preset value picture node be also denoted as it is slotting
It is worth picture node, and is incorporated into interpolation graphs piece node set, while caching to non-interpolative picture node;
Step 004, each interpolation graphs piece node is obtained at a distance from the node counts serial number of visual title node, and according to distance
Ascending order is ranked up all interpolation graphs segment points;
Step 005, prescreening is carried out to all interpolation graphs segment points according to default screening rule, filters out the nothing unrelated with text
Imitate picture;
Step 006, the node visit path of remaining interpolation graphs piece node after prescreening, and the interpolation graphs in step 003 are obtained
The identical node in node visit path is found in piece node set, then repeatedly step 004 and step 005, to filtering out again
Interpolation graphs piece node and non-interpolative picture node integrated.
In above-described embodiment, the used default screening rule includes following:
Rule 1, the image link based on interpolation graphs piece node filters common advertisement link, such as URL(Unified resource positioning)'s
Include common advertise printed words or common social networks exterior chain or logo in path.
Rule 2, obtain interpolation graphs piece node dimension of picture information, according to the length-width ratio of picture filter banner picture with
And size is lower than the small size picture of preset value, specifically can obtain dimension of picture information using following methods, for example work as prosthomere
Point has specified that width, height attribute, and attribute can then directly acquire dimension of picture information in effective range, no
Network inputs stream is then opened by picture URL and obtains dimension of picture information.It is not needed when obtaining dimension of picture information by network
Full picture is downloaded, only reads dimension information on network inputs stream head.The loose visit of the picture node is recorded simultaneously
It asks the way diameter, when traversing other picture nodes, such as other nodes are identical as the loose access path of the node, then other nodes can
Directly to use the dimension information of the picture node, do not need to open additional network request.
Rule 3, based on node path from picture node be starting point, at most recall 3 node layers, in conjunction with node id attribute and
Class attribute gives a mark to picture node, and according to score size, the picture unrelated to determination is filtered.Section is recorded simultaneously
The loose access path of point, when traversing other picture nodes, such as other nodes are identical as the loose access path of the node, then
Directly filter.
The step 3 of above-described embodiment is based on node visit path and is grouped to node, and carries out marking in packetized units and sentence
Whether the node content in fixed grouping belongs to subject content, to further improve the efficiency of Web page subject contents extraction.
It is illustrated in conjunction with process of the attached drawing 1 to general Web page subject method for extracting content, is tied below above
Fig. 2 is closed to be illustrated the structure of general Web page subject content extraction system.
Fig. 2 is a kind of structural schematic diagram for general Web page subject content extraction system that the embodiment of the present invention 2 provides, such as
Shown in Fig. 2, including dom tree processing module, cache module and extraction module,
The dom tree processing module is used to construct the dom tree of target webpage, clears up the node of the dom tree, and according to
Attribute label is carried out to the remaining node of the dom tree with the correlation of body matter;
The cache module is used to traverse the dom tree after attribute label, and the remaining node-classification of dom tree is cached as picture section
Point, date node, body text node or visual title node;
The extraction module be used for according to the picture node, the date node and the body text node respectively with institute
State the interior perhaps described body text section of the content of picture node described in the Distance Judgment of visual title node, the date node
Whether the content of point is the theme content, and completes the extraction to target webpage subject content according to judging result, in the theme
Hold includes text picture, issuing time and text.
Above-described embodiment be based on strong incidence relation present on page structure, to the visual title node of the text of dom tree into
Row identifies and carries out classification caching to other nodes, then with other category nodes in dom tree and the visual title node of text
Whether distance belongs to the important evidence of subject content as predicate node, to improve the precision and effect of Web page information extraction
Rate.
In in preferred embodiment, the dom tree processing module includes:
Resolution unit resolves to a dom tree for downloading the source code of target webpage, and by the source code;
Title word generation unit, for obtaining and caching the content of title label node in the dom tree, while to title
The content of label node carries out Chinese word segmentation and removal stop words, generates the title set of words including several title words;
Marking unit traverses the dom tree for the mode using depth-first, clears up the section of preset kind in the dom tree
After point, judge whether the id attribute, class attribute and/or style attribute of remaining node meet the first preset condition, and according to
Judging result carries out attribute to the remaining node and is labeled as determining the element unrelated with text, element that may be unrelated with text
With other elements.
In another preferred embodiment, the cache module includes:
Coordinates measurement unit, it is raw for selecting the body element of dom tree as the start node for carrying out depth-first recursive traversal
At the corresponding node visit path of surplus element each in dom tree;
Cache unit, for the attribute mark information according to surplus element in dom tree, it would be possible to the element unrelated with text and its
Its element is used as due-in collection element, carries out information collection and classify to cache as picture node, author to the due-in element that integrates
Node, date node, body text node or visual title node.
The cache unit includes picture nodal cache unit, pictorial information block nodal cache unit, author information block section
Point cache unit, text message block nodal cache unit, daughter element integrate replacement unit, daughter element cache unit and information record
Unit,
The picture nodal cache unit is used to judge whether the element tags of the due-in collection element to be img labels, if so,
It collects and caches the due-in element that integrates as picture node, if it is not, then driving the first judging unit;
The pictorial information block nodal cache unit is used to judge whether the id attribute of the due-in collection element or class attribute to wrap
Containing image, photo or gallery label, if it is not, author information block nodal cache unit is then driven, if so, described in determining
Due-in collection element is determining pictorial information block node, and the traversal of global mark dom tree enters pictorial information and collects block, when time
When going through the child node of the due-in collection element, judge whether the child node is picture node, if so, collecting and caching described
Child node is picture node, if it is not, then continuing to judge next due-in collection element;
The author information block nodal cache be used for judge it is described it is due-in collection element id attribute or class attribute whether include
Author, writtenby or byline label, if it is not, text message block nodal cache unit is then driven, if so, determining institute
Stating due-in collection element is determining author information block node, and the traversal of global mark dom tree enters author information and collects block, when
When traversing the child node of the due-in collection element, judge whether the child node is author node, if so, collecting and caching institute
Stating child node is author node, if it is not, then continuing to judge next due-in collection element;
The text message block nodal cache unit is used to judge whether the id attribute of the due-in collection element or class attribute to wrap
Containing article, post, main or content label, if it is not, then daughter element is driven to integrate replacement unit, if so, determining institute
Stating due-in collection element is determining text message block node, and the traversal of global mark dom tree enters text message and collects block, together
The current overall situation of Shi Ruguo only has collected non-deterministic text message block without collecting determining text message block, then empties current
The non-deterministic text message block collected;
The daughter element integrates replacement unit for determining whether the due-in collection element has daughter element, if there is daughter element, then
Judge whether the daughter element can be integrated replacement, if it is then by replacing with after the content integration of all daughter elements
The content of the due-in collection element, and the daughter element cache unit is driven, if cannot, it is single to directly drive daughter element caching
Member;
The daughter element cache unit, for traversing all child nodes of the due-in collection element and judging the child node one by one
Type, if child node is node element, global node is counted plus one, and picture nodal cache unit is driven to carry out again
Depth of recursion traversal identifies the content of the text child node if child node is text child node, is tied according to identification
Fruit caches the text child node for visual title node, date node or possible body text node;
The information recording unit counts serial number for recording the node counts serial number of due-in collection element, text node in dom tree
And node visit path.
In another preferred embodiment, the extraction module includes text extraction module, issuing time extraction module and text
Picture extraction module.The text extraction module specifically includes:
First sequencing unit, for all possible body text node to be carried out ascending sort according to node counts serial number;
First object node generation unit, for finding in all possible body text node, first node counts serial number
Greater than the first object node of the node counts serial number of visual title node, and the sentence number of the first object node is greater than 0
Or the content word of first object node and the content word of visual title node have correlation, by the first object section
Point is denoted as p1 node;
Cycling element, for reversely finding the node counts serial number difference with the P1 node forward using the p1 node as starting point
Less than 3, and similar second destination node of access path, and p1 is replaced with, until can not find the second new destination node;
Grouped element, for all possible body text node before clearing up the p1 node, and to it is remaining it is all can
The body text node of energy is grouped according to node visit path, and each packets inner carries out ascending order according to node counts serial number
It sorts, carries out ascending sort according to the node counts serial number of first node of each grouping between grouping;
Marking unit imported into training in advance for calculating the preset parameter value of each grouping, and by the preset parameter value
Prediction model is given a mark, and the targeted packets that score is greater than default score value are generated;
First extraction unit, for the node in all targeted packets to be carried out ascending sort, and shape according to node counts serial number
At text section point set, and cache the text node set.
In preferred embodiment, the issuing time extraction module is specifically included:
Unit, the invalid node for clearing up in all date nodes are cleared up, the invalid node is node counts serial number the
Node after one first object node;
Second extraction unit, for obtaining target date nearest from visual title node in the remaining date node after clearing up
Point, and the node counts serial number difference of the target date node is lower than the first preset value, it is low that text node counts serial number difference
In the second preset value.
Preferably, the text picture extraction module specifically includes:
Second sequencing unit, the picture node for will cache is according to by node counts serial number ascending sort;
Second destination node generation unit, for obtaining Target Photo node, by Target Photo node and Target Photo node
Other picture node complete liquidations later, the Target Photo node are near the last one first object node and node
Count the picture node that serial number difference is greater than third preset value;
Interpolation graphs piece node generation unit, for obtain node counts serial number be located at body text node and visual title node it
Between picture node, be denoted as interpolation graphs piece node, then will be located at before visual title node and section with visual title node
Point distance is also denoted as interpolation graphs piece node lower than the picture node of the 4th preset value, and is incorporated into interpolation graphs piece node set, together
When non-interpolative picture node is cached;
Third sequencing unit, for obtaining each interpolation graphs piece node at a distance from the node counts serial number of visual title node,
And ascending sort is carried out to all interpolation graphs segment points according to distance;
Prescreening unit filters out and text for carrying out prescreening to all interpolation graphs segment points according to presetting screening rule
Unrelated invalid picture;
Third extraction unit, for obtaining the node visit path of remaining interpolation graphs piece node after prescreening, and in interpolation picture
The identical node in node visit path is found in the interpolation graphs piece node set of node generation unit, then repeats driving third row
Sequence unit and prescreening unit integrate the interpolation graphs piece node and non-interpolative picture node that filter out again.
Reader should be understood that in the description of this specification reference term " one embodiment ", " is shown " some embodiments "
The description of example ", " specific example " or " some examples " etc. mean specific features described in conjunction with this embodiment or example, structure,
Material or feature are included at least one embodiment or example of the invention.In the present specification, above-mentioned term is shown
The statement of meaning property need not be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described
It may be combined in any suitable manner in any one or more of the embodiments or examples.In addition, without conflicting with each other, this
The technical staff in field can be by the spy of different embodiments or examples described in this specification and different embodiments or examples
Sign is combined.
It is apparent to those skilled in the art that for convenience of description and succinctly, the dress of foregoing description
The specific work process with unit is set, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.
In several embodiments provided herein, it should be understood that disclosed device and method can pass through it
Its mode is realized.For example, the apparatus embodiments described above are merely exemplary, for example, the division of unit, only
A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or
Person is desirably integrated into another system, or some features can be ignored or not executed.
Unit may or may not be physically separated as illustrated by the separation member, shown as a unit
Component may or may not be physical unit, it can and it is in one place, or may be distributed over multiple networks
On unit.It can select some or all of unit therein according to the actual needs to realize the mesh of the embodiment of the present invention
's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, is also possible to two or more units and is integrated in one unit.It is above-mentioned integrated
Unit both can take the form of hardware realization, can also realize in the form of software functional units.
It, can if integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product
To be stored in a computer readable storage medium.Based on this understanding, technical solution of the present invention substantially or
Say that all or part of the part that contributes to existing technology or the technical solution can embody in the form of software products
Out, which is stored in a storage medium, including some instructions are used so that a computer equipment
(It can be personal computer, server or the network equipment etc.)Execute all or part of each embodiment method of the present invention
Step.And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory(ROM, Read-Only Memory), it is random
Access memory(RAM, Random Access Memory), various Jie that can store program code such as magnetic or disk
Matter.
More than, only a specific embodiment of the invention, but scope of protection of the present invention is not limited thereto, and it is any to be familiar with
Those skilled in the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or substitutions,
These modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be wanted with right
Subject to the protection scope asked.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of general Web page subject method for extracting content, which is characterized in that include the following steps:
Step 1, the dom tree for constructing target webpage clears up the node of the dom tree, and according to the phase with body matter
Closing property carries out attribute label to the remaining node of the dom tree;
Step 2, traversal attribute label after dom tree, by the remaining node-classification of dom tree cache for picture node, date node,
Body text node or visual title node;
Step 3, according to the picture node, the date node and the body text node respectively with the visual title
The content of the content of picture node described in the Distance Judgment of node, the content of the date node and the body text node
Whether content, and complete the extraction to target webpage subject content according to judging result if being the theme, the subject content includes just
Texts and pictures piece, issuing time and text.
2. general Web page subject method for extracting content according to claim 1, which is characterized in that the step 1 is specific
Include the following steps:
S101 downloads the source code of target webpage, and the source code is resolved to a dom tree;
S102, obtains and caches the content of title label node in the dom tree, at the same to the content of title label node into
Row Chinese word segmentation and removal stop words, generate the title set of words including several title words;
S103 traverses the dom tree using the mode of depth-first, clears up in the dom tree after the node of preset kind, judgement
Whether id attribute, class attribute and/or the style attribute of remaining node meet the first preset condition, and according to judging result pair
The residue node carry out attribute labeled as determine the element unrelated with text, element that may be unrelated with text and other members
Element.
3. general Web page subject method for extracting content according to claim 2, which is characterized in that the step 2 is specific
Include the following steps:
S201 selects the body element of dom tree as the start node for carrying out depth-first recursive traversal, generates every in dom tree
The corresponding node visit path of a surplus element;
S202, according to the attribute mark information of surplus element in dom tree, it would be possible to which the element unrelated with text and other elements are equal
As due-in collection element, the due-in element that integrates is carried out information collection and classified to cache as picture node, author node, date
Node, body text node or visual title node.
4. general Web page subject method for extracting content according to claim 3, which is characterized in that right in step S202
The due-in collection element carries out information collection and caching of classifying specifically includes following steps:
Step a judges whether the element tags of the due-in collection element are img labels, if so, collecting and caching described due-in
Integrate element as picture node, if it is not, thening follow the steps b;
Step b, judges the id attribute of the due-in collection element or whether class attribute includes image, photo or gallery mark
Label, if it is not, c is thened follow the steps, if so, determining that the due-in collection element is determining pictorial information block node, and global mark
The traversal of note dom tree enters pictorial information and collects block, when traversing the child node of the due-in collection element, judges the child node
Whether be picture node, if so, collecting and caching the child node is picture node, if it is not, then continue to judge it is next to
Collect element;
Step c, judge it is described it is due-in collection element id attribute or class attribute whether include author, writtenby or
Byline label, if it is not, d is thened follow the steps, if so, determine that the due-in collection element is determining author information block node,
And the traversal of global mark dom tree enters author information and collects block, when traversing the child node of the due-in collection element, judges institute
State whether child node is author node, if so, collecting and caching the child node is author node, if it is not, then continuing to judge
Next due-in collection element;
Step d, judge it is described it is due-in collection element id attribute or class attribute whether include article, post, main or
Content label, if it is not, e is thened follow the steps, if so, determine that the due-in collection element is determining text message block node,
And the traversal of global mark dom tree enters text message and collects block, while if the current global text without collecting determination is believed
It ceases block and only has collected non-deterministic text message block, then empty the non-deterministic text message block currently collected;
Step e, determines whether the due-in collection element has daughter element, if there is daughter element, then judges that the daughter element whether may be used
To be integrated replacement, if it is then the content for the due-in collection element being replaced with after the content integration of all daughter elements,
And step f is executed, if cannot, directly execution step f;
Step f traverses all child nodes of the due-in collection element and handles one by one, and processing method is:Judge the child node
Type, if child node is node element, global node is counted plus one, and return step a carries out depth of recursion time again
It goes through, if child node is text child node, the content of the text child node is identified, according to recognition result by the text
Book nodal cache is visual title node, date node or possible body text node;
During carrying out the above depth-first recursive traversal, the node counts serial number of due-in collection element, text in dom tree are recorded
Node counts serial number and node visit path.
5. general Web page subject method for extracting content according to claim 4, which is characterized in that root in the step 3
Following steps are specifically included according to the body text Node extraction text cached:
All possible body text node is subjected to ascending sort according to node counts serial number;
It finds in all possible body text node, first node counts serial number is greater than the node counts of visual title node
The first object node of serial number, and the sentence number of the first object node is greater than the content word of 0 or first object node
There is correlation with the content word of visual title node, the first object node is denoted as p1 node;
The node counts serial number difference with the P1 node is reversely found forward using the p1 node as starting point less than 3, and is accessed
Similar second destination node in path, and p1 is replaced with, this step then being repeated, being until can not find the second new destination node
Only;
All possible body text node before clearing up the p1 node, and to remaining all possible body text section
Point is grouped according to node visit path, and each packets inner carries out ascending sort according to node counts serial number, between grouping
Ascending sort is carried out according to the node counts serial number of first node of each grouping;
The preset parameter value of each grouping is calculated, and the preset parameter value is imported into prediction model trained in advance and is beaten
Point, generate the targeted packets that score is greater than default score value;
Node in all targeted packets is subjected to ascending sort according to node counts serial number, and forms text node set;
Cache the text node set.
6. general Web page subject method for extracting content according to claim 5, which is characterized in that root in the step 3
Following steps are specifically included according to the date Node extraction issuing time cached:
The invalid node in all date nodes is cleared up, the invalid node is node counts serial number in first first mesh
Mark the node after node;
The target date node nearest from visual title node in remaining date node after obtaining cleaning, and the target date
The node counts serial number difference of node is lower than the first preset value, and text node counts serial number difference and is lower than the second preset value.
7. general Web page subject method for extracting content according to claim 6, which is characterized in that root in the step 3
Following steps are specifically included according to the picture Node extraction text picture cached:
Step 001, by the picture node cached according to by node counts serial number ascending sort;
Step 002, Target Photo node is obtained, by other picture nodes after Target Photo node and Target Photo node
Complete liquidation, the Target Photo node is near the last one first object node and node counts serial number difference is greater than the
The picture node of three preset values;
Step 003, picture node of the node counts serial number between body text node and visual title node is obtained, is denoted as
Then interpolation graphs piece node will be preset before being located at visual title node and with the nodal distance of visual title node lower than the 4th
The picture node of value is also denoted as interpolation graphs piece node, and is incorporated into interpolation graphs piece node set, while to non-interpolative picture node
It is cached;
Step 004, each interpolation graphs piece node is obtained at a distance from the node counts serial number of visual title node, and according to distance
Ascending sort is carried out to all interpolation graphs segment points;
Step 005, prescreening is carried out to all interpolation graphs segment points according to default screening rule, filters out the nothing unrelated with text
Imitate picture;
Step 006, the node visit path of remaining interpolation graphs piece node after prescreening, and the interpolation graphs in step 003 are obtained
The identical node in node visit path is found in piece node set, then repeatedly step 004 and step 005, to filtering out again
Interpolation graphs piece node and non-interpolative picture node integrated.
8. a kind of general Web page subject content extraction system, which is characterized in that including dom tree processing module, cache module and
Extraction module,
The dom tree processing module is used to construct the dom tree of target webpage, clears up the node of the dom tree, and according to
Attribute label is carried out to the remaining node of the dom tree with the correlation of body matter;
The cache module is used to traverse the dom tree after attribute label, and the remaining node-classification of dom tree is cached as picture section
Point, date node, body text node or visual title node;
The extraction module be used for according to the picture node, the date node and the body text node respectively with institute
State the interior perhaps described body text section of the content of picture node described in the Distance Judgment of visual title node, the date node
Whether the content of point is the theme content, and completes the extraction to target webpage subject content according to judging result, in the theme
Hold includes text picture, issuing time and text.
9. general Web page subject content extraction system according to claim 8, which is characterized in that the dom tree processing
Module includes:
Resolution unit resolves to a dom tree for downloading the source code of target webpage, and by the source code;
Title word generation unit, for obtaining and caching the content of title label node in the dom tree, while to title
The content of label node carries out Chinese word segmentation and removal stop words, generates the title set of words including several title words;
Marking unit traverses the dom tree for the mode using depth-first, clears up the section of preset kind in the dom tree
After point, judge whether the id attribute, class attribute and/or style attribute of remaining node meet the first preset condition, and according to
Judging result carries out attribute to the remaining node and is labeled as determining the element unrelated with text, element that may be unrelated with text
With other elements.
10. general Web page subject content extraction system according to claim 9, which is characterized in that the cache module
Including:
Coordinates measurement unit, it is raw for selecting the body element of dom tree as the start node for carrying out depth-first recursive traversal
At the corresponding node visit path of surplus element each in dom tree;
Cache unit, for the attribute mark information according to surplus element in dom tree, it would be possible to the element unrelated with text and its
Its element is used as due-in collection element, carries out information collection and classify to cache as picture node, author to the due-in element that integrates
Node, date node, body text node or visual title node.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810572726.0A CN108920434B (en) | 2018-06-06 | 2018-06-06 | Universal webpage theme content extraction method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810572726.0A CN108920434B (en) | 2018-06-06 | 2018-06-06 | Universal webpage theme content extraction method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108920434A true CN108920434A (en) | 2018-11-30 |
CN108920434B CN108920434B (en) | 2022-08-30 |
Family
ID=64419788
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810572726.0A Active CN108920434B (en) | 2018-06-06 | 2018-06-06 | Universal webpage theme content extraction method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108920434B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109657180A (en) * | 2018-12-11 | 2019-04-19 | 中科国力(镇江)智能技术有限公司 | It is a kind of intelligence web page contents automatically obscure extraction system |
CN109815326A (en) * | 2019-01-24 | 2019-05-28 | 网易(杭州)网络有限公司 | Dialog control method and device |
CN110309474A (en) * | 2019-06-05 | 2019-10-08 | 上海易点时空网络有限公司 | Document off-line system and method based on Electron |
CN111241446A (en) * | 2020-01-13 | 2020-06-05 | 杭州安恒信息技术股份有限公司 | Method, device, equipment and medium for extracting text content of web page |
CN111460259A (en) * | 2020-03-31 | 2020-07-28 | 腾讯科技(深圳)有限公司 | Method and device for determining similar elements, computer equipment and storage medium |
CN112667940A (en) * | 2020-10-15 | 2021-04-16 | 广东电子工业研究院有限公司 | Webpage text extraction method based on deep learning |
CN112765941A (en) * | 2021-01-21 | 2021-05-07 | 语联网(武汉)信息技术有限公司 | Method and system for automatically extracting webpage text |
CN112765940A (en) * | 2021-01-20 | 2021-05-07 | 南京万得资讯科技有限公司 | Novel webpage duplicate removal method based on subject characteristics and content semantics |
CN113204723A (en) * | 2021-04-12 | 2021-08-03 | 仲恺农业工程学院 | Page background matching method and device based on page theme |
CN113392354A (en) * | 2021-06-28 | 2021-09-14 | 山东亿云信息技术有限公司 | Webpage text analysis method, system, medium and electronic equipment |
CN113626737A (en) * | 2021-10-12 | 2021-11-09 | 北京天际友盟信息技术有限公司 | Method and device for identifying main body link, electronic equipment and storage medium |
CN113807050A (en) * | 2021-07-01 | 2021-12-17 | 西安华讯科技有限责任公司 | Node interception method, system, equipment and storage medium based on rich text |
CN114528811A (en) * | 2022-01-21 | 2022-05-24 | 北京麦克斯泰科技有限公司 | Article content extraction method, device, equipment and storage medium |
CN114610580A (en) * | 2022-03-17 | 2022-06-10 | 北京火山引擎科技有限公司 | Page white screen monitoring method, device, equipment and medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102270206A (en) * | 2010-06-03 | 2011-12-07 | 北京迅捷英翔网络科技有限公司 | Method and device for capturing valid web page contents |
US20130014002A1 (en) * | 2011-06-15 | 2013-01-10 | Alibaba Group Holding Limited | Method and System of Extracting Web Page Information |
CN105426388A (en) * | 2015-10-23 | 2016-03-23 | 青岛恒波仪器有限公司 | Apparatus for extracting and comparing webpage text |
CN105574066A (en) * | 2015-10-23 | 2016-05-11 | 青岛恒波仪器有限公司 | Web page text extraction and comparison method and system thereof |
CN105630941A (en) * | 2015-12-23 | 2016-06-01 | 成都电科心通捷信科技有限公司 | Statistics and webpage structure based Wen body text content extraction method |
CN106528583A (en) * | 2015-11-14 | 2017-03-22 | 孙燕群 | Method for extracting and comparing web page main body |
CN107391678A (en) * | 2017-07-21 | 2017-11-24 | 福州大学 | Web page content information extracting method based on cluster |
CN107590219A (en) * | 2017-09-04 | 2018-01-16 | 电子科技大学 | Webpage personage subject correlation message extracting method |
-
2018
- 2018-06-06 CN CN201810572726.0A patent/CN108920434B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102270206A (en) * | 2010-06-03 | 2011-12-07 | 北京迅捷英翔网络科技有限公司 | Method and device for capturing valid web page contents |
US20130014002A1 (en) * | 2011-06-15 | 2013-01-10 | Alibaba Group Holding Limited | Method and System of Extracting Web Page Information |
CN105426388A (en) * | 2015-10-23 | 2016-03-23 | 青岛恒波仪器有限公司 | Apparatus for extracting and comparing webpage text |
CN105574066A (en) * | 2015-10-23 | 2016-05-11 | 青岛恒波仪器有限公司 | Web page text extraction and comparison method and system thereof |
CN106528583A (en) * | 2015-11-14 | 2017-03-22 | 孙燕群 | Method for extracting and comparing web page main body |
CN105630941A (en) * | 2015-12-23 | 2016-06-01 | 成都电科心通捷信科技有限公司 | Statistics and webpage structure based Wen body text content extraction method |
CN107391678A (en) * | 2017-07-21 | 2017-11-24 | 福州大学 | Web page content information extracting method based on cluster |
CN107590219A (en) * | 2017-09-04 | 2018-01-16 | 电子科技大学 | Webpage personage subject correlation message extracting method |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109657180B (en) * | 2018-12-11 | 2021-11-26 | 中科国力(镇江)智能技术有限公司 | Intelligent automatic fuzzy extraction system for webpage content |
CN109657180A (en) * | 2018-12-11 | 2019-04-19 | 中科国力(镇江)智能技术有限公司 | It is a kind of intelligence web page contents automatically obscure extraction system |
CN109815326B (en) * | 2019-01-24 | 2021-09-10 | 网易(杭州)网络有限公司 | Conversation control method and device |
CN109815326A (en) * | 2019-01-24 | 2019-05-28 | 网易(杭州)网络有限公司 | Dialog control method and device |
CN110309474A (en) * | 2019-06-05 | 2019-10-08 | 上海易点时空网络有限公司 | Document off-line system and method based on Electron |
CN111241446A (en) * | 2020-01-13 | 2020-06-05 | 杭州安恒信息技术股份有限公司 | Method, device, equipment and medium for extracting text content of web page |
CN111241446B (en) * | 2020-01-13 | 2023-10-31 | 杭州安恒信息技术股份有限公司 | Method, device, equipment and medium for extracting text content of web page |
CN111460259A (en) * | 2020-03-31 | 2020-07-28 | 腾讯科技(深圳)有限公司 | Method and device for determining similar elements, computer equipment and storage medium |
CN111460259B (en) * | 2020-03-31 | 2023-04-14 | 腾讯科技(深圳)有限公司 | Method and device for determining similar elements, computer equipment and storage medium |
CN112667940A (en) * | 2020-10-15 | 2021-04-16 | 广东电子工业研究院有限公司 | Webpage text extraction method based on deep learning |
CN112667940B (en) * | 2020-10-15 | 2022-02-18 | 广东电子工业研究院有限公司 | Webpage text extraction method based on deep learning |
CN112765940A (en) * | 2021-01-20 | 2021-05-07 | 南京万得资讯科技有限公司 | Novel webpage duplicate removal method based on subject characteristics and content semantics |
CN112765940B (en) * | 2021-01-20 | 2024-04-19 | 南京万得资讯科技有限公司 | Webpage deduplication method based on theme features and content semantics |
CN112765941A (en) * | 2021-01-21 | 2021-05-07 | 语联网(武汉)信息技术有限公司 | Method and system for automatically extracting webpage text |
CN113204723A (en) * | 2021-04-12 | 2021-08-03 | 仲恺农业工程学院 | Page background matching method and device based on page theme |
CN113392354A (en) * | 2021-06-28 | 2021-09-14 | 山东亿云信息技术有限公司 | Webpage text analysis method, system, medium and electronic equipment |
CN113807050A (en) * | 2021-07-01 | 2021-12-17 | 西安华讯科技有限责任公司 | Node interception method, system, equipment and storage medium based on rich text |
CN113807050B (en) * | 2021-07-01 | 2024-04-09 | 西安华讯科技有限责任公司 | Node interception method, system, equipment and storage medium based on rich text |
CN113626737B (en) * | 2021-10-12 | 2022-03-11 | 北京天际友盟信息技术有限公司 | Method and device for identifying main body link, electronic equipment and storage medium |
CN113626737A (en) * | 2021-10-12 | 2021-11-09 | 北京天际友盟信息技术有限公司 | Method and device for identifying main body link, electronic equipment and storage medium |
CN114528811A (en) * | 2022-01-21 | 2022-05-24 | 北京麦克斯泰科技有限公司 | Article content extraction method, device, equipment and storage medium |
CN114528811B (en) * | 2022-01-21 | 2022-09-02 | 北京麦克斯泰科技有限公司 | Article content extraction method, device, equipment and storage medium |
CN114610580A (en) * | 2022-03-17 | 2022-06-10 | 北京火山引擎科技有限公司 | Page white screen monitoring method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN108920434B (en) | 2022-08-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108920434A (en) | A kind of general Web page subject method for extracting content and system | |
CN102156737B (en) | Method for extracting subject content of Chinese webpage | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN103823824B (en) | A kind of method and system that text classification corpus is built automatically by the Internet | |
CN103226578B (en) | Towards the website identification of medical domain and the method for webpage disaggregated classification | |
Das et al. | Text mining and topic modeling of compendiums of papers from transportation research board annual meetings | |
CN104199833B (en) | The clustering method and clustering apparatus of a kind of network search words | |
CN107590219A (en) | Webpage personage subject correlation message extracting method | |
CN105930469A (en) | Hadoop-based individualized tourism recommendation system and method | |
TWI695277B (en) | Automatic website data collection method | |
CN102119383A (en) | Method and subsystem for information acquisition and aggregation to facilitate ontology and language-model generation within a content-search-service system | |
CN103488746B (en) | Method and device for acquiring business information | |
CN103646078B (en) | Method and device for realizing internet propaganda monitoring target evaluations | |
CN104268148A (en) | Forum page information auto-extraction method and system based on time strings | |
CN102591992A (en) | Webpage classification identifying system and method based on vertical search and focused crawler technology | |
CN105095175B (en) | Obtain the method and device of truncated web page title | |
CN101630330A (en) | Method for webpage classification | |
CN105550359B (en) | Webpage sorting method and device based on vertical search and server | |
CN104331438B (en) | To novel web page contents selectivity abstracting method and device | |
CN108874870A (en) | A kind of data pick-up method, equipment and computer can storage mediums | |
CN110083760B (en) | Multi-recording dynamic webpage information extraction method based on visual block | |
CN105528421A (en) | Search dimension excavation method of query terms in mass data | |
CN108694192A (en) | The judgment method and device of type of webpage | |
CN100357942C (en) | Mobile internet intelligent information retrieval engine based on key-word retrieval | |
EP1910918A2 (en) | Method and system for automatically extracting data from web sites |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |