CN108920434B - Universal webpage theme content extraction method and system - Google Patents

Universal webpage theme content extraction method and system Download PDF

Info

Publication number
CN108920434B
CN108920434B CN201810572726.0A CN201810572726A CN108920434B CN 108920434 B CN108920434 B CN 108920434B CN 201810572726 A CN201810572726 A CN 201810572726A CN 108920434 B CN108920434 B CN 108920434B
Authority
CN
China
Prior art keywords
node
nodes
text
picture
dom tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810572726.0A
Other languages
Chinese (zh)
Other versions
CN108920434A (en
Inventor
钟刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Kuquan Data Technology Co ltd
Original Assignee
Wuhan Kuquan Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Kuquan Data Technology Co ltd filed Critical Wuhan Kuquan Data Technology Co ltd
Priority to CN201810572726.0A priority Critical patent/CN108920434B/en
Publication of CN108920434A publication Critical patent/CN108920434A/en
Application granted granted Critical
Publication of CN108920434B publication Critical patent/CN108920434B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • G06F40/146Coding or compression of tree-structured data
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention particularly relates to a universal webpage subject content extracting method and a universal webpage subject content extracting system, wherein the method comprises the following steps: constructing a DOM tree of a target webpage, cleaning nodes of the DOM tree, and performing attribute marking on the rest nodes according to the correlation with the text content; traversing the DOM tree, and classifying and caching the rest nodes of the DOM tree; and judging whether the content of the node is the subject content according to the distance between the node in each category and the visible title node, and finishing the extraction of the subject content of the target webpage according to the judgment result. The invention provides a more optimized semantic-based webpage information extraction method, which is characterized in that based on strong association relation existing on a page structure, text visual title nodes of a DOM tree are identified and other nodes are classified and cached, and then the distances between other category nodes and the text visual title nodes in the DOM tree are used as important basis for judging whether the nodes belong to subject contents, so that the precision and the efficiency of webpage information extraction are improved.

Description

Universal webpage theme content extraction method and system
Technical Field
The invention relates to the technical field of computer software, in particular to a universal webpage theme content extraction method and a universal webpage theme content extraction system.
Background
In the internet era today, most of the information disclosed and visible in the network is presented in the form of subject matter, such as blog articles in blogs, news information of web portals, etc. The subject contents are important channels for most Internet users to obtain information, are massive basic corpora of academic researchers, and have important value in the field of natural language processing. However, for many reasons, the subject content web pages on the network are not composed of pure subject content, and include information that is not directly related to the subject content, such as advertisements, comments, related recommendations, and website navigation. How to extract the subject content of the web page from the complicated web page information becomes a problem to be solved.
Currently, the existing topic content extraction methods are generally divided into two types: one is a semantic-based webpage information extraction method, and the other is a visual-based webpage blocking method. Both of the above approaches attempt to extract the information block where the true subject matter is located from the web page structure.
The semantic-based webpage information extraction generally has two modes, the first mode is to analyze the information based on the whole website, try to find out the repeated modules, such as a navigation bar and the like, among different webpages, and then remove the repeated modules when a certain webpage is specifically analyzed to find out the subject content; the second way is to simply rely on the currently analyzed web page itself to try to find some nodes of block-level elements in the HTML, and then analyze the text information of the node contents, such as the text length, to obtain the block-level element with the longest text length by comparison.
The visual-based webpage blocking method includes attempting to render a whole page through a browser engine, then blocking the rendered page based on background colors, fonts, frames and other factors of page elements, merging elements with relatively close relevance, and regarding elements with relatively loose relevance as different blocks, so that visual-based blocking reconstruction of the whole page is completed. The visual-based web page blocking method has a drawback in that it requires loading of CSS (cascading style sheet) files and the like that it depends on while analyzing a DOM tree constructed based on a web page source code, and rendering depending on a browser engine, and has a problem in that analysis of mass data is relatively slow.
Disclosure of Invention
The invention provides a universal webpage theme content extraction method and a universal webpage theme content extraction system, which solve the technical problems of low accuracy and efficiency of webpage theme content extraction in the prior art.
The technical scheme for solving the technical problems is as follows: a general webpage subject content extraction method comprises the following steps:
step 1, constructing a DOM tree of a target webpage, cleaning nodes of the DOM tree, and marking attributes of the rest nodes of the DOM tree according to the relevance of the nodes and the text content;
step 2, traversing the DOM tree after attribute marking, and classifying and caching the rest nodes of the DOM tree into picture nodes, date nodes, text nodes or visual target nodes;
and 3, judging whether the content of the picture node, the content of the date node and the content of the text node are subject contents according to the distances between the picture node, the date node and the text node and the visual title node respectively, and finishing the extraction of the subject contents of the target webpage according to the judgment result, wherein the subject contents comprise a text picture, release time and a text.
The invention has the beneficial effects that: the invention provides a more optimized semantic-based webpage information extraction method, which is characterized in that based on strong association relation existing on a page structure, text visual title nodes of a DOM tree are identified and other nodes are classified and cached, and then the distances between other category nodes and the text visual title nodes in the DOM tree are used as important basis for judging whether the nodes belong to subject contents, so that the precision and the efficiency of webpage information extraction are improved.
On the basis of the technical scheme, the invention can be further improved as follows.
Further, the step 1 specifically includes the following steps:
s101, downloading a source code of a target webpage, and analyzing the source code into a DOM tree;
s102, acquiring and caching the content of title label nodes in the DOM tree, and simultaneously performing Chinese word segmentation and stop word removal on the content of the title label nodes to generate a title word set comprising a plurality of title words;
s103, traversing the DOM tree in a depth-first mode, after cleaning nodes of preset types in the DOM tree, judging whether the id attribute, the class attribute and/or the style attribute of the remaining nodes meet a first preset condition, and performing attribute marking on the remaining nodes according to a judgment result to determine elements irrelevant to the text, elements possibly irrelevant to the text and other elements.
Further, the step 2 specifically includes the following steps:
s201, selecting a body element of the DOM tree as an initial node for performing depth-first recursive traversal, and generating a node access path corresponding to each remaining element in the DOM tree;
s202, according to attribute marking information of the rest elements in the DOM tree, taking the elements which are possibly irrelevant to the text and other elements as the elements to be collected, collecting the information of the elements to be collected, and classifying and caching the elements to be collected into picture nodes, author nodes, date nodes, text nodes or visual title nodes.
Further, in step S202, the information collection and classified caching of the to-be-collected elements specifically includes the following steps:
step a, judging whether an element tag of the element to be collected is an img tag, if so, collecting and caching the element to be collected as a picture node, and if not, executing the step b;
b, judging whether the id attribute or the class attribute of the element to be collected contains an image, photo or challenge tag, if not, executing step c, if so, judging that the element to be collected is a determined picture information block node, globally marking traversal of a DOM tree to enter a picture information collection block, judging whether a child node of the element to be collected is a picture node or not when traversing the child node, if so, collecting and caching the child node as the picture node, and if not, continuously judging the next element to be collected;
step c, judging whether the id attribute or the class attribute of the element to be collected contains an author tag, a writenby tag or a byline tag, if not, executing step d, if so, judging that the element to be collected is a determined author information block node, and traversing a global markup DOM tree to enter an author information collection block, when traversing child nodes of the element to be collected, judging whether the child nodes are author nodes, if so, collecting and caching the child nodes as the author nodes, and if not, continuously judging the next element to be collected;
step d, judging whether the id attribute or the class attribute of the element to be collected contains an article, a post, a main or a content label, if not, executing the step e, if so, judging that the element to be collected is a determined text information block node, traversing the overall marking DOM tree to enter a text information collecting block, and if the determined text information block is not collected in the current overall situation and only the undetermined text information block is collected, emptying the currently collected undetermined text information block;
step e, judging whether the element to be collected has a sub-element, if so, judging whether the sub-element can be integrated and replaced, if so, replacing the integrated contents of all the sub-elements with the contents of the element to be collected, and executing the step f, otherwise, directly executing the step f;
step f, traversing all child nodes of the elements to be collected and processing one by one, wherein the processing method comprises the following steps: judging the type of the child node, if the child node is an element node, adding one to the global node count, returning to the step a to perform recursive deep traversal again, if the child node is a text child node, identifying the content of the text child node, and caching the text child node as a visible title node, a date node or a possible text node according to an identification result;
and recording the node counting sequence number, the text node counting sequence number and the node access path of the element to be collected in the DOM tree in the process of performing the depth-first recursion traversal.
Further, the step 3 of extracting the body according to the cached body text node specifically includes the following steps:
sequencing all possible text nodes in an ascending order according to the node counting sequence number;
finding out a first target node of all possible text nodes, wherein the first node counting sequence number is greater than that of the visible title node, the sentence number of the first target node is greater than 0, or the content words of the first target node are related to the content words of the visible title node, and marking the first target node as a p1 node;
forward and backward finding a second target node which has a node count sequence number difference smaller than 3 and is similar to the p1 node by taking the p1 node as a starting point, replacing the second target node with p1, and repeating the steps until a new second target node cannot be found;
cleaning all possible text nodes before the p1 node, grouping all the remaining possible text nodes according to the node access path, sequencing the interior of each group in an ascending order according to the node counting sequence number, and sequencing the groups in an ascending order according to the node counting sequence number of the first node of each group;
calculating a preset parameter value of each group, importing the preset parameter value into a pre-trained prediction model for scoring, and generating a target group with the score larger than a preset score;
sequencing the nodes in all the target groups in an ascending order according to the node counting sequence number, and forming a text node set;
caching the text node set.
Further, the step 3 of extracting the release time according to the cached date node specifically includes the following steps:
clearing invalid nodes in all date nodes, wherein the invalid nodes are nodes with node counting serial numbers behind the first target node;
and acquiring a target date node closest to the visible title node in the cleaned residual date nodes, wherein the node counting sequence number difference of the target date node is lower than a first preset value, and the text node counting sequence number difference is lower than a second preset value.
Further, the extracting of the text picture according to the cached picture node in the step 3 specifically includes the following steps:
step 001, sorting the cached picture nodes in an ascending order according to the node counting sequence number;
step 002, acquiring a target picture node, and cleaning the target picture node and other picture nodes behind the target picture node, wherein the target picture node is the picture node which is closest to the last first target node and the node counting sequence number difference value is larger than a third preset value;
step 003, obtaining a picture node with a node counting sequence number between the text node and the visual title node, marking as an interpolation picture node, then marking a picture node which is positioned in front of the visual title node and has a node distance with the visual title node lower than a fourth preset value as an interpolation picture node, merging the interpolation picture node into an interpolation picture node set, and caching a non-interpolation picture node;
step 004, obtaining the distance between each interpolation picture node and the node counting serial number of the visible title node, and sequencing all interpolation picture nodes in an ascending order according to the distance;
005, pre-screening all interpolation picture nodes according to a preset screening rule to filter out invalid pictures irrelevant to the text;
step 006, obtaining node access paths of the remaining interpolation picture nodes after the pre-screening, finding out nodes with the same node access paths in the interpolation picture node set in the step 003, and then repeating the step 004 and the step 005 to integrate the interpolation picture nodes and the non-interpolation picture nodes which are screened again.
In order to solve the technical problem of the invention, the invention also provides a general webpage theme content extraction system, which comprises a DOM tree processing module, a cache module and an extraction module,
the DOM tree processing module is used for constructing a DOM tree of a target webpage, cleaning nodes of the DOM tree and marking attributes of the rest nodes of the DOM tree according to the relevance of the nodes and the text content;
the cache module is used for traversing the DOM tree after attribute marking, and classifying and caching the rest nodes of the DOM tree into picture nodes, date nodes, text nodes or visual title nodes;
the extraction module is used for judging whether the content of the picture node, the content of the date node or the content of the text node is subject content according to the distances between the picture node, the date node and the text node and the visual title node respectively, and finishing extraction of the subject content of the target webpage according to the judgment result, wherein the subject content comprises a text picture, release time and a text.
Further, the DOM tree processing module includes:
the analysis unit is used for downloading a source code of a target webpage and analyzing the source code into a DOM tree;
the title word generation unit is used for acquiring and caching the content of the title label nodes in the DOM tree, performing Chinese word segmentation and stop word removal on the content of the title label nodes, and generating a title word set comprising a plurality of title words;
and the marking unit is used for traversing the DOM tree in a depth-first mode, judging whether the id attribute, the class attribute and/or the style attribute of the residual nodes meet a first preset condition after cleaning the nodes of the preset type in the DOM tree, and performing attribute marking on the residual nodes according to the judgment result to determine elements irrelevant to the text, elements possibly irrelevant to the text and other elements.
Further, the cache module comprises:
the path generation unit is used for selecting a body element of the DOM tree as an initial node for performing depth-first recursive traversal, and generating a node access path corresponding to each residual element in the DOM tree;
and the caching unit is used for taking the elements which are possibly irrelevant to the text and other elements as the elements to be collected according to the attribute mark information of the rest elements in the DOM tree, collecting the information of the elements to be collected, and classifying and caching the elements to be collected into picture nodes, author nodes, date nodes, text nodes or visual title nodes.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
Fig. 1 is a schematic flowchart of a general method for extracting webpage theme content according to embodiment 1;
fig. 2 is a schematic structural diagram of a general webpage theme content extraction system provided in embodiment 2.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth to illustrate, but are not to be construed to limit the scope of the invention.
Fig. 1 is a schematic flowchart of a general method for extracting webpage theme content provided in embodiment 1, and as shown in fig. 1, the method includes the following steps:
step 1, constructing a DOM tree of a target webpage, and cleaning and marking the nodes of the DOM tree;
step 2, traversing the cleaned DOM tree with the attribute marked, and classifying and caching the rest nodes of the DOM tree into picture nodes, visual title nodes, date nodes or text nodes;
and 3, extracting the subject content of the target webpage from the cached information, wherein the subject content comprises a text, release time and a text picture.
The embodiment identifies the text visual title node of the DOM tree and classifies and caches other nodes based on the strong association relation existing on the page structure, and then the distance between other category nodes and the text visual title node in the DOM tree is used as an important basis for judging whether the node belongs to the subject content, so that the precision and the efficiency of extracting the webpage information are improved. Each step of the above embodiment is specifically described below.
In the above embodiment 1, the step 1 specifically includes the following steps:
s101, downloading a source code of a target webpage, and analyzing the source code into a DOM tree. The target webpage is usually given a webpage link, the source code of the target webpage can be downloaded through the webpage link, then the source code can be analyzed into a DOM tree by using an open source tool, and the specific analysis method is recorded in the prior art document and is not described in detail herein.
S102, obtaining and caching the content of the title label nodes in the DOM tree, performing Chinese word segmentation and stop word removal on the content of the title label nodes, and generating a title word set comprising a plurality of title words. Specifically, a CSS selector can be used for finding title label nodes in the DOM tree, then the content of the title label nodes is obtained, namely the title information of the target webpage is obtained, then Chinese word segmentation and word stop removal are carried out on the title information, a title word set is obtained, and the visible title nodes in the text are identified through the title words in the title word set. The visual title node herein refers to a node where the title word is located, not the above-described title tag node.
S103, traversing the DOM tree in a depth-first mode, after cleaning nodes of preset types in the DOM tree, judging whether the id attribute, the class attribute and/or the style attribute of the remaining nodes meet a first preset condition, and performing attribute marking on the remaining nodes according to a judgment result to determine elements irrelevant to the text, elements possibly irrelevant to the text and other elements. In this embodiment, the preset type nodes are nodes obviously unrelated to the text content, such as nodes that are neither text nodes nor element nodes, and various script nodes, such as meta, title, link nodes, and the like.
The first preset condition is as follows: and if the id attribute or the class attribute of the node contains texts such as banner, comment, sidebar, logo and the like or the style attribute of the node contains display: none, judging the node as an element which is determined to be irrelevant to the text. After judging the attributes of the remaining nodes and generating the judgment result, marking the node by using the special marking attribute score, but not directly cleaning, and preventing the subsequent node counting sequence number marking from being disturbed. Meanwhile, in this step, because the element attribute marking has been performed on the remaining nodes, the remaining nodes are also referred to as remaining elements in the following detailed step description, and the meanings of the two elements are the same.
Then traversing and collecting information of the DOM tree, and specifically comprising the following steps:
s201, selecting a body element of the DOM tree as an initial node for depth-first recursive traversal, and generating a node access path corresponding to each residual element in the DOM tree. The node access path of the body element is an empty character string, the node access paths of the rest elements in the DOM tree are complete paths from the body element to the element, and the node access paths are formed by splicing the node name of each node on the path and the serial numbers of the nodes under the parent nodes of the nodes. For example, the third p element under the second div element under body, with the access path being body.div [2]. p [3 ]. Meanwhile, when the node access path is long, a loose access path of the node may be specified, that is, the indexes of the last 3 levels on the node access path of the ignored element are used, and the loose access path is used to replace the node access path used in the subsequent step. For example, the node access path of an element is body.div [2]. div [1]. table [1]. div [2]. p [1], and its corresponding loose access path is body.div [2]. div [1]. table.div.p.
S202, according to attribute mark information of the rest elements in the DOM tree, taking the elements which are possibly irrelevant to the text and other elements as elements to be collected, collecting the information of the elements to be collected, and classifying and caching the elements to be collected into picture nodes, author nodes, visual title nodes, date nodes or text nodes, wherein the specific caching method comprises the following steps:
step a, judging whether the element label of the element to be collected is img label, if yes, collecting and caching the element to be collected as a picture node, because one picture element cannot be simultaneously elements such as date or title, and if not, executing step b.
And b, judging whether the id attribute or the class attribute of the element to be collected contains an image, photo or bullety tag, if not, executing the step c, if so, judging that the element to be collected is a determined picture information block node, globally marking the traversal of the DOM tree to enter the picture information collection block, and when traversing the child nodes of the element to be collected, judging whether the child nodes are the picture nodes without trying to judge whether the child nodes are the elements such as authors, dates or titles and the like, so that the extraction efficiency is improved. If so, collecting and caching the child node as a picture node, and if not, continuously judging the next element to be collected.
And c, judging whether the id attribute or the class attribute of the element to be collected contains an author tag, a writenby tag or a byline tag, if not, executing the step d, if so, judging that the element to be collected is a determined author information block node, and traversing the DOM tree to enter an author information collection block, and when traversing the child nodes of the element to be collected, judging whether the child nodes are the author nodes, and not attempting to identify whether the child nodes are the elements such as pictures, dates or titles, so that the extraction efficiency is further improved. If yes, the child node is collected and cached as the author node, and if not, the next element to be collected is continuously judged.
And d, judging whether the id attribute or the class attribute of the element to be collected contains an article, post, main or content label, if not, executing the step e, if so, judging that the element to be collected is a determined text information block node, traversing the overall marking DOM tree to enter a text information collecting block, and meanwhile, if the determined text information block is not collected in the current overall situation and only the undetermined text information block is collected, emptying the currently collected undetermined text information block.
And e, judging whether the element to be collected has sub-elements, if so, judging whether the sub-elements can be integrated and replaced, if so, replacing the integrated contents of all the sub-elements with the contents of the element to be collected, and executing the step f, otherwise, directly executing the step f.
F, traversing all child nodes of the elements to be collected and processing one by one, wherein the processing method comprises the following steps: and b, judging the type of the child node, if the child node is an element node, adding one to the global node count, returning to the step a to perform recursive deep traversal again, if the child node is a text child node, identifying the content of the text child node, and caching the text child node as a visible title node, a date node or a possible text node according to an identification result.
And recording the node counting sequence number, the text node counting sequence number and the node access path of the element to be collected in the DOM tree in the process of depth-first recursive traversal.
In step e of the above embodiment, a specific method for determining whether the sub-elements of the element to be collected can be integrated and replaced is as follows:
1) if the element to be collected is a pre element, title elements h 1-h 6 or other display tags such as strong, b, i, em, etc., sub-elements of the element to be collected can be integrated and replaced, i.e., can be directly combined in one element;
2) if the element to be collected is a p element, judging whether the element to be collected meets a first pre-integration condition, judging whether the element to be collected meets a second pre-integration condition on the basis of meeting the first pre-integration condition, and if both conditions are met, enabling sub-elements of the element to be collected to be integrated and replaced;
the first pre-integration condition is as follows: the element to be collected comprises more than one text child node or the text word ratio value of the link text and the common text in the child elements of the element to be collected is less than one third;
the second pre-integration condition is as follows: the element to be collected has more than one sentence, the node access path of the element to be collected is consistent with the node access path of the last collected text node, or the element to be collected is a simple element. The simple element means that one element only contains at most one simple element and text nodes, and is a recursive process;
3) if the element to be collected contains both child element nodes and text child nodes, checking whether all texts of the element to be collected form short texts, and if so, enabling the child elements of the element to be collected to be integrated and replaced. The short text means that the text contains less than 3 stop words after Chinese word segmentation.
In step f of the foregoing embodiment, a specific method for caching the text child node as a visible title node, a date node, or a possible text node according to the recognition result includes:
1) comparing the similarity of the text content of the text sub-node with the title words in the title word set, and judging whether the text sub-node is a visible title node according to the comparison result;
2) extracting date and time information in the text content of the text sub-nodes based on the regular expression, and if the extraction is successful and the ratio of the date and time text to the whole text content is greater than a preset threshold value of 0.5, judging the text sub-nodes as pure date nodes which are not used as other types of nodes, such as '2018-04-1307: 03:37 source: Xinhua society';
3) and if the text child node is not the visible title node or the pure date node, caching the text child node as a possible text node of the text for subsequent analysis.
And then extracting the text according to the cached possible text nodes of the text. The text extraction is mainly based on the following two facts: first, the text node is behind the visible title node, i.e., its node count number is greater than the node count number of the visible title node. Second, the body text nodes have similar access paths. Based on the above facts, the text extraction specifically includes the following steps:
1) sequencing all possible text nodes in an ascending order according to the node counting sequence number;
2) finding out a first target node of all possible text nodes, wherein the first node counting sequence number is greater than that of the visible title node, the sentence number of the first target node is greater than 0, or the content words of the first target node are related to the content words of the visible title node, and marking the first target node as a p1 node;
3) a second target node which has a node counting sequence number difference value smaller than 3 and is similar to the p1 node and is found by forward and backward directions with the p1 node as a starting point is replaced by p1, and then the step is repeated until a new second target node cannot be found;
4) cleaning all possible text nodes before the p1 node, grouping all the remaining possible text nodes according to the node access path, sequencing the interior of each group in an ascending order according to the node counting sequence number, and sequencing the groups in an ascending order according to the node counting sequence number of the first node of each group;
5) calculating a preset parameter value of each group, importing the preset parameter value into a pre-trained prediction model for scoring, and generating a target group with the score larger than a preset score;
6) sorting the nodes in all the target groups in an ascending order according to the node counting sequence numbers, and forming a text node set;
7) caching the text node set.
In step 5) of this embodiment, the preset parameter values include a node number, a total sentence number, a total related phrase number, an average related phrase number, a text node count sequence difference value of a node, a node count sequence difference value of a node, and a similarity between a node access path of a current packet and a node access path of a previous target packet. Where the text node counts the difference in sequence numbers, which for the first packet refers to the distance between the first text node of the current packet and the visible header node. For other packets, this difference refers to the distance of the first text node of the current packet from the last node of the last targeted packet.
Then, extracting the release time according to the cached date node, which specifically comprises the following steps:
1) and clearing invalid nodes in all date nodes, wherein the invalid nodes are nodes after the first target node found in the text extraction analysis by the node counting sequence number, because the release date node is either in front of the visible title node or between the visible title node and the first text node.
2) And acquiring a target date node closest to the visible title node in the cleaned residual date nodes, wherein the node counting sequence number difference of the target date node is lower than a first preset value, and the text node counting sequence number difference is lower than a second preset value.
And finally, extracting the text picture according to the cached picture node, which specifically comprises the following steps:
step 001, sorting the cached picture nodes in an ascending order according to the node counting sequence number;
step 002, acquiring a target picture node, and cleaning the target picture node and other picture nodes behind the target picture node, wherein the target picture node is the picture node which is closest to the last first target node and the node counting sequence number difference value is larger than a preset value;
step 003, obtaining the picture nodes with the node counting serial numbers between the text nodes and the visible title nodes, marking as interpolation picture nodes, then marking the picture nodes which are positioned in front of the visible title nodes and have the node distance lower than a preset value as interpolation picture nodes, merging the interpolation picture nodes into an interpolation picture node set, and caching the non-interpolation picture nodes;
step 004, obtaining the distance between each interpolation picture node and the node counting serial number of the visible title node, and sequencing all interpolation picture nodes according to the ascending order of the distance;
005, pre-screening all interpolation picture nodes according to a preset screening rule, and filtering out invalid pictures irrelevant to the text;
step 006, obtaining node access paths of the remaining interpolation picture nodes after the pre-screening, finding out nodes with the same node access paths in the interpolation picture node set in the step 003, and then repeating the step 004 and the step 005 to integrate the interpolation picture nodes and the non-interpolation picture nodes which are screened again.
In the above embodiment, the preset filtering rules include the following:
rule 1, filtering common advertisement links based on picture links of interpolated picture nodes, such as a URL (uniform resource locator) path including common advertisement words, or common social network links or logos.
And rule 2, acquiring picture size information of the interpolation picture node, filtering the banner picture and the small-size picture with the size lower than a preset value according to the aspect ratio of the picture, and specifically acquiring the picture size information by adopting the following method, for example, if the width and height attributes are specified by the current node and the attributes are in an effective range, directly acquiring the picture size information, otherwise, opening a network input stream through a picture URL to acquire the picture size information. When the picture size information is obtained through the network, the complete picture does not need to be downloaded, and the size information is only read at the head of the network input stream. And meanwhile, recording a loose access path of the picture node, and when traversing other picture nodes, if other nodes are the same as the loose access path of the node, the other nodes can directly use the size information of the picture node without opening an additional network request.
And 3, backtracking the 3-layer nodes at most based on the node paths by taking the picture nodes as starting points, scoring the picture nodes by combining the id attributes and the class attributes of the nodes, and filtering the determined irrelevant pictures according to the score. And meanwhile, recording the loose access path of the node, and directly filtering when traversing other picture nodes if the other nodes are the same as the loose access path of the node.
Step 3 of the above embodiment groups the nodes based on the node access paths, and performs scoring in units of groups to determine whether the node contents in the groups belong to the theme contents, thereby further improving the efficiency of extracting the theme contents of the web page.
The flow of the general webpage theme content extracting method is specifically described above with reference to fig. 1, and the structure of the general webpage theme content extracting system is described below with reference to fig. 2.
Fig. 2 is a schematic structural diagram of a general webpage theme content extraction system according to embodiment 2 of the present invention, as shown in fig. 2, including a DOM tree processing module, a caching module and an extraction module,
the DOM tree processing module is used for constructing a DOM tree of a target webpage, cleaning nodes of the DOM tree and marking attributes of the rest nodes of the DOM tree according to the relevance of the nodes and the text content;
the cache module is used for traversing the DOM tree after attribute marking, and classifying and caching the rest nodes of the DOM tree into picture nodes, date nodes, text nodes or visual target nodes;
the extraction module is used for judging whether the content of the picture node, the content of the date node or the content of the text node is subject content according to the distances between the picture node, the date node and the text node and the visible title node respectively, and completing extraction of the subject content of the target webpage according to the judgment result, wherein the subject content comprises a text picture, release time and a text.
The embodiment identifies the text visual title node of the DOM tree and classifies and caches other nodes based on the strong association relation existing on the page structure, and then the distance between other category nodes and the text visual title node in the DOM tree is used as an important basis for judging whether the node belongs to the subject content, so that the precision and the efficiency of extracting the webpage information are improved.
In a preferred embodiment, the DOM tree processing module includes:
the analysis unit is used for downloading a source code of a target webpage and analyzing the source code into a DOM tree;
the title word generating unit is used for acquiring and caching the content of the title label nodes in the DOM tree, and meanwhile performing Chinese word segmentation and stop word removal on the content of the title label nodes to generate a title word set comprising a plurality of title words;
and the marking unit is used for traversing the DOM tree in a depth-first mode, judging whether the id attribute, the class attribute and/or the style attribute of the residual nodes meet a first preset condition after cleaning the nodes of the preset type in the DOM tree, and performing attribute marking on the residual nodes according to the judgment result to determine elements irrelevant to the text, elements possibly irrelevant to the text and other elements.
In another preferred embodiment, the cache module includes:
the path generation unit is used for selecting a body element of the DOM tree as an initial node for performing depth-first recursive traversal, and generating a node access path corresponding to each residual element in the DOM tree;
and the caching unit is used for taking the elements which are possibly irrelevant to the text and other elements as the elements to be collected according to the attribute mark information of the rest elements in the DOM tree, collecting the information of the elements to be collected, and classifying and caching the elements to be collected into picture nodes, author nodes, date nodes, text nodes or visual title nodes.
The cache unit comprises a picture node cache unit, a picture information block node cache unit, an author information block node cache unit, a text information block node cache unit, a sub-element integration and replacement unit, a sub-element cache unit and an information recording unit,
the picture node cache unit is used for judging whether the element tag of the element to be collected is an img tag, if so, collecting and caching the element to be collected as a picture node, and if not, driving the first judgment unit;
the picture information block node cache unit is used for judging whether the id attribute or the class attribute of the element to be collected contains an image, photo or galery label, if not, the author information block node cache unit is driven, if so, the element to be collected is judged to be a determined picture information block node, traversal of a DOM tree is marked to enter a picture information collection block, when the child node of the element to be collected is traversed, whether the child node is a picture node is judged, if so, the child node is collected and cached as the picture node, and if not, the next element to be collected is continuously judged;
the author information block node cache is used for judging whether the id attribute or the class attribute of the element to be collected contains an author tag, a writenby tag or a byline tag, if not, driving a text information block node cache unit, if so, judging that the element to be collected is a determined author information block node, and globally marking traversal of a DOM tree to enter an author information collection block, when traversing child nodes of the element to be collected, judging whether the child nodes are author nodes, if so, collecting and caching the child nodes as author nodes, and if not, continuously judging the next element to be collected;
the text information block node cache unit is used for judging whether the id attribute or the class attribute of the element to be collected contains an attribute, a post, a main or a content tag, if not, the child element integration replacement unit is driven, if so, the element to be collected is judged to be a determined text information block node, traversal of a global marking DOM tree enters a text information collection block, and meanwhile, if the determined text information block is not collected in the current global state and only a non-determined text information block is collected, the currently collected non-determined text information block is emptied;
the sub-element integration and replacement unit is used for judging whether the element to be collected has sub-elements or not, if so, judging whether the sub-elements can be integrated and replaced or not, if so, replacing the integrated contents of all the sub-elements with the contents of the element to be collected and driving the sub-element cache unit, and if not, directly driving the sub-element cache unit;
the child element caching unit is used for traversing all child nodes of the elements to be collected and judging the types of the child nodes one by one, if the child nodes are element nodes, the global node count is increased by one, the picture node caching unit is driven to conduct recursive depth traversal again, if the child nodes are text child nodes, the content of the text child nodes is identified, and the text child nodes are cached as visible title nodes, date nodes or possible text nodes according to the identification result;
and the information recording unit is used for recording the node counting sequence number, the text node counting sequence number and the node access path of the element to be collected in the DOM tree.
In another preferred embodiment, the extraction module comprises a text extraction module, a release time extraction module and a text picture extraction module. The text extraction module specifically comprises:
the first sequencing unit is used for sequencing all possible text nodes in an ascending order according to the node counting sequence number;
the first target node generation unit is used for finding out a first target node of all possible text nodes, wherein the first node counting serial number is greater than the node counting serial number of the visible title node, the sentence number of the first target node is greater than 0, or the content words of the first target node are related to the content words of the visible title node, and the first target node is marked as a p1 node;
a circulation unit, configured to find a second target node with a node count sequence number difference smaller than 3 and a similar access path from the p1 node to the p1 node in a forward and backward direction, and replace the second target node with p1 until a new second target node cannot be found;
the grouping unit is used for cleaning all possible text nodes before the p1 node, grouping all the remaining possible text nodes according to the node access path, sequencing the interior of each group in an ascending order according to the node counting sequence number, and sequencing the groups in an ascending order according to the node counting sequence number of the first node of each group;
the scoring unit is used for calculating a preset parameter value of each group, importing the preset parameter value into a pre-trained prediction model for scoring, and generating a target group with a score larger than a preset score;
and the first extraction unit is used for sequencing the nodes in all the target groups in an ascending order according to the node counting sequence numbers, forming a text node set and caching the text node set.
In a preferred embodiment, the release time extracting module specifically includes:
a cleaning unit, configured to clean an invalid node in all date nodes, where the invalid node is a node with a node count sequence number after a first target node;
and the second extraction unit is used for acquiring a target date node which is closest to the visible title node in the cleaned residual date nodes, wherein the node counting sequence number difference of the target date node is lower than a first preset value, and the text node counting sequence number difference is lower than a second preset value.
Preferably, the text image extracting module specifically includes:
the second sorting unit is used for sorting the cached picture nodes in an ascending order according to the node counting sequence number;
the second target node generation unit is used for acquiring a target picture node, and cleaning the target picture node and other picture nodes behind the target picture node, wherein the target picture node is the picture node which is closest to the last first target node and the node counting sequence number difference value is larger than a third preset value;
the interpolation picture node generating unit is used for acquiring picture nodes with node counting serial numbers between the text nodes and the visual title nodes, recording the picture nodes as interpolation picture nodes, recording the picture nodes which are positioned in front of the visual title nodes and have a node distance from the visual title nodes lower than a fourth preset value as interpolation picture nodes, merging the interpolation picture nodes into an interpolation picture node set, and caching non-interpolation picture nodes;
the third sequencing unit is used for acquiring the distance between each interpolation picture node and the node counting serial number of the visible header node and sequencing all the interpolation picture nodes in an ascending order according to the distance;
the pre-screening unit is used for pre-screening all interpolation picture nodes according to a preset screening rule and filtering out invalid pictures irrelevant to the text;
and the third extraction unit is used for acquiring node access paths of the residual interpolation picture nodes after the pre-screening, finding out nodes with the same node access paths in the interpolation picture node set of the interpolation picture node generation unit, and then repeatedly driving the third sorting unit and the pre-screening unit to integrate the interpolation picture nodes and the non-interpolation picture nodes which are screened again.
The reader should understand that in the description of this specification, reference to the description of the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (9)

1. A general webpage subject content extraction method is characterized by comprising the following steps:
step 1, constructing a DOM tree of a target webpage, cleaning nodes of the DOM tree, and marking attributes of the rest nodes of the DOM tree according to the relevance of the nodes and the text content;
step 2, traversing the DOM tree after attribute marking, and classifying and caching the rest nodes of the DOM tree into picture nodes, date nodes, text nodes or visual title nodes;
step 3, judging whether the content of the picture node, the content of the date node and the content of the text node are subject content according to the distances between the picture node, the date node and the text node and the visible title node respectively, and finishing the extraction of the subject content of the target webpage according to the judgment result, wherein the subject content comprises a text picture, release time and a text;
the method for extracting the text according to the cached text nodes specifically comprises the following steps of:
sequencing all possible text nodes in an ascending order according to the node counting sequence number;
finding a first target node of all possible text nodes, wherein the first node counting sequence number is larger than the node counting sequence number of the visible title node, the sentence number of the first target node is larger than 0, or the content words of the first target node are related to the content words of the visible title node, and marking the first target node as a p1 node;
a second target node which has a node counting sequence number difference value smaller than 3 and is similar to the p1 node and is found by forward and backward directions with the p1 node as a starting point is replaced by p1, and then the step is repeated until a new second target node cannot be found;
cleaning all possible text nodes before the p1 node, grouping all the remaining possible text nodes according to the node access path, sequencing the interior of each group in an ascending order according to the node counting sequence number, and sequencing the groups in an ascending order according to the node counting sequence number of the first node of each group;
calculating a preset parameter value of each group, importing the preset parameter value into a pre-trained prediction model for scoring, and generating a target group with the score larger than a preset score;
sequencing the nodes in all the target groups in an ascending order according to the node counting sequence number, and forming a text node set;
caching the text node set.
2. The method for extracting the subject content of the universal webpage according to claim 1, wherein the step 1 specifically comprises the following steps:
s101, downloading a source code of a target webpage, and analyzing the source code into a DOM tree;
s102, acquiring and caching the content of title label nodes in the DOM tree, and simultaneously performing Chinese word segmentation and stop word removal on the content of the title label nodes to generate a title word set comprising a plurality of title words;
s103, traversing the DOM tree in a depth-first mode, after cleaning nodes of preset types in the DOM tree, judging whether the id attribute, the class attribute and/or the style attribute of the remaining nodes meet a first preset condition, and performing attribute marking on the remaining nodes according to a judgment result to determine elements irrelevant to the text, elements possibly irrelevant to the text and other elements.
3. The method for extracting the general webpage subject matter content according to the claim 2, wherein the step 2 specifically comprises the following steps:
s201, selecting a body element of the DOM tree as an initial node for performing depth-first recursive traversal, and generating a node access path corresponding to each remaining element in the DOM tree;
s202, according to attribute marking information of the rest elements in the DOM tree, taking the elements which are possibly irrelevant to the text and other elements as the elements to be collected, collecting the information of the elements to be collected, and classifying and caching the elements to be collected into picture nodes, author nodes, date nodes, text nodes or visual title nodes.
4. The method for extracting webpage theme content in general according to claim 3, wherein in the step S202, the step of collecting information of the elements to be collected and classifying and caching the elements to be collected specifically includes the following steps:
step a, judging whether an element label of the element to be collected is an img label, if so, collecting and caching the element to be collected as a picture node, and if not, executing step b;
b, judging whether the id attribute or the class attribute of the element to be collected contains an image, photo or challenge tag, if not, executing step c, if so, judging that the element to be collected is a determined picture information block node, globally marking traversal of a DOM tree to enter a picture information collection block, judging whether a child node of the element to be collected is a picture node or not when traversing the child node of the element to be collected, if so, collecting and caching the child node as a picture node, and if not, continuing to judge the next element to be collected;
step c, judging whether the id attribute or the class attribute of the element to be collected contains an author tag, a writenby tag or a byline tag, if not, executing step d, if so, judging that the element to be collected is a determined author information block node, traversing a global markup DOM tree to enter an author information collection block, judging whether a child node of the element to be collected is the author node or not when the child node is traversed, if so, collecting and caching the child node as the author node, and if not, continuing to judge the next element to be collected;
step d, judging whether the id attribute or the class attribute of the element to be collected contains an article, a post, a main or a content label, if not, executing the step e, if so, judging that the element to be collected is a determined text information block node, traversing the overall marking DOM tree to enter a text information collecting block, and if the determined text information block is not collected in the current overall situation and only the undetermined text information block is collected, emptying the currently collected undetermined text information block;
step e, judging whether the element to be collected has a sub-element, if so, judging whether the sub-element can be integrated and replaced, if so, replacing the integrated contents of all the sub-elements with the contents of the element to be collected, and executing the step f, otherwise, directly executing the step f;
f, traversing all child nodes of the elements to be collected and processing one by one, wherein the processing method comprises the following steps: judging the type of the child node, if the child node is an element node, adding one to the global node count, returning to the step a to perform recursive deep traversal again, if the child node is a text child node, identifying the content of the text child node, and caching the text child node as a visible title node, a date node or a possible text node according to an identification result;
and recording the node counting sequence number, the text node counting sequence number and the node access path of the element to be collected in the DOM tree in the process of performing the depth-first recursion traversal.
5. The method for extracting general webpage subject matter according to claim 4, wherein the step 3 of extracting the release time according to the cached date node specifically comprises the following steps:
clearing invalid nodes in all date nodes, wherein the invalid nodes are nodes with node counting serial numbers behind the first target node;
and acquiring a target date node closest to the visible title node in the cleaned residual date nodes, wherein the node counting sequence number difference of the target date node is lower than a first preset value, and the text node counting sequence number difference is lower than a second preset value.
6. The method for extracting webpage subject matter according to claim 5, wherein the step 3 of extracting the text picture according to the cached picture node specifically comprises the following steps:
step 001, sorting the cached picture nodes in an ascending order according to the node counting sequence number;
step 002, acquiring a target picture node, and cleaning the target picture node and other picture nodes behind the target picture node, wherein the target picture node is the picture node which is closest to the last first target node and the node counting sequence number difference value is larger than a third preset value;
step 003, obtaining a picture node with a node counting sequence number between the text node and the visual title node, marking as an interpolation picture node, then marking a picture node which is positioned in front of the visual title node and has a node distance with the visual title node lower than a fourth preset value as an interpolation picture node, merging the interpolation picture node into an interpolation picture node set, and caching a non-interpolation picture node;
step 004, obtaining the distance between each interpolation picture node and the node counting serial number of the visible title node, and sequencing all interpolation picture nodes in an ascending order according to the distance;
005, pre-screening all interpolation picture nodes according to a preset screening rule, and filtering out invalid pictures irrelevant to the text;
step 006, obtaining node access paths of the remaining interpolation picture nodes after the pre-screening, finding out nodes with the same node access paths in the interpolation picture node set in the step 003, and then repeating the step 004 and the step 005 to integrate the interpolation picture nodes and the non-interpolation picture nodes which are screened again.
7. A general webpage theme content extraction system is characterized by comprising a DOM tree processing module, a cache module and an extraction module,
the DOM tree processing module is used for constructing a DOM tree of a target webpage, cleaning nodes of the DOM tree and marking attributes of the rest nodes of the DOM tree according to the relevance of the nodes and the text content;
the cache module is used for traversing the DOM tree after attribute marking, and classifying and caching the rest nodes of the DOM tree into picture nodes, date nodes, text nodes or visual title nodes;
the extraction module is used for judging whether the content of the picture node, the content of the date node or the content of the text node is subject content according to the distances between the picture node, the date node and the text node and the visual title node respectively, and finishing extraction of the subject content of the target webpage according to the judgment result, wherein the subject content comprises a text picture, release time and a text;
the extraction module comprises a text extraction module, and the text extraction module specifically comprises:
the first sequencing unit is used for sequencing all possible text nodes in an ascending order according to the node counting sequence number;
the first target node generation unit is used for finding out a first target node of all possible text nodes, wherein the first node counting serial number is greater than the node counting serial number of the visible title node, the sentence number of the first target node is greater than 0, or the content words of the first target node are related to the content words of the visible title node, and the first target node is marked as a p1 node;
a circulation unit, configured to forward and backward find a second target node that has a node count sequence number difference smaller than 3 and is similar to that of the p1 node and has a similar access path, with the p1 node as a starting point, and replace the second target node with p1 until a new second target node cannot be found;
the grouping unit is used for cleaning all possible text nodes before the p1 node, grouping all the remaining possible text nodes according to the node access path, sequencing the interior of each group in an ascending order according to the node counting sequence number, and sequencing the groups in an ascending order according to the node counting sequence number of the first node of each group;
the scoring unit is used for calculating a preset parameter value of each group, importing the preset parameter value into a pre-trained prediction model for scoring, and generating a target group with a score larger than a preset score;
and the first extraction unit is used for sequencing the nodes in all the target groups in an ascending order according to the node counting sequence numbers, forming a text node set and caching the text node set.
8. The universal web page subject matter extraction system according to claim 7, wherein said DOM tree processing module comprises:
the analysis unit is used for downloading a source code of a target webpage and analyzing the source code into a DOM tree;
the title word generating unit is used for acquiring and caching the content of the title label nodes in the DOM tree, and meanwhile performing Chinese word segmentation and stop word removal on the content of the title label nodes to generate a title word set comprising a plurality of title words;
and the marking unit is used for traversing the DOM tree in a depth-first mode, judging whether the id attribute, the class attribute and/or the style attribute of the residual nodes meet a first preset condition after cleaning the nodes of the preset type in the DOM tree, and performing attribute marking on the residual nodes according to the judgment result to determine elements irrelevant to the text, elements possibly irrelevant to the text and other elements.
9. The system for universal web page theme content extraction as recited in claim 8, wherein the caching module comprises:
the path generation unit is used for selecting a body element of the DOM tree as an initial node for performing depth-first recursive traversal, and generating a node access path corresponding to each residual element in the DOM tree;
and the caching unit is used for taking the elements which are possibly irrelevant to the text and other elements as the elements to be collected according to the attribute mark information of the rest elements in the DOM tree, collecting the information of the elements to be collected, and classifying and caching the elements to be collected into picture nodes, author nodes, date nodes, text nodes or visual title nodes.
CN201810572726.0A 2018-06-06 2018-06-06 Universal webpage theme content extraction method and system Active CN108920434B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810572726.0A CN108920434B (en) 2018-06-06 2018-06-06 Universal webpage theme content extraction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810572726.0A CN108920434B (en) 2018-06-06 2018-06-06 Universal webpage theme content extraction method and system

Publications (2)

Publication Number Publication Date
CN108920434A CN108920434A (en) 2018-11-30
CN108920434B true CN108920434B (en) 2022-08-30

Family

ID=64419788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810572726.0A Active CN108920434B (en) 2018-06-06 2018-06-06 Universal webpage theme content extraction method and system

Country Status (1)

Country Link
CN (1) CN108920434B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657180B (en) * 2018-12-11 2021-11-26 中科国力(镇江)智能技术有限公司 Intelligent automatic fuzzy extraction system for webpage content
CN109815326B (en) * 2019-01-24 2021-09-10 网易(杭州)网络有限公司 Conversation control method and device
CN110309474A (en) * 2019-06-05 2019-10-08 上海易点时空网络有限公司 Document off-line system and method based on Electron
CN111241446B (en) * 2020-01-13 2023-10-31 杭州安恒信息技术股份有限公司 Method, device, equipment and medium for extracting text content of web page
CN111460259B (en) * 2020-03-31 2023-04-14 腾讯科技(深圳)有限公司 Method and device for determining similar elements, computer equipment and storage medium
CN112667940B (en) * 2020-10-15 2022-02-18 广东电子工业研究院有限公司 Webpage text extraction method based on deep learning
CN112765941A (en) * 2021-01-21 2021-05-07 语联网(武汉)信息技术有限公司 Method and system for automatically extracting webpage text
CN113204723A (en) * 2021-04-12 2021-08-03 仲恺农业工程学院 Page background matching method and device based on page theme
CN113392354B (en) * 2021-06-28 2022-09-13 山东亿云信息技术有限公司 Webpage text analysis method, system, medium and electronic equipment
CN113807050B (en) * 2021-07-01 2024-04-09 西安华讯科技有限责任公司 Node interception method, system, equipment and storage medium based on rich text
CN113626737B (en) * 2021-10-12 2022-03-11 北京天际友盟信息技术有限公司 Method and device for identifying main body link, electronic equipment and storage medium
CN114528811B (en) * 2022-01-21 2022-09-02 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium
CN114610580A (en) * 2022-03-17 2022-06-10 北京火山引擎科技有限公司 Page white screen monitoring method, device, equipment and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270206A (en) * 2010-06-03 2011-12-07 北京迅捷英翔网络科技有限公司 Method and device for capturing valid web page contents
CN105426388A (en) * 2015-10-23 2016-03-23 青岛恒波仪器有限公司 Apparatus for extracting and comparing webpage text
CN105574066A (en) * 2015-10-23 2016-05-11 青岛恒波仪器有限公司 Web page text extraction and comparison method and system thereof
CN105630941A (en) * 2015-12-23 2016-06-01 成都电科心通捷信科技有限公司 Statistics and webpage structure based Wen body text content extraction method
CN106528583A (en) * 2015-11-14 2017-03-22 孙燕群 Method for extracting and comparing web page main body
CN107391678A (en) * 2017-07-21 2017-11-24 福州大学 Web page content information extracting method based on cluster
CN107590219A (en) * 2017-09-04 2018-01-16 电子科技大学 Webpage personage subject correlation message extracting method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831121B (en) * 2011-06-15 2015-07-08 阿里巴巴集团控股有限公司 Method and system for extracting webpage information

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270206A (en) * 2010-06-03 2011-12-07 北京迅捷英翔网络科技有限公司 Method and device for capturing valid web page contents
CN105426388A (en) * 2015-10-23 2016-03-23 青岛恒波仪器有限公司 Apparatus for extracting and comparing webpage text
CN105574066A (en) * 2015-10-23 2016-05-11 青岛恒波仪器有限公司 Web page text extraction and comparison method and system thereof
CN106528583A (en) * 2015-11-14 2017-03-22 孙燕群 Method for extracting and comparing web page main body
CN105630941A (en) * 2015-12-23 2016-06-01 成都电科心通捷信科技有限公司 Statistics and webpage structure based Wen body text content extraction method
CN107391678A (en) * 2017-07-21 2017-11-24 福州大学 Web page content information extracting method based on cluster
CN107590219A (en) * 2017-09-04 2018-01-16 电子科技大学 Webpage personage subject correlation message extracting method

Also Published As

Publication number Publication date
CN108920434A (en) 2018-11-30

Similar Documents

Publication Publication Date Title
CN108920434B (en) Universal webpage theme content extraction method and system
CN107229668B (en) Text extraction method based on keyword matching
Gibson et al. The volume and evolution of web page templates
US8255793B2 (en) Automatic visual segmentation of webpages
US20050066269A1 (en) Information block extraction apparatus and method for Web pages
CN107590219A (en) Webpage personage subject correlation message extracting method
US20090319449A1 (en) Providing context for web articles
CN110390038B (en) Page blocking method, device and equipment based on DOM tree and storage medium
US20050251536A1 (en) Extracting information from Web pages
Manabe et al. Extracting logical hierarchical structure of HTML documents based on headings
CN104268148A (en) Forum page information auto-extraction method and system based on time strings
CN108520007B (en) Web page information extracting method, storage medium and computer equipment
Insa Cabrera et al. Using the words/leafs ratio in the DOM tree for content extraction
Uzun et al. An effective and efficient Web content extractor for optimizing the crawling process
Fauzi et al. Webpage segmentation for extracting images and their surrounding contextual information
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN108874870A (en) A kind of data pick-up method, equipment and computer can storage mediums
CN107239520B (en) General forum text extraction method
CN110083760B (en) Multi-recording dynamic webpage information extraction method based on visual block
Eldirdiery et al. Detecting and removing noisy data on web document using text density approach
Alim et al. Data retrieval from online social network profiles for social engineering applications
CN115640439A (en) Method, system and storage medium for network public opinion monitoring
CN112347353A (en) Webpage denoising method
Bhardwaj et al. An improvised algorithm for relevant content extraction from web pages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant