CN108920434B

CN108920434B - Universal webpage theme content extraction method and system

Info

Publication number: CN108920434B
Application number: CN201810572726.0A
Authority: CN
Inventors: 钟刚
Original assignee: Wuhan Kuquan Data Technology Co ltd
Current assignee: Wuhan Kuquan Data Technology Co ltd
Priority date: 2018-06-06
Filing date: 2018-06-06
Publication date: 2022-08-30
Anticipated expiration: 2038-06-06
Also published as: CN108920434A

Abstract

The invention particularly relates to a universal webpage subject content extracting method and a universal webpage subject content extracting system, wherein the method comprises the following steps: constructing a DOM tree of a target webpage, cleaning nodes of the DOM tree, and performing attribute marking on the rest nodes according to the correlation with the text content; traversing the DOM tree, and classifying and caching the rest nodes of the DOM tree; and judging whether the content of the node is the subject content according to the distance between the node in each category and the visible title node, and finishing the extraction of the subject content of the target webpage according to the judgment result. The invention provides a more optimized semantic-based webpage information extraction method, which is characterized in that based on strong association relation existing on a page structure, text visual title nodes of a DOM tree are identified and other nodes are classified and cached, and then the distances between other category nodes and the text visual title nodes in the DOM tree are used as important basis for judging whether the nodes belong to subject contents, so that the precision and the efficiency of webpage information extraction are improved.

Description

Universal webpage theme content extraction method and system

Technical Field

The invention relates to the technical field of computer software, in particular to a universal webpage theme content extraction method and a universal webpage theme content extraction system.

Background

In the internet era today, most of the information disclosed and visible in the network is presented in the form of subject matter, such as blog articles in blogs, news information of web portals, etc. The subject contents are important channels for most Internet users to obtain information, are massive basic corpora of academic researchers, and have important value in the field of natural language processing. However, for many reasons, the subject content web pages on the network are not composed of pure subject content, and include information that is not directly related to the subject content, such as advertisements, comments, related recommendations, and website navigation. How to extract the subject content of the web page from the complicated web page information becomes a problem to be solved.

Currently, the existing topic content extraction methods are generally divided into two types: one is a semantic-based webpage information extraction method, and the other is a visual-based webpage blocking method. Both of the above approaches attempt to extract the information block where the true subject matter is located from the web page structure.

The semantic-based webpage information extraction generally has two modes, the first mode is to analyze the information based on the whole website, try to find out the repeated modules, such as a navigation bar and the like, among different webpages, and then remove the repeated modules when a certain webpage is specifically analyzed to find out the subject content; the second way is to simply rely on the currently analyzed web page itself to try to find some nodes of block-level elements in the HTML, and then analyze the text information of the node contents, such as the text length, to obtain the block-level element with the longest text length by comparison.

The visual-based webpage blocking method includes attempting to render a whole page through a browser engine, then blocking the rendered page based on background colors, fonts, frames and other factors of page elements, merging elements with relatively close relevance, and regarding elements with relatively loose relevance as different blocks, so that visual-based blocking reconstruction of the whole page is completed. The visual-based web page blocking method has a drawback in that it requires loading of CSS (cascading style sheet) files and the like that it depends on while analyzing a DOM tree constructed based on a web page source code, and rendering depending on a browser engine, and has a problem in that analysis of mass data is relatively slow.

Disclosure of Invention

The invention provides a universal webpage theme content extraction method and a universal webpage theme content extraction system, which solve the technical problems of low accuracy and efficiency of webpage theme content extraction in the prior art.

The technical scheme for solving the technical problems is as follows: a general webpage subject content extraction method comprises the following steps:

step 1, constructing a DOM tree of a target webpage, cleaning nodes of the DOM tree, and marking attributes of the rest nodes of the DOM tree according to the relevance of the nodes and the text content;

step 2, traversing the DOM tree after attribute marking, and classifying and caching the rest nodes of the DOM tree into picture nodes, date nodes, text nodes or visual target nodes;

and 3, judging whether the content of the picture node, the content of the date node and the content of the text node are subject contents according to the distances between the picture node, the date node and the text node and the visual title node respectively, and finishing the extraction of the subject contents of the target webpage according to the judgment result, wherein the subject contents comprise a text picture, release time and a text.

The invention has the beneficial effects that: the invention provides a more optimized semantic-based webpage information extraction method, which is characterized in that based on strong association relation existing on a page structure, text visual title nodes of a DOM tree are identified and other nodes are classified and cached, and then the distances between other category nodes and the text visual title nodes in the DOM tree are used as important basis for judging whether the nodes belong to subject contents, so that the precision and the efficiency of webpage information extraction are improved.

On the basis of the technical scheme, the invention can be further improved as follows.

Further, the step 1 specifically includes the following steps:

s101, downloading a source code of a target webpage, and analyzing the source code into a DOM tree;

s102, acquiring and caching the content of title label nodes in the DOM tree, and simultaneously performing Chinese word segmentation and stop word removal on the content of the title label nodes to generate a title word set comprising a plurality of title words;

s103, traversing the DOM tree in a depth-first mode, after cleaning nodes of preset types in the DOM tree, judging whether the id attribute, the class attribute and/or the style attribute of the remaining nodes meet a first preset condition, and performing attribute marking on the remaining nodes according to a judgment result to determine elements irrelevant to the text, elements possibly irrelevant to the text and other elements.

Further, the step 2 specifically includes the following steps:

s201, selecting a body element of the DOM tree as an initial node for performing depth-first recursive traversal, and generating a node access path corresponding to each remaining element in the DOM tree;

s202, according to attribute marking information of the rest elements in the DOM tree, taking the elements which are possibly irrelevant to the text and other elements as the elements to be collected, collecting the information of the elements to be collected, and classifying and caching the elements to be collected into picture nodes, author nodes, date nodes, text nodes or visual title nodes.

Further, in step S202, the information collection and classified caching of the to-be-collected elements specifically includes the following steps:

step a, judging whether an element tag of the element to be collected is an img tag, if so, collecting and caching the element to be collected as a picture node, and if not, executing the step b;

b, judging whether the id attribute or the class attribute of the element to be collected contains an image, photo or challenge tag, if not, executing step c, if so, judging that the element to be collected is a determined picture information block node, globally marking traversal of a DOM tree to enter a picture information collection block, judging whether a child node of the element to be collected is a picture node or not when traversing the child node, if so, collecting and caching the child node as the picture node, and if not, continuously judging the next element to be collected;

step c, judging whether the id attribute or the class attribute of the element to be collected contains an author tag, a writenby tag or a byline tag, if not, executing step d, if so, judging that the element to be collected is a determined author information block node, and traversing a global markup DOM tree to enter an author information collection block, when traversing child nodes of the element to be collected, judging whether the child nodes are author nodes, if so, collecting and caching the child nodes as the author nodes, and if not, continuously judging the next element to be collected;

step d, judging whether the id attribute or the class attribute of the element to be collected contains an article, a post, a main or a content label, if not, executing the step e, if so, judging that the element to be collected is a determined text information block node, traversing the overall marking DOM tree to enter a text information collecting block, and if the determined text information block is not collected in the current overall situation and only the undetermined text information block is collected, emptying the currently collected undetermined text information block;

step e, judging whether the element to be collected has a sub-element, if so, judging whether the sub-element can be integrated and replaced, if so, replacing the integrated contents of all the sub-elements with the contents of the element to be collected, and executing the step f, otherwise, directly executing the step f;

step f, traversing all child nodes of the elements to be collected and processing one by one, wherein the processing method comprises the following steps: judging the type of the child node, if the child node is an element node, adding one to the global node count, returning to the step a to perform recursive deep traversal again, if the child node is a text child node, identifying the content of the text child node, and caching the text child node as a visible title node, a date node or a possible text node according to an identification result;

and recording the node counting sequence number, the text node counting sequence number and the node access path of the element to be collected in the DOM tree in the process of performing the depth-first recursion traversal.

Further, the step 3 of extracting the body according to the cached body text node specifically includes the following steps:

sequencing all possible text nodes in an ascending order according to the node counting sequence number;

finding out a first target node of all possible text nodes, wherein the first node counting sequence number is greater than that of the visible title node, the sentence number of the first target node is greater than 0, or the content words of the first target node are related to the content words of the visible title node, and marking the first target node as a p1 node;

forward and backward finding a second target node which has a node count sequence number difference smaller than 3 and is similar to the p1 node by taking the p1 node as a starting point, replacing the second target node with p1, and repeating the steps until a new second target node cannot be found;

cleaning all possible text nodes before the p1 node, grouping all the remaining possible text nodes according to the node access path, sequencing the interior of each group in an ascending order according to the node counting sequence number, and sequencing the groups in an ascending order according to the node counting sequence number of the first node of each group;

calculating a preset parameter value of each group, importing the preset parameter value into a pre-trained prediction model for scoring, and generating a target group with the score larger than a preset score;

sequencing the nodes in all the target groups in an ascending order according to the node counting sequence number, and forming a text node set;

caching the text node set.

Further, the step 3 of extracting the release time according to the cached date node specifically includes the following steps:

clearing invalid nodes in all date nodes, wherein the invalid nodes are nodes with node counting serial numbers behind the first target node;

and acquiring a target date node closest to the visible title node in the cleaned residual date nodes, wherein the node counting sequence number difference of the target date node is lower than a first preset value, and the text node counting sequence number difference is lower than a second preset value.

Further, the extracting of the text picture according to the cached picture node in the step 3 specifically includes the following steps:

step 001, sorting the cached picture nodes in an ascending order according to the node counting sequence number;

step 002, acquiring a target picture node, and cleaning the target picture node and other picture nodes behind the target picture node, wherein the target picture node is the picture node which is closest to the last first target node and the node counting sequence number difference value is larger than a third preset value;

step 003, obtaining a picture node with a node counting sequence number between the text node and the visual title node, marking as an interpolation picture node, then marking a picture node which is positioned in front of the visual title node and has a node distance with the visual title node lower than a fourth preset value as an interpolation picture node, merging the interpolation picture node into an interpolation picture node set, and caching a non-interpolation picture node;

step 004, obtaining the distance between each interpolation picture node and the node counting serial number of the visible title node, and sequencing all interpolation picture nodes in an ascending order according to the distance;

005, pre-screening all interpolation picture nodes according to a preset screening rule to filter out invalid pictures irrelevant to the text;

step 006, obtaining node access paths of the remaining interpolation picture nodes after the pre-screening, finding out nodes with the same node access paths in the interpolation picture node set in the step 003, and then repeating the step 004 and the step 005 to integrate the interpolation picture nodes and the non-interpolation picture nodes which are screened again.

In order to solve the technical problem of the invention, the invention also provides a general webpage theme content extraction system, which comprises a DOM tree processing module, a cache module and an extraction module,

the DOM tree processing module is used for constructing a DOM tree of a target webpage, cleaning nodes of the DOM tree and marking attributes of the rest nodes of the DOM tree according to the relevance of the nodes and the text content;

the cache module is used for traversing the DOM tree after attribute marking, and classifying and caching the rest nodes of the DOM tree into picture nodes, date nodes, text nodes or visual title nodes;

the extraction module is used for judging whether the content of the picture node, the content of the date node or the content of the text node is subject content according to the distances between the picture node, the date node and the text node and the visual title node respectively, and finishing extraction of the subject content of the target webpage according to the judgment result, wherein the subject content comprises a text picture, release time and a text.

Further, the DOM tree processing module includes:

the analysis unit is used for downloading a source code of a target webpage and analyzing the source code into a DOM tree;

the title word generation unit is used for acquiring and caching the content of the title label nodes in the DOM tree, performing Chinese word segmentation and stop word removal on the content of the title label nodes, and generating a title word set comprising a plurality of title words;

and the marking unit is used for traversing the DOM tree in a depth-first mode, judging whether the id attribute, the class attribute and/or the style attribute of the residual nodes meet a first preset condition after cleaning the nodes of the preset type in the DOM tree, and performing attribute marking on the residual nodes according to the judgment result to determine elements irrelevant to the text, elements possibly irrelevant to the text and other elements.

Further, the cache module comprises:

the path generation unit is used for selecting a body element of the DOM tree as an initial node for performing depth-first recursive traversal, and generating a node access path corresponding to each residual element in the DOM tree;

and the caching unit is used for taking the elements which are possibly irrelevant to the text and other elements as the elements to be collected according to the attribute mark information of the rest elements in the DOM tree, collecting the information of the elements to be collected, and classifying and caching the elements to be collected into picture nodes, author nodes, date nodes, text nodes or visual title nodes.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

Fig. 1 is a schematic flowchart of a general method for extracting webpage theme content according to embodiment 1;

fig. 2 is a schematic structural diagram of a general webpage theme content extraction system provided in embodiment 2.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth to illustrate, but are not to be construed to limit the scope of the invention.

Fig. 1 is a schematic flowchart of a general method for extracting webpage theme content provided in embodiment 1, and as shown in fig. 1, the method includes the following steps:

step 1, constructing a DOM tree of a target webpage, and cleaning and marking the nodes of the DOM tree;

step 2, traversing the cleaned DOM tree with the attribute marked, and classifying and caching the rest nodes of the DOM tree into picture nodes, visual title nodes, date nodes or text nodes;

and 3, extracting the subject content of the target webpage from the cached information, wherein the subject content comprises a text, release time and a text picture.

The embodiment identifies the text visual title node of the DOM tree and classifies and caches other nodes based on the strong association relation existing on the page structure, and then the distance between other category nodes and the text visual title node in the DOM tree is used as an important basis for judging whether the node belongs to the subject content, so that the precision and the efficiency of extracting the webpage information are improved. Each step of the above embodiment is specifically described below.

In the above embodiment 1, the step 1 specifically includes the following steps:

s101, downloading a source code of a target webpage, and analyzing the source code into a DOM tree. The target webpage is usually given a webpage link, the source code of the target webpage can be downloaded through the webpage link, then the source code can be analyzed into a DOM tree by using an open source tool, and the specific analysis method is recorded in the prior art document and is not described in detail herein.

S102, obtaining and caching the content of the title label nodes in the DOM tree, performing Chinese word segmentation and stop word removal on the content of the title label nodes, and generating a title word set comprising a plurality of title words. Specifically, a CSS selector can be used for finding title label nodes in the DOM tree, then the content of the title label nodes is obtained, namely the title information of the target webpage is obtained, then Chinese word segmentation and word stop removal are carried out on the title information, a title word set is obtained, and the visible title nodes in the text are identified through the title words in the title word set. The visual title node herein refers to a node where the title word is located, not the above-described title tag node.

S103, traversing the DOM tree in a depth-first mode, after cleaning nodes of preset types in the DOM tree, judging whether the id attribute, the class attribute and/or the style attribute of the remaining nodes meet a first preset condition, and performing attribute marking on the remaining nodes according to a judgment result to determine elements irrelevant to the text, elements possibly irrelevant to the text and other elements. In this embodiment, the preset type nodes are nodes obviously unrelated to the text content, such as nodes that are neither text nodes nor element nodes, and various script nodes, such as meta, title, link nodes, and the like.

The first preset condition is as follows: and if the id attribute or the class attribute of the node contains texts such as banner, comment, sidebar, logo and the like or the style attribute of the node contains display: none, judging the node as an element which is determined to be irrelevant to the text. After judging the attributes of the remaining nodes and generating the judgment result, marking the node by using the special marking attribute score, but not directly cleaning, and preventing the subsequent node counting sequence number marking from being disturbed. Meanwhile, in this step, because the element attribute marking has been performed on the remaining nodes, the remaining nodes are also referred to as remaining elements in the following detailed step description, and the meanings of the two elements are the same.

Then traversing and collecting information of the DOM tree, and specifically comprising the following steps:

s201, selecting a body element of the DOM tree as an initial node for depth-first recursive traversal, and generating a node access path corresponding to each residual element in the DOM tree. The node access path of the body element is an empty character string, the node access paths of the rest elements in the DOM tree are complete paths from the body element to the element, and the node access paths are formed by splicing the node name of each node on the path and the serial numbers of the nodes under the parent nodes of the nodes. For example, the third p element under the second div element under body, with the access path being body.div [2]. p [3 ]. Meanwhile, when the node access path is long, a loose access path of the node may be specified, that is, the indexes of the last 3 levels on the node access path of the ignored element are used, and the loose access path is used to replace the node access path used in the subsequent step. For example, the node access path of an element is body.div [2]. div [1]. table [1]. div [2]. p [1], and its corresponding loose access path is body.div [2]. div [1]. table.div.p.

S202, according to attribute mark information of the rest elements in the DOM tree, taking the elements which are possibly irrelevant to the text and other elements as elements to be collected, collecting the information of the elements to be collected, and classifying and caching the elements to be collected into picture nodes, author nodes, visual title nodes, date nodes or text nodes, wherein the specific caching method comprises the following steps:

step a, judging whether the element label of the element to be collected is img label, if yes, collecting and caching the element to be collected as a picture node, because one picture element cannot be simultaneously elements such as date or title, and if not, executing step b.

And b, judging whether the id attribute or the class attribute of the element to be collected contains an image, photo or bullety tag, if not, executing the step c, if so, judging that the element to be collected is a determined picture information block node, globally marking the traversal of the DOM tree to enter the picture information collection block, and when traversing the child nodes of the element to be collected, judging whether the child nodes are the picture nodes without trying to judge whether the child nodes are the elements such as authors, dates or titles and the like, so that the extraction efficiency is improved. If so, collecting and caching the child node as a picture node, and if not, continuously judging the next element to be collected.

And c, judging whether the id attribute or the class attribute of the element to be collected contains an author tag, a writenby tag or a byline tag, if not, executing the step d, if so, judging that the element to be collected is a determined author information block node, and traversing the DOM tree to enter an author information collection block, and when traversing the child nodes of the element to be collected, judging whether the child nodes are the author nodes, and not attempting to identify whether the child nodes are the elements such as pictures, dates or titles, so that the extraction efficiency is further improved. If yes, the child node is collected and cached as the author node, and if not, the next element to be collected is continuously judged.

And d, judging whether the id attribute or the class attribute of the element to be collected contains an article, post, main or content label, if not, executing the step e, if so, judging that the element to be collected is a determined text information block node, traversing the overall marking DOM tree to enter a text information collecting block, and meanwhile, if the determined text information block is not collected in the current overall situation and only the undetermined text information block is collected, emptying the currently collected undetermined text information block.

And e, judging whether the element to be collected has sub-elements, if so, judging whether the sub-elements can be integrated and replaced, if so, replacing the integrated contents of all the sub-elements with the contents of the element to be collected, and executing the step f, otherwise, directly executing the step f.

F, traversing all child nodes of the elements to be collected and processing one by one, wherein the processing method comprises the following steps: and b, judging the type of the child node, if the child node is an element node, adding one to the global node count, returning to the step a to perform recursive deep traversal again, if the child node is a text child node, identifying the content of the text child node, and caching the text child node as a visible title node, a date node or a possible text node according to an identification result.

And recording the node counting sequence number, the text node counting sequence number and the node access path of the element to be collected in the DOM tree in the process of depth-first recursive traversal.

In step e of the above embodiment, a specific method for determining whether the sub-elements of the element to be collected can be integrated and replaced is as follows:

1) if the element to be collected is a pre element, title elements h 1-h 6 or other display tags such as strong, b, i, em, etc., sub-elements of the element to be collected can be integrated and replaced, i.e., can be directly combined in one element;

2) if the element to be collected is a p element, judging whether the element to be collected meets a first pre-integration condition, judging whether the element to be collected meets a second pre-integration condition on the basis of meeting the first pre-integration condition, and if both conditions are met, enabling sub-elements of the element to be collected to be integrated and replaced;

the first pre-integration condition is as follows: the element to be collected comprises more than one text child node or the text word ratio value of the link text and the common text in the child elements of the element to be collected is less than one third;

the second pre-integration condition is as follows: the element to be collected has more than one sentence, the node access path of the element to be collected is consistent with the node access path of the last collected text node, or the element to be collected is a simple element. The simple element means that one element only contains at most one simple element and text nodes, and is a recursive process;

3) if the element to be collected contains both child element nodes and text child nodes, checking whether all texts of the element to be collected form short texts, and if so, enabling the child elements of the element to be collected to be integrated and replaced. The short text means that the text contains less than 3 stop words after Chinese word segmentation.

In step f of the foregoing embodiment, a specific method for caching the text child node as a visible title node, a date node, or a possible text node according to the recognition result includes:

1) comparing the similarity of the text content of the text sub-node with the title words in the title word set, and judging whether the text sub-node is a visible title node according to the comparison result;

2) extracting date and time information in the text content of the text sub-nodes based on the regular expression, and if the extraction is successful and the ratio of the date and time text to the whole text content is greater than a preset threshold value of 0.5, judging the text sub-nodes as pure date nodes which are not used as other types of nodes, such as '2018-04-1307: 03:37 source: Xinhua society';

3) and if the text child node is not the visible title node or the pure date node, caching the text child node as a possible text node of the text for subsequent analysis.

And then extracting the text according to the cached possible text nodes of the text. The text extraction is mainly based on the following two facts: first, the text node is behind the visible title node, i.e., its node count number is greater than the node count number of the visible title node. Second, the body text nodes have similar access paths. Based on the above facts, the text extraction specifically includes the following steps:

1) sequencing all possible text nodes in an ascending order according to the node counting sequence number;

2) finding out a first target node of all possible text nodes, wherein the first node counting sequence number is greater than that of the visible title node, the sentence number of the first target node is greater than 0, or the content words of the first target node are related to the content words of the visible title node, and marking the first target node as a p1 node;

3) a second target node which has a node counting sequence number difference value smaller than 3 and is similar to the p1 node and is found by forward and backward directions with the p1 node as a starting point is replaced by p1, and then the step is repeated until a new second target node cannot be found;

4) cleaning all possible text nodes before the p1 node, grouping all the remaining possible text nodes according to the node access path, sequencing the interior of each group in an ascending order according to the node counting sequence number, and sequencing the groups in an ascending order according to the node counting sequence number of the first node of each group;

5) calculating a preset parameter value of each group, importing the preset parameter value into a pre-trained prediction model for scoring, and generating a target group with the score larger than a preset score;

6) sorting the nodes in all the target groups in an ascending order according to the node counting sequence numbers, and forming a text node set;

7) caching the text node set.

In step 5) of this embodiment, the preset parameter values include a node number, a total sentence number, a total related phrase number, an average related phrase number, a text node count sequence difference value of a node, a node count sequence difference value of a node, and a similarity between a node access path of a current packet and a node access path of a previous target packet. Where the text node counts the difference in sequence numbers, which for the first packet refers to the distance between the first text node of the current packet and the visible header node. For other packets, this difference refers to the distance of the first text node of the current packet from the last node of the last targeted packet.

Then, extracting the release time according to the cached date node, which specifically comprises the following steps:

1) and clearing invalid nodes in all date nodes, wherein the invalid nodes are nodes after the first target node found in the text extraction analysis by the node counting sequence number, because the release date node is either in front of the visible title node or between the visible title node and the first text node.

2) And acquiring a target date node closest to the visible title node in the cleaned residual date nodes, wherein the node counting sequence number difference of the target date node is lower than a first preset value, and the text node counting sequence number difference is lower than a second preset value.

And finally, extracting the text picture according to the cached picture node, which specifically comprises the following steps:

step 002, acquiring a target picture node, and cleaning the target picture node and other picture nodes behind the target picture node, wherein the target picture node is the picture node which is closest to the last first target node and the node counting sequence number difference value is larger than a preset value;

step 003, obtaining the picture nodes with the node counting serial numbers between the text nodes and the visible title nodes, marking as interpolation picture nodes, then marking the picture nodes which are positioned in front of the visible title nodes and have the node distance lower than a preset value as interpolation picture nodes, merging the interpolation picture nodes into an interpolation picture node set, and caching the non-interpolation picture nodes;

step 004, obtaining the distance between each interpolation picture node and the node counting serial number of the visible title node, and sequencing all interpolation picture nodes according to the ascending order of the distance;

005, pre-screening all interpolation picture nodes according to a preset screening rule, and filtering out invalid pictures irrelevant to the text;

In the above embodiment, the preset filtering rules include the following:

rule 1, filtering common advertisement links based on picture links of interpolated picture nodes, such as a URL (uniform resource locator) path including common advertisement words, or common social network links or logos.

And rule 2, acquiring picture size information of the interpolation picture node, filtering the banner picture and the small-size picture with the size lower than a preset value according to the aspect ratio of the picture, and specifically acquiring the picture size information by adopting the following method, for example, if the width and height attributes are specified by the current node and the attributes are in an effective range, directly acquiring the picture size information, otherwise, opening a network input stream through a picture URL to acquire the picture size information. When the picture size information is obtained through the network, the complete picture does not need to be downloaded, and the size information is only read at the head of the network input stream. And meanwhile, recording a loose access path of the picture node, and when traversing other picture nodes, if other nodes are the same as the loose access path of the node, the other nodes can directly use the size information of the picture node without opening an additional network request.

And 3, backtracking the 3-layer nodes at most based on the node paths by taking the picture nodes as starting points, scoring the picture nodes by combining the id attributes and the class attributes of the nodes, and filtering the determined irrelevant pictures according to the score. And meanwhile, recording the loose access path of the node, and directly filtering when traversing other picture nodes if the other nodes are the same as the loose access path of the node.

Step 3 of the above embodiment groups the nodes based on the node access paths, and performs scoring in units of groups to determine whether the node contents in the groups belong to the theme contents, thereby further improving the efficiency of extracting the theme contents of the web page.

The flow of the general webpage theme content extracting method is specifically described above with reference to fig. 1, and the structure of the general webpage theme content extracting system is described below with reference to fig. 2.

Fig. 2 is a schematic structural diagram of a general webpage theme content extraction system according to embodiment 2 of the present invention, as shown in fig. 2, including a DOM tree processing module, a caching module and an extraction module,

the cache module is used for traversing the DOM tree after attribute marking, and classifying and caching the rest nodes of the DOM tree into picture nodes, date nodes, text nodes or visual target nodes;

the extraction module is used for judging whether the content of the picture node, the content of the date node or the content of the text node is subject content according to the distances between the picture node, the date node and the text node and the visible title node respectively, and completing extraction of the subject content of the target webpage according to the judgment result, wherein the subject content comprises a text picture, release time and a text.

The embodiment identifies the text visual title node of the DOM tree and classifies and caches other nodes based on the strong association relation existing on the page structure, and then the distance between other category nodes and the text visual title node in the DOM tree is used as an important basis for judging whether the node belongs to the subject content, so that the precision and the efficiency of extracting the webpage information are improved.

In a preferred embodiment, the DOM tree processing module includes:

the title word generating unit is used for acquiring and caching the content of the title label nodes in the DOM tree, and meanwhile performing Chinese word segmentation and stop word removal on the content of the title label nodes to generate a title word set comprising a plurality of title words;

In another preferred embodiment, the cache module includes:

The cache unit comprises a picture node cache unit, a picture information block node cache unit, an author information block node cache unit, a text information block node cache unit, a sub-element integration and replacement unit, a sub-element cache unit and an information recording unit,

the picture node cache unit is used for judging whether the element tag of the element to be collected is an img tag, if so, collecting and caching the element to be collected as a picture node, and if not, driving the first judgment unit;

the picture information block node cache unit is used for judging whether the id attribute or the class attribute of the element to be collected contains an image, photo or galery label, if not, the author information block node cache unit is driven, if so, the element to be collected is judged to be a determined picture information block node, traversal of a DOM tree is marked to enter a picture information collection block, when the child node of the element to be collected is traversed, whether the child node is a picture node is judged, if so, the child node is collected and cached as the picture node, and if not, the next element to be collected is continuously judged;

the author information block node cache is used for judging whether the id attribute or the class attribute of the element to be collected contains an author tag, a writenby tag or a byline tag, if not, driving a text information block node cache unit, if so, judging that the element to be collected is a determined author information block node, and globally marking traversal of a DOM tree to enter an author information collection block, when traversing child nodes of the element to be collected, judging whether the child nodes are author nodes, if so, collecting and caching the child nodes as author nodes, and if not, continuously judging the next element to be collected;

the text information block node cache unit is used for judging whether the id attribute or the class attribute of the element to be collected contains an attribute, a post, a main or a content tag, if not, the child element integration replacement unit is driven, if so, the element to be collected is judged to be a determined text information block node, traversal of a global marking DOM tree enters a text information collection block, and meanwhile, if the determined text information block is not collected in the current global state and only a non-determined text information block is collected, the currently collected non-determined text information block is emptied;

the sub-element integration and replacement unit is used for judging whether the element to be collected has sub-elements or not, if so, judging whether the sub-elements can be integrated and replaced or not, if so, replacing the integrated contents of all the sub-elements with the contents of the element to be collected and driving the sub-element cache unit, and if not, directly driving the sub-element cache unit;

the child element caching unit is used for traversing all child nodes of the elements to be collected and judging the types of the child nodes one by one, if the child nodes are element nodes, the global node count is increased by one, the picture node caching unit is driven to conduct recursive depth traversal again, if the child nodes are text child nodes, the content of the text child nodes is identified, and the text child nodes are cached as visible title nodes, date nodes or possible text nodes according to the identification result;

and the information recording unit is used for recording the node counting sequence number, the text node counting sequence number and the node access path of the element to be collected in the DOM tree.

In another preferred embodiment, the extraction module comprises a text extraction module, a release time extraction module and a text picture extraction module. The text extraction module specifically comprises:

the first sequencing unit is used for sequencing all possible text nodes in an ascending order according to the node counting sequence number;

the first target node generation unit is used for finding out a first target node of all possible text nodes, wherein the first node counting serial number is greater than the node counting serial number of the visible title node, the sentence number of the first target node is greater than 0, or the content words of the first target node are related to the content words of the visible title node, and the first target node is marked as a p1 node;

a circulation unit, configured to find a second target node with a node count sequence number difference smaller than 3 and a similar access path from the p1 node to the p1 node in a forward and backward direction, and replace the second target node with p1 until a new second target node cannot be found;

the grouping unit is used for cleaning all possible text nodes before the p1 node, grouping all the remaining possible text nodes according to the node access path, sequencing the interior of each group in an ascending order according to the node counting sequence number, and sequencing the groups in an ascending order according to the node counting sequence number of the first node of each group;

the scoring unit is used for calculating a preset parameter value of each group, importing the preset parameter value into a pre-trained prediction model for scoring, and generating a target group with a score larger than a preset score;

and the first extraction unit is used for sequencing the nodes in all the target groups in an ascending order according to the node counting sequence numbers, forming a text node set and caching the text node set.

In a preferred embodiment, the release time extracting module specifically includes:

a cleaning unit, configured to clean an invalid node in all date nodes, where the invalid node is a node with a node count sequence number after a first target node;

and the second extraction unit is used for acquiring a target date node which is closest to the visible title node in the cleaned residual date nodes, wherein the node counting sequence number difference of the target date node is lower than a first preset value, and the text node counting sequence number difference is lower than a second preset value.

Preferably, the text image extracting module specifically includes:

the second sorting unit is used for sorting the cached picture nodes in an ascending order according to the node counting sequence number;

the second target node generation unit is used for acquiring a target picture node, and cleaning the target picture node and other picture nodes behind the target picture node, wherein the target picture node is the picture node which is closest to the last first target node and the node counting sequence number difference value is larger than a third preset value;

the interpolation picture node generating unit is used for acquiring picture nodes with node counting serial numbers between the text nodes and the visual title nodes, recording the picture nodes as interpolation picture nodes, recording the picture nodes which are positioned in front of the visual title nodes and have a node distance from the visual title nodes lower than a fourth preset value as interpolation picture nodes, merging the interpolation picture nodes into an interpolation picture node set, and caching non-interpolation picture nodes;

the third sequencing unit is used for acquiring the distance between each interpolation picture node and the node counting serial number of the visible header node and sequencing all the interpolation picture nodes in an ascending order according to the distance;

the pre-screening unit is used for pre-screening all interpolation picture nodes according to a preset screening rule and filtering out invalid pictures irrelevant to the text;

and the third extraction unit is used for acquiring node access paths of the residual interpolation picture nodes after the pre-screening, finding out nodes with the same node access paths in the interpolation picture node set of the interpolation picture node generation unit, and then repeatedly driving the third sorting unit and the pre-screening unit to integrate the interpolation picture nodes and the non-interpolation picture nodes which are screened again.

The reader should understand that in the description of this specification, reference to the description of the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A general webpage subject content extraction method is characterized by comprising the following steps:

step 2, traversing the DOM tree after attribute marking, and classifying and caching the rest nodes of the DOM tree into picture nodes, date nodes, text nodes or visual title nodes;

step 3, judging whether the content of the picture node, the content of the date node and the content of the text node are subject content according to the distances between the picture node, the date node and the text node and the visible title node respectively, and finishing the extraction of the subject content of the target webpage according to the judgment result, wherein the subject content comprises a text picture, release time and a text;

the method for extracting the text according to the cached text nodes specifically comprises the following steps of:

finding a first target node of all possible text nodes, wherein the first node counting sequence number is larger than the node counting sequence number of the visible title node, the sentence number of the first target node is larger than 0, or the content words of the first target node are related to the content words of the visible title node, and marking the first target node as a p1 node;

a second target node which has a node counting sequence number difference value smaller than 3 and is similar to the p1 node and is found by forward and backward directions with the p1 node as a starting point is replaced by p1, and then the step is repeated until a new second target node cannot be found;

caching the text node set.

2. The method for extracting the subject content of the universal webpage according to claim 1, wherein the step 1 specifically comprises the following steps:

3. The method for extracting the general webpage subject matter content according to the claim 2, wherein the step 2 specifically comprises the following steps:

4. The method for extracting webpage theme content in general according to claim 3, wherein in the step S202, the step of collecting information of the elements to be collected and classifying and caching the elements to be collected specifically includes the following steps:

step a, judging whether an element label of the element to be collected is an img label, if so, collecting and caching the element to be collected as a picture node, and if not, executing step b;

b, judging whether the id attribute or the class attribute of the element to be collected contains an image, photo or challenge tag, if not, executing step c, if so, judging that the element to be collected is a determined picture information block node, globally marking traversal of a DOM tree to enter a picture information collection block, judging whether a child node of the element to be collected is a picture node or not when traversing the child node of the element to be collected, if so, collecting and caching the child node as a picture node, and if not, continuing to judge the next element to be collected;

step c, judging whether the id attribute or the class attribute of the element to be collected contains an author tag, a writenby tag or a byline tag, if not, executing step d, if so, judging that the element to be collected is a determined author information block node, traversing a global markup DOM tree to enter an author information collection block, judging whether a child node of the element to be collected is the author node or not when the child node is traversed, if so, collecting and caching the child node as the author node, and if not, continuing to judge the next element to be collected;

f, traversing all child nodes of the elements to be collected and processing one by one, wherein the processing method comprises the following steps: judging the type of the child node, if the child node is an element node, adding one to the global node count, returning to the step a to perform recursive deep traversal again, if the child node is a text child node, identifying the content of the text child node, and caching the text child node as a visible title node, a date node or a possible text node according to an identification result;

5. The method for extracting general webpage subject matter according to claim 4, wherein the step 3 of extracting the release time according to the cached date node specifically comprises the following steps:

6. The method for extracting webpage subject matter according to claim 5, wherein the step 3 of extracting the text picture according to the cached picture node specifically comprises the following steps:

7. A general webpage theme content extraction system is characterized by comprising a DOM tree processing module, a cache module and an extraction module,

the extraction module is used for judging whether the content of the picture node, the content of the date node or the content of the text node is subject content according to the distances between the picture node, the date node and the text node and the visual title node respectively, and finishing extraction of the subject content of the target webpage according to the judgment result, wherein the subject content comprises a text picture, release time and a text;

the extraction module comprises a text extraction module, and the text extraction module specifically comprises:

a circulation unit, configured to forward and backward find a second target node that has a node count sequence number difference smaller than 3 and is similar to that of the p1 node and has a similar access path, with the p1 node as a starting point, and replace the second target node with p1 until a new second target node cannot be found;

8. The universal web page subject matter extraction system according to claim 7, wherein said DOM tree processing module comprises:

9. The system for universal web page theme content extraction as recited in claim 8, wherein the caching module comprises: