CN102156737A

CN102156737A - Method for extracting subject content of Chinese webpage

Info

Publication number: CN102156737A
Application number: CN 201110090737
Authority: CN
Inventors: 刘清堂; 邵明博; 向丹丹; 吴林静
Original assignee: Huazhong Normal University
Current assignee: Huazhong Normal University
Priority date: 2011-04-12
Filing date: 2011-04-12
Publication date: 2011-08-17
Anticipated expiration: 2031-04-12
Also published as: CN102156737B

Abstract

The invention belongs to the field of computer application and information extraction, provides a method for extracting a subject content of a Chinese webpage, wherein the method comprises the following steps of: converting webpage data into DOM (Document Object Model) objects, then fusing, sorting and filtering the DOM objects, and finally returning to the webpage content extracted. The method for extracting a subject content of a Chinese webpage provided by the invention has the advantages that the operation is convenient, the application range is wide, the method is not attached to any specific soft and hardware and not attached to a specific webpage template, the method can effectively eliminate ''noise'' information in the webpage aiming at the webpage of Chinese news of different styles and extract the subject content according to a lot of experimental results, and the method has relatively high practical applicability.

Description

A kind of extracting method of Chinese web page subject content

Technical field

The invention belongs to computer utility and information extraction field, particularly a kind of subject content extracting method of Chinese web page.

Background technology

Along with the continuous maturation and the development of Internet technology and environment thereof, the internet has become people and has obtained the indispensable mode of information resources.The explosion type of internet mass information produces, " data are abundant; lack of knowledge " this problem is more and more outstanding: when we pass through the WEB browsing page, can find that not all information that is presented on screen is all relevant with theme, it is comprising a large amount of advertisements, navigation, copyright information and various interactive operation interface (as: questionnaire etc.) usually.The burden that the information that these and theme have nothing to do has not only caused user profile to browse is returned based on the application system of Web page subject content and has been brought the difficulty of implementing and developing.

Therefore, the subject content that can extract webpage fast and accurately is a gordian technique based on the service of WEB content application.It not only can improve the accuracy of the application system of various content-based services, can also promote its work efficiency greatly, also more directly simultaneously alleviates the burden that user profile is browsed.The expert in information extraction field is attempting solving the trouble that the irrelevant information of these and theme is brought by computing machine always.

The extraction of web page contents usually can be based on masterplate or two kinds of methods of piecemeal.Based on the method for masterplate, need two DOM(Document Object Model of top-down comparison in general at least from identical masterplate) tree, find subtree identical between them and remove, rest parts as subject content.Experimental results show that this method be feasible effectively, but the limitation of this method be machine learning one cover Page template might not reuse on other collections of web pages.In addition, we it should further be appreciated that the calculation cost of machine learning also is appreciable.Because the randomness of people's accesses network, make such method can not real-time and effective extract the subject content of webpage.Branch is more based on block division method, more representationally mainly contains webpage piecemeal based on pure dom tree, based on the webpage piecemeal (Vision-based Page Segmentation:VIPS) of visual information and based on the webpage piecemeal of specific label.Because introducing the earliest of DOM is to show rather than carry out the semantic description of the WEB page in order to carry out layout in browser, before not introducing side information, can not be competent at contents extraction work fully only according to its label hierarchical relationship that provides based on the method for partition of pure DOM.Utilize the visual cues of the WEB page such as background color, font color, information such as font size, bold based on the webpage piecemeal of visual information, the hierarchical structure that provides in conjunction with DOM is carried out the piecemeal of the page, and it has been applied in the test and appraisal of TREC2003, obtained effect preferably.But because the complicacy of visual signature is difficult to a general rule set.In addition, the VIPS algorithm also needs to preserve a large amount of visual information, and its handling property is along with the complexity of the page sharply descends.Because the popular set several layouts in early stage internet, the people also being arranged according to＜table〉label is divided into several content pieces to webpage.The piecemeal flow process is very simple like this, but in the face of the complicated day by day page, treatment effect often can not be satisfactory.

In sum, existing method or algorithm flow are too simple, can only carry out contents extraction at the web page style of specific label; Algorithm complex too high (calculating) based on the machine learning of masterplate or complicated vision, all directly cause can't be real-time processing people page access at random.

Summary of the invention

The present invention is exactly at the weak point in the above-mentioned background technology, and the subject content extracting method of a kind of Chinese web page that proposes.This method does not rely on the information outside the single web document, only according to the internal feature information of each atom (can not divide again) node, in conjunction with the language description characteristics of Chinese web page, effectively extracts subject content.

The objective of the invention is to realize by following technical measures.

A kind of extracting method of Chinese web page subject content, the hardware components that this method is used comprises that DOM generates parts, DOM processing element, node fusion parts, node signature analysis parts, node element filtrator, the interim interpretation of result parts of filtrator, and this method may further comprise the steps:

(1) DOM generates the copy that parts use web data stream, generates the DOM object;

(2) the DOM processing element is carried out corresponding handle with the DOM object that obtains in the step (1) according to different page types in conjunction with page type information, calculates the characteristic information of node, and preserves result; Described characteristic information comprises the literal density δ (b) and the link density θ (b) of current node;

(3) for the result of preserving in the above-mentioned steps (2), node merges parts according to the characteristic information between neighborhood of nodes, calculate similarity, if simulated condition is true, then merge field identical in the neighborhood of nodes, keep previous node, give up a back node (hereinafter being called mixing operation);

(4) node signature analysis parts use the node set after merging in the step (3), according to the characteristic information of three whenever adjacent nodes, node are divided into " content node " and " noise node " two big classes;

(5) the node filtrator carries out bed filtration to " noise node " that stays in the step (4) and " the content node " that some have special tag, and filter result each time all uses the interim interpretation of result parts of filtrator to preserve; Draw optimum node set by analysis as the subject content after extracting.

In technique scheme, this method can be according to client's demand, media informations such as the picture that utilizes the media detection compression member to return webpage to comprise, video, the node set that uses above-mentioned steps (5) to provide, the media detection compression member can detect this webpage and whether comprise media information, locate the media information relevant, and it is compressed, is cached to this locality with document.

In technique scheme, the DOM processing element described in the step (2) comprises page type conjecture module, document pretreatment module, node computing module, and its concrete job step is as follows:

(3-1) the web data stream that obtains is preserved a copy, in order to fault-tolerant processing;

(3-2) from the DOM object＜title node and＜H1 node extracts heading message;

(3-3) call the document pretreatment module, filter out the annotation information that current DOM object is comprised, also have mutual node such as script, pattern and Flash;

(3-4) invoking page type conjecture module, the type of conjecture target pages, if the content type page, then order is carried out following steps; If the catalogue type page is then directly carried out the step of (3-7);

(3-5) call the node computing module, remaining node in the traversal DOM object is ignored＜applet 〉,＜button etc. mutual node, and＜b,＜u wait the modification node; Calculate the literal density δ (b) and the link density θ (b) of other each remaining node, and the result of calculation more than preserving, and the Word message of node, DOM operation-interface etc.; Its computing formula is as follows

(formula 1)

(formula 2)

L(b) the literal line number of expression current node, the word length of T (b) expression current node, maxLen represents the character length that screen delegation can comprise at most, T ' (b) represents line number greater than 1 node word length (not comprising last column), in Ta (b) expression current node and the descendants's node thereof, all＜a〉the character length sum of node;

(3-6) result in (3-5) is preserved, operate in order to subsequent parts;

(3-7) if the conjecture page type is the catalogue type, the web data stream copy that then uses (3-1) to preserve regenerates the DOM object, and travels through again in the object＜a〉node, the Returning catalogue content.

In technique scheme, node described in the step (3) merges parts and comprises former child node similarity calculation module and node Fusion Module, its concrete job step is as follows: each node in the result that former child node similarity calculation module traversal step (2) is preserved, calculate the δ (b) and the link density θ (b) of 2 whenever adjacent nodes according to formula 3, judge whether both are similar, if reaching empirical value ε is that 0.1 node Fusion Module carries out mixing operation, make that finally the discrimination of adjacent per two nodes is enough big; Wherein,

,

It is the weight of two class values

(formula 3)

。

In technique scheme, node signature analysis parts described in the step (4) are at the enough big node set of discrimination that produces in the step (3), searching loop each node in should set, and to the neighbours before and after the node and itself make the following judgment:

(5-1) whether Rule of judgment (a) is set up, and condition (a) is false, then current node is classified as the noise node;

Whether (5-2) condition (a) is true, then need Rule of judgment (b) to set up, if condition (b) is false, then whether Rule of judgment (c) is set up, if condition (c) is false, then current node is classified as the content node;

(5-3) if condition (c) is true, then whether Rule of judgment (d) is set up, if condition (d) is true, then current node is classified as the noise node, otherwise, current node is classified as the content node;

(5-4) if condition (b) is true, then whether Rule of judgment (e) is set up, if condition (e) is false, then with the current content node that is classified as, if condition (e) is true, then need Rule of judgment (f) whether to set up,, then current node is classified as the content node if condition (f) is false, otherwise then need Rule of judgment (g) whether to set up, if condition (g) is set up, then current node is classified as the noise node, otherwise it is classified as the content node;

Condition (a) wherein, whether the link density of current node is less than empirical value 0.353333;

Condition (b), whether the link density of previous node is less than empirical value 0.555556;

Condition (c), whether the literal density of current node is less than empirical value 0.555556;

Condition (d), whether the literal density of next node is less than empirical value 0.353333;

Condition (e), whether the literal density of current node is less than empirical value 0.488889;

Condition (f), whether the literal density of next node is smaller or equal to empirical value 0.555556;

Condition (g), whether the literal density of previous node is smaller or equal to empirical value 0.353333.

In technique scheme, described node filtrator, the interim interpretation of result parts of filtrator, its concrete job step is as follows:

(6-1) use node filtrator A to filter out blank, invalid noise node;

(6-2) use the node filter B in the content node＜Span with＜TD node carries out specific aim and filters: otherwise whether the character length of at first judging current node and being comprised, if vacation is then filtered it then keeps greater than empirical value 4; Judge then whether current node comprises the punctuation mark with semanteme,, otherwise then keep if vacation is then filtered it; Operating result is preserved by the data statistics module of the interim interpretation of result parts of filtrator;

(6-3) use node filtrator C to filter out the nonstandard＜P that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because W3C standard recommendation＜P〉node should not comprise other container nodes, so the criterion of this filtering rule is＜P〉node is an individual layer node;

(6-4) use node filtrator C to filter out the nonstandard＜TD that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because not closed＜TD〉node wrong easily usually comprise other container nodes, so the criterion of this filtering rule is＜TD node is an individual layer node;

(6-5) use node filtrator C to filter out the nonstandard＜DIV that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because not closed＜DIV〉node wrong easily usually comprise other container nodes, so the criterion of this filtering rule is＜DIV node is an individual layer node;

(6-6) the interim interpretation of result parts of filtrator carry out descending sort to the result object that aforesaid operations produces, and at first carry out according to the separator statistic in this object, if this field is equal, then carry out according to character length; Travel through orderly results set, find out the result that first meets the following conditions; The separator statistic is more than or equal to empirical value 2, and literal density is greater than empirical value 0.28;

(6-7) if the result is empty, the web data stream that then uses (3-1) to preserve is preserved a copy, regenerate the DOM object, and utilize each node in the DOM processing element traversal object, and only at＜P 〉,＜TD 〉,＜PRE 〉, and＜DIV〉node carries out corresponding filtration, and store, and will gather the Web page subject content of conduct extraction; Judge the literal density in this web page contents,, then carry out next-step operation if be not 0;

(6-8) the web data stream copy that uses (3-1) to preserve regenerates the DOM object, and utilizes each node in the DOM processing element traversal object, only at＜a〉node filters storage, Returning catalogue content.

The present invention compared with prior art has following advantage: the present invention is easy to operate, and is applied widely, neither depends on specific soft, hardware, also do not rely on the particular Web page masterplate; A large amount of experimental results show that this method can effectively be got rid of " noise " information in the page at the Chinese news web page of different-style, extract subject content, have advantages of high practicability.

Description of drawings

Fig. 1 is the subject content extracting method schematic diagram of a kind of Chinese web page of the embodiment of the invention.

Fig. 2 is the program flow diagram of the subject content extracting method of a kind of Chinese web page of the embodiment of the invention.

Embodiment

Below in conjunction with accompanying drawing and implement to inventing further description.

As shown in Figure 1, be the subject content extracting method schematic diagram of a kind of Chinese web page of the embodiment of the invention.System at first can carry out suitable format and handle to the URL of user's request, obtain the network data of remote server end, makes up an exercisable DOM object.

The DOM object is an exercisable primitive network data structure, need utilize DOM processing element (DOMHandler) to set up a new object of handling this model, the DOM processing element by the enumeration type that page type conjecture module provides, selects different strategies to carry out treatment conversion.

As shown in Figure 2, be the program flow diagram of the subject content extracting method of a kind of Chinese web page of present embodiment.When judgement is input as content pages, DOM processing element (DOMHandler) can be WebDocument with the DOM model conversion at first, it is a kind of data structure of self-defining description web page characteristics, the literal density that comprises each node, link density, Word message, DOM operation-interface etc., but data itself do not obtain any refining.At this moment need to filter by the scanning that node fusion parts, node signature analysis parts, node element filtrator come WebDocument to be carried out various dimensions, and after scanning is filtered each time, they are saved in the interim interpretation of result parts of filtrator, by calculating ordering, analyze these interim results, obtain optimum results set and extract content as the page.

When result set when not being empty, then prove successfully found the content node.According to the configuration of program, can be by the content node that has found, the position of inverted orientation picture concerned or video.If successfully found the visit URL of related media, then with their compression and be cached to this locality.In conjunction with the media content that obtained just now, assembling generates new page entity.

When result set when being empty, then proof success find the content node.Usually this situation occurring is because original HTML code is not followed the standard of W3C, or itself does not just possess the content node.For fault tolerant mechanism is provided, program provides another kind of conversion regime: the DOM object is converted to SimpleWebDocument (data structure of another kind of self-defining description web page characteristics).At this structure, there is special filtrator to carry out above-mentioned similar filtration, and directly returns web page contents.

Need check the literal density of current web page contents this time.When literal density is 0, whether then need to detect current web page to show media information.If true, then return information.If false, then return error message.When literal density is not 0, check then whether this density drops in the security domain, if true, then return web page contents.If false, then needing the DOM model conversion is IndexDoc (self-defining a kind of data structure that is used for describing the catalogue page feature).

The concrete steps of present embodiment are as follows.

A kind of extracting method of Chinese web page subject content, the hardware components that this method is used comprises that DOM generates parts, DOM processing element, node fusion parts, node signature analysis parts, node element filtrator, the interim interpretation of result parts of filtrator, media detection compression member, is characterized in that this method may further comprise the steps:

(3) for the result of preserving in the above-mentioned steps (2), node merges parts according to the characteristic information between neighborhood of nodes, if simulated condition is true, then merge field identical in the neighborhood of nodes, keep previous node, give up a back node (hereinafter being called mixing operation);

In the above-described embodiments, this method can be according to client's demand, media informations such as the picture that utilizes the media detection compression member to return webpage to comprise, video, the node set of its method for using above-mentioned steps (5) to provide, the media detection compression member can detect this webpage and whether comprise media information, locate the media information relevant, and it is compressed, is cached to this locality with document.

In the above-described embodiments, described DOM processing element comprises page type conjecture module, document pretreatment module, node computing module, and its concrete job step is as follows:

(3-2) from the DOM object＜title node and＜H1 node extracts heading message;

(formula 1)

(formula 2)

(3-6) result in (3-5) is preserved, operate in order to subsequent parts;

In the above-described embodiments, described node merges parts and comprises former child node similarity calculation module and node Fusion Module, its concrete job step is as follows: each node in the result that traversal step (2) is preserved, calculate the δ (b) and the link density θ (b) of 2 whenever adjacent nodes according to formula 3, judge whether both are similar, be 0.1 and carry out mixing operation if reach empirical value ε, make that finally the discrimination of adjacent per two nodes is enough big; Wherein,

, It is the weight of two class values

(formula 3)

。

In the above-described embodiments, the node signature analysis parts described in the step (4) are at the enough big node set of discrimination that produces in the step (3), searching loop each node in should set, and to the neighbours of node front and back and itself make the following judgment:

In the above-described embodiments, described node filtrator, the interim interpretation of result parts of filtrator, its concrete job step is as follows:

(6-1) use node filtrator A to filter out blank, invalid noise node;

(6-3) filter out the nonstandard＜P that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because W3C standard recommendation＜P〉node should not comprise other container nodes, so the criterion of this filtering rule is＜P〉node is an individual layer node;

(6-4) filter out the nonstandard＜TD that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because not closed＜TD〉node wrong easily usually comprise other container nodes, so the criterion of this filtering rule is＜TD node is an individual layer node;

(6-5) filter out the nonstandard＜DIV that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because not closed＜DIV〉node wrong easily usually comprise other container nodes, so the criterion of this filtering rule is＜DIV node is an individual layer node;

(6-7) if the result is empty, the web data stream that then uses (3-1) to preserve is preserved a copy, regenerate the DOM object, and utilize each node in the DOM processing element traversal object, and only at＜P 〉,＜TD 〉,＜PRE 〉, and＜DIV〉node carries out corresponding filtration, and store, and directly return web page contents; Judge the literal density in this web page contents,, then return media information,, then carry out next-step operation if be not 0 according to customer demand if be 0;

Claims

1. the extracting method of a Chinese web page subject content, the hardware components that this method is used comprises that DOM generates parts, DOM processing element, node fusion parts, node signature analysis parts, node element filtrator, the interim interpretation of result parts of filtrator, is characterized in that this method may further comprise the steps:

(3) for the result of preserving in the above-mentioned steps (2), node merges parts according to the characteristic information between neighborhood of nodes, calculates similarity, if simulated condition is true, then merge field identical in the neighborhood of nodes, keep previous node, give up a back node;

(5) the node filtrator carries out bed filtration to " noise node " that stays in the step (4) and " the content node " that has special tag, and filter result each time all uses the interim interpretation of result parts of filtrator to preserve; Draw optimum node set by analysis as the subject content after extracting.

2. the extracting method of a kind of Chinese web page subject content according to claim 1, it is characterized in that: picture, video media information that this method is utilized the media detection compression member to return webpage to comprise, the node set that uses above-mentioned steps (5) to provide, the media detection compression member can detect this webpage and whether comprise media information, locate the media information relevant, and it is compressed, is cached to this locality with document.

3. the extracting method of a kind of Chinese web page subject content according to claim 1, it is characterized in that the DOM processing element described in the step (2) comprises page type conjecture module, document pretreatment module, node computing module, its concrete job step is as follows:

(3-2) from the DOM object＜title node and＜H1 node extracts heading message;

(3-3) call the document pretreatment module, filter out the annotation information that current DOM object is comprised, also have script, pattern and the mutual node of Flash;

(3-5) call the node computing module, remaining node in the traversal DOM object is ignored＜applet 〉,＜button〉mutual node, and＜b 〉,＜u〉the modification node; Calculate the literal density δ (b) and the link density θ (b) of other each remaining node, and the result of calculation more than preserving, and the Word message of node, DOM operation-interface; Its computing formula is as follows

(formula 1)

(formula 2)

(3-6) result in (3-5) is preserved, operate in order to subsequent parts;

4. the extracting method of a kind of Chinese web page subject content according to claim 1, it is characterized in that: the node described in the step (3) merges parts and comprises former child node similarity calculation module and node Fusion Module, each node in the result that the following former child node similarity calculation module traversal step of its concrete job step (2) is preserved, calculate the δ (b) and the link density θ (b) of 2 whenever adjacent nodes according to formula 3, judge whether both are similar, if reaching empirical value ε is that 0.1 node Fusion Module carries out mixing operation, make that finally the discrimination of adjacent per two nodes is enough big; Wherein,

,

It is the weight of two class values

(formula 3)

。

5. the extracting method of a kind of Chinese web page subject content according to claim 1, it is characterized in that: the node signature analysis parts described in the step (4) are at the enough big node set of discrimination that produces in the step (3), searching loop each node in should set, and to the neighbours before and after the node and itself make the following judgment

6. the extracting method of a kind of Chinese web page subject content according to claim 1 is characterized in that: the interim interpretation of result parts of node filtrator, the filtrator described in the step (5), its concrete job step is as follows

(6-1) use node filtrator A to filter out blank, invalid noise node;

(6-2) use the node filter B in the content node＜Span with＜TD node filters: otherwise whether the character length of at first judging current node and being comprised, if vacation is then filtered it then keeps greater than empirical value 4; Judge then whether current node comprises the punctuation mark with semantic segmentation function,, otherwise then keep if vacation is then filtered it; Operating result is preserved by the data statistics module of the interim interpretation of result parts of filtrator;

(6-3) use node filtrator C to filter out the nonstandard＜P that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because W3C standard recommendation＜P〉node should not comprise other container node, so the criterion of this filtering rule is＜P〉node is an individual layer node;

(6-4) use node filtrator C to filter out the nonstandard＜TD that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because not closed＜TD〉node wrong easily usually comprise other container node, so the criterion of this filtering rule is＜TD node is an individual layer node;

(6-5) use node filtrator C to filter out the nonstandard＜DIV that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because not closed＜DIV〉node wrong easily usually comprise other container node, so the criterion of this filtering rule is＜DIV node is an individual layer node;