CN102156737A - Method for extracting subject content of Chinese webpage - Google Patents

Method for extracting subject content of Chinese webpage Download PDF

Info

Publication number
CN102156737A
CN102156737A CN 201110090737 CN201110090737A CN102156737A CN 102156737 A CN102156737 A CN 102156737A CN 201110090737 CN201110090737 CN 201110090737 CN 201110090737 A CN201110090737 A CN 201110090737A CN 102156737 A CN102156737 A CN 102156737A
Authority
CN
China
Prior art keywords
node
condition
result
filtrator
dom
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201110090737
Other languages
Chinese (zh)
Other versions
CN102156737B (en
Inventor
刘清堂
邵明博
向丹丹
吴林静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong Normal University
Original Assignee
Huazhong Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong Normal University filed Critical Huazhong Normal University
Priority to CN 201110090737 priority Critical patent/CN102156737B/en
Publication of CN102156737A publication Critical patent/CN102156737A/en
Application granted granted Critical
Publication of CN102156737B publication Critical patent/CN102156737B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of computer application and information extraction, provides a method for extracting a subject content of a Chinese webpage, wherein the method comprises the following steps of: converting webpage data into DOM (Document Object Model) objects, then fusing, sorting and filtering the DOM objects, and finally returning to the webpage content extracted. The method for extracting a subject content of a Chinese webpage provided by the invention has the advantages that the operation is convenient, the application range is wide, the method is not attached to any specific soft and hardware and not attached to a specific webpage template, the method can effectively eliminate ''noise'' information in the webpage aiming at the webpage of Chinese news of different styles and extract the subject content according to a lot of experimental results, and the method has relatively high practical applicability.

Description

A kind of extracting method of Chinese web page subject content
Technical field
The invention belongs to computer utility and information extraction field, particularly a kind of subject content extracting method of Chinese web page.
Background technology
Along with the continuous maturation and the development of Internet technology and environment thereof, the internet has become people and has obtained the indispensable mode of information resources.The explosion type of internet mass information produces, " data are abundant; lack of knowledge " this problem is more and more outstanding: when we pass through the WEB browsing page, can find that not all information that is presented on screen is all relevant with theme, it is comprising a large amount of advertisements, navigation, copyright information and various interactive operation interface (as: questionnaire etc.) usually.The burden that the information that these and theme have nothing to do has not only caused user profile to browse is returned based on the application system of Web page subject content and has been brought the difficulty of implementing and developing.
Therefore, the subject content that can extract webpage fast and accurately is a gordian technique based on the service of WEB content application.It not only can improve the accuracy of the application system of various content-based services, can also promote its work efficiency greatly, also more directly simultaneously alleviates the burden that user profile is browsed.The expert in information extraction field is attempting solving the trouble that the irrelevant information of these and theme is brought by computing machine always.
The extraction of web page contents usually can be based on masterplate or two kinds of methods of piecemeal.Based on the method for masterplate, need two DOM(Document Object Model of top-down comparison in general at least from identical masterplate) tree, find subtree identical between them and remove, rest parts as subject content.Experimental results show that this method be feasible effectively, but the limitation of this method be machine learning one cover Page template might not reuse on other collections of web pages.In addition, we it should further be appreciated that the calculation cost of machine learning also is appreciable.Because the randomness of people's accesses network, make such method can not real-time and effective extract the subject content of webpage.Branch is more based on block division method, more representationally mainly contains webpage piecemeal based on pure dom tree, based on the webpage piecemeal (Vision-based Page Segmentation:VIPS) of visual information and based on the webpage piecemeal of specific label.Because introducing the earliest of DOM is to show rather than carry out the semantic description of the WEB page in order to carry out layout in browser, before not introducing side information, can not be competent at contents extraction work fully only according to its label hierarchical relationship that provides based on the method for partition of pure DOM.Utilize the visual cues of the WEB page such as background color, font color, information such as font size, bold based on the webpage piecemeal of visual information, the hierarchical structure that provides in conjunction with DOM is carried out the piecemeal of the page, and it has been applied in the test and appraisal of TREC2003, obtained effect preferably.But because the complicacy of visual signature is difficult to a general rule set.In addition, the VIPS algorithm also needs to preserve a large amount of visual information, and its handling property is along with the complexity of the page sharply descends.Because the popular set several layouts in early stage internet, the people also being arranged according to<table〉label is divided into several content pieces to webpage.The piecemeal flow process is very simple like this, but in the face of the complicated day by day page, treatment effect often can not be satisfactory.
In sum, existing method or algorithm flow are too simple, can only carry out contents extraction at the web page style of specific label; Algorithm complex too high (calculating) based on the machine learning of masterplate or complicated vision, all directly cause can't be real-time processing people page access at random.
Summary of the invention
The present invention is exactly at the weak point in the above-mentioned background technology, and the subject content extracting method of a kind of Chinese web page that proposes.This method does not rely on the information outside the single web document, only according to the internal feature information of each atom (can not divide again) node, in conjunction with the language description characteristics of Chinese web page, effectively extracts subject content.
The objective of the invention is to realize by following technical measures.
A kind of extracting method of Chinese web page subject content, the hardware components that this method is used comprises that DOM generates parts, DOM processing element, node fusion parts, node signature analysis parts, node element filtrator, the interim interpretation of result parts of filtrator, and this method may further comprise the steps:
(1) DOM generates the copy that parts use web data stream, generates the DOM object;
(2) the DOM processing element is carried out corresponding handle with the DOM object that obtains in the step (1) according to different page types in conjunction with page type information, calculates the characteristic information of node, and preserves result; Described characteristic information comprises the literal density δ (b) and the link density θ (b) of current node;
(3) for the result of preserving in the above-mentioned steps (2), node merges parts according to the characteristic information between neighborhood of nodes, calculate similarity, if simulated condition is true, then merge field identical in the neighborhood of nodes, keep previous node, give up a back node (hereinafter being called mixing operation);
(4) node signature analysis parts use the node set after merging in the step (3), according to the characteristic information of three whenever adjacent nodes, node are divided into " content node " and " noise node " two big classes;
(5) the node filtrator carries out bed filtration to " noise node " that stays in the step (4) and " the content node " that some have special tag, and filter result each time all uses the interim interpretation of result parts of filtrator to preserve; Draw optimum node set by analysis as the subject content after extracting.
In technique scheme, this method can be according to client's demand, media informations such as the picture that utilizes the media detection compression member to return webpage to comprise, video, the node set that uses above-mentioned steps (5) to provide, the media detection compression member can detect this webpage and whether comprise media information, locate the media information relevant, and it is compressed, is cached to this locality with document.
In technique scheme, the DOM processing element described in the step (2) comprises page type conjecture module, document pretreatment module, node computing module, and its concrete job step is as follows:
(3-1) the web data stream that obtains is preserved a copy, in order to fault-tolerant processing;
(3-2) from the DOM object<title node and<H1 node extracts heading message;
(3-3) call the document pretreatment module, filter out the annotation information that current DOM object is comprised, also have mutual node such as script, pattern and Flash;
(3-4) invoking page type conjecture module, the type of conjecture target pages, if the content type page, then order is carried out following steps; If the catalogue type page is then directly carried out the step of (3-7);
(3-5) call the node computing module, remaining node in the traversal DOM object is ignored<applet 〉,<button etc. mutual node, and<b,<u wait the modification node; Calculate the literal density δ (b) and the link density θ (b) of other each remaining node, and the result of calculation more than preserving, and the Word message of node, DOM operation-interface etc.; Its computing formula is as follows
Figure 599239DEST_PATH_IMAGE001
(formula 1)
Figure 501336DEST_PATH_IMAGE002
(formula 2)
L(b) the literal line number of expression current node, the word length of T (b) expression current node, maxLen represents the character length that screen delegation can comprise at most, T ' (b) represents line number greater than 1 node word length (not comprising last column), in Ta (b) expression current node and the descendants's node thereof, all<a〉the character length sum of node;
(3-6) result in (3-5) is preserved, operate in order to subsequent parts;
(3-7) if the conjecture page type is the catalogue type, the web data stream copy that then uses (3-1) to preserve regenerates the DOM object, and travels through again in the object<a〉node, the Returning catalogue content.
In technique scheme, node described in the step (3) merges parts and comprises former child node similarity calculation module and node Fusion Module, its concrete job step is as follows: each node in the result that former child node similarity calculation module traversal step (2) is preserved, calculate the δ (b) and the link density θ (b) of 2 whenever adjacent nodes according to formula 3, judge whether both are similar, if reaching empirical value ε is that 0.1 node Fusion Module carries out mixing operation, make that finally the discrimination of adjacent per two nodes is enough big; Wherein,
Figure 44313DEST_PATH_IMAGE003
,
Figure 399071DEST_PATH_IMAGE004
It is the weight of two class values
Figure 256168DEST_PATH_IMAGE005
(formula 3)
Figure 517166DEST_PATH_IMAGE006
In technique scheme, node signature analysis parts described in the step (4) are at the enough big node set of discrimination that produces in the step (3), searching loop each node in should set, and to the neighbours before and after the node and itself make the following judgment:
(5-1) whether Rule of judgment (a) is set up, and condition (a) is false, then current node is classified as the noise node;
Whether (5-2) condition (a) is true, then need Rule of judgment (b) to set up, if condition (b) is false, then whether Rule of judgment (c) is set up, if condition (c) is false, then current node is classified as the content node;
(5-3) if condition (c) is true, then whether Rule of judgment (d) is set up, if condition (d) is true, then current node is classified as the noise node, otherwise, current node is classified as the content node;
(5-4) if condition (b) is true, then whether Rule of judgment (e) is set up, if condition (e) is false, then with the current content node that is classified as, if condition (e) is true, then need Rule of judgment (f) whether to set up,, then current node is classified as the content node if condition (f) is false, otherwise then need Rule of judgment (g) whether to set up, if condition (g) is set up, then current node is classified as the noise node, otherwise it is classified as the content node;
Condition (a) wherein, whether the link density of current node is less than empirical value 0.353333;
Condition (b), whether the link density of previous node is less than empirical value 0.555556;
Condition (c), whether the literal density of current node is less than empirical value 0.555556;
Condition (d), whether the literal density of next node is less than empirical value 0.353333;
Condition (e), whether the literal density of current node is less than empirical value 0.488889;
Condition (f), whether the literal density of next node is smaller or equal to empirical value 0.555556;
Condition (g), whether the literal density of previous node is smaller or equal to empirical value 0.353333.
In technique scheme, described node filtrator, the interim interpretation of result parts of filtrator, its concrete job step is as follows:
(6-1) use node filtrator A to filter out blank, invalid noise node;
(6-2) use the node filter B in the content node<Span with<TD node carries out specific aim and filters: otherwise whether the character length of at first judging current node and being comprised, if vacation is then filtered it then keeps greater than empirical value 4; Judge then whether current node comprises the punctuation mark with semanteme,, otherwise then keep if vacation is then filtered it; Operating result is preserved by the data statistics module of the interim interpretation of result parts of filtrator;
(6-3) use node filtrator C to filter out the nonstandard<P that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because W3C standard recommendation<P〉node should not comprise other container nodes, so the criterion of this filtering rule is<P〉node is an individual layer node;
(6-4) use node filtrator C to filter out the nonstandard<TD that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because not closed<TD〉node wrong easily usually comprise other container nodes, so the criterion of this filtering rule is<TD node is an individual layer node;
(6-5) use node filtrator C to filter out the nonstandard<DIV that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because not closed<DIV〉node wrong easily usually comprise other container nodes, so the criterion of this filtering rule is<DIV node is an individual layer node;
(6-6) the interim interpretation of result parts of filtrator carry out descending sort to the result object that aforesaid operations produces, and at first carry out according to the separator statistic in this object, if this field is equal, then carry out according to character length; Travel through orderly results set, find out the result that first meets the following conditions; The separator statistic is more than or equal to empirical value 2, and literal density is greater than empirical value 0.28;
(6-7) if the result is empty, the web data stream that then uses (3-1) to preserve is preserved a copy, regenerate the DOM object, and utilize each node in the DOM processing element traversal object, and only at<P 〉,<TD 〉,<PRE 〉, and<DIV〉node carries out corresponding filtration, and store, and will gather the Web page subject content of conduct extraction; Judge the literal density in this web page contents,, then carry out next-step operation if be not 0;
(6-8) the web data stream copy that uses (3-1) to preserve regenerates the DOM object, and utilizes each node in the DOM processing element traversal object, only at<a〉node filters storage, Returning catalogue content.
The present invention compared with prior art has following advantage: the present invention is easy to operate, and is applied widely, neither depends on specific soft, hardware, also do not rely on the particular Web page masterplate; A large amount of experimental results show that this method can effectively be got rid of " noise " information in the page at the Chinese news web page of different-style, extract subject content, have advantages of high practicability.
Description of drawings
Fig. 1 is the subject content extracting method schematic diagram of a kind of Chinese web page of the embodiment of the invention.
Fig. 2 is the program flow diagram of the subject content extracting method of a kind of Chinese web page of the embodiment of the invention.
Embodiment
Below in conjunction with accompanying drawing and implement to inventing further description.
As shown in Figure 1, be the subject content extracting method schematic diagram of a kind of Chinese web page of the embodiment of the invention.System at first can carry out suitable format and handle to the URL of user's request, obtain the network data of remote server end, makes up an exercisable DOM object.
The DOM object is an exercisable primitive network data structure, need utilize DOM processing element (DOMHandler) to set up a new object of handling this model, the DOM processing element by the enumeration type that page type conjecture module provides, selects different strategies to carry out treatment conversion.
As shown in Figure 2, be the program flow diagram of the subject content extracting method of a kind of Chinese web page of present embodiment.When judgement is input as content pages, DOM processing element (DOMHandler) can be WebDocument with the DOM model conversion at first, it is a kind of data structure of self-defining description web page characteristics, the literal density that comprises each node, link density, Word message, DOM operation-interface etc., but data itself do not obtain any refining.At this moment need to filter by the scanning that node fusion parts, node signature analysis parts, node element filtrator come WebDocument to be carried out various dimensions, and after scanning is filtered each time, they are saved in the interim interpretation of result parts of filtrator, by calculating ordering, analyze these interim results, obtain optimum results set and extract content as the page.
When result set when not being empty, then prove successfully found the content node.According to the configuration of program, can be by the content node that has found, the position of inverted orientation picture concerned or video.If successfully found the visit URL of related media, then with their compression and be cached to this locality.In conjunction with the media content that obtained just now, assembling generates new page entity.
When result set when being empty, then proof success find the content node.Usually this situation occurring is because original HTML code is not followed the standard of W3C, or itself does not just possess the content node.For fault tolerant mechanism is provided, program provides another kind of conversion regime: the DOM object is converted to SimpleWebDocument (data structure of another kind of self-defining description web page characteristics).At this structure, there is special filtrator to carry out above-mentioned similar filtration, and directly returns web page contents.
Need check the literal density of current web page contents this time.When literal density is 0, whether then need to detect current web page to show media information.If true, then return information.If false, then return error message.When literal density is not 0, check then whether this density drops in the security domain, if true, then return web page contents.If false, then needing the DOM model conversion is IndexDoc (self-defining a kind of data structure that is used for describing the catalogue page feature).
The concrete steps of present embodiment are as follows.
A kind of extracting method of Chinese web page subject content, the hardware components that this method is used comprises that DOM generates parts, DOM processing element, node fusion parts, node signature analysis parts, node element filtrator, the interim interpretation of result parts of filtrator, media detection compression member, is characterized in that this method may further comprise the steps:
(1) DOM generates the copy that parts use web data stream, generates the DOM object;
(2) the DOM processing element is carried out corresponding handle with the DOM object that obtains in the step (1) according to different page types in conjunction with page type information, calculates the characteristic information of node, and preserves result; Described characteristic information comprises the literal density δ (b) and the link density θ (b) of current node;
(3) for the result of preserving in the above-mentioned steps (2), node merges parts according to the characteristic information between neighborhood of nodes, if simulated condition is true, then merge field identical in the neighborhood of nodes, keep previous node, give up a back node (hereinafter being called mixing operation);
(4) node signature analysis parts use the node set after merging in the step (3), according to the characteristic information of three whenever adjacent nodes, node are divided into " content node " and " noise node " two big classes;
(5) the node filtrator carries out bed filtration to " noise node " that stays in the step (4) and " the content node " that some have special tag, and filter result each time all uses the interim interpretation of result parts of filtrator to preserve; Draw optimum node set by analysis as the subject content after extracting.
In the above-described embodiments, this method can be according to client's demand, media informations such as the picture that utilizes the media detection compression member to return webpage to comprise, video, the node set of its method for using above-mentioned steps (5) to provide, the media detection compression member can detect this webpage and whether comprise media information, locate the media information relevant, and it is compressed, is cached to this locality with document.
In the above-described embodiments, described DOM processing element comprises page type conjecture module, document pretreatment module, node computing module, and its concrete job step is as follows:
(3-1) the web data stream that obtains is preserved a copy, in order to fault-tolerant processing;
(3-2) from the DOM object<title node and<H1 node extracts heading message;
(3-3) call the document pretreatment module, filter out the annotation information that current DOM object is comprised, also have mutual node such as script, pattern and Flash;
(3-4) invoking page type conjecture module, the type of conjecture target pages, if the content type page, then order is carried out following steps; If the catalogue type page is then directly carried out the step of (3-7);
(3-5) call the node computing module, remaining node in the traversal DOM object is ignored<applet 〉,<button etc. mutual node, and<b,<u wait the modification node; Calculate the literal density δ (b) and the link density θ (b) of other each remaining node, and the result of calculation more than preserving, and the Word message of node, DOM operation-interface etc.; Its computing formula is as follows
Figure 231044DEST_PATH_IMAGE001
(formula 1)
Figure 338678DEST_PATH_IMAGE002
(formula 2)
L(b) the literal line number of expression current node, the word length of T (b) expression current node, maxLen represents the character length that screen delegation can comprise at most, T ' (b) represents line number greater than 1 node word length (not comprising last column), in Ta (b) expression current node and the descendants's node thereof, all<a〉the character length sum of node;
(3-6) result in (3-5) is preserved, operate in order to subsequent parts;
(3-7) if the conjecture page type is the catalogue type, the web data stream copy that then uses (3-1) to preserve regenerates the DOM object, and travels through again in the object<a〉node, the Returning catalogue content.
In the above-described embodiments, described node merges parts and comprises former child node similarity calculation module and node Fusion Module, its concrete job step is as follows: each node in the result that traversal step (2) is preserved, calculate the δ (b) and the link density θ (b) of 2 whenever adjacent nodes according to formula 3, judge whether both are similar, be 0.1 and carry out mixing operation if reach empirical value ε, make that finally the discrimination of adjacent per two nodes is enough big; Wherein,
Figure 796204DEST_PATH_IMAGE003
, It is the weight of two class values
Figure 121454DEST_PATH_IMAGE005
(formula 3)
Figure 450804DEST_PATH_IMAGE006
In the above-described embodiments, the node signature analysis parts described in the step (4) are at the enough big node set of discrimination that produces in the step (3), searching loop each node in should set, and to the neighbours of node front and back and itself make the following judgment:
(5-1) whether Rule of judgment (a) is set up, and condition (a) is false, then current node is classified as the noise node;
Whether (5-2) condition (a) is true, then need Rule of judgment (b) to set up, if condition (b) is false, then whether Rule of judgment (c) is set up, if condition (c) is false, then current node is classified as the content node;
(5-3) if condition (c) is true, then whether Rule of judgment (d) is set up, if condition (d) is true, then current node is classified as the noise node, otherwise, current node is classified as the content node;
(5-4) if condition (b) is true, then whether Rule of judgment (e) is set up, if condition (e) is false, then with the current content node that is classified as, if condition (e) is true, then need Rule of judgment (f) whether to set up,, then current node is classified as the content node if condition (f) is false, otherwise then need Rule of judgment (g) whether to set up, if condition (g) is set up, then current node is classified as the noise node, otherwise it is classified as the content node;
Condition (a) wherein, whether the link density of current node is less than empirical value 0.353333;
Condition (b), whether the link density of previous node is less than empirical value 0.555556;
Condition (c), whether the literal density of current node is less than empirical value 0.555556;
Condition (d), whether the literal density of next node is less than empirical value 0.353333;
Condition (e), whether the literal density of current node is less than empirical value 0.488889;
Condition (f), whether the literal density of next node is smaller or equal to empirical value 0.555556;
Condition (g), whether the literal density of previous node is smaller or equal to empirical value 0.353333.
In the above-described embodiments, described node filtrator, the interim interpretation of result parts of filtrator, its concrete job step is as follows:
(6-1) use node filtrator A to filter out blank, invalid noise node;
(6-2) use the node filter B in the content node<Span with<TD node carries out specific aim and filters: otherwise whether the character length of at first judging current node and being comprised, if vacation is then filtered it then keeps greater than empirical value 4; Judge then whether current node comprises the punctuation mark with semanteme,, otherwise then keep if vacation is then filtered it; Operating result is preserved by the data statistics module of the interim interpretation of result parts of filtrator;
(6-3) filter out the nonstandard<P that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because W3C standard recommendation<P〉node should not comprise other container nodes, so the criterion of this filtering rule is<P〉node is an individual layer node;
(6-4) filter out the nonstandard<TD that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because not closed<TD〉node wrong easily usually comprise other container nodes, so the criterion of this filtering rule is<TD node is an individual layer node;
(6-5) filter out the nonstandard<DIV that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because not closed<DIV〉node wrong easily usually comprise other container nodes, so the criterion of this filtering rule is<DIV node is an individual layer node;
(6-6) the interim interpretation of result parts of filtrator carry out descending sort to the result object that aforesaid operations produces, and at first carry out according to the separator statistic in this object, if this field is equal, then carry out according to character length; Travel through orderly results set, find out the result that first meets the following conditions; The separator statistic is more than or equal to empirical value 2, and literal density is greater than empirical value 0.28;
(6-7) if the result is empty, the web data stream that then uses (3-1) to preserve is preserved a copy, regenerate the DOM object, and utilize each node in the DOM processing element traversal object, and only at<P 〉,<TD 〉,<PRE 〉, and<DIV〉node carries out corresponding filtration, and store, and directly return web page contents; Judge the literal density in this web page contents,, then return media information,, then carry out next-step operation if be not 0 according to customer demand if be 0;
(6-8) the web data stream copy that uses (3-1) to preserve regenerates the DOM object, and utilizes each node in the DOM processing element traversal object, only at<a〉node filters storage, Returning catalogue content.

Claims (6)

1. the extracting method of a Chinese web page subject content, the hardware components that this method is used comprises that DOM generates parts, DOM processing element, node fusion parts, node signature analysis parts, node element filtrator, the interim interpretation of result parts of filtrator, is characterized in that this method may further comprise the steps:
(1) DOM generates the copy that parts use web data stream, generates the DOM object;
(2) the DOM processing element is carried out corresponding handle with the DOM object that obtains in the step (1) according to different page types in conjunction with page type information, calculates the characteristic information of node, and preserves result; Described characteristic information comprises the literal density δ (b) and the link density θ (b) of current node;
(3) for the result of preserving in the above-mentioned steps (2), node merges parts according to the characteristic information between neighborhood of nodes, calculates similarity, if simulated condition is true, then merge field identical in the neighborhood of nodes, keep previous node, give up a back node;
(4) node signature analysis parts use the node set after merging in the step (3), according to the characteristic information of three whenever adjacent nodes, node are divided into " content node " and " noise node " two big classes;
(5) the node filtrator carries out bed filtration to " noise node " that stays in the step (4) and " the content node " that has special tag, and filter result each time all uses the interim interpretation of result parts of filtrator to preserve; Draw optimum node set by analysis as the subject content after extracting.
2. the extracting method of a kind of Chinese web page subject content according to claim 1, it is characterized in that: picture, video media information that this method is utilized the media detection compression member to return webpage to comprise, the node set that uses above-mentioned steps (5) to provide, the media detection compression member can detect this webpage and whether comprise media information, locate the media information relevant, and it is compressed, is cached to this locality with document.
3. the extracting method of a kind of Chinese web page subject content according to claim 1, it is characterized in that the DOM processing element described in the step (2) comprises page type conjecture module, document pretreatment module, node computing module, its concrete job step is as follows:
(3-1) the web data stream that obtains is preserved a copy, in order to fault-tolerant processing;
(3-2) from the DOM object<title node and<H1 node extracts heading message;
(3-3) call the document pretreatment module, filter out the annotation information that current DOM object is comprised, also have script, pattern and the mutual node of Flash;
(3-4) invoking page type conjecture module, the type of conjecture target pages, if the content type page, then order is carried out following steps; If the catalogue type page is then directly carried out the step of (3-7);
(3-5) call the node computing module, remaining node in the traversal DOM object is ignored<applet 〉,<button〉mutual node, and<b 〉,<u〉the modification node; Calculate the literal density δ (b) and the link density θ (b) of other each remaining node, and the result of calculation more than preserving, and the Word message of node, DOM operation-interface; Its computing formula is as follows
(formula 1)
(formula 2)
L(b) the literal line number of expression current node, the word length of T (b) expression current node, maxLen represents the character length that screen delegation can comprise at most, T ' (b) represents line number greater than 1 node word length (not comprising last column), in Ta (b) expression current node and the descendants's node thereof, all<a〉the character length sum of node;
(3-6) result in (3-5) is preserved, operate in order to subsequent parts;
(3-7) if the conjecture page type is the catalogue type, the web data stream copy that then uses (3-1) to preserve regenerates the DOM object, and travels through again in the object<a〉node, the Returning catalogue content.
4. the extracting method of a kind of Chinese web page subject content according to claim 1, it is characterized in that: the node described in the step (3) merges parts and comprises former child node similarity calculation module and node Fusion Module, each node in the result that the following former child node similarity calculation module traversal step of its concrete job step (2) is preserved, calculate the δ (b) and the link density θ (b) of 2 whenever adjacent nodes according to formula 3, judge whether both are similar, if reaching empirical value ε is that 0.1 node Fusion Module carries out mixing operation, make that finally the discrimination of adjacent per two nodes is enough big; Wherein,
Figure 907174DEST_PATH_IMAGE003
,
Figure 573778DEST_PATH_IMAGE004
It is the weight of two class values
(formula 3)
5. the extracting method of a kind of Chinese web page subject content according to claim 1, it is characterized in that: the node signature analysis parts described in the step (4) are at the enough big node set of discrimination that produces in the step (3), searching loop each node in should set, and to the neighbours before and after the node and itself make the following judgment
(5-1) whether Rule of judgment (a) is set up, and condition (a) is false, then current node is classified as the noise node;
Whether (5-2) condition (a) is true, then need Rule of judgment (b) to set up, if condition (b) is false, then whether Rule of judgment (c) is set up, if condition (c) is false, then current node is classified as the content node;
(5-3) if condition (c) is true, then whether Rule of judgment (d) is set up, if condition (d) is true, then current node is classified as the noise node, otherwise, current node is classified as the content node;
(5-4) if condition (b) is true, then whether Rule of judgment (e) is set up, if condition (e) is false, then with the current content node that is classified as, if condition (e) is true, then need Rule of judgment (f) whether to set up,, then current node is classified as the content node if condition (f) is false, otherwise then need Rule of judgment (g) whether to set up, if condition (g) is set up, then current node is classified as the noise node, otherwise it is classified as the content node;
Condition (a) wherein, whether the link density of current node is less than empirical value 0.353333;
Condition (b), whether the link density of previous node is less than empirical value 0.555556;
Condition (c), whether the literal density of current node is less than empirical value 0.555556;
Condition (d), whether the literal density of next node is less than empirical value 0.353333;
Condition (e), whether the literal density of current node is less than empirical value 0.488889;
Condition (f), whether the literal density of next node is smaller or equal to empirical value 0.555556;
Condition (g), whether the literal density of previous node is smaller or equal to empirical value 0.353333.
6. the extracting method of a kind of Chinese web page subject content according to claim 1 is characterized in that: the interim interpretation of result parts of node filtrator, the filtrator described in the step (5), its concrete job step is as follows
(6-1) use node filtrator A to filter out blank, invalid noise node;
(6-2) use the node filter B in the content node<Span with<TD node filters: otherwise whether the character length of at first judging current node and being comprised, if vacation is then filtered it then keeps greater than empirical value 4; Judge then whether current node comprises the punctuation mark with semantic segmentation function,, otherwise then keep if vacation is then filtered it; Operating result is preserved by the data statistics module of the interim interpretation of result parts of filtrator;
(6-3) use node filtrator C to filter out the nonstandard<P that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because W3C standard recommendation<P〉node should not comprise other container node, so the criterion of this filtering rule is<P〉node is an individual layer node;
(6-4) use node filtrator C to filter out the nonstandard<TD that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because not closed<TD〉node wrong easily usually comprise other container node, so the criterion of this filtering rule is<TD node is an individual layer node;
(6-5) use node filtrator C to filter out the nonstandard<DIV that comprises in the web data〉node information, operating result is stored in the interim interpretation of result parts of filtrator; Because not closed<DIV〉node wrong easily usually comprise other container node, so the criterion of this filtering rule is<DIV node is an individual layer node;
(6-6) the interim interpretation of result parts of filtrator carry out descending sort to the result object that aforesaid operations produces, and at first carry out according to the separator statistic in this object, if this field is equal, then carry out according to character length; Travel through orderly results set, find out the result that first meets the following conditions; The separator statistic is more than or equal to empirical value 2, and literal density is greater than empirical value 0.28;
(6-7) if the result is empty, the web data stream that then uses (3-1) to preserve is preserved a copy, regenerate the DOM object, and utilize each node in the DOM processing element traversal object, and only at<P 〉,<TD 〉,<PRE 〉, and<DIV〉node carries out corresponding filtration, and store, and will gather the Web page subject content of conduct extraction; Judge the literal density in this web page contents,, then carry out next-step operation if be not 0;
(6-8) the web data stream copy that uses (3-1) to preserve regenerates the DOM object, and utilizes each node in the DOM processing element traversal object, only at<a〉node filters storage, Returning catalogue content.
CN 201110090737 2011-04-12 2011-04-12 Method for extracting subject content of Chinese webpage Active CN102156737B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110090737 CN102156737B (en) 2011-04-12 2011-04-12 Method for extracting subject content of Chinese webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110090737 CN102156737B (en) 2011-04-12 2011-04-12 Method for extracting subject content of Chinese webpage

Publications (2)

Publication Number Publication Date
CN102156737A true CN102156737A (en) 2011-08-17
CN102156737B CN102156737B (en) 2013-03-20

Family

ID=44438236

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110090737 Active CN102156737B (en) 2011-04-12 2011-04-12 Method for extracting subject content of Chinese webpage

Country Status (1)

Country Link
CN (1) CN102156737B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662966A (en) * 2012-03-08 2012-09-12 中国科学院计算机网络信息中心 Method and system for obtaining subject-oriented dynamic page content
CN102955852A (en) * 2012-11-01 2013-03-06 北京小米科技有限责任公司 Method, device and equipment for webpage resource processing
CN103064827A (en) * 2013-01-16 2013-04-24 盘古文化传播有限公司 Method and device for extracting webpage content
CN103064966A (en) * 2012-12-31 2013-04-24 中国科学院计算技术研究所 Method for extracting regular noise from single record web pages
CN103353842A (en) * 2013-06-20 2013-10-16 北京小米科技有限责任公司 Webpage loading method and device
CN103425644A (en) * 2012-05-14 2013-12-04 腾讯科技(深圳)有限公司 Method and device for extracting pictures in webpage content
CN103678335A (en) * 2012-09-05 2014-03-26 阿里巴巴集团控股有限公司 Method and device for identifying commodity with labels and method for commodity navigation
CN103838801A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Webpage theme information extraction method
CN103838792A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Method for determining webpage theme
CN103927309A (en) * 2013-01-14 2014-07-16 阿里巴巴集团控股有限公司 Method and device for marking information labels for business objects
CN104376061A (en) * 2014-11-10 2015-02-25 武汉传神信息技术有限公司 Webpage text extracting method
CN104965849A (en) * 2015-03-31 2015-10-07 哈尔滨工程大学 Webpage-undeformed noise filtering method based on similarity of WVP_DOM tree
CN107145591A (en) * 2017-05-17 2017-09-08 广州瞬速信息科技有限公司 A kind of effective content metadata extracting method of webpage based on title
CN107391675A (en) * 2017-07-21 2017-11-24 百度在线网络技术(北京)有限公司 Method and apparatus for generating structure information
CN109325204A (en) * 2018-09-13 2019-02-12 武汉伯远生物科技有限公司 Web page contents extraction method
CN110110252A (en) * 2019-05-17 2019-08-09 北京市博汇科技股份有限公司 A kind of audiovisual material recognition methods, device and storage medium
CN111709230A (en) * 2020-04-30 2020-09-25 昆明理工大学 Short text automatic summarization method based on part-of-speech soft template attention mechanism

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1934807A2 (en) * 2005-08-09 2008-06-25 Zalag Corporation Methods and apparatuses to assemble, extract and deploy content from electronic documents
US7669119B1 (en) * 2005-07-20 2010-02-23 Alexa Internet Correlation-based information extraction from markup language documents
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure
CN102004805A (en) * 2010-12-30 2011-04-06 上海交通大学 Webpage denoising system and method based on maximum similarity matching

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7669119B1 (en) * 2005-07-20 2010-02-23 Alexa Internet Correlation-based information extraction from markup language documents
EP1934807A2 (en) * 2005-08-09 2008-06-25 Zalag Corporation Methods and apparatuses to assemble, extract and deploy content from electronic documents
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure
CN102004805A (en) * 2010-12-30 2011-04-06 上海交通大学 Webpage denoising system and method based on maximum similarity matching

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662966A (en) * 2012-03-08 2012-09-12 中国科学院计算机网络信息中心 Method and system for obtaining subject-oriented dynamic page content
CN103425644B (en) * 2012-05-14 2016-04-06 腾讯科技(深圳)有限公司 The extracting method of picture and device in Web page text
CN103425644A (en) * 2012-05-14 2013-12-04 腾讯科技(深圳)有限公司 Method and device for extracting pictures in webpage content
CN103678335B (en) * 2012-09-05 2017-12-08 阿里巴巴集团控股有限公司 The method of method, apparatus and the commodity navigation of commodity sign label
CN103678335A (en) * 2012-09-05 2014-03-26 阿里巴巴集团控股有限公司 Method and device for identifying commodity with labels and method for commodity navigation
CN102955852A (en) * 2012-11-01 2013-03-06 北京小米科技有限责任公司 Method, device and equipment for webpage resource processing
CN103838792A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Method for determining webpage theme
CN103838801A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Webpage theme information extraction method
CN103064966B (en) * 2012-12-31 2016-01-27 中国科学院计算技术研究所 A kind of method extracting rule noise from unirecord webpage
CN103064966A (en) * 2012-12-31 2013-04-24 中国科学院计算技术研究所 Method for extracting regular noise from single record web pages
CN103927309A (en) * 2013-01-14 2014-07-16 阿里巴巴集团控股有限公司 Method and device for marking information labels for business objects
CN103927309B (en) * 2013-01-14 2017-08-11 阿里巴巴集团控股有限公司 A kind of method and device to business object markup information label
CN103064827A (en) * 2013-01-16 2013-04-24 盘古文化传播有限公司 Method and device for extracting webpage content
CN103353842A (en) * 2013-06-20 2013-10-16 北京小米科技有限责任公司 Webpage loading method and device
CN104376061A (en) * 2014-11-10 2015-02-25 武汉传神信息技术有限公司 Webpage text extracting method
CN104965849A (en) * 2015-03-31 2015-10-07 哈尔滨工程大学 Webpage-undeformed noise filtering method based on similarity of WVP_DOM tree
CN104965849B (en) * 2015-03-31 2018-12-07 哈尔滨工程大学 A kind of indeformable noise filtering method of webpage based on WVP_DOM tree similitude
CN107145591B (en) * 2017-05-17 2020-10-16 广州瞬速信息科技有限公司 Title-based webpage effective metadata content extraction method
CN107145591A (en) * 2017-05-17 2017-09-08 广州瞬速信息科技有限公司 A kind of effective content metadata extracting method of webpage based on title
CN107391675A (en) * 2017-07-21 2017-11-24 百度在线网络技术(北京)有限公司 Method and apparatus for generating structure information
CN109325204A (en) * 2018-09-13 2019-02-12 武汉伯远生物科技有限公司 Web page contents extraction method
CN109325204B (en) * 2018-09-13 2022-01-07 武汉伯远生物科技有限公司 Automatic extraction method of webpage content
CN110110252A (en) * 2019-05-17 2019-08-09 北京市博汇科技股份有限公司 A kind of audiovisual material recognition methods, device and storage medium
CN110110252B (en) * 2019-05-17 2021-01-15 北京市博汇科技股份有限公司 Audio-visual program identification method, device and storage medium
CN111709230A (en) * 2020-04-30 2020-09-25 昆明理工大学 Short text automatic summarization method based on part-of-speech soft template attention mechanism
CN111709230B (en) * 2020-04-30 2023-04-07 昆明理工大学 Short text automatic summarization method based on part-of-speech soft template attention mechanism

Also Published As

Publication number Publication date
CN102156737B (en) 2013-03-20

Similar Documents

Publication Publication Date Title
CN102156737B (en) Method for extracting subject content of Chinese webpage
Hattori et al. Robust web page segmentation for mobile terminal using content-distances and page layout information
CN102663023B (en) Implementation method for extracting web content
CN101515272B (en) Method and device for extracting webpage content
CN102253979B (en) Vision-based web page extracting method
CN102073726B (en) Structured data import method and device for search engine system
CN104598577B (en) A kind of extracting method of Web page text
CN102270206A (en) Method and device for capturing valid web page contents
CN108920434A (en) A kind of general Web page subject method for extracting content and system
CN102279894A (en) Method for searching, integrating and providing comment information based on semantics and searching system
CN109815386B (en) User portrait-based construction method and device and storage medium
CN106503211B (en) Method for automatically generating mobile version facing information publishing website
CN110263248A (en) A kind of information-pushing method, device, storage medium and server
CN103955529A (en) Internet information searching and aggregating presentation method
CN106909663A (en) Based on tagging user Brang Preference behavior prediction method and its device
JP2005063432A (en) Multimedia object retrieval apparatus and multimedia object retrieval method
Ahmadi et al. User-centric adaptation of Web information for small screens
CN110222251A (en) A kind of Service encapsulating method based on Web-page segmentation and searching algorithm
CN110134844A (en) Subdivision field public sentiment monitoring method, device, computer equipment and storage medium
Nyein Mining contents in Web page using cosine similarity
CN114443928B (en) Web text data crawler method and system
JP2008269069A (en) Information processing system and method
CN105243120A (en) Retrieval method and apparatus
Liu et al. Main content extraction from web pages based on node characteristics
CN101593187B (en) Method and system for managing book marks

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20110817

Assignee: Wuhan Hezhongxing Trading Co.,Ltd.

Assignor: CENTRAL CHINA NORMAL University

Contract record no.: X2023980052458

Denomination of invention: A Method for Extracting Theme Content from Chinese Web Pages

Granted publication date: 20130320

License type: Common License

Record date: 20231219

EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20110817

Assignee: Hubei ZHENGBO Xusheng Technology Co.,Ltd.

Assignor: CENTRAL CHINA NORMAL University

Contract record no.: X2024980001275

Denomination of invention: A Method for Extracting Theme Content from Chinese Web Pages

Granted publication date: 20130320

License type: Common License

Record date: 20240124

EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20110817

Assignee: Hubei Rongzhi Youan Technology Co.,Ltd.

Assignor: CENTRAL CHINA NORMAL University

Contract record no.: X2024980001548

Denomination of invention: A Method for Extracting Theme Content from Chinese Web Pages

Granted publication date: 20130320

License type: Common License

Record date: 20240126

EE01 Entry into force of recordation of patent licensing contract