CN102880707A

CN102880707A - Method and device for webpage body content recognition

Info

Publication number: CN102880707A
Application number: CN2012103713105A
Authority: CN
Inventors: 梁捷; 俞永福; 何小鹏; 朱顺炎; 陈德志
Original assignee: Guangzhou Dongjing Computer Technology Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2012-09-27
Filing date: 2012-09-27
Publication date: 2013-01-16
Anticipated expiration: 2032-09-27
Also published as: CN102880707B

Abstract

The invention provides a method and a device for webpage body content recognition. The method includes: analyzing a webpage to be loaded, and establishing a DOM (document object model) tree; scoring each node in the DOM tree; determining scores of all block elements in the webpage according to scores of each node in the DOM tree; and finding out the block element with a highest score in the DOM tree, and taking the block element with the highest score as body content of the webpage. Utilizing the method for webpage body content recognition can quickly judge real body content of the webpage, so that a user can read the body content of a webpage requested with higher speed and more flux saved.

Description

Webpage main content identification method and device

Technical Field

The present invention relates to the field of web browsing technologies for wireless networks, and in particular, to a method and an apparatus for identifying content of a web page.

Background

At present, the content of web pages of the internet is more and more, the typesetting of the web pages is more and more complex, and the content of non-main bodies such as advertisements, videos, Flash animations, embedded objects and the like contained in the main content of the web pages is more and more. It becomes increasingly difficult to directly understand the desired information from the web page at a glance. Especially, for terminal devices with smaller screens such as mobile phones and PDAs, due to the limitation of terminal hardware, the screens are smaller, when a WWW web page is browsed through browsers such as mobile phones, only a small amount of web page content can be displayed at one time, and adverse effects of non-main content in the web page on browsing experience of users are more serious.

The common web page is displayed on the mobile phone through a process of zooming or even re-typesetting. The resolution of the screen of the PC, which is currently more popular, is 480 × 800 and 240 × 320, etc., while the size of one web page is usually 1024 × 768, or 800 × 600, the width and height of different web pages are different. When a web page with such a large resolution is displayed on a mobile phone with a relatively small resolution, a scaling mode is usually adopted to scale up the web page, so as to reduce one large web page, and then the web page is displayed in a mobile phone resolution mode. However, the existing simple zoom mode is not suitable for the current user's web browsing requirement, and the user wants to see the desired content on the internet rather than the structure of the irrelevant content of the web page, advertisement, etc.

Another typesetting mode for the webpage when being displayed on the mobile phone is a typesetting mode suitable for a screen. Although the typesetting mode adapting to the screen can be typeset by taking the current mobile phone screen as a reference, the webpage displayed after typesetting still contains information such as a webpage related structure, webpage advertisements and the like, not all the webpage is the reading content expected by the user, and the webpage browsing experience of the user is influenced badly.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a method and an apparatus for identifying a main content of a web page, which can easily identify the main content of the web page, so as to directly obtain information of the web page without pulling the web page left or right when browsing information on a mobile terminal device such as a mobile phone.

According to one aspect of the invention, a method for identifying webpage main content is provided, which comprises the following steps:

analyzing a webpage to be loaded and constructing a DOM tree;

scoring each node in the DOM tree;

determining the scores of all block elements in the webpage according to the score of each node in the DOM tree;

and finding out the block element with the highest score in the DOM tree, and taking the block element with the highest score as the main content of the webpage.

Wherein, in the process of scoring each node in the DOM tree,

the basis for scoring is the sum of the scores of the child nodes under each node, the scores of the different types of child nodes depending on their node type, wherein,

the score of the text node is the length of the character string of the text node;

for an element node, if the element is an inline element, the score of the element node is 0; if the element node is a block element, judging whether the length of the text contained in the block element exceeds a preset threshold value, and if so, determining the real score of the block element according to the length of the text contained in the block element and the node type of the block element.

Wherein, in the process of determining the real score of the block element according to the length of the text contained in the block element and the node type of the block element,

if the length T of the text contained in the block element is determined according to the innerText attribute of the block element div, T = Length (innerText), and the score is T correspondingly, the score is increased on the basis of T for the element of which the node type belongs to the main content of the webpage; for elements whose node type does not belong to the main content of the web page, the score is subtracted on the basis of T.

According to another aspect of the present invention, there is provided a web page main body content recognition apparatus including:

the webpage analyzing unit is used for analyzing the webpage to be loaded and constructing a DOM tree;

the node scoring unit is used for scoring each node in the DOM tree constructed by the webpage analyzing unit;

the block element scoring unit is used for determining scores of all block elements in the webpage according to the score of each node in the DOM tree;

and the webpage main content determining unit is used for finding out the block element with the highest score in the DOM tree and taking the block element with the highest score as the main content of the webpage.

Wherein the node scoring unit further comprises:

a node type judging unit for judging the type of the node;

a node score calculation unit for calculating a score of a node according to a type of the node, wherein,

the node scoring unit scores according to the sum of scores of child nodes below each node, and scores of different types of child nodes are determined according to the node types of the child nodes; wherein,

By utilizing the method and the device for identifying the main contents of the web pages, the real main contents of the web pages can be judged quickly, so that a user can read the main contents of the requested web pages at a higher speed and with a more saved flow.

To the accomplishment of the foregoing and related ends, one or more aspects of the invention comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the annexed drawings set forth in detail certain illustrative aspects of the invention. These aspects are indicative, however, of but a few of the various ways in which the principles of the invention may be employed. Further, the present invention is intended to include all such aspects and their equivalents.

Drawings

Other objects and results of the present invention will become more apparent and more readily appreciated as the same becomes better understood by reference to the following description and appended claims, taken in conjunction with the accompanying drawings. In the drawings:

FIG. 1 is a diagram illustrating the structure of a DOM tree of an HTML web page;

FIG. 2 is a flowchart of a method for identifying main contents of a web page according to the present invention;

fig. 3 is a block diagram illustrating an apparatus for identifying contents of a webpage according to the present invention.

The same reference numbers in all figures indicate similar or corresponding features or functions.

Detailed Description

In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more embodiments. It may be evident, however, that such embodiment(s) may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing one or more embodiments.

Since the mobile internet is mainly a mobile phone at present, in the following description of embodiments of the present invention, "mobile terminal" and "mobile phone" both refer to a target carrier used by a user to access the mobile internet, and "mobile phone" can be understood as a specific representation of "mobile terminal", but not as a unique representation.

The structure of an internet web page can be described by a DOM (Document Object Model) that can access and modify the content and structure of a Document in a platform and language independent manner. The DOM is designed based on the conventions of the Object Management Group (OMG) and can therefore be used in any programming language. The DOM defines the objects needed to represent and modify a document, the behavior and properties of these objects, and the relationships between these objects, and thus can be thought of as a tree representation of the data and structures on a page.

The whole webpage is composed of page elements, attributes and texts to form a tree-shaped structure, each Element of the webpage is called a Node (Node), each label corresponds to an Element (Element), and Text character strings among the labels correspond to a Text (Text). For example: the following HTML page can be represented as follows using the DOM tree shown in FIG. 1:

what is the sun star cloud. [ p ] </body > </html ]

Elements of an entire web page may be divided into Block elements (Block elements) and Inline elements (Inline elements). The block elements can be nested, stacked and typeset, also can be tiled and typeset, and can be typeset at will. The web page body content is a block element that contains at most continuous content information.

Usually the text nodes are on leaf nodes. As the smallest unit in the webpage elements, the nodes include text nodes, element nodes, attribute nodes, comment nodes and the like; an element node is the node where the element is located.

Based on the analysis, the invention provides a method for identifying the main content of the webpage based on a node scoring mode. The webpage main content identification method analyzes a DOM tree representing an HTML webpage, scores each node in the DOM tree, and determines a positive score for a node which tends to be webpage main content according to the text length and the node type of the node; and determining a negative score for the node which is inclined to the non-webpage main content according to the text length and the node type of the node, and finally taking the content contained in the block element with the highest score as the finally identified webpage main content.

Fig. 2 is a flowchart illustrating a web page body content identification method according to the present invention.

As shown in fig. 2, when a user inputs a web address and requests to load a web page into a terminal browser, the process of identifying the content of the web page includes the following steps:

s210: and analyzing the webpage to be loaded by the browser and constructing a DOM tree.

S220: scoring each node in the DOM tree; the basis for the score is the sum of the scores based on the child nodes below the node. If a node has three child nodes below it, the score of the node is the sum of the scores of the three child nodes.

S230: and determining the scores of all block elements in the webpage according to the score of each node in the DOM tree.

Since different types of nodes play different roles in the HTML web page, different types of nodes have different scores in the scoring process. In general, for a node whose content is more important than the main content of a web page (e.g., text, paragraph division, etc.), the score is positive, i.e., the node is given a score; for nodes whose contents are more heavily weighted than non-web page subject contents (e.g., hyperlinks, advertisements, etc.), the score is negative, i.e., the node is given a reduced score. Whether the content of a node is biased to the main content of the webpage or not can be judged according to the type of the node and the text length contained in the node, for example, a text node, a paragraph dividing node (p element node), a line feed node (br element node) and the like.

Specifically, for a text node, since this node represents the text contained in the HTML element, that is, the content contained in this node is all text, which can be generally understood as belonging to the main content of the web page, the score of the text node is the length of the character string of this text node. And for the element node, determining the score of the element node according to the specific content of the element node.

If the element is an Inline element (Inline element), the node score is 0 because the node of the Inline element generally functions to modify the style, color, style and size of the text without adding extra content.

If the element node is a Block element, the length of the text contained by the element needs to be calculated first (the length of the text of this element and all its children can be obtained via the lnnertext attribute). At this time, a threshold may be set, and if the threshold is set to 200 (according to a general news website, for example, the number of words in a line of a news article in news of newcomer is about 40, and the threshold is 200, 5 lines can be displayed), for example, the text length of innerText is less than 200, it is considered that the amount of information contained is too small, and the content is regarded as non-web main content.

If the length of the InnerText text exceeds 200, whether the information contained in the node is the main content of the webpage cannot be determined. Thus the block element may also contain embedded objects, embedded Flash, embedded table or form elements, etc., which may also be advertising information, or the block element may contain all hyperlinked text, etc. The scores of the different tag elements must be deducted from the child elements of the block element to obtain a true score. If the block element contains a label element representing the non-webpage main content, a certain proportion of score is deducted according to the possibility that the label element is the non-webpage main content, and the score can be determined according to the basic score and the preset weight corresponding to the label element.

For other nodes, such as comment nodes, processing instruction nodes, document type nodes, etc., since these nodes are part of the structure, their scores are all set to 0.

The tag elements in a chunk element include < div >, < p >, < br >, < a >, < h1>, < form >, < ul >, < li >, < dl >, etc., and the score calculation for several tag elements and the effect of these tag elements on the chunk element score calculation are illustrated below.

For example, below the block element div, first, from the length of the text shown by the innerText of the block element, its basic score is calculated as T. On the basis, for the elements of which the node types belong to the main content of the webpage, increasing the score on the basis of T; for elements whose node type does not belong to the main content of the web page, the score is subtracted on the basis of T.

If the number of p elements corresponding to all < p > tags under this block element div is count (p), since p elements are usually used for paragraph division of text, the more p elements, the more the content information is part of paragraph division belonging to the whole content. The score under this block element div must therefore be added to the score brought by the number of p elements:

T+Count(p)*W_p

wherein, W_pThe weight of P element is preset, and is called w (weight), the size of the weight depends on the contribution of the element to the main content of the web page, P is a paragraph, and is generally regarded as more than two lines of characters, so the value is generally 20.

Likewise, the < br > tag is part of the paragraph partition of the whole content, so the score under this block element div must be:

T+Count(br)*W_br

wherein, W_brThe weight value of the preset br element is used for line feed of the element, and the value of the element is usually 10-20.

The < a > tag is used as an anchor point and also belongs to a webpage content part, but the excessive hyperlinks play a role of only webpage navigation, so that the a element corresponding to the < a > tag is to be subtracted, and the score is as follows:

T-Count(a)*W_a

wherein W_aThe weight of the element a is that the element a belongs to an Inline element, and the value of the element a is usually 10-20.

If the block element is a li element of the < li > tag, since it is used for a list navigation, the li element corresponding to the < li > tag is also divided down in the score calculation of the node, and the score is:

T-Count(li)*W_li

wherein W_liThe weight of li element, li is list item, usually 10 words in a line can be expressed, so its value is usually 10.

It can be seen that the specific value of the preset weight is determined according to the bearing range of the specific label, and the values of the preset weights of different labels are not necessarily the same.

For the object element, the form element, the table element, the embed element, the script element, the style element, the dl element, the ul element, and the like below the block element div, since they do not belong to a part of the main content of the web page, the corresponding score must be subtracted from the basic score T, and if the text length of the lnertext corresponding to the block element from which the score should be subtracted is a (a = length (lnertext)), for each element not belonging to the main content of the web page, the score corresponding to the element not belonging to the main content of the web page, i.e., T-a, is subtracted from the basic score T determined by the lnertext of the block element.

Finally, in step S240, the block element with the highest score in the DOM tree is found, and the block element with the highest score is used as the main content of the web page.

By circularly traversing the scores of all the block elements div, the block element with the highest score can be determined, and the content included in the block element with the highest score is the finally determined main content of the web page. According to the calculation principle of the highest score, the unique webpage main content of the webpage can be ensured to be determined after the circular traversal.

The block element with the highest score is obtained, the block element with the highest score representing the webpage main body content can be recorded or marked in various modes, for example, a frame is added around the block element with the highest score for surrounding, the block element with the highest score is directly intercepted, and the webpage main body content can be opened by clicking the frame or the intercepted part. In addition, the background pattern of the content, the font size of the content and the page brightness can be adjusted to improve the user experience. The user can skip irrelevant non-main content parts such as advertisements, Flash, hyperlinks and the like through the preprocessing of the webpage content, and directly browse the webpage main content, so that the method is convenient and fast, and can effectively save the flow expenditure of the user.

As a part of improving experience, the self-adaptive screen can be rearranged on the determined webpage main body content at the mobile terminal so as to adapt to the screen, the user does not need to pull left and right, and the user can operate the webpage with one hand conveniently.

In addition, in the application process, if the method for identifying the webpage main content is applied to the middleware (the middleware refers to a server for providing data processing service for the mobile terminal), after the main content is identified, only the webpage main content needs to be sent to the mobile terminal. Under the operation, the webpage main content received by the mobile terminal is far smaller than the original webpage requested to be browsed by the user, the data can reach the terminal browser of the user more quickly, and the traffic of the user is saved and the response time is improved more obviously.

The web page body content identification method according to the present invention is described above with reference to fig. 1 and 2. The method for identifying the main content of the webpage can be realized by software, hardware or a combination of software and hardware.

Fig. 3 shows a block schematic diagram of a web page body content recognition apparatus 300 according to the present invention.

As shown in fig. 3, the web page body content recognition apparatus 300 includes a web page parsing unit 310, a node scoring unit 320, a block element scoring unit 330, and a web page body content determination unit 340. The node scoring unit 320 may include a node type determining unit 321 and a node score calculating unit 322.

When a user inputs a website to request to load a webpage into a terminal browser, in order to identify the webpage content, the webpage analyzing unit 310 first analyzes the webpage to be loaded, and constructs a DOM tree; the node scoring unit 320 scores each node in the DOM tree constructed by the web page parsing unit 310 according to the total score of the child nodes below the node; the block element scoring unit 330 determines scores of all block elements in the webpage according to the score of each node in the DOM tree; the web page main content determining unit 340 finds the block element with the highest score in the DOM tree, and takes the block element with the highest score as the main content of the web page.

The node scoring unit 320 further includes a node type determining unit 321 and a node score calculating unit 322, where the node type determining unit 321 is configured to determine the type of the node; the node score calculating unit 322 is used for calculating the score of the node according to the type of the node, wherein the score of the node scoring unit 320 is the sum of the scores of the child nodes under each node, and the scores of the child nodes of different types are determined according to the node types of the child nodes; the score of the text node is the length of the character string of the text node; for an element node, if the element is an inline element, the score of the element node is 0; if the element node is a block element, judging whether the length of the text contained in the block element exceeds a preset threshold value, and if so, determining the real score of the block element according to the length of the text contained in the block element and the node type of the block element.

In addition, the web page main body content recognition apparatus 300 may further include a web page main body content marking unit (not shown in the figure) for recording or marking the block element representing the highest score of the web page main body content after the web page main body content determination unit finds the block element of the highest score in the DOM tree.

By the webpage main content identification method and device, the main body judgment of the webpage to be loaded can be carried out on the middleware or the mobile terminal, if the webpage main content is judged after the webpage is loaded on the mobile terminal, the judged webpage main content can be circled in an explicit frame mode, the webpage main content can be opened by clicking the part surrounded by the frame by a user, the browser can enable the webpage main content to be suitable for screen display according to the clicking operation of the user, and the information which influences the browsing of the user, such as any messy advertisement, pictures and the like, does not exist, and is very suitable for the user to read.

On the other hand, if the identification of the webpage main content is performed in the middleware before the mobile terminal loads the webpage, the identified webpage main content can be directly sent to the user mobile terminal after the main content is identified, and the webpage main content received by the mobile terminal is far smaller than the original webpage requested to be browsed by the user, so that the webpage main content of the webpage requested by the user can reach the terminal browser of the user with more saved flow and faster speed.

The node scoring criteria described in the above embodiments are based on the sum of the scores of the child nodes under each node, with the scores of the different types of child nodes being based on their node types. Since the manner of determining the main content of the web page is not unique, the scoring basis for the node may also be determined according to other features in the web page structure, for example, another optional scoring method is: the calculation is performed according to different element types and the densities of the child nodes, and the method is to score all the child nodes of the elements, then combine the scores, and divide the scores by the number of the child nodes to obtain the node density value. The higher the density of the nodes is, the denser the child nodes are, the lower the density of the nodes is, the sparse the child nodes are, and the webpage main content is the one with the minimum density of the nodes.

The web page body content recognition method and apparatus according to the present invention are described above by way of example with reference to the accompanying drawings. However, it should be understood by those skilled in the art that various modifications may be made to the method and apparatus for identifying the main contents of the web page provided by the present invention without departing from the scope of the present invention. Therefore, the scope of the present invention should be determined by the contents of the appended claims.

Claims

1. A webpage main content identification method comprises the following steps:

analyzing a webpage to be loaded and constructing a DOM tree;

scoring each node in the DOM tree;

2. The web page body content recognition method according to claim 1, wherein, in the process of scoring each node in the DOM tree,

3. The web page main body content recognition method according to claim 2, wherein in determining the true score of the block element based on the length of the text included in the block element and the node type of the block element,

4. The web page body content identification method according to claim 3,

all under the block element div<p>The number of p elements corresponding to the label is count (p), and the score of the block element is T + count (p) × W_pWherein W is_pIs the weight of the p element node;

if all under the block element div<br>The number of br elements corresponding to the label is count (br), and then the block elementsScore of (1) is T + count (br) × W_brWherein W is_brThe weight of br element node;

if all under the block element div<a>The number of the a elements corresponding to the label is count (a), and the score of the block element is T-count (a) × W_aWherein W is_aIs the weight of the element node a;

if all under the block element div<li>The number of li elements corresponding to the label is count (li), and the score of the block element is T-count (li) W_liWherein W is_liThe weight value of the li element node is obtained;

for an object element, a form element, a table element, an embed element, a script element, a style element, a dl element, an ul element and a li element below the block element div, if the text length of the InnerText corresponding to the object element, the form element, the table element, the embed element, the script element, the style element, the ul element and the li element is a, wherein a = Length (InnerText), the score of the block element is T-a.

5. The web page main body content identifying method according to claim 1, wherein, after finding the block element of the highest score in the DOM tree,

recording or marking the block element which represents the highest score of the webpage main content, and displaying the webpage main content of the webpage through the display of the block element with the highest score.

6. The web page main body content identification method according to claim 1, wherein the method is implemented in middleware which transmits the identified web page main body content to a user terminal after identifying the web page main body content.

7. A web page body content recognition apparatus comprising:

8. The web page body content recognition apparatus as recited in claim 7, wherein the node scoring unit further comprises:

a node type judging unit for judging the type of the node;

9. The web page body content recognition apparatus according to claim 7, further comprising:

and the webpage main content marking unit is used for recording or marking the block element which represents the highest score of the webpage main content after the webpage main content determining unit finds the block element with the highest score in the DOM tree.