A kind of system and method identifying text floor of webpage
The divisional application that patented claim of the present invention is the applying date is on 06 25th, 2012, application number is 201210214079.9, name is called the Chinese invention patent application of " a kind of system and method identifying text floor of webpage ".
Technical field
The present invention relates to internet arena, particularly a kind of method identifying text floor of webpage.
Background technology
Along with the development of mobile terminal and universal, people more and more use mobile terminal to browse webpage.But because special processing is not mostly done to representing of mobile terminal in the website on internet, so the distortion representing generation on mobile terminals of most of webpage, cause the reading experience extreme difference of user's body.
The method of current improvement user reading experience extracts Web page text, resets, be more again presented to user.For having, news, the information webpage effect of large section content are better, but can discard user comment, and text is divided into the forum etc. in many " building ", effect is poorer: can only identify certain Stall text or can not identify text.Do not reject the rubbish word information in source web page, and the content of webpage does not have fixed effect, there will be webpage and the source web page effect of generation.
Summary of the invention
The present invention seeks to solve current text extracting technology to the dependence of maximum text segment with for the bad situation of many " building " contents processing, make when carrying out text extracting to webpage and resetting, not only can identify, extract body, also can identify the comment content of news analysis, and identify many " building " content in forum.
Identify a system for Web page text, described system comprises:
Web analysis layout modules, is suitable for the source code of analyzing web page, carries out layout calculation to analysis result, the dom tree of generating web page;
Node identification module, is suitable for traveling through from the root node of dom tree, identifies the text node in dom tree and/or rubbish word node;
Floor divides module, and the text node be suitable for identifying divides by the floor of webpage.
Wherein, the node of described dom tree divides according to the label in homepages language.
Wherein, described system comprises mobile terminal page generation module, is suitable for generating the mobile terminal page.
Wherein, described root node is body node.
Wherein, the dom tree of described webpage only retains the main element of webpage afterwards in generation.
Wherein, described webpage main body element comprises text, image link and/or text form.
Wherein, described node identification module comprises:
Statistical module, is suitable for calculating the spatial distribution of nodes value of each Webpage, text density and/or rubbish word density;
Analysis module, is suitable for analyzing described spatial distribution of nodes value showing that each node of each Webpage forms situation, and by described text density and/or waste density compared with default corresponding threshold value;
Text recognition module, being suitable for above-mentioned text density and/or the rubbish word density content recognition dropped in corresponding threshold value is text.
Wherein, described spatial distribution of nodes value, represents that the child node of a node forms situation, comprises the ratio that the number of various label and label account for child node;
Described text density, represents the average text size that the text size in a node obtains divided by child node number;
Described rubbish word density, represents the value of length divided by text sizes all in node of all rubbish words in a node.
Wherein, described rubbish word identifies based on dictionary.
Wherein, described floor division module comprises:
Position divides module, is suitable for dividing floor according to the position relationship of text node on dom tree; And/or
Feature Words divides module, is suitable for dividing floor according to the Feature Words in webpage.
Wherein, described position divides module to carry out dividing the rule of institute's foundation as follows:
If two text nodes are adjacent on dom tree, then these two nodes belong to same floor;
If with other, a text node has judged that the text node belonging to same floor has identical father node, then these text nodes have belonged to same floor;
If two text nodes public father node be root node, then two text nodes are divided into different floors;
And, if the internodal relation of text does not comprise in these cases, be then divided into different floor.
Wherein, described Feature Words comprises the author information in text node and/or the time of delivering in non-text node, hour of log-on or news analysis.
Wherein, described rubbish word node is after recognition as the foundation that text node floor divides.
Wherein, described mobile terminal page generation module comprises:
Layout generation module, is suitable for the content of text node by the floor divided again layout generate the mobile terminal page.
Identify a method for Web page text, described method comprises:
The source code of analyzing web page, carries out layout calculation to analysis result, the dom tree of generating web page;
Travel through from the root node of dom tree, identify the text node in dom tree and/or rubbish word node;
The text node identified is divided by the floor of webpage.
Wherein, the node of described dom tree divides according to the label in homepages language.
Wherein, after the described text node to identifying divides by the floor of webpage, generate the mobile terminal page.
Wherein, described root node is body node.
Wherein, the dom tree of described webpage only retains the main element of webpage afterwards in generation.
Wherein, described webpage main body element comprises text, image link and/or text form.
Wherein, the text node in described other dom tree and/or the process of rubbish word node comprise:
Calculate the spatial distribution of nodes value of each Webpage, text density and/or rubbish word density;
Described spatial distribution of nodes value is analyzed and show that each node of each Webpage forms situation, and by described text density and/or waste density compared with default corresponding threshold value;
Be text by above-mentioned text density and/or the rubbish word density content recognition dropped in corresponding threshold value.
Wherein, described spatial distribution of nodes value, represents that the child node of a node forms situation, comprises the ratio that the number of various label and label account for child node;
Described text density, represents the average text size that the text size in a node obtains divided by child node number;
Described rubbish word density, represents the value of length divided by text sizes all in node of all rubbish words in a node.
Wherein, described rubbish word identifies based on dictionary.
Wherein, the described method divided by the floor of webpage the text node identified is as follows:
According to the position relationship of text node on dom tree, floor is divided; And/or
According to the Feature Words in webpage, floor is divided.
Wherein, the described rule carrying out dividing institute's foundation to floor according to the text node position relationship on dom tree is as follows:
If two text nodes are adjacent on dom tree, then these two nodes belong to same floor;
If with other, a text node has judged that the text node belonging to same floor has identical father node, then these text nodes have belonged to same floor;
If two text nodes public father node be root node, then two text nodes are divided into different floors;
And, if the internodal relation of text does not comprise in these cases, be then divided into different floor.
Wherein, described Feature Words comprises the author information in text node and/or the time of delivering in non-text node, hour of log-on or news analysis.
Wherein, described rubbish word node is after recognition as the foundation that text node floor divides.
Wherein, the process of the described generation mobile terminal page is:
By the content of text node by the floor divided again layout generate the mobile terminal page.
The present invention carries out after identification extracts for the content of conventional internet webpage, can effectively extract BBS text, body and comment, and reduce " point building " of body matter in former webpage and represent feature, represent effect and keep former " many buildings " feature, to provide outstanding reading experience to user.
Accompanying drawing explanation
Fig. 1 is the structural drawing of present system
Fig. 2 is the process flow diagram of the inventive method
Fig. 3 is the dom tree generated according to the present invention
Fig. 4 is the mobile terminal webpage schematic diagram generated according to the dom tree of Fig. 3
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are suitable for the present invention is described, but are not used for limiting the scope of the invention.
The structural drawing of system provided by the invention as shown in Figure 1.
Web analysis layout modules 100 carries out web page source code to resolve and layout calculation.Resolve html source code and layout time adopt HTML analytics engine, the conventional HTML analytics engine such as webkit that increases income.Resolve and layout according to the label in web page source code, can according to but be not limited to div label, the dom tree of generating web page, and calculate the position and highly of each node display when webpage represents.The dom tree generated as shown in Figure 3.
Due on mobile terminals, the dynamic effect of internet web page is difficult to display, therefore needs dynamic effect to give up in the process generating dom tree, only retains the text formatting of image link and text.
Node identification module 200 travels through from body node whole dom tree, carries out the identification of body matter and rubbish word content, and its algorithm mainly adopts the algorithm can sorted out data rule, typically such as decision Tree algorithms.
Node identification module 200 comprises statistical module, comparison module and text recognition module.First the spatial distribution of nodes value of each Webpage, text density and rubbish word density is calculated by statistical module; Then comparison module by spatial distribution of nodes value, text density and rubbish word density with preset threshold value compared with; The content recognition finally spatial distribution of nodes value, text density and rubbish word density in dom tree being dropped on part in threshold value by text recognition module 300 is text.
Wherein, spatial distribution of nodes, represents that the child node of a node forms situation, the number of label as various in div, img, table etc., accounts for the ratio situation of child node; Text density, represents that a node Chinese version length obtains average text size divided by child node number; Rubbish word density (remittance of anon-normal cliction), represents the value of length divided by text sizes all in node of all advertising words in a node.Rubbish word identifies based on dictionary, by manually safeguarding, as have nothing to do with text in webpage Print Preview, push up, popular comment on, word and the phrase such as to comment on without hot topic temporarily.
Above-mentioned three features draw a threshold value according to decision Tree algorithms, and the node within threshold range is all identified as text, and all the other are identified as rubbish word.
Floor divides module and comprises position division module and Feature Words division module.
It is carry out floor division and identification according to the path of text node on dom tree and position relationship that position divides module, and during division, the rule of institute's foundation is as follows:
If 1 two text nodes are adjacent on dom tree, then these two nodes belong to same floor
As shown in Figure 3, br represents newline, and br label is sky label.Text node 1 and text node 2 have identical father node div1, and text node 1 and text 2 node are adjacent, then text node 1 and text node 2 identifiable design are the node in same floor
If 2 one text nodes and other judged that the text node belonging to same floor has identical father node, then these text nodes belong to same floor
As text node 3 in figure 3 and text node 2, text node 1 have identical father node div1, and text node 2 and text node 1 have been determined and have belonged to same Stall, so text node 3 also belongs to identical floor.
If 3 two text nodes public father node be body, firm two text nodes are divided into different floors.
As the text node 1 in Fig. 3 and text node 4, its path in dom tree respectively:
Text 1 → div1 → body
Text 4 → div3 → body
The public father node in its path is body, so should be identified as different floors.
If the internodal relation of 4 text does not comprise in these cases, be then divided into different floor.
Feature time divides module and mainly divides according to the feature in node time, and it is represent with the relevant information of author that the author of such as BBS text, Domestic News comment delivers content simultaneously, and for alternately to represent, is generally:
Author information → text → author information → text → author information → text
By identifying the keyword (as delivering time, hour of log-on etc.) showing author information in non-text node, further " floor " being carried out to text and divides.
Mobile terminal page generation module comprises layout generation module, by the content of text node by the floor divided again layout generate the mobile terminal page.In above process, according to the dom tree shown in Fig. 3, the floor distribution results of text node as shown in Figure 4, that is,
Floor 1: text 1, text 2, text 3;
Floor 2: text 4;
Floor 3: text 5, text 6.
The process flow diagram of method provided by the invention as shown in Figure 2.
S102, web page source code to be resolved and layout calculation.Resolve html source code and layout time adopt HTML analytics engine, the conventional HTML analytics engine such as webkit that increases income.Resolve and layout according to the label in web page source code, mainly div label, the dom tree of generating web page, and the position and highly calculating each node display when webpage represents.The dom tree generated as shown in Figure 3.
Due on mobile terminals, the dynamic effect of internet web page is difficult to display, therefore needs dynamic effect to give up in the process generating dom tree, only retains the text formatting of image link and text.
S104, travel through from body node whole dom tree, carry out the identification of body matter and rubbish word content, its algorithm mainly adopts the algorithm can sorted out data rule, typically such as decision Tree algorithms.
First the spatial distribution of nodes value of each Webpage, text density and rubbish word density is calculated; Then by spatial distribution of nodes value, text density and rubbish word density with preset threshold value compared with; Be finally text by the content recognition not exceeding threshold portion in dom tree.
Wherein, spatial distribution of nodes, represents that the child node of a node forms situation, the number of label as various in div, img, table etc., accounts for the ratio situation of child node; Text density, represents that a node Chinese version length obtains average text size divided by child node number; Rubbish word density (remittance of anon-normal cliction), represents the value of length divided by text sizes all in node of all advertising words in a node.Rubbish word identifies based on dictionary, by manually safeguarding, as have nothing to do with text in webpage Print Preview, push up, popular comment on, word and the phrase such as to comment on without hot topic temporarily.
Above-mentioned three features draw a threshold value according to decision Tree algorithms, and the node within threshold range is all identified as text, and all the other are identified as rubbish word.
S106, divide by the floor of webpage the text node identified, the method for use comprises opsition dependent and divides and divide by Feature Words.
It is carry out floor division and identification according to the path of text node on dom tree and position relationship that opsition dependent divides, and during division, the rule of institute's foundation is as follows:
If 1 two text nodes are adjacent on dom tree, then these two nodes belong to same floor
As shown in Figure 3, br represents newline, and br label is sky label.Text node 1 and text node 2 have identical father node div1, and text node 1 and text 2 node are adjacent, then text node 1 and text node 2 identifiable design are the node in same floor
If 2 one text nodes and other judged that the text node belonging to same floor has identical father node, then these text nodes belong to same floor
As text node 3 in figure 3 and text node 2, text node 1 have identical father node div1, and text node 2 and text node 1 have been determined and have belonged to same Stall, so text node 3 also belongs to identical floor.
If 3 two text nodes public father node be body, firm two text nodes are divided into different floors.
As the text node 1 in Fig. 3 and text node 4, its path in dom tree respectively:
Text 1 → div1 → body
Text 4 → div3 → body
The public father node in its path is body, so should be identified as different floors.
If the internodal relation of 4 text does not comprise in these cases, be then divided into different floor
Dividing by feature time is divide according to the feature in text time.Such as the author of BBS text, Domestic News comment delivers content is represent with the relevant information of author simultaneously, and for alternately to represent, is generally:
Author information → text → author information → text → author information → text
By identifying the keyword (as delivering time, hour of log-on etc.) showing author information in non-text node, further " floor " being carried out to text and divides.
Generate the mobile terminal page, the content of text node is generated the mobile terminal page by the floor divided again layout.In above process, according to the dom tree shown in Fig. 3, the floor distribution results of text node as shown in Figure 4, that is,
Floor 1: text 1, text 2, text 3;
Floor 2: text 4;
Floor 3: text 5, text 6.
Should be noted that, in all parts of controller of the present invention, the function that will realize according to it and logical partitioning has been carried out to parts wherein, but, the present invention is not limited to this, can repartition all parts as required or combine, such as, can be single parts by some component combinations, or some parts can be decomposed into more subassembly further.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the controller of the embodiment of the present invention.The present invention can also be embodied as part or all the equipment or device program (such as, computer program and computer program) that are suitable for performing method as described herein.Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.