CN103064966A - Method for extracting regular noise from single record web pages - Google Patents

Method for extracting regular noise from single record web pages Download PDF

Info

Publication number
CN103064966A
CN103064966A CN2012105927950A CN201210592795A CN103064966A CN 103064966 A CN103064966 A CN 103064966A CN 2012105927950 A CN2012105927950 A CN 2012105927950A CN 201210592795 A CN201210592795 A CN 201210592795A CN 103064966 A CN103064966 A CN 103064966A
Authority
CN
China
Prior art keywords
node
text
tree
dom tree
iterator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012105927950A
Other languages
Chinese (zh)
Other versions
CN103064966B (en
Inventor
程学旗
李海燕
郭岩
万圣贤
郭少华
刘悦
余智华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201210592795.0A priority Critical patent/CN103064966B/en
Publication of CN103064966A publication Critical patent/CN103064966A/en
Application granted granted Critical
Publication of CN103064966B publication Critical patent/CN103064966B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record web pages into document object model (DOM) trees, and classifying the DOM trees according to structures; then aligning and integrating the DOM trees belonging to the same type to obtain site section style trees; and positioning approximate positions of web page text headline nodes and approximate positions of web page text main body nodes in the site section style trees, and finally extracting the regular noise in front of texts, in the texts and after the texts according to the web page text headline nodes and the web page text main body nodes. By means of the method, space resources required by construction of the site section style trees is decreased, possible extraction leakage situations are decreased, and extracting speed is accelerated. In addition, an extracting result has high accuracy, good effect is obtained, and the reliability is high.

Description

A kind of method that from the unirecord webpage, extracts the rule noise
Technical field
The present invention relates to the networked information retrieval field, and more specifically, relate to the method for rule noise before from unirecord webpage (namely use the webpage of data record one style, this data recording refers to the zone of the main part of webpage), extracting respectively text, in the text and behind the text.
Background technology
In the information age, the approach of acquired information is more and more.There is being the status that can not be substituted the internet as the carrier of information aspect propagation efficiency and the information capacity.At present, the internet has become the important sources that people obtain various knowledge, information.Yet along with the develop rapidly of Web technology, the mass data information on the internet increases with progression every day, and the content of information is all-embracing, and form is multifarious.The webpage noise has also occupied certain ratio in the content of webpage.For researchist and application personnel, some noise content is optional in the processing of web data, even some noise content can have a strong impact on the effect of part Study and application, and along with the form of noise also becomes varied, the user wants to obtain needed information from the internet and also becomes more and more difficult.Therefore, remove the important pre-treatment step that noise on the webpage has become further processing web data.How to eliminate better the webpage noise, seek significant information, so that the Web de-noising technology becomes the distinctive research field of networked information retrieval.And along with the technology such as many researchs such as information retrieval, text classification, search engine are applied on the Web, remove noise on the webpage and also become and be even more important.
Noise content on the webpage can be divided into overall noise and local noise according to the granule size of noise content.Overall situation noise refers to the larger noise of granularity on the webpage, normally the webpage that repeats of content (such as mirror site, copy article); Local noise refers in the webpage and application purpose or the incoherent content of theme, generally is that the template of calcaneus rete page or leaf itself is relevant, and for example advertisement of webpage, navigation bar, website statement, the hyperlink of related article, copyright information, noise link etc.Pre-service link in the networked information retrieval field need to partly be identified removal to navigation bar and peer link etc., in order to improve the quality of retrieval; Also need in the webpage information mining in advance noise to be removed to improve the quality of excavation.The present invention is directed to local noise the method that extracts this part noise is provided, hereinafter do not have the concrete noise that limits, generally all refer to local noise.
In recent years, done a large amount of work for removing local noise researcher.The reasonable method of denoising effect is to adopt the multi-model denoising of different models in the denoising process for different webpages.Owing to most websites in order the reason such as to maintain easily, by the automatic generating web page of the good template of predefined so that in the webpage except theme (such as text) content, remainder is substantially all identical.In general, the column of different web sites adopts different templates, shows two kinds of web page templates such as Figure 1A and 1B.The multi-model denoising mainly is that the column for different web sites carries out template detection, because the navigation bar of webpage, advertisement, website statement, copyright notice etc. are general all in the template of website, so have removed template, has been equivalent to remove a part of local noise.Its general disposal route is the template that detects first the interior perhaps structure of whole website, then when a webpage of this website need to be processed, and the template content of just deleting this webpage, the content after the denoising that remaining is exactly.
More famous multi-model denoising method such as SST (Site Style Tree).SST merges the HTML DOM of webpage (Document Object Model) tree (the HTML dom tree is that html document is rendered as tree construction with element, attribute and text), carries out afterwards the judgement of noise.The judgement of noise is based on two hypothesis:
The displaying style of (1) node is more, and its importance is higher;
The content branch of (2) nodes is more, and its importance is higher.
A final node element importance is comprised of style importance and content importance two parts, and this node of the less explanation of value more might be noise.Roughly process is as follows in the execution of SST tree:
(1) at first to the merging of aliging of N dom tree of same website, record each node structure branch subtree different with content during merging, SST just is the child node that different branches forms different-style, and be counter of each Node configuration, be used for showing that this N dom tree has the webpage number of this node of same style and content.For example, with the dom tree of a certain webpage of Tree1(shown in Fig. 2 A) and Fig. 2 B shown in Tree2 carry out SST and merge and obtain as a result Fig. 2 C, the number of times that this node of the digitized representation among Fig. 2 C occurs.
(2) calculate style importance and the content weights of importance of each node, judge whether noise node of this node according to the size of weight.
The SST denoising method has higher accuracy, but also has some defectives:
(1) along with the development of web technologies, even same website, the dom tree structural difference of different columns is also increasing, the SST tree can cause in the achievement process branch node too much, meeting in the situation of the different dom tree distributed number inequality of structure is so that the frequency that certain noise branch occurs in the SST tree is too small, causes the webpage of such DOM structure can only extract the phenomenon of a part of noise;
(2) even according to the DOM structure webpage is classified, with algorithm application in the dom tree of structural similarity, if a certain node layer (10 nodes are for example arranged) only has a node difference, the SST method will be set up for different branches different child nodes, can cause a large amount of wastes in space, the efficient of achievement also reduces greatly;
(3) the SST method causes branch's granularity excessive, so that the little noise of generating portion leaks situation about extracting being different branch when forming the child node of different-style easily;
(4) especially for the unirecord page, the SST method can not be located the relative position of noise and theme part.And with regard to present stage, the noise of the unirecord page extracts some practical applications on the approaches engineering more.One throw the net the page or leaf usually comprise a plurality of data areas, there is different data recording different data areas.Be different from many record webpages and adopt similar style to represent each record, the main part of unirecord webpage only has data record one style, and the unirecord webpage for example is the text page of news and blog, and its body part only has a kind of style.
For different noise types and different application demands, the researcher has also carried out multinomial R﹠D work.The extraction of text refers to extract most important text main part in the webpage in the unirecord webpage (such as news web page, blog article webpage), be used for the basic data that a lot of post analysis are excavated to be needed, for example, Fig. 3 C and 3D show respectively the text main body in the text of webpage shown in Fig. 3 A and the 3B.The content of extracting exactly text is the key component that affects effect, but repeated multiple times appearance with news and the irrelevant text of blog article text on the webpage, be that rule noise (Fig. 3 E and 3F show respectively the rule noise of webpage shown in Fig. 3 A and the 3B) is attached to around the text main body, so that the text main part is very not obvious, the user can not accurately locate the text main information; The possibility of result of text extracting can extract the noise informations such as the author that occurs between the noise informations such as navigation, text title and the text main body that text occurs previously and source, also may extract the noise informations such as related article that text occurs later, peer link, comment.So in the text extracting algorithm, process in advance noise key word (being the noise key word in the text) and the noise key word of text back or even all the elements after the noise key word of text back of filtering between text front, text title and the text main body, for the extraction of unirecord page body, be in extraction speed or extract very large booster action is all arranged qualitatively.Therefore in the Web page text information extraction, remove text before, the rule noise of in the text and text back is important preprocessing process; The key point of this problem is to extract the rule noise, and can also distinguish the relative position of rule noise and text, namely extract respectively the text of unirecord webpage before, the rule noise in the text and behind the text.
In sum, extract for the regular noise in the unirecord page, existing noise abstracting method has extraction, wasting space and the inefficient problem of leakage, nor can rule on orientation noise and theme relative position partly.
Summary of the invention
For the problems referred to above, according to one embodiment of the invention, provide a kind of method that from the unirecord webpage, extracts the rule noise, described method comprises:
Step 1), a plurality of unirecord webpages are converted into dom tree, and described dom tree is classified according to structure;
Step 2), other dom tree of same class is alignd merge and obtain website plate style tree;
Step 3), according to Web page text title node and the Web page text master nodes apparent position in described website plate style tree, extract text before, the rule noise in the text and behind the text.
In one embodiment, in the step 1) described dom tree classified according to structure and comprises:
Step 11), dom tree of selection are known class;
Step 12), from the dom tree of all the other non-classified, select the dom tree that needs are classified, with every dom tree calculating similarity in this dom tree that need to classify and the known class;
Step 13), determining step 12) in the maximum similarity that calculates whether satisfy more than or equal to predefined threshold value, if satisfy then the dom tree that described needs are classified be grouped into known class with the dom tree place of its similarity maximum, as the dom tree in this known class, and if do not satisfy a newly-built classification that comprises the dom tree that described needs classify as known class;
If still there is the dom tree of non-classified in step 14) then returns step 12), otherwise return the good dom tree set of classification.
In a further embodiment, wherein need in the step 12) calculation procedure of similarity of the dom tree of classifying and known class dom tree as follows:
Step 121), the root node iterator set that will calculate two dom trees of similarity joins in the formation;
Step 122), the iterator set of the queue heads of described formation is ejected, child's iterator of two iterators in this set is mated, two iterators that obtain aliging are gathered;
Step 123), the iterator after the traversal alignment, the set of the iterator on the coupling is joined in the formation, and the match information of iterator was set to 1 in should gathering; Match information less than the iterator on the coupling is set to 0; If described formation is not empty, return step 122);
Step 124), dom tree that needs are classified calculates the similarity weight of each iterator from bottom to top, formula is as follows:
All child's iterator similarity weighted values of the percent * of the sub-iterator on the match information+coupling of the similarity weight=iterator of iterator itself and,
The dom tree that the similarity weight of returning the root iterator is classified as needs and the similarity of known class dom tree.
In a further embodiment, use the Needleman-Wunsch algorithm that child's iterator of two iterators in the described set is mated step 122).
In one embodiment, step 2) quantity of other dom tree of same class is more than or equal to 2 in, in a further embodiment, step 2) in other dom tree of same class alignd to merge comprise:
Step 21), with the node alignment of each respective layer of other dom tree of same class;
Step 22), will align after node corresponding to each position, insert website plate style tree, wherein:
If node corresponding to this position is label node entirely, then first label is inserted the relevant position in the website plate style tree;
If node corresponding to this position is the text leaf node entirely, then add up and record the number of times that each text leaf node occurs, the mutual unduplicated text leaf node of content is inserted relevant position in the website plate style tree entirely;
If node section corresponding to this position is the text leaf node partly is label node, then select the relevant position in first label node insertion website plate style tree, statistics also records the number of times that each text leaf node occurs, and the mutual unduplicated leaf node of content is inserted relevant position in the website plate style tree entirely.
In a further embodiment, use the central start algorithm with the node alignment of each respective layer of other dom tree of same class step 21).
In one embodiment, in the step 3) in described website column style tree the apparent position of locating web-pages text title node comprise:
Step 31), traversal is from step 2) the website column style that obtains tree, find<title〉content of label node the inside;
Step 32), again travel through described website column style tree, look for<h1 〉-<h6 label node;
If find, calculating<h1 〉-<h6〉inner text and described<title〉inner text similarity;
As not finding, then search line feed or with the label node of title feature, if the class of these label nodes or style attribute contain tit, center, middle, big, biao, head, bt, topic, then calculate text and described<title in these label nodes〉similarity of inner text;
Step 33), find label node corresponding to maximum similarity, if this label node is not<h1 〉-<h6 〉, then travel through described website column style tree, the attribute of searching the class of the node that whether also has other or style attribute and this label node is the same, if attribute is unique, think that then the content of this node the inside is title.
In one embodiment, in the step 3) in described website column style tree the apparent position of locating web-pages text master nodes comprise:
Step 34), traversal website plate style tree, find comprise Chinese fullstop "." or English fullstop ". " and satisfy the text node of following condition:
X > = 0 ; X < ( X + Width ) max / 2 ; ( X + Width ) > ( X + Width ) max / 2 ;
Wherein, X represents the abscissa value of text node on web plane, and Width represents the width of text node, (X+Width) MaxThe width of expression webpage,
Record horizontal ordinate and the width of each fullstop text node;
Step 35) if from step 34) the fullstop text node quantity that obtains is more than or equal to 2, traversal step 34 from back to front then) in the fullstop text node that obtains, find out the fullstop text node that satisfies following condition:
PN ( i ) > PN ( j ) , PN ( j ) = Max ( PN ( i + 1 ) , PN ( i + 2 ) . . . . . . PN ( n - 1 ) ) IR ( i ) > IR ( j ) , IR ( i ) = ( PN ( i ) - PN ( j ) ) / PN ( j )
Wherein, have n fullstop text node, PN (i) is the fullstop quantity of node i, and IR (i) is the fullstop quantity growth rate of node i, i ∈ [0, n-2], and j ∈ [i+1, n-1], IR (n-1) is predetermined threshold,
Fullstop text node before and after this node is carried out respectively cluster; If from step 34) the fullstop text node quantity that obtains is 1, then this fullstop text node is exactly the text node, skips steps 36);
Step 36), calculate from step 35) first node and the alternate position spike of text title on ordinate the class that obtains, if this value greater than predetermined threshold return step 35), until find from the minimum category node of title node distance on ordinate as final text node.
In a further embodiment, in step 35) the fullstop text node before and after the described node is carried out respectively cluster comprise:
If Chinese web page then carries out cluster to the poor fullstop text node that is no more than 400 bytes of byte between two fullstop text nodes;
If English webpage then carries out cluster to the poor fullstop text node that is no more than 1000 bytes of byte between two fullstop text nodes.
In one embodiment, wherein extract in the step 3) before the text, the rule noise in the text and behind the text comprises:
If slave site column style is set up number of times that the traversal node that begins occurs to the text node between the text title node more than or equal to predetermined threshold, then be added to during the rule noise is gathered before the text;
If the number of times that slave site column style tree text title node occurs to the text node between the text main body start node more than or equal to predetermined threshold, then is added in the text in the set of rule noise;
If tree stops the text node the traversal node from text main body terminal node to website column style, if there is number of times more than or equal to certain threshold value, then be added to behind the text in the set of rule noise.
Compare existing noise abstracting method, the present invention has following beneficial effect:
(1) before dom tree is merged, judge that at first the dom tree structure is whether identical and carry out automatic classification according to the dom tree structure of webpage, set up website column style tree (SBSTree for the webpage of structural similarity afterwards, Site Board Style Tree), so that the extraction of noise has more specific aim, and certain class formation dom tree quantity can not occur very little and the situation that noise leak to extract;
(2) in the method for having used multi-string matching algorithms in the process of setting up of website column style tree SBSTree, only merge the label node text node different with content on not mating, reduced the granularity that merges, thereby reduced a required space resources of column style tree of building a station, also reduce the situation of the leakage extraction that may occur, accelerated extraction speed;
(3) apparent position of location text master nodes has fully utilized visual information and two kinds of rules of text message in the tree of the website column style after merging, and the SST rule is more stable relatively, can not change along with the variation of structure of web page and content.In the suitable situation of the sampling net number of pages of website, extraction result of the present invention has higher accuracy, has obtained preferably effect, and reliability is high.
Description of drawings
Figure 1A and 1B have schematically described the column that adopts two kinds of different web sites of different templates;
Fig. 2 A and 2B have described respectively two dom tree Tree1 and the Tree2 of structural similarity according to an embodiment of the invention;
Fig. 2 C has described the Tree2 among the Tree1 among Fig. 2 A and Fig. 2 B has been merged rear SST tree of setting up;
Fig. 2 D has described the Tree2 among the Tree1 among Fig. 2 A and Fig. 2 B has been merged rear SBSTree tree of setting up;
Fig. 3 A and 3B have schematically described two webpages of structural similarity;
Fig. 3 C and 3D have described respectively the position of text and text main body in the webpage shown in Fig. 3 A and the 3B;
Fig. 3 E and 3F have described respectively the rule noise of each position in the webpage shown in Fig. 3 A and the 3B;
Fig. 3 G has described by using the method that extracts according to an embodiment of the invention the rule noise from the unirecord webpage that the webpage shown in Fig. 3 A and the 3B is carried out noise and has extracted resulting regular noise;
Fig. 3 H has described by using the rule noise that the method for extraction rule noise extracts from the webpage shown in Fig. 3 A and the 3B from the unirecord webpage according to an embodiment of the invention to sum up chart;
Fig. 4 has described the process flow diagram that extracts according to an embodiment of the invention the method for rule noise from the unirecord webpage;
Fig. 5 A and 5B have schematically described two webpages;
Fig. 5 C has described by using the classification results chart that the method for extraction rule noise is classified the webpage shown in Fig. 3 A, 3B, 5A and the 5B to obtain from the unirecord webpage according to an embodiment of the invention;
Fig. 5 D has described by using the method that extracts according to an embodiment of the invention the rule noise from the unirecord webpage that the webpage shown in Fig. 5 A and the 5B is carried out noise and has extracted resulting regular noise;
Fig. 5 E has described by using the rule noise that the method for extraction rule noise extracts from the webpage shown in Fig. 5 A and the 5B from the unirecord webpage according to an embodiment of the invention to sum up chart.
Embodiment
Below in conjunction with the drawings and specific embodiments the present invention is illustrated.
According to one embodiment of the invention, provide a kind of method that from the unirecord webpage, extracts the rule noise.Based on the dom tree structural information of webpage, visual information and the text message of webpage, utilized the multi-template model, the unirecord webpage is carried out respectively text before, the extraction of noise in the text and behind the text.In the process that extracts, at first automatically, the individual webpage of n (n 〉=2) is carried out automatic classification according to the dom tree structure of webpage, then the webpage of m same class (structure of web page is similar) (m 〉=2) is carried out Match merging and form website column style tree SBSTree, utilize on this basis some visions and text rule to find the apparent position of the middle text title of website plate style tree (dom tree after merging) and text main body, the frequency that occurs according to each text node afterwards judges whether it is the rule noise.
As shown in Figure 4, according to one embodiment of present invention, the method that extracts the rule noise from the unirecord webpage comprises following seven steps:
Step 1, read a plurality of unirecord webpages (n webpage, n 〉=2) content, and these webpages are converted into dom tree, the form node on the deletion dom tree and attribute style.display and style.visibility are respectively the invisible nodes of none and hidden.
Wherein, the form label comprises input, noembed, noscript, textarea, marquee, object, select, iframe, style, script, img.
Step 2 is classified by its structure obtaining dom tree in the step 1.
The dom tree classification is to realize by the similarity of calculating two dom trees.According to one embodiment of the invention, can make the classification that realizes with the following method dom tree:
A) based on the n that a transforms dom tree, dom tree of initial selected is known class;
B) for all the other without the dom tree of crossing classification, calculate the similarity of every dom tree in the wherein dom tree (dom tree that need to classify) and known class;
C) select the similarity (wherein needing the dom tree of classifying and the dom tree in the known class (such as classification 1) to have maximum similarity) of the maximum calculate, if this maximum similarity value is more than or equal to certain predetermined threshold (this predetermined threshold is preferably 0.5), the dom tree that needs so to classify belongs to this known class (such as classification 1), this dom tree that need to classify is added this known class (such as classification 1), otherwise a newly-built new classification that comprises this dom tree is as known class;
Repeating step b), c) until all dom trees all be classified.
D) return according to the good dom tree set of dom tree textural classification.
Should be understood that present embodiment and be not used in to limit the method that dom tree is classified, other methods of dom tree being classified according to structure also are applicable to this.For example, all dom trees comparative structure similarity between any two can be represented every dom tree with node, be higher than in similarity between the dom tree node of a certain assign thresholds and do line, by corporations' algorithm node is classified again.
Step 3 is carried out language to other webpage dom tree of same class and is detected
From webpage, obtain a part and be used for carrying out the coding that text that language detects is judged this part text, carry out language according to this section text and coding thereof and detect; If English and homology languages, then locate the apparent position of text main body according to the mode of English punctuate; If Chinese and homology languages, then locate the apparent position of text main body according to the mode of Chinese punctuate.
Step 4 with the dom tree alignment of same classification (being that tree construction is similar), is set up website column style tree SBSTree, and step is as follows:
A) will belong to the node alignment of each respective layer of of a sort m (m 〉=a 2) dom tree, and in one embodiment, can use central start (center star) algorithm that Gusfield proposes with the node alignment of each respective layer; Algorithm is implemented as follows:
(1) the iterator set with m dom tree root node joins among the formation queue;
(2) set of the iterator of queue team head is ejected, the child nodes of each iterator in the set is alignd according to the mode of m (m 〉=a 2) dom tree respective layer iterator coupling;
(3) iterator after the traversal alignment joins the set of the iterator on the coupling among the queue;
(4) repeating step (2), (3) are until queue is empty;
(5) return the iterator of m each respective layer of dom tree after the alignment.
B) will align after k node corresponding to each position, selectively be inserted into website column style and set in (SBSTree, Site Board Style Tree).For example, with the Tree2 shown in the Tree1 shown in Fig. 2 A and Fig. 2 B carry out SBSTree merge obtain as a result Fig. 2 D, the number of times that this node of the digitized representation among the figure occurs), insertion regular as follows:
If be label node (this label node is identical) entirely, then first label node is inserted into relevant position among the SBSTree;
If entirely be the text leaf node, then add up and record the number of times that each text leaf node occurs, and the mutual unduplicated text leaf node of content all is inserted into relevant position among the SBSTree (under the same father node);
That the text leaf node partly is label node such as fruit part, then select first label node to be inserted into relevant position among the SBSTree, statistics also records the number of times that each text leaf node occurs, and the mutual unduplicated leaf node of content also all is inserted into relevant position among the SBSTree (under the same father node).
Step 5 finds the apparent position of text title node according to heuristic rule, comprises following rule and step:
A) the website column style tree that obtains from step 4 of traversal is found out<title〉content of label node the inside;
B) again travel through this website column style tree, look for<h1 〉-<h6 label node, minute following two kinds of situations are processed:
If find, calculate the label node the inside text and<title text similarity in the label node;
If do not find, then search such as br, big, caption, center, div, font, p, span, these line feed of strong, td or with the label node of title feature.If the class of these label nodes or style attribute contain tit, center, middle, big, biao, head, bt, topic, then calculate in this label node text and<title the similarity of text in the label node.
C) find the label node of similarity maximum, if node is not<h1 〉-<h6 〉, then travel through website column style tree, also the attribute with this label node is the same to search the class of the node that whether also has other or style attribute.If the class attribute of this node or style attribute are unique, think that then the content of this node the inside is title.
Should be understood that above-mentioned rule only is used for explanation, and be not used in the apparent position how restriction seeks the text title node.
Step 6 finds the apparent position of text master nodes
The reference position of considering the text body matter piece left side from the angle of vision is generally in position that webpage keeps left, and the final position on the right generally is the position of keeping right at webpage, namely the right and left of text block is across in the middle of the webpage, and (page take webpage defines coordinate to first node of text as the plane at ordinate, X-axis is horizontal ordinate, Y-axis is ordinate, and initial point is the point at place, the webpage upper left corner) upper distance from the text title very close to.Moreover, consider that from the angle of text the main part of text all can use fullstop (Chinese or English fullstop), therefore, select the apparent position of the text main body of the webpage after comprehensive utilization visual information and text message location merges.
Step 7 extracts the rule noise by traversal SBSTree tree, comprises the steps:
A) set up the traversal node (i.e. the root node of this tree) that begins to the text node the text title node from SBSTree, if there is number of times more than or equal to certain threshold value, then be added in the rule noise S set 0 of the front appearance of text;
B) from SBSTree tree text title node to the text node the text main body start node, if there is number of times more than or equal to certain threshold value, then be added in the rule noise S set 1 that occurs in the text;
C) tree stops text node the traversal node (this sets last child nodes) from text main body terminal node to SBSTree, if there is number of times more than or equal to certain threshold value, then be added in the rule noise S set 2 that occurs behind the text.
Rule noise before the text that then S0, S1 and S2 will extract respectively exactly, in the text and behind the text.
According to one embodiment of the invention, b in step 2) in the step, it is that according to the similarity weight that the node iterator quantity on the coupling is calculated, this algorithm comprises following 6 steps on the basis of these two dom trees couplings that similarity between two dom trees is calculated:
(1) the iterator set with two dom tree root nodes joins among the formation queue;
(2) the iterator set of queue team head is ejected, child's iterator of each iterator in the set is mated according to the mode of two dom tree respective layer iterator couplings, according to one embodiment of present invention, the coupling of two dom tree respective layer iterators can use the Needleman-Wunsch algorithm to mate, the Needleman-Wunsch algorithm has utilized the thought of dynamic programming that two sequence strings (being iterator) are carried out global registration, it at first calculates a score matrix (SM, ScoreMatrix), this score matrix SM gives respectively the iterator and the score of iterator that does not have on the coupling on the coupling; According to this score matrix, can obtain the matching way an of the best.
(3) iterator after the traversal alignment set of the iterator on the coupling is joined among the queue, and the match information of this iterator is set to 1; Do not have the iterator on the coupling not do any operation, the match information of this iterator is set to 0.
(4) repeating step (2), (3) are until queue is empty.
(5) calculate the similarity weight of each iterator;
(6) return the similarity weight of root iterator, this similarity weight namely needs the dom tree of classifying and the similarity of known class DOM.
According to one embodiment of the invention, in the step (2) of above-mentioned calculating dom tree similarity, the Needleman-Wunsch algorithm steps that is used for two dom tree respective layer iterators of coupling is as follows, comprises for 6 steps:
1) the penalty factor d=0 that mates with gap iterator (empty iterator) is set;
2) set up a behavior p+1, classify the score matrix SM of q+1 as, and the first row and the first row of initialization matrix S M be d, p wherein, q is respectively the quantity (i.e. child's iterator quantity of two iterators) of element in two sequence strings;
3) calculate score matrix SM, from SM[0,0] begin to calculate according to formula (1) score of per two iterators, and record the source, path of per two iterator scores:
SM [ i , j ] = max SM [ i - 1 , j - 1 ] + S ( x i , y j ) SM [ i - 1 , j ] - d SM [ i , j - 1 ] - d - - - ( 1 )
Wherein, i ∈ [0, p] and j ∈ [0, q], S (x i, y j) be iterator x iAnd y jSimilarity, its computing method are as follows:
If 1. the node of two iterator sensings is label nodes, relatively their tag name and attribute.If identical, return 1; Otherwise return 0;
If 2. the node of two iterator sensings is text leaf nodes, return 1; Otherwise return 0;
If 3. one of the node of two iterator sensings is text leaf nodes, one is label node, then returns 0;
4) from SM[p, q] begin to seek forward to recall the path, date back to SM[0,0] always; If recall the path have a plurality of, the path on the first-selected diagonal line;
5) from SM[0,0] begin to travel through the path of recalling, the iterator of two respective layer mated, comprising:
If 1. from SM[i, j] date back to SM[i-1, j-1], x then iAnd y iOn the coupling, do not do any operation;
If 2. from SM[i, j] date back to SM[i-1, j], at y J-1The place inserts a gap iterator;
If 3. from SM[i, j] date back to SM[i, j-1], at x I-1The place inserts a gap iterator;
6) return two iterators set after the alignment.
Use the Needleman-Wunsch algorithm to carry out the coupling of two dom tree respective layer iterators in the present embodiment, should be understood that other text comparison algorithms such as the LD algorithm also are applicable to this.
According to one embodiment of present invention, in the step (5) of above-mentioned calculating dom tree similarity, calculate the similarity weight of each iterator in dom tree mode from bottom to top:
All child's iterator similarity weight sums of the percent * of the child's iterator on the match information+coupling of the similarity weight=iterator of iterator itself.
This similarity calculating method calculates the root node of tree from bottom to top from the leaf node of tree, considered the information that node itself reaches child node.
According to a further embodiment of the invention, also can calculate with following method the similarity weight of each iterator:
Two similarity sim(i that set i and j, j)=(D (i)/T (i)+D (j)/T (j))/2, wherein D (i) is illustrated in the tree alignment procedure, interstitial content on the tree i corresponding (coupling), T (i) is illustrated in the tree alignment procedure, node digit rate on node sum among the tree i, the method are considered to align from the overall situation.
According to one embodiment of present invention, in a) in (2) small step in step of step 4, the set of the iterator of queue team head is ejected, the algorithm steps that the child nodes of each iterator in the set is alignd according to m (m 〉=2) mode of a dom tree respective layer iterator coupling is as follows:
A) iterator of a dom tree input layer of selection is as the center iterator;
B) with each of the iterator of remaining m-1 dom tree respective layer, implementation method with the algorithm (with reference to b in the step 2)-(2) of two dom tree respective layer iterators coupling), mate with the center iterator respectively, record final alignment result and calculate each iterator and center iterator coupling similarity afterwards, similarity is the quantity of the iterator on this layer coupling and mates the afterwards ratio of the quantity of all iterators;
C) calculate average similarity as the similarity of this center iterator;
D) select successively remaining m-1 iterator as the center iterator, repeating step b), c);
E) select the center iterator of average similarity maximum as the center for standard iterator;
F) iterator after will aliging with the center for standard iterator carries out global alignment;
G) return iterator after m (m 〉=2) this layer of dom tree alignment.
According to one embodiment of present invention, in step 4, when other dom tree quantity of same class more than or equal to 2 the time, then set up website column style tree SBSTree; If the dom tree in the same classification only has one, then can not come the decimation rule noise by method provided herein.
According to one embodiment of the invention, in step 6, find the apparent position of text master nodes to comprise the steps:
A) traversal SBSTree tree, find text node comprise fullstop "." or the horizontal ordinate X of ". " (in English webpage) and visual information and width W idth(are wherein; text node can be regarded the piece on the Webpage as; X is the horizontal ordinate in this piece upper left corner; Width is the width of this piece along horizontal ordinate) the fullstop text node that meets the following conditions, and record displacement (horizontal ordinate X) and the side-play amount (width W idth) of each fullstop text node:
X > = 0 ; X < ( X + Width ) max / 2 ; ( X + Width ) > ( X + Width ) max / 2 ; - - - ( 2 )
Wherein (X+Width) MaxThe width of expression webpage.
In the English webpage, the difference of suspension points, radix point and real English fullstop then will be considered in English fullstop, and the English fullstop of " legal " does not comprise following content: suspension points; Radix point; The fullstop that occurs in the english name; The fullstop of some address back is such as Mr.Mrs.Ms.Dr.; Skip .com .edu .org etc.; And skip All rights reserved. etc.
B) the fullstop text node that obtains in scanning from back to front a) (n, n 〉=2) finds the maximum node T of fullstop quantity PN (PeriodNum) rate of growth IR (IncreaseRate) of fullstop text node according to following formula, namely
PN ( i ) > PN ( j ) , PN ( j ) = Max ( PN ( i + 1 ) , PN ( i + 2 ) . . . . . . PN ( n - 1 ) ) IR ( i ) > IR ( j ) , IR ( i ) = ( PN ( i ) - PN ( j ) ) / PN ( j ) - - - ( 3 )
Wherein, i ∈ [0, n-2], j ∈ [i+1, n-1], IR (n-1) is assign thresholds.
Afterwards, the fullstop node forward and backward to node T carries out respectively cluster, and the cluster principle is two poor MaxGap of being no more than of the byte between the fullstop text node (Chinese web page is defaulted as 400 bytes, and English webpage is defaulted as 1000 bytes).
C) utilize visual information, calculation procedure b) obtain that first node and text title (are respectively Y and Y at ordinate in the class Title) on alternate position spike | Y-Y Title| value, if should value greater than certain predetermined threshold, then continue repeating step b forward at this category node), until find from title node distance on Y-direction minimum | Y-Y Title| MinA class, be final text node.
According to one embodiment of present invention, when the fullstop node that finds only had one, this node was exactly the text node.
Should be understood that the method based on visual information and text message that adopts in the present embodiment only is used for explanation, and be not used in the apparent position how restriction seeks the text master nodes.
When adopting the method to process web data, obtained desirable experimental result.According to the webpage of two structural similarities describing among Fig. 3 A and the 3B, obtaining noise web results (shown in Fig. 3 G) after having used method of the present invention and summing up chart (shown in Fig. 3 H) at the noise of each position.Described two other webpage among Fig. 5 A and the 5B, the result who arrives that Fig. 5 A and 5B and Fig. 3 A and 3B are classified is shown in Fig. 5 C, and wherein the webpage among Fig. 3 A and the 3B is same classification, and the webpage among Fig. 5 A and the 5B is same classification.(shown in 5A and 5B) uses method provided by the invention to other webpage of same class, and the noise that obtains noise web results (shown in Fig. 5 D) and each position is summed up chart (shown in Fig. 5 E).Can find out that from Fig. 3 G and Fig. 5 D a kind of method that extracts the rule noise from the unirecord webpage that the present invention proposes can accurately extract the rule noise of each structural similarity webpage, has verified its validity and reliability, can be applied in the actual application.
It should be noted last that above embodiment is only unrestricted in order to technical scheme of the present invention to be described.Although with reference to embodiment the present invention is had been described in detail, those of ordinary skill in the art is to be understood that, technical scheme of the present invention is made amendment or is equal to replacement, do not break away from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of the claim scope of the present invention.

Claims (13)

1. method that from the unirecord webpage, extracts the rule noise, described method comprises:
Step 1), a plurality of unirecord webpages are converted into dom tree, and described dom tree is classified according to structure;
Step 2), other dom tree of same class is alignd merge and obtain website plate style tree;
Step 3), according to Web page text title node and the Web page text master nodes apparent position in described website plate style tree, extract text before, the rule noise in the text and behind the text.
2. method according to claim 1, wherein in the step 1) described dom tree being classified according to structure comprises:
Step 11), dom tree of selection are known class;
Step 12), from the dom tree of all the other non-classified, select the dom tree that needs are classified, with every dom tree calculating similarity in this dom tree that need to classify and the known class;
Step 13), determining step 12) in the maximum similarity that calculates whether satisfy more than or equal to predefined threshold value, if satisfy then the dom tree that described needs are classified be grouped into known class with the dom tree place of its similarity maximum, as the dom tree in this known class, and if do not satisfy a newly-built classification that comprises the dom tree that described needs classify as known class;
If still there is the dom tree of non-classified in step 14) then returns step 12), otherwise return the good dom tree set of classification.
3. method according to claim 2 wherein needs the calculation procedure of similarity of the dom tree of classifying and known class dom tree as follows in the step 12):
Step 121), the root node iterator set that will calculate two dom trees of similarity joins in the formation;
Step 122), the iterator set of the queue heads of described formation is ejected, child's iterator of two iterators in this set is mated, two iterators that obtain aliging are gathered;
Step 123), the iterator after the traversal alignment, the set of the iterator on the coupling is joined in the formation, and the match information of iterator was set to 1 in should gathering; Match information less than the iterator on the coupling is set to 0; If described formation is not empty, return step 122);
Step 124), dom tree that needs are classified calculates the similarity weight of each iterator from bottom to top, formula is as follows:
All child's iterator similarity weighted values of the percent * of the sub-iterator on the match information+coupling of the similarity weight=iterator of iterator itself and,
The dom tree that the similarity weight of returning the root iterator is classified as needs and the similarity of known class dom tree.
4. method according to claim 3, wherein step 122) in use the Needleman-Wunsch algorithm that child's iterator of two iterators in the described set is mated.
5. any one described method according to claim 1-4, wherein step 1) also comprises: before described dom tree is classified according to structure, form node and invisible node on the deletion dom tree.
6. any one described method according to claim 1-4 comprises also after the step 1) that wherein other dom tree of same class is carried out language to be detected, and judgement is Chinese or English.
7. any one described method, wherein step 2 according to claim 1-4) in the quantity of other dom tree of same class more than or equal to 2.
8. method according to claim 7, wherein step 2) in other dom tree of same class alignd to merge comprise:
Step 21), with the node alignment of each respective layer of other dom tree of same class;
Step 22), will align after node corresponding to each position, insert website plate style tree, wherein:
If node corresponding to this position is label node entirely, then first label is inserted the relevant position in the website plate style tree;
If node corresponding to this position is the text leaf node entirely, then add up and record the number of times that each text leaf node occurs, the mutual unduplicated text leaf node of content is inserted relevant position in the website plate style tree entirely;
If node section corresponding to this position is the text leaf node partly is label node, then select the relevant position in first label node insertion website plate style tree, statistics also records the number of times that each text leaf node occurs, and the mutual unduplicated leaf node of content is inserted relevant position in the website plate style tree entirely.
9. method according to claim 8, wherein step 21) in use the central start algorithm with the node alignment of each respective layer of other dom tree of same class.
10. any one described method according to claim 1-4, wherein in the step 3) in described website column style tree the apparent position of locating web-pages text title node comprise:
Step 31), traversal is from step 2) the website column style that obtains tree, find<title〉content of label node the inside;
Step 32), again travel through described website column style tree, look for<h1 〉-<h6 label node;
If find, calculating<h1 〉-<h6〉inner text and described<title〉inner text similarity;
As not finding, then search line feed or with the label node of title feature, if the class of these label nodes or style attribute contain tit, center, middle, big, biao, head, bt, topic, then calculate text and described<title in these label nodes〉similarity of inner text;
Step 33), find label node corresponding to maximum similarity, if this label node is not<h1 〉-<h6 〉, then travel through described website column style tree, the attribute of searching the class of the node that whether also has other or style attribute and this label node is the same, if attribute is unique, think that then the content of this node the inside is title.
11. any one described method according to claim 1-4, wherein in the step 3) in described website column style tree the apparent position of locating web-pages text master nodes comprise:
Step 34), traversal website plate style tree, find comprise Chinese fullstop "." or English fullstop ". " and satisfy the text node of following condition:
X > = 0 ; X < ( X + Width ) max / 2 ; ( X + Width ) > ( X + Width ) max / 2 ;
Wherein, X represents the abscissa value of text node on web plane, and Width represents the width of text node, (X+Width) MaxThe width of expression webpage,
Record horizontal ordinate and the width of each fullstop text node;
Step 35) if from step 34) the fullstop text node quantity that obtains is more than or equal to 2, traversal step 34 from back to front then) in the fullstop text node that obtains, find out the fullstop text node that satisfies following condition:
PN ( i ) > PN ( j ) , PN ( j ) = Max ( PN ( i + 1 ) , PN ( i + 2 ) . . . . . . PN ( n - 1 ) ) IR ( i ) > IR ( j ) , IR ( i ) = ( PN ( i ) - PN ( j ) ) / PN ( j )
Wherein, have n fullstop text node, PN (i) is the fullstop quantity of node i, and IR (i) is the fullstop quantity growth rate of node i, i ∈ [0, n-2], and j ∈ [i+1, n-1], IR (n-1) is predetermined threshold,
Fullstop text node before and after this node is carried out respectively cluster; If from step 34) the fullstop text node quantity that obtains is 1, then this fullstop text node is exactly the text node, skips steps 36);
Step 36), calculate from step 35) first node and the alternate position spike of text title on ordinate the class that obtains, if this value greater than predetermined threshold return step 35), until find from the minimum category node of title node distance on ordinate as final text node.
12. method according to claim 11 is wherein in step 35) the fullstop text node before and after the described node is carried out respectively cluster comprise:
If Chinese web page then carries out cluster to the poor fullstop text node that is no more than 400 bytes of byte between two fullstop text nodes;
If English webpage then carries out cluster to the poor fullstop text node that is no more than 1000 bytes of byte between two fullstop text nodes.
13. any one described method according to claim 1-4, wherein extract text in the step 3) before, the rule noise in the text and behind the text comprises:
If slave site column style is set up number of times that the traversal node that begins occurs to the text node between the text title node more than or equal to predetermined threshold, then be added to during the rule noise is gathered before the text;
If the number of times that slave site column style tree text title node occurs to the text node between the text main body start node more than or equal to predetermined threshold, then is added in the text in the set of rule noise;
If tree stops the text node the traversal node from text main body terminal node to website column style, if there is number of times more than or equal to certain threshold value, then be added to behind the text in the set of rule noise.
CN201210592795.0A 2012-12-31 2012-12-31 A kind of method extracting rule noise from unirecord webpage Active CN103064966B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210592795.0A CN103064966B (en) 2012-12-31 2012-12-31 A kind of method extracting rule noise from unirecord webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210592795.0A CN103064966B (en) 2012-12-31 2012-12-31 A kind of method extracting rule noise from unirecord webpage

Publications (2)

Publication Number Publication Date
CN103064966A true CN103064966A (en) 2013-04-24
CN103064966B CN103064966B (en) 2016-01-27

Family

ID=48107596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210592795.0A Active CN103064966B (en) 2012-12-31 2012-12-31 A kind of method extracting rule noise from unirecord webpage

Country Status (1)

Country Link
CN (1) CN103064966B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104484451A (en) * 2014-12-25 2015-04-01 北京国双科技有限公司 Web page information extraction method and web page information extraction device
CN105183801A (en) * 2015-08-25 2015-12-23 北京信息科技大学 Web page body text extraction method and apparatus
WO2017080090A1 (en) * 2015-11-14 2017-05-18 孙燕群 Extraction and comparison method for text of webpage
CN110083760A (en) * 2019-04-16 2019-08-02 浙江工业大学 A kind of more recordable type dynamic web page information extracting methods based on visible-block
CN114528811A (en) * 2022-01-21 2022-05-24 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium
CN115658993A (en) * 2022-09-27 2023-01-31 观澜网络(杭州)有限公司 Intelligent extraction method and system for core content of webpage

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833554A (en) * 2009-03-09 2010-09-15 富士通株式会社 Method and equipment for producing extraction template and method and equipment for extracting content on web pages
CN102004805A (en) * 2010-12-30 2011-04-06 上海交通大学 Webpage denoising system and method based on maximum similarity matching
CN102073654A (en) * 2009-11-20 2011-05-25 富士通株式会社 Methods and equipment for generating and maintaining web content extraction template
CN102156737A (en) * 2011-04-12 2011-08-17 华中师范大学 Method for extracting subject content of Chinese webpage
CN102253937A (en) * 2010-05-18 2011-11-23 阿里巴巴集团控股有限公司 Method and related device for acquiring information of interest in webpages
CN102662969A (en) * 2012-03-11 2012-09-12 复旦大学 Internet information object positioning method based on webpage structure semantic meaning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833554A (en) * 2009-03-09 2010-09-15 富士通株式会社 Method and equipment for producing extraction template and method and equipment for extracting content on web pages
CN102073654A (en) * 2009-11-20 2011-05-25 富士通株式会社 Methods and equipment for generating and maintaining web content extraction template
CN102253937A (en) * 2010-05-18 2011-11-23 阿里巴巴集团控股有限公司 Method and related device for acquiring information of interest in webpages
CN102004805A (en) * 2010-12-30 2011-04-06 上海交通大学 Webpage denoising system and method based on maximum similarity matching
CN102156737A (en) * 2011-04-12 2011-08-17 华中师范大学 Method for extracting subject content of Chinese webpage
CN102662969A (en) * 2012-03-11 2012-09-12 复旦大学 Internet information object positioning method based on webpage structure semantic meaning

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104484451A (en) * 2014-12-25 2015-04-01 北京国双科技有限公司 Web page information extraction method and web page information extraction device
CN104484451B (en) * 2014-12-25 2017-12-19 北京国双科技有限公司 The extracting method and device of Webpage information
CN105183801A (en) * 2015-08-25 2015-12-23 北京信息科技大学 Web page body text extraction method and apparatus
CN105183801B (en) * 2015-08-25 2018-07-06 北京信息科技大学 web page text extracting method and device
WO2017080090A1 (en) * 2015-11-14 2017-05-18 孙燕群 Extraction and comparison method for text of webpage
CN110083760A (en) * 2019-04-16 2019-08-02 浙江工业大学 A kind of more recordable type dynamic web page information extracting methods based on visible-block
CN114528811A (en) * 2022-01-21 2022-05-24 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium
CN114528811B (en) * 2022-01-21 2022-09-02 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium
CN115658993A (en) * 2022-09-27 2023-01-31 观澜网络(杭州)有限公司 Intelligent extraction method and system for core content of webpage

Also Published As

Publication number Publication date
CN103064966B (en) 2016-01-27

Similar Documents

Publication Publication Date Title
CN101251855B (en) Equipment, system and method for cleaning internet web page
CN103023714B (en) The liveness of topic Network Based and cluster topology analytical system and method
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN103955529B (en) A kind of internet information search polymerize rendering method
CN102663023B (en) Implementation method for extracting web content
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
CN103064966B (en) A kind of method extracting rule noise from unirecord webpage
CN104598577B (en) A kind of extracting method of Web page text
CN104268148A (en) Forum page information auto-extraction method and system based on time strings
Peters et al. Content extraction using diverse feature sets
CN102591612B (en) General webpage text extraction method based on punctuation continuity and system thereof
CN102402566A (en) Web user behavior analysis method based on Chinese webpage automatic classification technology
Shi et al. AutoRM: An effective approach for automatic Web data record mining
Velloso et al. Automatic web page segmentation and noise removal for structured extraction using tag path sequences
CN110083760B (en) Multi-recording dynamic webpage information extraction method based on visual block
Kim et al. Main content extraction from web documents using text block context
Subercaze et al. Mining user-generated comments
Liu et al. Extraction and management of meta information on the domain-oriented Deep Web
Dutta et al. Noise elimination from web page based on regular expressions for web content mining
KR20080008573A (en) Method for extracting association rule from xml data
Dong et al. A generic Web news extraction approach
Keller et al. GRABEX: A graph-based method for web site block classification and its application on mining breadcrumb trails
Zeleny et al. Cluster-based Page Segmentation-a fast and precise method for web page pre-processing
Boddu ELIMINATE THE NOISY DATA FROM WEB PAGES USING DATA MINING TECHNIQUES.
Zeng et al. Layout-tree-based approach for identifying visually similar blocks in a web page

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20130424

Assignee: Branch DNT data Polytron Technologies Inc

Assignor: Institute of Computing Technology, Chinese Academy of Sciences

Contract record no.: 2018110000033

Denomination of invention: Method for extracting regular noise from single record web pages

Granted publication date: 20160127

License type: Common License

Record date: 20180807

EE01 Entry into force of recordation of patent licensing contract