CN103530429A

CN103530429A - Webpage content extracting method

Info

Publication number: CN103530429A
Application number: CN201310538575.4A
Authority: CN
Inventors: 涂波
Original assignee: Beijing Zhongsou Network Technology Co ltd
Current assignee: Beijing Zhongsou Cloud Business Network Technology Co ltd
Priority date: 2013-11-04
Filing date: 2013-11-04
Publication date: 2014-01-22
Anticipated expiration: 2033-11-04
Also published as: CN103530429B

Abstract

The invention provides a webpage content extracting method. The method comprises the following steps of I, preprocessing a webpage, II, searching for the longest series in the webpage, III, establishing a DOM tree and searching for the nodes corresponding to the longest series according to the DOM tree, IV, determining a beginning node and a finishing node according to labels of the nodes corresponding to the longest series, V, checking and filtering the beginning node and the finishing node, and VI, outputting text in the filtered beginning node and text in the filtered finishing node. The method overcomes the defect of a module or blocking technique in news content extraction application, searches for seed paragraphs based on the longest series and improves webpage content extracting work efficiency and accuracy.

Description

A kind of method of Web page text extracting

Technical field

It is in particular to a kind of that the method that news web page body matter is extracted is realized based on the significant node of searching " most long string " lookup the present invention relates to a kind of method of computer realm.

Background technology

In news（Or information）Search field, it is the essential link of item that body, which is extracted, and the quality height of its text extracting determines the quality and Consumer's Experience of news search.

Current body abstracting method form various kinds, is divided into two major classes in the way of whether template is used：Based on template（Or wrapper）Mode is extracted and extracted based on untemplated fashion.

Extracted based on template way：Definition template, then writes program parsing execution template and obtains data first.According to template generation mode, it can be divided into again：Artificial template is extracted and automatic moulding plate is extracted.Artificial template extracts.For the targeted sites of extraction, artificial hand-coding template, template can be canonical matching way or the first matching way of simple string matching.Automatic moulding plate is extracted.Using machine learning algorithm, a part of web data is first obtained from targeted website and carries out learning training, template is obtained, then program utilizes template extracted data.

Untemplated fashion is extracted to be realized based on statistics and mode of learning more.Main algorithm has rule-based at present, based on piecemeal, view-based access control model etc..Compare the page partitioning algorithm of the representational view-based access control model for being Microsoft, extracted by page block, divider is extracted and semantic chunk reconstructs 3 steps, determines the main semantic chunk of webpage.

The shortcoming of manual compiling template way is to need to expend huge human resources to write template, and with the change of targeted website, safeguards that the cost of template is also very big.The shortcoming of automatic moulding plate mode is that algorithm is complicated, simultaneously, it is also desirable to targeted website cycle monitoring, to safeguard the change of template.Either whether manually or automatically produce template, on the assumption that the data of website are produced by template, some large-scale website basic problems are little, the possible template of namely different entrances is different, but for numerous medium and small websites, its templating is not fine, and most information can only be extracted by being extracted using template, has more chance to include junk information.The page partitioning algorithm of view-based access control model is due to regular complexity, and performance is not high, the application of unsuitable news search engine.

The content of the invention

In order to overcome the above-mentioned deficiencies of the prior art, the present invention provides a kind of effective method for extracting internet news content.Extracted for template or partition in news content using upper deficiency, design and seed paragraph is found based on " most long string ", extract news content using the algorithm of label clustering, it is to avoid manually the drawbacks of regular and its template.

Realize solution that above-mentioned purpose used for：

A kind of method of Web page text extracting, it is theed improvement is that：It the described method comprises the following steps：

I, Web-page preprocessing；

II, the most long string found in the webpage；

III, dom tree is created, searched according to DOM and described most long go here and there corresponding node；

IV, the most long label for going here and there corresponding node according to determine start node and end node；

V, inspection filtering is carried out to the start node and end node；

The text in start node and end node after VI, output filtering.

Further, the step I includes：Whether judge in the webpage comprising negligible label：" annotation ", " script ", " meta "；Obtain the negligible label and the content in negligible label and deletion.

Further, the step II comprises the following steps：1）, in the Web page text the most long string in the webpage is found with behavior unit；

2）, obtain and record most long string length, the most long string further processing to being obtained increases or decreases the length of acquisition when most long string is in specific label.

Further, dom tree is created to the webpage in the step III, the information of all nodes is obtained according to dom tree, and the node is stored in array, searched in the array comprising node comprising the most long string of node of correspondence；

The information of the node includes word number, Chinese character number, link number.

Further, similar node is found using label clustering method according to the label of the most long string of node stored in array, determines the start node and end node in the step IV.

Further, the label clustering method is included to the label characteristics forward, backward and two-way searching.

Further, the start node and end node selected in the step IV are checked and filtered, obtain the content in remaining node, the remaining node of output and node.

Further, the Web page text searches text using the part that the most paragraph of continuous text is text, and seed node is searched in dom tree, according to seed node to forward and backward extension, whole text region is found out.

Compared with prior art, the invention has the advantages that：

（1）Method of the present invention design is based on most long go here and there and finds seed paragraph, and news content is extracted using the algorithm of label clustering, is extracted for template or partition in news content and applies upper deficiency, it is to avoid manually the drawbacks of regular and its template.

（2）The method of the present invention need not create dom trees when early stage looks for " most long string ", and most long string is directly searched in web page text, with behavior unit, without entering a new line naturally, the processing of forced termination current line, improve operating efficiency, and accuracy rate is high.

（3）The method of the present invention is based on single web page analysis, without template, saves a large amount of artificial；Simple with kind of substring finding algorithm, analysis efficiency is high；Method of the present invention flexibility simultaneously is high, more convenient for abnormal conditions processing.

（4）The method of the present invention is using single webpage without template label clustering news web page content extraction, and its result is more accurate；Calculated for follow-up fingerprint, content clustering, media event cluster provides quality data and ensured.

（5）The method of the present invention mutually simple and quick can find text area, and when because of majority be operated in dom tree, and flexibility is good, and convenient increase filtering rule, end to end locating rule, this method are applicable not only to Chinese and are also applied for western language.

Brief description of the drawings

Fig. 1 is the flow chart of Web page text extracting method.

Embodiment

The embodiment to the present invention is described in further detail below in conjunction with the accompanying drawings.

Webpage includes the information such as text title, text source, text issuing time, text, author, substantial amounts of advertisement, junk information etc. may also be included in webpage, and " most long string " is appeared in text more in news category webpage, one section is found in text area using this feature and its corresponding label characteristics is obtained, then found label characteristics are utilized in turn forward, backward, two-way searching similar tags node, this process is referred to as " label clustering ".

A kind of Web page text extracting method, searches the extraction that significant node realizes news web page body matter according to " most long string " is found, the described method comprises the following steps：I, the negligible label in the deletion webpage and the content in negligible label；II, the most long string found in the webpage；III, dom tree is created, searched according to DOM and described most long go here and there corresponding node；IV, the most long label for going here and there corresponding node according to determine start node and end node；V, inspection filtering is carried out to the start node and end node；The text in start node and end node after VI, output filtering.

As shown in figure 1, Fig. 1 is the flow chart of Web page text extracting method；A kind of method of Web page text extracting specifically includes following steps：

Step 1: deleting the content in the negligible label and negligible label in the webpage.

Collection obtains the source file of webpage, is such as acquired using acquisition system；

The source file of html web page is pre-processed.Because the data in webpage are various, unified page specificationsization need to be carried out to the HTML code in source file and handled, that is, pre-processed, comprise the following steps：

First, it is determined that whether the label in source file matches, then label is modified if any not paired situation, it is ensured that the beginning and end matching of all labels；

Secondly, whether judge in the webpage comprising negligible label, obtain the content in negligible label and negligible label, the content in label and negligible label can be neglected in deletion.

Negligible label：Label substance is not related to body matter, such as " annotation ", " script ", " meta ".

Step 2: finding the most long string in the webpage.

1）, found in the Web page text with behavior unit and record the continuous string length in the webpage.

Do not include label in the continuous string, length is recorded when running into label（When length be more than current line most long string length when, be expert at head when the most long string length of current line be initialized as 0）, and length is counted clear 0（Start new string length to count）.According to residing label adjustment correlation length, such as when in paragraph tag<p></p>Increase when middle,（Length is adjusted using proportionality coefficient is multiplied）, when in similar<strong></strong>Reduced when middle.

2）, most long string therein is obtained from the continuous string length of each row recorded；

The continuous string is continuous Chinese character（2 or more）Or continuous word（The western language word of 2 or more, centre is with space interval）.

Step 3: create dom tree, searched according to DOM and described most long go here and there corresponding node.

1）, dom tree is created to the source file of the webpage of acquisition, count the information of each node（Including word number, Chinese character number, link number etc.）And node is stored in array.

2）, the most long string that is obtained according to step 2 search out corresponding node and the node position in array, it is simple to search correspondence node using substring.

For example:"<td><div>My test, language</div></td>"）Whether middle search contains resulting most long string, if most long go here and there is that " my test " just can be found, if " catching a duck " can not just be found, find for matched node, i.e. seed node.

Step 4: according to the corresponding node array position of most long string and its label characteristics string（Such as：html:body:div:p）Find start node and end node.

Due to the node in Web page text totally one father or grandfather's node, similar node is found according to the label of the most long string of node stored in array, it may be determined that start node and end node.

Step 5: to the node obtained（Including start node and end node）Carry out inspection filtering.

The start node and end node selected in the step 4 are checked and filtered, the content in remaining node, the remaining node of output and node is obtained, the content includes word, picture etc..

Filtering is carried out according to some features of interdependent node, such as class is deleted the need for being equal to certain particular value, id is equal to deletion etc. the need for certain particular value, and content, which meets some feature and is then followed by some network address, just deletes, and is mainly used in the advertisement category information cleared up in the middle of content.

Step 6: the text in start node and end node after output filtering.

Finally it should be noted that:Above example is merely to illustrate the technical scheme of the application rather than the limitation to its protection domain, although the application is described in detail with reference to above-described embodiment, those of ordinary skill in the art should be understood:Those skilled in the art read can still carry out a variety of changes, modification or equivalent to the embodiment of application after the application, but these changes, modification or equivalent, apply within pending claims.

Claims

1. a kind of method of Web page text extracting, it is characterised in that：It the described method comprises the following steps：

I, Web-page preprocessing；

II, the most long string found in the webpage；

V, inspection filtering is carried out to the start node and end node；

The text in start node and end node after VI, output filtering.

2. a kind of method of Web page text extracting as claimed in claim 1, it is characterised in that：The step I includes：Whether judge in the webpage comprising negligible label：" annotation ", " script ", " meta "；Obtain the negligible label and the content in negligible label and deletion.

3. a kind of method of Web page text extracting as claimed in claim 1, it is characterised in that：The step II comprises the following steps：1）, in the Web page text the most long string in the webpage is found with behavior unit；

4. a kind of method of Web page text extracting as claimed in claim 1, it is characterised in that：Dom tree is created to the webpage in the step III, the information of all nodes is obtained according to dom tree, and the node is stored in array, is searched in the array comprising node comprising the most long string of node of correspondence；

5. a kind of method of Web page text extracting as claimed in claim 1, it is characterised in that：Similar node is found using label clustering method according to the label of the most long string of node stored in array, the start node and end node in the step IV is determined.

6. a kind of method of Web page text extracting as claimed in claim 5, it is characterised in that：The label clustering method is included to the label characteristics forward, backward and two-way searching.

7. a kind of method of Web page text extracting as claimed in claim 1, it is characterised in that：The start node and end node selected in the step IV are checked and filtered, the content in remaining node, the remaining node of output and node is obtained.

8. a kind of method of Web page text extracting as claimed in claim 1, it is characterised in that：The Web page text searches text using the part that the most paragraph of continuous text is text, and seed node is searched in dom tree, according to seed node to forward and backward extension, whole text region is found out.