CN110069618A

CN110069618A - A kind of method and system of extracting content on web pages

Info

Publication number: CN110069618A
Application number: CN201711135743.XA
Authority: CN
Inventors: 吴远辉
Original assignee: Guangzhou Wanlong Securities Advisory Consultants Co Ltd
Current assignee: Guangzhou Wanlong Securities Advisory Consultants Co Ltd
Priority date: 2017-11-16
Filing date: 2017-11-16
Publication date: 2019-07-30

Abstract

The invention discloses a kind of method and system of extracting content on web pages, comprising the following steps: S1, carries out the processing of the content extraction based on regular expression matching to webpage, when judgement is extracted successfully, step S4 is executed, conversely, continuing to execute step S2；S2, the processing of the content extraction based on CSS style is carried out to webpage, when judgement is extracted successfully, step S4 is executed, conversely, continuing to execute step S3；S3, webpage handled based on the matched content extraction of XPath；Result is extracted in S4, output.The present invention carries out webpage content extraction by using the mode that regular expression, CSS style, XPath are sequentially combined, it can be with most fast velocity interpolation webpage content extraction, and three kinds of extraction modes combine, substantially increase the accuracy rate of institute's extracting content on web pages, effective, accurate extraction be can provide as a result, can be widely applied in the field of information processing of webpage.

Description

A kind of method and system of extracting content on web pages

Technical field

The present invention relates to computer applications and information extraction field, a kind of method more particularly to extracting content on web pages and System.

Background technique

Explanation of nouns:

CSS style: cascading style sheets, one kind being used to show HTML (application of standard generalized markup language) or XML The computer language of files patterns such as (a subsets of standard generalized markup language)；

XPath: one is searched the language of information in XML document, it is a kind of for determining certain part position in XML document The language set.Tree of the Xpath based on XML provides the ability that node is found in data-structure tree.

General text mining analysis can all be related to web page contents extraction.Web page contents are information basic in text Element is the basis of correct understanding text.Webpage content extraction is the important foundation tool of the application fields such as machine learning, certainly Right language processing techniques occupy an important position during moving towards practical.

In the extraction process of web page contents, the content of WEB webpage is other than subject content, there are also such as copyright information, Advertisement, navigation bar, the content unrelated with subject content such as decoration information, referred to as " noise " information, these noises increase text The automatic extraction difficulty of content.How noise information is removed, the body matter in webpage is extracted, it is fast in Internet technology The today for hailing exhibition is of great significance.There are certain methods in this field at present, but technological means is relatively simple, It is slow to extract speed, and it is relatively low to extract accuracy rate, it is difficult to meet application demand.

Summary of the invention

In order to solve the above technical problems, the object of the present invention is to provide a kind of method of extracting content on web pages and it is System.

The technical solution adopted by the present invention to solve the technical problems is:

A kind of method of extracting content on web pages, comprising the following steps:

S1, the processing of the content extraction based on regular expression matching is carried out to webpage, when judgement is extracted successfully, execute step Rapid S4, conversely, continuing to execute step S2；

S2, the processing of the content extraction based on CSS style is carried out to webpage, when judgement is extracted successfully, execute step S4, instead It, continues to execute step S3；

S3, webpage handled based on the matched content extraction of XPath；

Result is extracted in S4, output.

It is further used as preferred embodiment, webpage is carried out based on regular expression matching described in the step S1 Content extraction processing the step of, specifically include:

The regular expression of S11, configuration webpage；

S12, content extraction processing is carried out to webpage using regular expression；

S13, data cleansing is carried out to extraction result.

It is further used as preferred embodiment, the content based on CSS style is carried out to webpage described in the step S2 The step of extracting processing, specifically includes:

The CSS style expression formula of S21, configuration webpage；

S22, content extraction processing is carried out to webpage using CSS style expression formula；

S23, data cleansing is carried out to extraction result.

It is further used as preferred embodiment, the step S3 is specifically included:

The XPath path expression of S31, configuration webpage；

S32, content extraction processing is carried out to webpage using XPath path expression；

S33, data cleansing is carried out to extraction result.

It is further used as preferred embodiment, the step S33, specifically:

It will extract after result matched with preset noise lexicon, and delete the noise vocabulary in extraction result.

Another technical solution adopted by the present invention to solve the technical problem thereof is that:

A kind of system of extracting content on web pages, comprises the following modules:

First abstraction module, for carrying out the content extraction processing based on regular expression matching to webpage, when judgement is taken out When taking successfully, output module is executed, conversely, executing the second abstraction module；

Second abstraction module, for carrying out the content extraction processing based on CSS style to webpage, when judgement is extracted successfully When, output module is executed, conversely, executing third abstraction module；

Third abstraction module, for handle based on the matched content extraction of XPath to webpage；

Output module extracts result for exporting.

It is further used as preferred embodiment, first abstraction module specifically includes:

First configuration unit, the regular expression for configuration webpage；

First extracting unit, for carrying out content extraction processing to webpage using regular expression；

First cleaning unit, for carrying out data to extraction result；

First judging unit, for executing output module when judgement is extracted successfully, conversely, executing the second abstraction module.

It is further used as preferred embodiment, second abstraction module specifically includes:

Second configuration unit, the CSS style expression formula for configuration webpage；

Second extracting unit, for carrying out content extraction processing to webpage using CSS style expression formula；

Second cleaning unit, for carrying out data cleansing to extraction result；

Second judgment unit, for executing output module when judgement is extracted successfully, conversely, executing third abstraction module.

It is further used as preferred embodiment, the third abstraction module specifically includes:

Third configuration unit, the XPath path expression for configuration webpage；

Third extracting unit, for carrying out content extraction processing to webpage using XPath path expression；

Third cleaning unit, for carrying out data cleansing to extraction result.

It is further used as preferred embodiment, the third submodule is specifically used for:

The method of the present invention, the beneficial effect of system are: the present invention by using regular expression, CSS style, XPath according to The mode that sequence combines carries out webpage content extraction, can be with most fast velocity interpolation webpage content extraction, and three kinds are extracted Mode combines, and substantially increases the accuracy rate of institute's extracting content on web pages, it is possible to provide effectively, accurately extracts result.

Detailed description of the invention

Fig. 1 is the flow chart of the method for extracting content on web pages of the invention.

Specific embodiment

Referring to Fig.1, the present invention provides a kind of methods of extracting content on web pages, comprising the following steps:

S3, webpage handled based on the matched content extraction of XPath；

Result is extracted in S4, output.

This method is primarily based on regular expression and carries out content extraction processing to webpage, when extracting unsuccessful, is based on CSS Pattern carries out content extraction processing to webpage, and when extracting unsuccessful again, carries out webpage based on the matched content of XPath Extraction processing, according to the characteristic of webpage, carries out by using the mode that regular expression, CSS style, XPath are sequentially combined Webpage content extraction, can be with most fast velocity interpolation webpage content extraction, and three kinds of extraction modes combine, and greatly improve The accuracy rate of institute's extracting content on web pages, it is possible to provide effectively, accurately extract result.

The regular expression of S11, configuration webpage；

S13, data cleansing is carried out to extraction result.

Regular expression is a kind of effective ways of extracting content on web pages, in step S11, passes through the canonical table of configuration webpage Up to after formula, webpage content extraction can be effectively carried out, after extraction processing, result will be extracted and preset noise lexicon carries out After matching, the noise vocabulary extracted in result is deleted, the data cleansing to result is extracted is realized, is not met so as to delete The extraction of web page contents is as a result, make extraction result more acurrate.

The CSS style expression formula of S21, configuration webpage；Such as content to be extracted is located in div tag (< div class=' Content ' id=' conmain '), by observing id the class attribute of the div tag, it can configure and extract content The expression formula of node is " div [@class='content'] ".

S22, content extraction processing is carried out to webpage using CSS style expression formula；By utilizing CSS style expression formula, look into The HTML node for finding content to be extracted in webpage, can carry out content extraction to it.

S23, data cleansing is carried out to extraction result, removes unrelated content.

In step S21, after the CSS style expression formula of configuration webpage, for the canonical table in step S1 can not be passed through The webpage extracted up to formula is further extracted.After extraction processing, result and the progress of preset noise lexicon will be extracted After matching, the noise vocabulary extracted in result is deleted, the data cleansing to result is extracted is realized, does not meet net so as to delete The extraction of page content is as a result, make extraction result more acurrate.

The XPath path expression of S31, configuration webpage；

S32, content extraction processing is carried out to webpage using XPath path expression；By utilizing XPath path expression Formula finds the HTML node of content to be extracted in webpage, can carry out content extraction to it.

S33, data cleansing is carried out to extraction result.

Principle based on the matched content extraction of XPath are as follows:

Html web page is tree, can be successively unfolded, and is successively positioned.XPath is exactly to carry out work according to this characteristic Make.Expression formula principal mode is as follows: two oblique lines // expression positions root node, and an oblique line/expression is found toward lower layer, wherein One html tag indicates one layer, and the expression formula for extracting content of text is /text (), and the content that extract some attribute is then adopted With expression formula/@* * *, wherein * * * is the name of specific object.

In step S31, after the XPath path expression of configuration webpage, for can not can not also be led to by step S1 The webpage that the CSS style expression formula crossed in step S2 extracts further is extracted, it is ensured that webpage content extraction it is comprehensive Property, it is extracted by the path XPath and obtains web page contents, guarantee to extract the accurate of result.This method by step S1~S3 by The extraction deterministic process of step gradually carries out webpage content extraction using different extraction modes, can guarantee most fast extraction speed Under the premise of degree, highest extraction accuracy is obtained.

It is further used as preferred embodiment, the step S33, specifically:

Data cleansing step in step S13 and S23, identical as this step, purpose is to remove unrelated content.

The present invention also provides a kind of systems of extracting content on web pages, comprise the following modules:

Output module extracts result for exporting.

First configuration unit, the regular expression for configuration webpage；

First cleaning unit, for carrying out data to extraction result；

Second cleaning unit, for carrying out data cleansing to extraction result；

Third cleaning unit, for carrying out data cleansing to extraction result.

A kind of system of extracting content on web pages of the invention, can be performed the present invention it is aforementioned provided by a kind of extraction webpage The method of appearance, any combination implementation steps of executing method embodiment have the corresponding function of this method and beneficial effect.

It is to be illustrated to preferable implementation of the invention, but the invention is not limited to the implementation above Example, those skilled in the art can also make various equivalent variations on the premise of without prejudice to spirit of the invention or replace It changes, these equivalent variation or replacement are all included in the scope defined by the claims of the present application.

Claims

1. a kind of method of extracting content on web pages, which comprises the following steps:

S1, the processing of the content extraction based on regular expression matching is carried out to webpage, when judgement is extracted successfully, execute step S4, conversely, continuing to execute step S2；

S2, the processing of the content extraction based on CSS style is carried out to webpage, when judgement is extracted successfully, execute step S4, conversely, Continue to execute step S3；

S3, webpage handled based on the matched content extraction of XPath；

Result is extracted in S4, output.

2. a kind of method of extracting content on web pages according to claim 1, which is characterized in that right described in the step S1 Webpage carries out the step of processing of the content extraction based on regular expression matching, specifically includes:

The regular expression of S11, configuration webpage；

S13, data cleansing is carried out to extraction result.

3. a kind of method of extracting content on web pages according to claim 1, which is characterized in that right described in the step S2 Webpage carries out the step of processing of the content extraction based on CSS style, specifically includes:

The CSS style expression formula of S21, configuration webpage；

S23, data cleansing is carried out to extraction result.

4. a kind of method of extracting content on web pages according to claim 1, which is characterized in that the step S3, it is specific to wrap It includes:

The XPath path expression of S31, configuration webpage；

S33, data cleansing is carried out to extraction result.

5. a kind of method of extracting content on web pages according to claim 4, which is characterized in that the step

S33, specifically:

6. a kind of system of extracting content on web pages, which is characterized in that comprise the following modules:

First abstraction module, for carrying out the content extraction processing based on regular expression matching to webpage, when judgement is taken into When function, output module is executed, conversely, executing the second abstraction module；

Second abstraction module is held for carrying out the content extraction processing based on CSS style to webpage when judgement is extracted successfully Row output module, conversely, executing third abstraction module；

Output module extracts result for exporting.

7. a kind of system of extracting content on web pages according to claim 6, which is characterized in that first abstraction module, It specifically includes:

First configuration unit, the regular expression for configuration webpage；

First cleaning unit, for carrying out data to extraction result；

8. a kind of system of extracting content on web pages according to claim 6, which is characterized in that second abstraction module, It specifically includes:

Second cleaning unit, for carrying out data cleansing to extraction result；

9. a kind of system of extracting content on web pages according to claim 6, which is characterized in that the third abstraction module, It specifically includes:

Third cleaning unit, for carrying out data cleansing to extraction result.

10. a kind of system of extracting content on web pages according to claim 9, which is characterized in that the third submodule, tool Body is used for: