CN110069618A - A kind of method and system of extracting content on web pages - Google Patents

A kind of method and system of extracting content on web pages Download PDF

Info

Publication number
CN110069618A
CN110069618A CN201711135743.XA CN201711135743A CN110069618A CN 110069618 A CN110069618 A CN 110069618A CN 201711135743 A CN201711135743 A CN 201711135743A CN 110069618 A CN110069618 A CN 110069618A
Authority
CN
China
Prior art keywords
webpage
content
extraction
web pages
extracting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711135743.XA
Other languages
Chinese (zh)
Inventor
吴远辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Wanlong Securities Advisory Consultants Co Ltd
Original Assignee
Guangzhou Wanlong Securities Advisory Consultants Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Wanlong Securities Advisory Consultants Co Ltd filed Critical Guangzhou Wanlong Securities Advisory Consultants Co Ltd
Priority to CN201711135743.XA priority Critical patent/CN110069618A/en
Publication of CN110069618A publication Critical patent/CN110069618A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a kind of method and system of extracting content on web pages, comprising the following steps: S1, carries out the processing of the content extraction based on regular expression matching to webpage, when judgement is extracted successfully, step S4 is executed, conversely, continuing to execute step S2;S2, the processing of the content extraction based on CSS style is carried out to webpage, when judgement is extracted successfully, step S4 is executed, conversely, continuing to execute step S3;S3, webpage handled based on the matched content extraction of XPath;Result is extracted in S4, output.The present invention carries out webpage content extraction by using the mode that regular expression, CSS style, XPath are sequentially combined, it can be with most fast velocity interpolation webpage content extraction, and three kinds of extraction modes combine, substantially increase the accuracy rate of institute's extracting content on web pages, effective, accurate extraction be can provide as a result, can be widely applied in the field of information processing of webpage.

Description

A kind of method and system of extracting content on web pages
Technical field
The present invention relates to computer applications and information extraction field, a kind of method more particularly to extracting content on web pages and System.
Background technique
Explanation of nouns:
CSS style: cascading style sheets, one kind being used to show HTML (application of standard generalized markup language) or XML The computer language of files patterns such as (a subsets of standard generalized markup language);
XPath: one is searched the language of information in XML document, it is a kind of for determining certain part position in XML document The language set.Tree of the Xpath based on XML provides the ability that node is found in data-structure tree.
General text mining analysis can all be related to web page contents extraction.Web page contents are information basic in text Element is the basis of correct understanding text.Webpage content extraction is the important foundation tool of the application fields such as machine learning, certainly Right language processing techniques occupy an important position during moving towards practical.
In the extraction process of web page contents, the content of WEB webpage is other than subject content, there are also such as copyright information, Advertisement, navigation bar, the content unrelated with subject content such as decoration information, referred to as " noise " information, these noises increase text The automatic extraction difficulty of content.How noise information is removed, the body matter in webpage is extracted, it is fast in Internet technology The today for hailing exhibition is of great significance.There are certain methods in this field at present, but technological means is relatively simple, It is slow to extract speed, and it is relatively low to extract accuracy rate, it is difficult to meet application demand.
Summary of the invention
In order to solve the above technical problems, the object of the present invention is to provide a kind of method of extracting content on web pages and it is System.
The technical solution adopted by the present invention to solve the technical problems is:
A kind of method of extracting content on web pages, comprising the following steps:
S1, the processing of the content extraction based on regular expression matching is carried out to webpage, when judgement is extracted successfully, execute step Rapid S4, conversely, continuing to execute step S2;
S2, the processing of the content extraction based on CSS style is carried out to webpage, when judgement is extracted successfully, execute step S4, instead It, continues to execute step S3;
S3, webpage handled based on the matched content extraction of XPath;
Result is extracted in S4, output.
It is further used as preferred embodiment, webpage is carried out based on regular expression matching described in the step S1 Content extraction processing the step of, specifically include:
The regular expression of S11, configuration webpage;
S12, content extraction processing is carried out to webpage using regular expression;
S13, data cleansing is carried out to extraction result.
It is further used as preferred embodiment, the content based on CSS style is carried out to webpage described in the step S2 The step of extracting processing, specifically includes:
The CSS style expression formula of S21, configuration webpage;
S22, content extraction processing is carried out to webpage using CSS style expression formula;
S23, data cleansing is carried out to extraction result.
It is further used as preferred embodiment, the step S3 is specifically included:
The XPath path expression of S31, configuration webpage;
S32, content extraction processing is carried out to webpage using XPath path expression;
S33, data cleansing is carried out to extraction result.
It is further used as preferred embodiment, the step S33, specifically:
It will extract after result matched with preset noise lexicon, and delete the noise vocabulary in extraction result.
Another technical solution adopted by the present invention to solve the technical problem thereof is that:
A kind of system of extracting content on web pages, comprises the following modules:
First abstraction module, for carrying out the content extraction processing based on regular expression matching to webpage, when judgement is taken out When taking successfully, output module is executed, conversely, executing the second abstraction module;
Second abstraction module, for carrying out the content extraction processing based on CSS style to webpage, when judgement is extracted successfully When, output module is executed, conversely, executing third abstraction module;
Third abstraction module, for handle based on the matched content extraction of XPath to webpage;
Output module extracts result for exporting.
It is further used as preferred embodiment, first abstraction module specifically includes:
First configuration unit, the regular expression for configuration webpage;
First extracting unit, for carrying out content extraction processing to webpage using regular expression;
First cleaning unit, for carrying out data to extraction result;
First judging unit, for executing output module when judgement is extracted successfully, conversely, executing the second abstraction module.
It is further used as preferred embodiment, second abstraction module specifically includes:
Second configuration unit, the CSS style expression formula for configuration webpage;
Second extracting unit, for carrying out content extraction processing to webpage using CSS style expression formula;
Second cleaning unit, for carrying out data cleansing to extraction result;
Second judgment unit, for executing output module when judgement is extracted successfully, conversely, executing third abstraction module.
It is further used as preferred embodiment, the third abstraction module specifically includes:
Third configuration unit, the XPath path expression for configuration webpage;
Third extracting unit, for carrying out content extraction processing to webpage using XPath path expression;
Third cleaning unit, for carrying out data cleansing to extraction result.
It is further used as preferred embodiment, the third submodule is specifically used for:
It will extract after result matched with preset noise lexicon, and delete the noise vocabulary in extraction result.
The method of the present invention, the beneficial effect of system are: the present invention by using regular expression, CSS style, XPath according to The mode that sequence combines carries out webpage content extraction, can be with most fast velocity interpolation webpage content extraction, and three kinds are extracted Mode combines, and substantially increases the accuracy rate of institute's extracting content on web pages, it is possible to provide effectively, accurately extracts result.
Detailed description of the invention
Fig. 1 is the flow chart of the method for extracting content on web pages of the invention.
Specific embodiment
Referring to Fig.1, the present invention provides a kind of methods of extracting content on web pages, comprising the following steps:
S1, the processing of the content extraction based on regular expression matching is carried out to webpage, when judgement is extracted successfully, execute step Rapid S4, conversely, continuing to execute step S2;
S2, the processing of the content extraction based on CSS style is carried out to webpage, when judgement is extracted successfully, execute step S4, instead It, continues to execute step S3;
S3, webpage handled based on the matched content extraction of XPath;
Result is extracted in S4, output.
This method is primarily based on regular expression and carries out content extraction processing to webpage, when extracting unsuccessful, is based on CSS Pattern carries out content extraction processing to webpage, and when extracting unsuccessful again, carries out webpage based on the matched content of XPath Extraction processing, according to the characteristic of webpage, carries out by using the mode that regular expression, CSS style, XPath are sequentially combined Webpage content extraction, can be with most fast velocity interpolation webpage content extraction, and three kinds of extraction modes combine, and greatly improve The accuracy rate of institute's extracting content on web pages, it is possible to provide effectively, accurately extract result.
It is further used as preferred embodiment, webpage is carried out based on regular expression matching described in the step S1 Content extraction processing the step of, specifically include:
The regular expression of S11, configuration webpage;
S12, content extraction processing is carried out to webpage using regular expression;
S13, data cleansing is carried out to extraction result.
Regular expression is a kind of effective ways of extracting content on web pages, in step S11, passes through the canonical table of configuration webpage Up to after formula, webpage content extraction can be effectively carried out, after extraction processing, result will be extracted and preset noise lexicon carries out After matching, the noise vocabulary extracted in result is deleted, the data cleansing to result is extracted is realized, is not met so as to delete The extraction of web page contents is as a result, make extraction result more acurrate.
It is further used as preferred embodiment, the content based on CSS style is carried out to webpage described in the step S2 The step of extracting processing, specifically includes:
The CSS style expression formula of S21, configuration webpage;Such as content to be extracted is located in div tag (< div class=' Content ' id=' conmain '), by observing id the class attribute of the div tag, it can configure and extract content The expression formula of node is " div [@class='content'] ".
S22, content extraction processing is carried out to webpage using CSS style expression formula;By utilizing CSS style expression formula, look into The HTML node for finding content to be extracted in webpage, can carry out content extraction to it.
S23, data cleansing is carried out to extraction result, removes unrelated content.
In step S21, after the CSS style expression formula of configuration webpage, for the canonical table in step S1 can not be passed through The webpage extracted up to formula is further extracted.After extraction processing, result and the progress of preset noise lexicon will be extracted After matching, the noise vocabulary extracted in result is deleted, the data cleansing to result is extracted is realized, does not meet net so as to delete The extraction of page content is as a result, make extraction result more acurrate.
It is further used as preferred embodiment, the step S3 is specifically included:
The XPath path expression of S31, configuration webpage;
S32, content extraction processing is carried out to webpage using XPath path expression;By utilizing XPath path expression Formula finds the HTML node of content to be extracted in webpage, can carry out content extraction to it.
S33, data cleansing is carried out to extraction result.
Principle based on the matched content extraction of XPath are as follows:
Html web page is tree, can be successively unfolded, and is successively positioned.XPath is exactly to carry out work according to this characteristic Make.Expression formula principal mode is as follows: two oblique lines // expression positions root node, and an oblique line/expression is found toward lower layer, wherein One html tag indicates one layer, and the expression formula for extracting content of text is /text (), and the content that extract some attribute is then adopted With expression formula/@* * *, wherein * * * is the name of specific object.
In step S31, after the XPath path expression of configuration webpage, for can not can not also be led to by step S1 The webpage that the CSS style expression formula crossed in step S2 extracts further is extracted, it is ensured that webpage content extraction it is comprehensive Property, it is extracted by the path XPath and obtains web page contents, guarantee to extract the accurate of result.This method by step S1~S3 by The extraction deterministic process of step gradually carries out webpage content extraction using different extraction modes, can guarantee most fast extraction speed Under the premise of degree, highest extraction accuracy is obtained.
It is further used as preferred embodiment, the step S33, specifically:
It will extract after result matched with preset noise lexicon, and delete the noise vocabulary in extraction result.
Data cleansing step in step S13 and S23, identical as this step, purpose is to remove unrelated content.
The present invention also provides a kind of systems of extracting content on web pages, comprise the following modules:
First abstraction module, for carrying out the content extraction processing based on regular expression matching to webpage, when judgement is taken out When taking successfully, output module is executed, conversely, executing the second abstraction module;
Second abstraction module, for carrying out the content extraction processing based on CSS style to webpage, when judgement is extracted successfully When, output module is executed, conversely, executing third abstraction module;
Third abstraction module, for handle based on the matched content extraction of XPath to webpage;
Output module extracts result for exporting.
It is further used as preferred embodiment, first abstraction module specifically includes:
First configuration unit, the regular expression for configuration webpage;
First extracting unit, for carrying out content extraction processing to webpage using regular expression;
First cleaning unit, for carrying out data to extraction result;
First judging unit, for executing output module when judgement is extracted successfully, conversely, executing the second abstraction module.
It is further used as preferred embodiment, second abstraction module specifically includes:
Second configuration unit, the CSS style expression formula for configuration webpage;
Second extracting unit, for carrying out content extraction processing to webpage using CSS style expression formula;
Second cleaning unit, for carrying out data cleansing to extraction result;
Second judgment unit, for executing output module when judgement is extracted successfully, conversely, executing third abstraction module.
It is further used as preferred embodiment, the third abstraction module specifically includes:
Third configuration unit, the XPath path expression for configuration webpage;
Third extracting unit, for carrying out content extraction processing to webpage using XPath path expression;
Third cleaning unit, for carrying out data cleansing to extraction result.
It is further used as preferred embodiment, the third submodule is specifically used for:
It will extract after result matched with preset noise lexicon, and delete the noise vocabulary in extraction result.
A kind of system of extracting content on web pages of the invention, can be performed the present invention it is aforementioned provided by a kind of extraction webpage The method of appearance, any combination implementation steps of executing method embodiment have the corresponding function of this method and beneficial effect.
It is to be illustrated to preferable implementation of the invention, but the invention is not limited to the implementation above Example, those skilled in the art can also make various equivalent variations on the premise of without prejudice to spirit of the invention or replace It changes, these equivalent variation or replacement are all included in the scope defined by the claims of the present application.

Claims (10)

1. a kind of method of extracting content on web pages, which comprises the following steps:
S1, the processing of the content extraction based on regular expression matching is carried out to webpage, when judgement is extracted successfully, execute step S4, conversely, continuing to execute step S2;
S2, the processing of the content extraction based on CSS style is carried out to webpage, when judgement is extracted successfully, execute step S4, conversely, Continue to execute step S3;
S3, webpage handled based on the matched content extraction of XPath;
Result is extracted in S4, output.
2. a kind of method of extracting content on web pages according to claim 1, which is characterized in that right described in the step S1 Webpage carries out the step of processing of the content extraction based on regular expression matching, specifically includes:
The regular expression of S11, configuration webpage;
S12, content extraction processing is carried out to webpage using regular expression;
S13, data cleansing is carried out to extraction result.
3. a kind of method of extracting content on web pages according to claim 1, which is characterized in that right described in the step S2 Webpage carries out the step of processing of the content extraction based on CSS style, specifically includes:
The CSS style expression formula of S21, configuration webpage;
S22, content extraction processing is carried out to webpage using CSS style expression formula;
S23, data cleansing is carried out to extraction result.
4. a kind of method of extracting content on web pages according to claim 1, which is characterized in that the step S3, it is specific to wrap It includes:
The XPath path expression of S31, configuration webpage;
S32, content extraction processing is carried out to webpage using XPath path expression;
S33, data cleansing is carried out to extraction result.
5. a kind of method of extracting content on web pages according to claim 4, which is characterized in that the step
S33, specifically:
It will extract after result matched with preset noise lexicon, and delete the noise vocabulary in extraction result.
6. a kind of system of extracting content on web pages, which is characterized in that comprise the following modules:
First abstraction module, for carrying out the content extraction processing based on regular expression matching to webpage, when judgement is taken into When function, output module is executed, conversely, executing the second abstraction module;
Second abstraction module is held for carrying out the content extraction processing based on CSS style to webpage when judgement is extracted successfully Row output module, conversely, executing third abstraction module;
Third abstraction module, for handle based on the matched content extraction of XPath to webpage;
Output module extracts result for exporting.
7. a kind of system of extracting content on web pages according to claim 6, which is characterized in that first abstraction module, It specifically includes:
First configuration unit, the regular expression for configuration webpage;
First extracting unit, for carrying out content extraction processing to webpage using regular expression;
First cleaning unit, for carrying out data to extraction result;
First judging unit, for executing output module when judgement is extracted successfully, conversely, executing the second abstraction module.
8. a kind of system of extracting content on web pages according to claim 6, which is characterized in that second abstraction module, It specifically includes:
Second configuration unit, the CSS style expression formula for configuration webpage;
Second extracting unit, for carrying out content extraction processing to webpage using CSS style expression formula;
Second cleaning unit, for carrying out data cleansing to extraction result;
Second judgment unit, for executing output module when judgement is extracted successfully, conversely, executing third abstraction module.
9. a kind of system of extracting content on web pages according to claim 6, which is characterized in that the third abstraction module, It specifically includes:
Third configuration unit, the XPath path expression for configuration webpage;
Third extracting unit, for carrying out content extraction processing to webpage using XPath path expression;
Third cleaning unit, for carrying out data cleansing to extraction result.
10. a kind of system of extracting content on web pages according to claim 9, which is characterized in that the third submodule, tool Body is used for:
It will extract after result matched with preset noise lexicon, and delete the noise vocabulary in extraction result.
CN201711135743.XA 2017-11-16 2017-11-16 A kind of method and system of extracting content on web pages Pending CN110069618A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711135743.XA CN110069618A (en) 2017-11-16 2017-11-16 A kind of method and system of extracting content on web pages

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711135743.XA CN110069618A (en) 2017-11-16 2017-11-16 A kind of method and system of extracting content on web pages

Publications (1)

Publication Number Publication Date
CN110069618A true CN110069618A (en) 2019-07-30

Family

ID=67364559

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711135743.XA Pending CN110069618A (en) 2017-11-16 2017-11-16 A kind of method and system of extracting content on web pages

Country Status (1)

Country Link
CN (1) CN110069618A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254751A (en) * 2021-06-24 2021-08-13 北森云计算有限公司 Method, equipment and storage medium for accurately extracting complex webpage structured information

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462268A (en) * 2014-11-24 2015-03-25 深圳市比一比网络科技有限公司 HTML document information extraction expression method and system
US9177060B1 (en) * 2011-03-18 2015-11-03 Michele Bennett Method, system and apparatus for identifying and parsing social media information for providing business intelligence
CN107220250A (en) * 2016-03-21 2017-09-29 北大方正集团有限公司 A kind of template configuration method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9177060B1 (en) * 2011-03-18 2015-11-03 Michele Bennett Method, system and apparatus for identifying and parsing social media information for providing business intelligence
CN104462268A (en) * 2014-11-24 2015-03-25 深圳市比一比网络科技有限公司 HTML document information extraction expression method and system
CN107220250A (en) * 2016-03-21 2017-09-29 北大方正集团有限公司 A kind of template configuration method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254751A (en) * 2021-06-24 2021-08-13 北森云计算有限公司 Method, equipment and storage medium for accurately extracting complex webpage structured information
CN113254751B (en) * 2021-06-24 2021-09-21 北森云计算有限公司 Method, equipment and storage medium for accurately extracting complex webpage structured information

Similar Documents

Publication Publication Date Title
CN103853834B (en) Text structure analysis-based Web document abstract generation method
CN105022803B (en) A kind of method and system for extracting Web page text content
CN102541874B (en) Webpage text content extracting method and device
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
Zheng et al. Template-independent news extraction based on visual consistency
CN106055667B (en) It is a kind of based on text-label densities web page core content extracting method
CN102087648B (en) Method and system for fetching news comment page
CN102591612B (en) General webpage text extraction method based on punctuation continuity and system thereof
CN106970912A (en) Chinese sentence similarity calculating method, computing device and computer-readable storage medium
CN106021392A (en) News key information extraction method and system
CN104598577A (en) Extraction method for webpage text
CN102693279A (en) Method, device and system for fast calculating comment similarity
CN101571860A (en) Method and device for generating dynamic website as well as method and device for extracting structural data
CN103559234A (en) System and method for automated semantic annotation of RESTful Web services
CN103970898A (en) Method and device for extracting information based on multistage rule base
CN112257462A (en) Hypertext markup language translation method based on neural machine translation technology
CN106372232B (en) Information mining method and device based on artificial intelligence
CN106202007B (en) A kind of appraisal procedure of MATLAB program files similarity
CN110069618A (en) A kind of method and system of extracting content on web pages
CN104778232B (en) Searching result optimizing method and device based on long query
CN104217025B (en) For the entry extraction system and method for more record webpages
CN106528509A (en) Webpage information extracting method and apparatus
CN106897287A (en) Homepage Publishing decimation in time method and the device for Homepage Publishing decimation in time
CN104572874A (en) Webpage information extraction method and device
CN106168947A (en) A kind of related entities method for digging and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190730