CN110069618A - A kind of method and system of extracting content on web pages - Google Patents
A kind of method and system of extracting content on web pages Download PDFInfo
- Publication number
- CN110069618A CN110069618A CN201711135743.XA CN201711135743A CN110069618A CN 110069618 A CN110069618 A CN 110069618A CN 201711135743 A CN201711135743 A CN 201711135743A CN 110069618 A CN110069618 A CN 110069618A
- Authority
- CN
- China
- Prior art keywords
- webpage
- content
- extraction
- web pages
- extracting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention discloses a kind of method and system of extracting content on web pages, comprising the following steps: S1, carries out the processing of the content extraction based on regular expression matching to webpage, when judgement is extracted successfully, step S4 is executed, conversely, continuing to execute step S2;S2, the processing of the content extraction based on CSS style is carried out to webpage, when judgement is extracted successfully, step S4 is executed, conversely, continuing to execute step S3;S3, webpage handled based on the matched content extraction of XPath;Result is extracted in S4, output.The present invention carries out webpage content extraction by using the mode that regular expression, CSS style, XPath are sequentially combined, it can be with most fast velocity interpolation webpage content extraction, and three kinds of extraction modes combine, substantially increase the accuracy rate of institute's extracting content on web pages, effective, accurate extraction be can provide as a result, can be widely applied in the field of information processing of webpage.
Description
Technical field
The present invention relates to computer applications and information extraction field, a kind of method more particularly to extracting content on web pages and
System.
Background technique
Explanation of nouns:
CSS style: cascading style sheets, one kind being used to show HTML (application of standard generalized markup language) or XML
The computer language of files patterns such as (a subsets of standard generalized markup language);
XPath: one is searched the language of information in XML document, it is a kind of for determining certain part position in XML document
The language set.Tree of the Xpath based on XML provides the ability that node is found in data-structure tree.
General text mining analysis can all be related to web page contents extraction.Web page contents are information basic in text
Element is the basis of correct understanding text.Webpage content extraction is the important foundation tool of the application fields such as machine learning, certainly
Right language processing techniques occupy an important position during moving towards practical.
In the extraction process of web page contents, the content of WEB webpage is other than subject content, there are also such as copyright information,
Advertisement, navigation bar, the content unrelated with subject content such as decoration information, referred to as " noise " information, these noises increase text
The automatic extraction difficulty of content.How noise information is removed, the body matter in webpage is extracted, it is fast in Internet technology
The today for hailing exhibition is of great significance.There are certain methods in this field at present, but technological means is relatively simple,
It is slow to extract speed, and it is relatively low to extract accuracy rate, it is difficult to meet application demand.
Summary of the invention
In order to solve the above technical problems, the object of the present invention is to provide a kind of method of extracting content on web pages and it is
System.
The technical solution adopted by the present invention to solve the technical problems is:
A kind of method of extracting content on web pages, comprising the following steps:
S1, the processing of the content extraction based on regular expression matching is carried out to webpage, when judgement is extracted successfully, execute step
Rapid S4, conversely, continuing to execute step S2;
S2, the processing of the content extraction based on CSS style is carried out to webpage, when judgement is extracted successfully, execute step S4, instead
It, continues to execute step S3;
S3, webpage handled based on the matched content extraction of XPath;
Result is extracted in S4, output.
It is further used as preferred embodiment, webpage is carried out based on regular expression matching described in the step S1
Content extraction processing the step of, specifically include:
The regular expression of S11, configuration webpage;
S12, content extraction processing is carried out to webpage using regular expression;
S13, data cleansing is carried out to extraction result.
It is further used as preferred embodiment, the content based on CSS style is carried out to webpage described in the step S2
The step of extracting processing, specifically includes:
The CSS style expression formula of S21, configuration webpage;
S22, content extraction processing is carried out to webpage using CSS style expression formula;
S23, data cleansing is carried out to extraction result.
It is further used as preferred embodiment, the step S3 is specifically included:
The XPath path expression of S31, configuration webpage;
S32, content extraction processing is carried out to webpage using XPath path expression;
S33, data cleansing is carried out to extraction result.
It is further used as preferred embodiment, the step S33, specifically:
It will extract after result matched with preset noise lexicon, and delete the noise vocabulary in extraction result.
Another technical solution adopted by the present invention to solve the technical problem thereof is that:
A kind of system of extracting content on web pages, comprises the following modules:
First abstraction module, for carrying out the content extraction processing based on regular expression matching to webpage, when judgement is taken out
When taking successfully, output module is executed, conversely, executing the second abstraction module;
Second abstraction module, for carrying out the content extraction processing based on CSS style to webpage, when judgement is extracted successfully
When, output module is executed, conversely, executing third abstraction module;
Third abstraction module, for handle based on the matched content extraction of XPath to webpage;
Output module extracts result for exporting.
It is further used as preferred embodiment, first abstraction module specifically includes:
First configuration unit, the regular expression for configuration webpage;
First extracting unit, for carrying out content extraction processing to webpage using regular expression;
First cleaning unit, for carrying out data to extraction result;
First judging unit, for executing output module when judgement is extracted successfully, conversely, executing the second abstraction module.
It is further used as preferred embodiment, second abstraction module specifically includes:
Second configuration unit, the CSS style expression formula for configuration webpage;
Second extracting unit, for carrying out content extraction processing to webpage using CSS style expression formula;
Second cleaning unit, for carrying out data cleansing to extraction result;
Second judgment unit, for executing output module when judgement is extracted successfully, conversely, executing third abstraction module.
It is further used as preferred embodiment, the third abstraction module specifically includes:
Third configuration unit, the XPath path expression for configuration webpage;
Third extracting unit, for carrying out content extraction processing to webpage using XPath path expression;
Third cleaning unit, for carrying out data cleansing to extraction result.
It is further used as preferred embodiment, the third submodule is specifically used for:
It will extract after result matched with preset noise lexicon, and delete the noise vocabulary in extraction result.
The method of the present invention, the beneficial effect of system are: the present invention by using regular expression, CSS style, XPath according to
The mode that sequence combines carries out webpage content extraction, can be with most fast velocity interpolation webpage content extraction, and three kinds are extracted
Mode combines, and substantially increases the accuracy rate of institute's extracting content on web pages, it is possible to provide effectively, accurately extracts result.
Detailed description of the invention
Fig. 1 is the flow chart of the method for extracting content on web pages of the invention.
Specific embodiment
Referring to Fig.1, the present invention provides a kind of methods of extracting content on web pages, comprising the following steps:
S1, the processing of the content extraction based on regular expression matching is carried out to webpage, when judgement is extracted successfully, execute step
Rapid S4, conversely, continuing to execute step S2;
S2, the processing of the content extraction based on CSS style is carried out to webpage, when judgement is extracted successfully, execute step S4, instead
It, continues to execute step S3;
S3, webpage handled based on the matched content extraction of XPath;
Result is extracted in S4, output.
This method is primarily based on regular expression and carries out content extraction processing to webpage, when extracting unsuccessful, is based on CSS
Pattern carries out content extraction processing to webpage, and when extracting unsuccessful again, carries out webpage based on the matched content of XPath
Extraction processing, according to the characteristic of webpage, carries out by using the mode that regular expression, CSS style, XPath are sequentially combined
Webpage content extraction, can be with most fast velocity interpolation webpage content extraction, and three kinds of extraction modes combine, and greatly improve
The accuracy rate of institute's extracting content on web pages, it is possible to provide effectively, accurately extract result.
It is further used as preferred embodiment, webpage is carried out based on regular expression matching described in the step S1
Content extraction processing the step of, specifically include:
The regular expression of S11, configuration webpage;
S12, content extraction processing is carried out to webpage using regular expression;
S13, data cleansing is carried out to extraction result.
Regular expression is a kind of effective ways of extracting content on web pages, in step S11, passes through the canonical table of configuration webpage
Up to after formula, webpage content extraction can be effectively carried out, after extraction processing, result will be extracted and preset noise lexicon carries out
After matching, the noise vocabulary extracted in result is deleted, the data cleansing to result is extracted is realized, is not met so as to delete
The extraction of web page contents is as a result, make extraction result more acurrate.
It is further used as preferred embodiment, the content based on CSS style is carried out to webpage described in the step S2
The step of extracting processing, specifically includes:
The CSS style expression formula of S21, configuration webpage;Such as content to be extracted is located in div tag (< div class='
Content ' id=' conmain '), by observing id the class attribute of the div tag, it can configure and extract content
The expression formula of node is " div [@class='content'] ".
S22, content extraction processing is carried out to webpage using CSS style expression formula;By utilizing CSS style expression formula, look into
The HTML node for finding content to be extracted in webpage, can carry out content extraction to it.
S23, data cleansing is carried out to extraction result, removes unrelated content.
In step S21, after the CSS style expression formula of configuration webpage, for the canonical table in step S1 can not be passed through
The webpage extracted up to formula is further extracted.After extraction processing, result and the progress of preset noise lexicon will be extracted
After matching, the noise vocabulary extracted in result is deleted, the data cleansing to result is extracted is realized, does not meet net so as to delete
The extraction of page content is as a result, make extraction result more acurrate.
It is further used as preferred embodiment, the step S3 is specifically included:
The XPath path expression of S31, configuration webpage;
S32, content extraction processing is carried out to webpage using XPath path expression;By utilizing XPath path expression
Formula finds the HTML node of content to be extracted in webpage, can carry out content extraction to it.
S33, data cleansing is carried out to extraction result.
Principle based on the matched content extraction of XPath are as follows:
Html web page is tree, can be successively unfolded, and is successively positioned.XPath is exactly to carry out work according to this characteristic
Make.Expression formula principal mode is as follows: two oblique lines // expression positions root node, and an oblique line/expression is found toward lower layer, wherein
One html tag indicates one layer, and the expression formula for extracting content of text is /text (), and the content that extract some attribute is then adopted
With expression formula/@* * *, wherein * * * is the name of specific object.
In step S31, after the XPath path expression of configuration webpage, for can not can not also be led to by step S1
The webpage that the CSS style expression formula crossed in step S2 extracts further is extracted, it is ensured that webpage content extraction it is comprehensive
Property, it is extracted by the path XPath and obtains web page contents, guarantee to extract the accurate of result.This method by step S1~S3 by
The extraction deterministic process of step gradually carries out webpage content extraction using different extraction modes, can guarantee most fast extraction speed
Under the premise of degree, highest extraction accuracy is obtained.
It is further used as preferred embodiment, the step S33, specifically:
It will extract after result matched with preset noise lexicon, and delete the noise vocabulary in extraction result.
Data cleansing step in step S13 and S23, identical as this step, purpose is to remove unrelated content.
The present invention also provides a kind of systems of extracting content on web pages, comprise the following modules:
First abstraction module, for carrying out the content extraction processing based on regular expression matching to webpage, when judgement is taken out
When taking successfully, output module is executed, conversely, executing the second abstraction module;
Second abstraction module, for carrying out the content extraction processing based on CSS style to webpage, when judgement is extracted successfully
When, output module is executed, conversely, executing third abstraction module;
Third abstraction module, for handle based on the matched content extraction of XPath to webpage;
Output module extracts result for exporting.
It is further used as preferred embodiment, first abstraction module specifically includes:
First configuration unit, the regular expression for configuration webpage;
First extracting unit, for carrying out content extraction processing to webpage using regular expression;
First cleaning unit, for carrying out data to extraction result;
First judging unit, for executing output module when judgement is extracted successfully, conversely, executing the second abstraction module.
It is further used as preferred embodiment, second abstraction module specifically includes:
Second configuration unit, the CSS style expression formula for configuration webpage;
Second extracting unit, for carrying out content extraction processing to webpage using CSS style expression formula;
Second cleaning unit, for carrying out data cleansing to extraction result;
Second judgment unit, for executing output module when judgement is extracted successfully, conversely, executing third abstraction module.
It is further used as preferred embodiment, the third abstraction module specifically includes:
Third configuration unit, the XPath path expression for configuration webpage;
Third extracting unit, for carrying out content extraction processing to webpage using XPath path expression;
Third cleaning unit, for carrying out data cleansing to extraction result.
It is further used as preferred embodiment, the third submodule is specifically used for:
It will extract after result matched with preset noise lexicon, and delete the noise vocabulary in extraction result.
A kind of system of extracting content on web pages of the invention, can be performed the present invention it is aforementioned provided by a kind of extraction webpage
The method of appearance, any combination implementation steps of executing method embodiment have the corresponding function of this method and beneficial effect.
It is to be illustrated to preferable implementation of the invention, but the invention is not limited to the implementation above
Example, those skilled in the art can also make various equivalent variations on the premise of without prejudice to spirit of the invention or replace
It changes, these equivalent variation or replacement are all included in the scope defined by the claims of the present application.
Claims (10)
1. a kind of method of extracting content on web pages, which comprises the following steps:
S1, the processing of the content extraction based on regular expression matching is carried out to webpage, when judgement is extracted successfully, execute step
S4, conversely, continuing to execute step S2;
S2, the processing of the content extraction based on CSS style is carried out to webpage, when judgement is extracted successfully, execute step S4, conversely,
Continue to execute step S3;
S3, webpage handled based on the matched content extraction of XPath;
Result is extracted in S4, output.
2. a kind of method of extracting content on web pages according to claim 1, which is characterized in that right described in the step S1
Webpage carries out the step of processing of the content extraction based on regular expression matching, specifically includes:
The regular expression of S11, configuration webpage;
S12, content extraction processing is carried out to webpage using regular expression;
S13, data cleansing is carried out to extraction result.
3. a kind of method of extracting content on web pages according to claim 1, which is characterized in that right described in the step S2
Webpage carries out the step of processing of the content extraction based on CSS style, specifically includes:
The CSS style expression formula of S21, configuration webpage;
S22, content extraction processing is carried out to webpage using CSS style expression formula;
S23, data cleansing is carried out to extraction result.
4. a kind of method of extracting content on web pages according to claim 1, which is characterized in that the step S3, it is specific to wrap
It includes:
The XPath path expression of S31, configuration webpage;
S32, content extraction processing is carried out to webpage using XPath path expression;
S33, data cleansing is carried out to extraction result.
5. a kind of method of extracting content on web pages according to claim 4, which is characterized in that the step
S33, specifically:
It will extract after result matched with preset noise lexicon, and delete the noise vocabulary in extraction result.
6. a kind of system of extracting content on web pages, which is characterized in that comprise the following modules:
First abstraction module, for carrying out the content extraction processing based on regular expression matching to webpage, when judgement is taken into
When function, output module is executed, conversely, executing the second abstraction module;
Second abstraction module is held for carrying out the content extraction processing based on CSS style to webpage when judgement is extracted successfully
Row output module, conversely, executing third abstraction module;
Third abstraction module, for handle based on the matched content extraction of XPath to webpage;
Output module extracts result for exporting.
7. a kind of system of extracting content on web pages according to claim 6, which is characterized in that first abstraction module,
It specifically includes:
First configuration unit, the regular expression for configuration webpage;
First extracting unit, for carrying out content extraction processing to webpage using regular expression;
First cleaning unit, for carrying out data to extraction result;
First judging unit, for executing output module when judgement is extracted successfully, conversely, executing the second abstraction module.
8. a kind of system of extracting content on web pages according to claim 6, which is characterized in that second abstraction module,
It specifically includes:
Second configuration unit, the CSS style expression formula for configuration webpage;
Second extracting unit, for carrying out content extraction processing to webpage using CSS style expression formula;
Second cleaning unit, for carrying out data cleansing to extraction result;
Second judgment unit, for executing output module when judgement is extracted successfully, conversely, executing third abstraction module.
9. a kind of system of extracting content on web pages according to claim 6, which is characterized in that the third abstraction module,
It specifically includes:
Third configuration unit, the XPath path expression for configuration webpage;
Third extracting unit, for carrying out content extraction processing to webpage using XPath path expression;
Third cleaning unit, for carrying out data cleansing to extraction result.
10. a kind of system of extracting content on web pages according to claim 9, which is characterized in that the third submodule, tool
Body is used for:
It will extract after result matched with preset noise lexicon, and delete the noise vocabulary in extraction result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711135743.XA CN110069618A (en) | 2017-11-16 | 2017-11-16 | A kind of method and system of extracting content on web pages |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711135743.XA CN110069618A (en) | 2017-11-16 | 2017-11-16 | A kind of method and system of extracting content on web pages |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110069618A true CN110069618A (en) | 2019-07-30 |
Family
ID=67364559
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711135743.XA Pending CN110069618A (en) | 2017-11-16 | 2017-11-16 | A kind of method and system of extracting content on web pages |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110069618A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113254751A (en) * | 2021-06-24 | 2021-08-13 | 北森云计算有限公司 | Method, equipment and storage medium for accurately extracting complex webpage structured information |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462268A (en) * | 2014-11-24 | 2015-03-25 | 深圳市比一比网络科技有限公司 | HTML document information extraction expression method and system |
US9177060B1 (en) * | 2011-03-18 | 2015-11-03 | Michele Bennett | Method, system and apparatus for identifying and parsing social media information for providing business intelligence |
CN107220250A (en) * | 2016-03-21 | 2017-09-29 | 北大方正集团有限公司 | A kind of template configuration method and system |
-
2017
- 2017-11-16 CN CN201711135743.XA patent/CN110069618A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9177060B1 (en) * | 2011-03-18 | 2015-11-03 | Michele Bennett | Method, system and apparatus for identifying and parsing social media information for providing business intelligence |
CN104462268A (en) * | 2014-11-24 | 2015-03-25 | 深圳市比一比网络科技有限公司 | HTML document information extraction expression method and system |
CN107220250A (en) * | 2016-03-21 | 2017-09-29 | 北大方正集团有限公司 | A kind of template configuration method and system |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113254751A (en) * | 2021-06-24 | 2021-08-13 | 北森云计算有限公司 | Method, equipment and storage medium for accurately extracting complex webpage structured information |
CN113254751B (en) * | 2021-06-24 | 2021-09-21 | 北森云计算有限公司 | Method, equipment and storage medium for accurately extracting complex webpage structured information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103853834B (en) | Text structure analysis-based Web document abstract generation method | |
CN105022803B (en) | A kind of method and system for extracting Web page text content | |
CN102541874B (en) | Webpage text content extracting method and device | |
CN101727498A (en) | Automatic extraction method of web page information based on WEB structure | |
Zheng et al. | Template-independent news extraction based on visual consistency | |
CN106055667B (en) | It is a kind of based on text-label densities web page core content extracting method | |
CN102087648B (en) | Method and system for fetching news comment page | |
CN102591612B (en) | General webpage text extraction method based on punctuation continuity and system thereof | |
CN106970912A (en) | Chinese sentence similarity calculating method, computing device and computer-readable storage medium | |
CN106021392A (en) | News key information extraction method and system | |
CN104598577A (en) | Extraction method for webpage text | |
CN102693279A (en) | Method, device and system for fast calculating comment similarity | |
CN101571860A (en) | Method and device for generating dynamic website as well as method and device for extracting structural data | |
CN103559234A (en) | System and method for automated semantic annotation of RESTful Web services | |
CN103970898A (en) | Method and device for extracting information based on multistage rule base | |
CN112257462A (en) | Hypertext markup language translation method based on neural machine translation technology | |
CN106372232B (en) | Information mining method and device based on artificial intelligence | |
CN106202007B (en) | A kind of appraisal procedure of MATLAB program files similarity | |
CN110069618A (en) | A kind of method and system of extracting content on web pages | |
CN104778232B (en) | Searching result optimizing method and device based on long query | |
CN104217025B (en) | For the entry extraction system and method for more record webpages | |
CN106528509A (en) | Webpage information extracting method and apparatus | |
CN106897287A (en) | Homepage Publishing decimation in time method and the device for Homepage Publishing decimation in time | |
CN104572874A (en) | Webpage information extraction method and device | |
CN106168947A (en) | A kind of related entities method for digging and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190730 |