CN111966881A

CN111966881A - Webpage information extraction method and system and electronic equipment

Info

Publication number: CN111966881A
Application number: CN202011095088.1A
Authority: CN
Inventors: 何莹瑜; 丁明会; 许杰; 吴桐
Original assignee: Chengdu Business Big Data Technology Co Ltd
Current assignee: Chengdu Business Big Data Technology Co Ltd
Priority date: 2020-10-14
Filing date: 2020-10-14
Publication date: 2020-11-20

Abstract

The invention discloses a webpage information extraction method, a webpage information extraction system and electronic equipment. The invention also discloses a webpage information extraction system, and the method and the system based on the invention solve the defect of low efficiency of the traditional method for manually customizing the field extraction rule, also solve the problems of low accuracy and poor stability of extracting the webpage information by using the existing open source toolkit and the like, reduce the labor cost and the resource cost, and simultaneously improve the accuracy and the stability of extracting the field information, thereby having obvious technical advantages and technical effects.

Description

Webpage information extraction method and system and electronic equipment

Technical Field

The invention relates to the field of data analysis, in particular to a webpage information extraction method and system and electronic equipment.

Background

In the big data era, web crawlers are beneficial tools for collecting data from the internet, and need to crawl hundreds of related web pages of sites to obtain information related to topics, such as titles, time, sources, contents, authors and the like, whereas in view of diversification of web page development technology and style design, a traditional solution is to customize extraction codes and extraction rules for each site. The solution has the advantages that the correct rate of field extraction is very high, but the obvious defect is that due to the diversification of the webpage styles and the dependence on the stability of the webpage structure, the adjustment of the webpage structure can cause the extraction codes and rules to be correspondingly adjusted, so the solution has no universality, high development cost and poor continuity and stability.

In order to improve efficiency, another existing technical solution is to use a third-party open-source crawler development kit, for example, newsapper, which is an open-source Python class library and can be used for extracting website content. The third-party open source library similar to newsapper is used for extracting the webpage content, although the efficiency can be improved, the framework of the third-party open source library is not stable enough, various bugs exist in the crawling process, the accuracy is low, for example, key url, field information and the like cannot be obtained, and therefore direct commercial use is difficult.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a webpage information extraction method and a webpage information extraction system, which are used for solving the technical defects that in the existing webpage information extraction process, the labor cost consumption is too high by utilizing a customized extraction rule, or the extraction result is unstable and the system resource occupation is too high by utilizing a third-party open source library.

In order to achieve the purpose, the webpage information extraction method comprises the steps of processing a target webpage through a pre-preprocessing process, obtaining the processed target webpage and a corresponding field extraction rule, generating a field extraction rule base, and extracting corresponding field information from the target webpage based on the field extraction rule base. The scheme adopts the pre-preprocessing to determine the field extraction rule, improves the efficiency of the field extraction rule, has better adaptability and expandability and reduces the labor cost.

Specifically, the method comprises the following steps:

step S1: receiving a target webpage, carrying out pre-pretreatment on the original text of the target webpage, acquiring the processed target webpage and a field extraction rule, and further generating a field extraction rule base; step S2: reading in the processed target webpage based on the field extraction rule base to obtain corresponding field information; step S3: and carrying out correctness verification on the acquired field information.

Further, the preprocessing in step S1 specifically includes the following implementation steps: s1-1, cleaning the target webpage original text, removing interference information and obtaining a cleaned target webpage; step S1-2: and establishing a webpage sample based on the cleaned target webpage, and learning the webpage sample to obtain a field extraction rule. In the scheme, through the pre-preprocessing stage, the original text of the webpage is cleaned, interference information is removed, and the efficiency and the accuracy of field information extraction can be further improved.

Further, the format of the field extraction rule is a regular expression.

Further, the interference information comprises comments, script codes and predefined texts.

The step of obtaining the field extraction rule in step S1-2 specifically includes the following implementation steps: randomly selecting a part of webpages from the target webpages as webpage samples, acquiring field information according to the webpage samples, and verifying the correctness of the acquired field information; if the field information is wrong, acquiring a corresponding field extraction rule, and submitting a request for modifying the field extraction rule; modified field extraction rules are received and replaced. According to the scheme, before the information is formally extracted, the preliminary information acquisition and correctness verification are carried out on the webpage sample, and the error rule is modified and corrected, so that the correctness and efficiency of the subsequent acquisition of the webpage field information of the target website can be further improved.

Further, the step S2 includes the following steps: step S2-1: reading in a target webpage, and analyzing a header part and a text part of the target webpage; step S2-2: acquiring meta information in a meta tag from a header part of the target webpage; step S2-3: acquiring field information related to a profile from a header part of the target webpage, wherein the profile comprises a title, a source and time information; step S2-4: acquiring field information from the text part of the target webpage; step S2-5: the field information obtained in step S2-3 and step S2-4 is optimally supplemented based on the meta information obtained in the above-described step S2-2. According to the scheme, the meta information of the webpage is read in to serve as backup information, the field information obtained in the step S2-3 and the step S2-4 is optimized, and the accuracy of field information extraction is further improved.

Further, in step S3, the method includes cleaning the obtained field information to remove the interference information, where the interference information includes comment information, style labels, and predefined texts. According to the scheme, the step of field information cleaning is added, and the accuracy of field information acquisition is further improved.

Further, in step S3, the method further includes detecting the obtained field information, and determining whether a field extraction rule is met. The scheme increases the possibility of further automatically judging whether the format of the field accords with the rule before outputting the field, and further improves the validity of field information acquisition.

Based on the same inventive concept, the second aspect of the present invention discloses a website field general extraction system, which specifically comprises:

a: the front-end processing module: the method is used for preprocessing the target website page before extracting the field information, and specifically comprises the following steps: a01: a rule determination module: the method comprises the steps of learning elements of a website page, acquiring field extraction rules of the website page, and storing the field extraction rules to a field extraction rule base; a02: the first cleaning module is used for cleaning the original text or field information of the website page and removing interference information;

b: the field information extraction module: the system comprises a field extraction rule base, a database and a database server, wherein the field extraction rule base is used for reading the content of a website page, dividing the content of the page to obtain field information related to the field extraction rule, and performing primary correctness verification on the obtained field information;

c: and the field extraction rule base comprises a plurality of field extraction rules and is used for storing the field extraction rules acquired by the rule determination module.

Further, the system further includes an output processing module, where the output processing module is configured to optimize and verify correctness of the extracted field information, and specifically includes: a second cleaning module: the field information extraction module is used for cleaning the extracted field information and removing interference information; a detection and verification module: and the field information is used for detecting the obtained field information and judging whether the field information accords with the field extraction rule.

Based on the same inventive concept, the invention also provides an electronic device, which is characterized in that the device comprises a processor and a memory, wherein the memory is used for storing an executable program; the processor is used for executing the executable program to realize the webpage information extraction method.

Compared with the prior art, one or more technical schemes disclosed by the invention have the following obvious technical advantages and technical effects:

1) the method disclosed by the invention overcomes the defect of low efficiency of extracting the field by manual customization in the prior art, reduces the labor cost and improves the efficiency of extracting the field information;

2) the method has universality, can be used for solving the problem of extraction of the fields of the multi-style web pages, and avoids the need of a set of independent rules matched with different web pages, thereby simplifying the extraction work and reducing the maintenance cost caused by subsequent web page reprinting;

3) according to the scheme, the advantages of the third-party open source library are used for reference, and the functions and the flow are optimized and cut, so that the efficiency is improved, the resource occupation is reduced, and the resource cost is saved;

4) according to the scheme, by adding the preprocessing process and the optimization verification process, interference of the storage of error data on subsequent use of the data is avoided, and the accuracy of field information extraction is improved.

In conclusion, based on the method and the system disclosed by the invention, the labor cost and the resource cost are reduced, and the extraction efficiency and the extraction accuracy are improved, so that the method and the system have obvious technical advantages and technical effects.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a schematic flowchart of a method for extracting web page information according to an embodiment of the present invention.

Fig. 2 is a flowchart illustrating a process of obtaining a field extraction rule according to an embodiment of the present invention.

Fig. 3 is a schematic flow chart illustrating a process of obtaining corresponding field information based on a field extraction rule base according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of a website field general extraction system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a method for extracting webpage information according to an embodiment of the present invention, which specifically includes steps S11 to S13:

step S11: receiving a target webpage, carrying out pre-pretreatment on the original text of the target webpage, acquiring the processed target webpage and a field extraction rule, and further generating a field extraction rule base;

in the pre-preprocessing process, the original text of the webpage is cleaned, interference information including comments, script codes, pre-defined character strings and the like is removed, and the efficiency and the accuracy of field information extraction can be further improved.

In addition, the pre-processing process further includes obtaining a field extraction rule, and please refer to the flowchart shown in fig. 2 for specific steps.

The rules established through the pre-processing procedure are stored in the rule base, and may be specifically stored in a database manner or a file manner, an example of the file manner storage is given below, in this embodiment, the field extraction rule base is stored as a json file, and the format is as follows:

re_pubdate_list = [

"(\ d {4} year \ d {1,2} month \ d {1,2} day \ s

[1-24]0-60 of d time]\ d score) ([ 1-24)]\ d hours) ",

"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s

[0-1]

[0-9]:[0-5]

[0-9])",

"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s

[2][0-3]:[0-5]

[0-9]:[0-5]

[0-9])",

"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s

[0-1]

[0-9]:[0-5]

[0-9])",

"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s

[2][0-3]:[0-5]

[0-9])",

"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s

[1-24]0-60 of d time]\ d score) ([ 1-24)]'d shi'

]

Step S12: and reading in the processed target webpage based on the field extraction rule base to obtain corresponding field information.

Please refer to the flowchart shown in fig. 3 for a specific implementation process of this step.

Step S13: and carrying out correctness verification on the acquired field information.

In order to further improve the accuracy of the acquired field information and avoid interference caused by the fact that the field information contains wrong content to use subsequent data, after the field information is acquired in the previous step, the accuracy of the acquired field information is further verified in the previous step.

In this embodiment, the correctness verification includes cleaning the obtained field information, and removing the interference information, specifically including the comment information, the style label, and the predefined text. The predefined text can be set through a configuration file, and specifically comprises advertisement information, special character strings and the like.

In this embodiment, the correctness verification further includes detecting the obtained field information and determining whether it complies with the field extraction rule. The detection rule may be configured, and as an example, the check rule for the news _ site field is:

{ "news _ site" { "contacts" - [ "nets", "reports" - ], [ "len" - [3,7] }

The meaning of the above detection rule is: the news _ site field needs to contain the text "web" or "newspaper", and has a length of 3 or more and 7 or less.

Referring to fig. 2, fig. 2 is a flowchart illustrating a process of obtaining a field extraction rule according to an embodiment of the present invention.

For convenience of explanation, as an example, the text of a target web page in the embodiment of the present invention is as follows:

<!DOCTYPE html>

<head>

< meta name = "news _ site" content = "this is a news website"/>)

< meta name = "keywords" content = "keyword 1, keyword 2, keyword 3, keyword 4, keyword 5"/>)

< title > this is a news title </title >

</head>

<body>

// writing a content in a page

Write ("here, a sentence-, -);

</script>

< div > < h1> this is a news headline </h1> </div >

< span data-role = "news _ site" > source: < a href = "" target = "_ blank" > first website </sapn >

</div>

< p > this is the first segment of an example news text, three segments total >

< p > this is the second segment of an example news text, three segments total >

< p > this is a third segment of an example news text, three segments total </p >

</div>

</body>

</html>

The extraction of the rule from the obtained field specifically includes steps S21 to S23:

step S21: and cleaning the target webpage original text, removing interference information and obtaining the cleaned target webpage.

In order to improve the efficiency and accuracy of field information extraction, before formal field information extraction is carried out, the original text of the target webpage is firstly cleaned, and interference information is removed. The mainly removed interference information includes: comments in the web page, script code, predefined text, such as the code content of js script code ending with a < script > start </script > tag, including advertising words, special strings, and the like. In one embodiment, the above contents are all predefined set by a configuration file.

In the embodiment of the present invention, after the target web page in the above example is cleaned, the interference information is removed, and the following contents are obtained:

<head>

< meta name = "news _ site" content = "this is a news website"/>)

< title > this is a news title </title >

</head>

<body>

< div > < h1> this is a news headline </h1> </div >

</div>

< p > this is the first segment of an example news text, three segments total >

</div>

</body>

Step S22: randomly selecting a part of webpages from the target webpages as webpage samples, acquiring field information according to the webpage samples, and verifying the correctness of the acquired field information.

In order to improve the efficiency and accuracy of field information extraction, before formal field information extraction is carried out, preliminary extraction is required to be carried out, and whether the field extraction rule is suitable for a target website webpage is verified.

In one embodiment, a part of webpages are randomly selected from the target webpages as webpage samples, the sample webpages are preliminarily extracted based on the existing field extraction rules, field information is obtained, and the correctness of the field information is verified. The verification process is automatically completed according to a defined rule, and the rule can be defined by using a regular expression, for example, the check rule for the news _ site field is as follows:

{ "news _ site" { "contacts" - [ "nets", "reports" - ], [ "len" - [3,7] }

For field information that does not comply with the field check rule, field information that is in error is recorded.

Step S23: and if the field information is wrong, acquiring the corresponding field extraction rule, and submitting a request for modifying the field extraction rule.

And if the field information with the error is obtained, the corresponding field extraction rule has the error and needs to be modified, so that the corresponding field extraction rule is obtained based on the field information with the error, a request for modifying the field extraction rule is submitted, and the modification can be completed by an operator.

Step S24: modified field extraction rules are received and replaced.

The verification of the field extraction rule is completed through the steps, so that the efficiency of acquiring subsequent formal field information can be greatly improved.

Referring to fig. 3, fig. 3 is a schematic diagram illustrating a process of obtaining corresponding field information based on a field extraction rule base according to an embodiment of the present invention, which specifically includes steps S31 to S35:

step S31: reading in a target webpage, and analyzing a head part and a text part of the target webpage;

in this embodiment, the processed target webpage is read in, and the header content and the text content of the webpage are obtained by parsing according to the html webpage code specification, where the header content refers to the following:

<head>

< meta name = "news _ site" content = "this is a news website"/>)

< title > this is a news title </title >

</head>

The text part content is as follows:

<body>

< div > < h1> this is a news headline </h1> </div >

</div>

< p > this is the first segment of an example news text, three segments total >

</div>

</body>

Step S32: acquiring meta information in a meta tag from a header part of a target webpage;

the Meta tag may contain information such as title, source, time, etc. and may be temporarily stored as a supplement to the field information obtained from the body.

In this embodiment, the meta information obtained from the meta tag is:

{

"dataUpdate"："2020-04-24 19:00"，

"dataPublished"："2020-04-24 19:00"，

"news _ site": "this is a news web site",

"keywords": "keyword 1, keyword 2, keyword 3, keyword 4, keyword 5"

}

Step S33: acquiring field information related to the profile from a header part of a target webpage, wherein the profile comprises a title, a source and time;

in this embodiment, the field information related to the profile obtained from the header portion of the target web page includes:

{

"title": "this is a news headline",

"news-time"："2020-04-24 17:00"，

"news _ site": first web site "

}

Step S34: acquiring field information from the text part of the target webpage;

in this embodiment, the field information obtained from the text part of the target web page includes:

content：[

"this is the first segment of an example news body, three total segments",

"this is the second, three total segments of an example news body",

"this is the third, total three segments of an example news text"

]

Step S35: the field information obtained in step S33 and step S34 is optimally supplemented based on the meta information obtained in step S32 described above.

The effect of this step is to further improve the accuracy and validity of the obtained field information by using the effective meta information obtained in step S32, and to form better verification and optimized supplementation for the target field information. In this embodiment, the process is automatically completed according to a preset optimization strategy, where the preset optimization strategy includes strategies for determining format, length, keywords, and the like. As an example, when the information of the destination field news _ site acquired from the body is NULL and the information of the site field acquired from the meta tag is not NULL in step S32, the information of the site field acquired from the meta tag is used as the information of the final destination field news _ site according to the policy. In another embodiment, if neither the information of the site field obtained from the meta tag nor the information of the target field news _ site obtained from the body is NULL, the lengths of the two fields are further compared, the information of the site field obtained from the meta tag is longer than the information of the target field news _ site obtained from the body, and the information of the site field obtained from the meta tag is still used as the final information of the target field news _ site according to the length priority principle.

Through the processing of the steps 31 to 35, the finally obtained field information is:

{

content：[

"this is the first segment of an example news body, three total segments",

"this is the second, three total segments of an example news body",

"this is the third, total three segments of an example news text"

]，

"title": "this is a news headline",

"news-time"："2020-04-24 17:00"，

"news _ site": first web site "

}

Referring to fig. 4, fig. 4 is a schematic structural diagram of a website field general extraction system according to an embodiment of the present invention, in which the system mainly includes: the device comprises a preprocessing module 01, a field information extraction module 02, an output processing module 03 and a field extraction rule base 04, wherein:

the preprocessing module 01: the method is used for preprocessing a target website page before extracting field information, and specifically comprises the steps of cleaning an input target webpage original text, removing interference information, and obtaining a field extraction rule according to a target webpage.

The front-end processing module 01 specifically includes two main functional modules: a first cleaning module 011 and a rule determining module 012, wherein:

the first cleaning module 011 is used for cleaning the website page original text or field information and removing interference information, wherein the interference information can be comments and script codes or predefined texts, such as advertisement words and special character strings.

The rule determining module 012: the method is used for learning elements of the website page, acquiring field extraction rules of the web page, and storing the field extraction rules to a field extraction rule base. The format of the field extraction rule is a regular expression, and please refer to the flowchart shown in fig. 2 for the process of acquiring the field extraction rule by the module.

The field information extraction module 02: the method is used for reading the content of the website page according to the field extraction rule base, segmenting the page content to obtain field information related to the field extraction rule, and performing preliminary correctness verification on the obtained field information. Please refer to the flowchart of fig. 3 for the process of extracting field information.

In one embodiment, the field information extraction module is formed by performing optimized cutting and expansion on the news paper of the open source library, and after the cutting, the field information extraction module has the following advantages:

the functions of the webpage request part are perfected, and the method has the advantages of reducing network IO and shortening processing time.

And the unnecessary element extraction process is cut, so that the time for extracting and processing the elements is reduced, and the interference caused by excessive result field sets is reduced.

The existing field extraction rule is extended and supplemented, and the field extraction range is enlarged.

The output processing module 03 is configured to optimize and verify the correctness of the extracted field information, and specifically includes a second cleaning module 031 and a detection verification module 032, where:

the second cleaning module 031: the field information extraction module is used for cleaning the extracted field information and removing interference information; the interference information may include the interference information including comment information, style labels, predefined text such as advertising words, etc.

The detection and verification module 032: and the field information is used for detecting the obtained field information and judging whether the field information accords with the field extraction rule.

And the field extraction rule base 04 comprises a plurality of field extraction rules and is used for storing the field extraction rules acquired by the rule determination module. The field extraction rule may be stored in a database manner or a file manner, and an example of the storage in the file manner is given below, in this embodiment, the field extraction rule base is stored as a json file, and the format is as follows:

re_pubdate_list = [

"(\ d {4} year \ d {1,2} month \ d {1,2} day \ s

[1-24]0-60 of d time]\ d score) ([ 1-24)]\ d hours) ",

"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s

[0-1]

[0-9]:[0-5]

[0-9])",

"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s

[2][0-3]:[0-5]

[0-9]:[0-5]

[0-9])",

"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s

[0-1]

[0-9]:[0-5]

[0-9])",

"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s

[2][0-3]:[0-5]

[0-9])",

"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s

[1-24]0-60 of d time]\ d score) ([ 1-24)]'d shi'

]

Those of ordinary skill in the art will appreciate that the various illustrative modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed system may be implemented in other ways. For example, the division of the modules into only one logical functional division may be implemented in another way, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A webpage information extraction method is characterized in that a target webpage is processed through a pre-preprocessing process, the processed target webpage and corresponding field extraction rules are obtained, a field extraction rule base is generated, and corresponding field information is extracted from the target webpage based on the field extraction rule base.

2. The method for extracting web page information as claimed in claim 1, comprising the steps of:

step S1: receiving a target webpage, carrying out pre-pretreatment on the original text of the target webpage, acquiring the processed target webpage and a field extraction rule, and further generating a field extraction rule base;

step S2: reading in the processed target webpage based on the field extraction rule base to obtain corresponding field information;

step S3: and carrying out correctness verification on the acquired field information.

3. The method for extracting web page information as claimed in claim 2, wherein the pre-preprocessing in step S1 includes the following steps:

s1-1, cleaning the target webpage original text, removing interference information and obtaining a cleaned target webpage;

step S1-2: and establishing a webpage sample based on the cleaned target webpage, and learning the webpage sample to obtain a field extraction rule.

4. The method for extracting web page information as claimed in claim 2, wherein the format of the field extraction rule is a regular expression.

5. A method for extracting web page information as claimed in claim 3, wherein said disturbance information includes comments, script codes, and predefined texts.

6. The method for extracting web page information as claimed in claim 3, wherein the step of obtaining the field extraction rule in the step S1-2 comprises the following steps:

randomly selecting a part of webpages from the target webpages as webpage samples, acquiring field information according to the webpage samples, and verifying the correctness of the acquired field information;

if the field information is wrong, acquiring a corresponding field extraction rule, and submitting a request for modifying the field extraction rule;

modified field extraction rules are received and replaced.

7. The method for extracting web page information as claimed in claim 3, wherein said step S2 specifically includes the following steps:

step S2-1: reading in a target webpage, and analyzing a header part and a text part of the target webpage;

step S2-2: acquiring meta information in a meta tag from a header part of the target webpage;

step S2-3: acquiring field information related to a profile from a header part of the target webpage, wherein the profile comprises a title, a source and time information;

step S2-4: acquiring field information from the text part of the target webpage;

step S2-5: the field information obtained in step S2-3 and step S2-4 is optimally supplemented based on the meta information obtained in the above-described step S2-2.

8. The method for extracting information on web pages as claimed in claim 3, wherein in said step S3, includes cleaning the obtained field information to remove the interference information, said interference information includes comment information, style label, predefined text.

9. The method for extracting web page information as claimed in claim 3, wherein in said step S3, further comprising detecting said obtained field information to determine whether a field extraction rule is satisfied.

10. A web page information extraction system is characterized by comprising:

a: the front-end processing module: the method is used for preprocessing the target website page before extracting the field information, and specifically comprises the following steps:

a01: a rule determination module: the method comprises the steps of learning elements of a website page, acquiring field extraction rules of the website page, and storing the field extraction rules to a field extraction rule base;

a02: the first cleaning module is used for cleaning the original text or field information of the website page and removing interference information;

11. The system for extracting web page information according to claim 10, further comprising an output processing module, wherein the output processing module is configured to perform optimization and correctness verification on the extracted field information, and specifically comprises:

a second cleaning module: the field information extraction module is used for cleaning the extracted field information and removing interference information;

a detection and verification module: and the field information is used for detecting the obtained field information and judging whether the field information accords with the field extraction rule.

12. An electronic device, characterized in that the device comprises a processor and a memory,

the memory is used for storing an executable program;

the processor is used for executing the executable program to realize a webpage information extraction method of one of claims 1 to 9.