CN111966881A - Webpage information extraction method and system and electronic equipment - Google Patents

Webpage information extraction method and system and electronic equipment Download PDF

Info

Publication number
CN111966881A
CN111966881A CN202011095088.1A CN202011095088A CN111966881A CN 111966881 A CN111966881 A CN 111966881A CN 202011095088 A CN202011095088 A CN 202011095088A CN 111966881 A CN111966881 A CN 111966881A
Authority
CN
China
Prior art keywords
information
field
webpage
extraction rule
extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011095088.1A
Other languages
Chinese (zh)
Inventor
何莹瑜
丁明会
许杰
吴桐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Business Big Data Technology Co Ltd
Original Assignee
Chengdu Business Big Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Business Big Data Technology Co Ltd filed Critical Chengdu Business Big Data Technology Co Ltd
Priority to CN202011095088.1A priority Critical patent/CN111966881A/en
Publication of CN111966881A publication Critical patent/CN111966881A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage information extraction method, a webpage information extraction system and electronic equipment. The invention also discloses a webpage information extraction system, and the method and the system based on the invention solve the defect of low efficiency of the traditional method for manually customizing the field extraction rule, also solve the problems of low accuracy and poor stability of extracting the webpage information by using the existing open source toolkit and the like, reduce the labor cost and the resource cost, and simultaneously improve the accuracy and the stability of extracting the field information, thereby having obvious technical advantages and technical effects.

Description

Webpage information extraction method and system and electronic equipment
Technical Field
The invention relates to the field of data analysis, in particular to a webpage information extraction method and system and electronic equipment.
Background
In the big data era, web crawlers are beneficial tools for collecting data from the internet, and need to crawl hundreds of related web pages of sites to obtain information related to topics, such as titles, time, sources, contents, authors and the like, whereas in view of diversification of web page development technology and style design, a traditional solution is to customize extraction codes and extraction rules for each site. The solution has the advantages that the correct rate of field extraction is very high, but the obvious defect is that due to the diversification of the webpage styles and the dependence on the stability of the webpage structure, the adjustment of the webpage structure can cause the extraction codes and rules to be correspondingly adjusted, so the solution has no universality, high development cost and poor continuity and stability.
In order to improve efficiency, another existing technical solution is to use a third-party open-source crawler development kit, for example, newsapper, which is an open-source Python class library and can be used for extracting website content. The third-party open source library similar to newsapper is used for extracting the webpage content, although the efficiency can be improved, the framework of the third-party open source library is not stable enough, various bugs exist in the crawling process, the accuracy is low, for example, key url, field information and the like cannot be obtained, and therefore direct commercial use is difficult.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a webpage information extraction method and a webpage information extraction system, which are used for solving the technical defects that in the existing webpage information extraction process, the labor cost consumption is too high by utilizing a customized extraction rule, or the extraction result is unstable and the system resource occupation is too high by utilizing a third-party open source library.
In order to achieve the purpose, the webpage information extraction method comprises the steps of processing a target webpage through a pre-preprocessing process, obtaining the processed target webpage and a corresponding field extraction rule, generating a field extraction rule base, and extracting corresponding field information from the target webpage based on the field extraction rule base. The scheme adopts the pre-preprocessing to determine the field extraction rule, improves the efficiency of the field extraction rule, has better adaptability and expandability and reduces the labor cost.
Specifically, the method comprises the following steps:
step S1: receiving a target webpage, carrying out pre-pretreatment on the original text of the target webpage, acquiring the processed target webpage and a field extraction rule, and further generating a field extraction rule base; step S2: reading in the processed target webpage based on the field extraction rule base to obtain corresponding field information; step S3: and carrying out correctness verification on the acquired field information.
Further, the preprocessing in step S1 specifically includes the following implementation steps: s1-1, cleaning the target webpage original text, removing interference information and obtaining a cleaned target webpage; step S1-2: and establishing a webpage sample based on the cleaned target webpage, and learning the webpage sample to obtain a field extraction rule. In the scheme, through the pre-preprocessing stage, the original text of the webpage is cleaned, interference information is removed, and the efficiency and the accuracy of field information extraction can be further improved.
Further, the format of the field extraction rule is a regular expression.
Further, the interference information comprises comments, script codes and predefined texts.
The step of obtaining the field extraction rule in step S1-2 specifically includes the following implementation steps: randomly selecting a part of webpages from the target webpages as webpage samples, acquiring field information according to the webpage samples, and verifying the correctness of the acquired field information; if the field information is wrong, acquiring a corresponding field extraction rule, and submitting a request for modifying the field extraction rule; modified field extraction rules are received and replaced. According to the scheme, before the information is formally extracted, the preliminary information acquisition and correctness verification are carried out on the webpage sample, and the error rule is modified and corrected, so that the correctness and efficiency of the subsequent acquisition of the webpage field information of the target website can be further improved.
Further, the step S2 includes the following steps: step S2-1: reading in a target webpage, and analyzing a header part and a text part of the target webpage; step S2-2: acquiring meta information in a meta tag from a header part of the target webpage; step S2-3: acquiring field information related to a profile from a header part of the target webpage, wherein the profile comprises a title, a source and time information; step S2-4: acquiring field information from the text part of the target webpage; step S2-5: the field information obtained in step S2-3 and step S2-4 is optimally supplemented based on the meta information obtained in the above-described step S2-2. According to the scheme, the meta information of the webpage is read in to serve as backup information, the field information obtained in the step S2-3 and the step S2-4 is optimized, and the accuracy of field information extraction is further improved.
Further, in step S3, the method includes cleaning the obtained field information to remove the interference information, where the interference information includes comment information, style labels, and predefined texts. According to the scheme, the step of field information cleaning is added, and the accuracy of field information acquisition is further improved.
Further, in step S3, the method further includes detecting the obtained field information, and determining whether a field extraction rule is met. The scheme increases the possibility of further automatically judging whether the format of the field accords with the rule before outputting the field, and further improves the validity of field information acquisition.
Based on the same inventive concept, the second aspect of the present invention discloses a website field general extraction system, which specifically comprises:
a: the front-end processing module: the method is used for preprocessing the target website page before extracting the field information, and specifically comprises the following steps: a01: a rule determination module: the method comprises the steps of learning elements of a website page, acquiring field extraction rules of the website page, and storing the field extraction rules to a field extraction rule base; a02: the first cleaning module is used for cleaning the original text or field information of the website page and removing interference information;
b: the field information extraction module: the system comprises a field extraction rule base, a database and a database server, wherein the field extraction rule base is used for reading the content of a website page, dividing the content of the page to obtain field information related to the field extraction rule, and performing primary correctness verification on the obtained field information;
c: and the field extraction rule base comprises a plurality of field extraction rules and is used for storing the field extraction rules acquired by the rule determination module.
Further, the system further includes an output processing module, where the output processing module is configured to optimize and verify correctness of the extracted field information, and specifically includes: a second cleaning module: the field information extraction module is used for cleaning the extracted field information and removing interference information; a detection and verification module: and the field information is used for detecting the obtained field information and judging whether the field information accords with the field extraction rule.
Based on the same inventive concept, the invention also provides an electronic device, which is characterized in that the device comprises a processor and a memory, wherein the memory is used for storing an executable program; the processor is used for executing the executable program to realize the webpage information extraction method.
Compared with the prior art, one or more technical schemes disclosed by the invention have the following obvious technical advantages and technical effects:
1) the method disclosed by the invention overcomes the defect of low efficiency of extracting the field by manual customization in the prior art, reduces the labor cost and improves the efficiency of extracting the field information;
2) the method has universality, can be used for solving the problem of extraction of the fields of the multi-style web pages, and avoids the need of a set of independent rules matched with different web pages, thereby simplifying the extraction work and reducing the maintenance cost caused by subsequent web page reprinting;
3) according to the scheme, the advantages of the third-party open source library are used for reference, and the functions and the flow are optimized and cut, so that the efficiency is improved, the resource occupation is reduced, and the resource cost is saved;
4) according to the scheme, by adding the preprocessing process and the optimization verification process, interference of the storage of error data on subsequent use of the data is avoided, and the accuracy of field information extraction is improved.
In conclusion, based on the method and the system disclosed by the invention, the labor cost and the resource cost are reduced, and the extraction efficiency and the extraction accuracy are improved, so that the method and the system have obvious technical advantages and technical effects.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a schematic flowchart of a method for extracting web page information according to an embodiment of the present invention.
Fig. 2 is a flowchart illustrating a process of obtaining a field extraction rule according to an embodiment of the present invention.
Fig. 3 is a schematic flow chart illustrating a process of obtaining corresponding field information based on a field extraction rule base according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of a website field general extraction system according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a method for extracting webpage information according to an embodiment of the present invention, which specifically includes steps S11 to S13:
step S11: receiving a target webpage, carrying out pre-pretreatment on the original text of the target webpage, acquiring the processed target webpage and a field extraction rule, and further generating a field extraction rule base;
in the pre-preprocessing process, the original text of the webpage is cleaned, interference information including comments, script codes, pre-defined character strings and the like is removed, and the efficiency and the accuracy of field information extraction can be further improved.
In addition, the pre-processing process further includes obtaining a field extraction rule, and please refer to the flowchart shown in fig. 2 for specific steps.
The rules established through the pre-processing procedure are stored in the rule base, and may be specifically stored in a database manner or a file manner, an example of the file manner storage is given below, in this embodiment, the field extraction rule base is stored as a json file, and the format is as follows:
re_pubdate_list = [
"(\ d {4} year \ d {1,2} month \ d {1,2} day \ s
Figure 82708DEST_PATH_IMAGE001
[1-24]0-60 of d time]\ d score) ([ 1-24)]\ d hours) ",
"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s
Figure 798991DEST_PATH_IMAGE001
[0-1]
Figure 400874DEST_PATH_IMAGE002
[0-9]:[0-5]
Figure 126384DEST_PATH_IMAGE002
[0-9]:[0-5]
Figure 402383DEST_PATH_IMAGE002
[0-9])",
"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s
Figure 820726DEST_PATH_IMAGE002
[2][0-3]:[0-5]
Figure 316429DEST_PATH_IMAGE002
[0-9]:[0-5]
Figure 750690DEST_PATH_IMAGE002
[0-9])",
"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s
Figure 648239DEST_PATH_IMAGE003
[0-1]
Figure 237484DEST_PATH_IMAGE002
[0-9]:[0-5]
Figure 954904DEST_PATH_IMAGE002
[0-9])",
"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s
Figure 146851DEST_PATH_IMAGE003
[2][0-3]:[0-5]
Figure 860424DEST_PATH_IMAGE002
[0-9])",
"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s
Figure 354990DEST_PATH_IMAGE003
[1-24]0-60 of d time]\ d score) ([ 1-24)]'d shi'
]
Step S12: and reading in the processed target webpage based on the field extraction rule base to obtain corresponding field information.
Please refer to the flowchart shown in fig. 3 for a specific implementation process of this step.
Step S13: and carrying out correctness verification on the acquired field information.
In order to further improve the accuracy of the acquired field information and avoid interference caused by the fact that the field information contains wrong content to use subsequent data, after the field information is acquired in the previous step, the accuracy of the acquired field information is further verified in the previous step.
In this embodiment, the correctness verification includes cleaning the obtained field information, and removing the interference information, specifically including the comment information, the style label, and the predefined text. The predefined text can be set through a configuration file, and specifically comprises advertisement information, special character strings and the like.
In this embodiment, the correctness verification further includes detecting the obtained field information and determining whether it complies with the field extraction rule. The detection rule may be configured, and as an example, the check rule for the news _ site field is:
{ "news _ site" { "contacts" - [ "nets", "reports" - ], [ "len" - [3,7] }
The meaning of the above detection rule is: the news _ site field needs to contain the text "web" or "newspaper", and has a length of 3 or more and 7 or less.
Referring to fig. 2, fig. 2 is a flowchart illustrating a process of obtaining a field extraction rule according to an embodiment of the present invention.
For convenience of explanation, as an example, the text of a target web page in the embodiment of the present invention is as follows:
<!DOCTYPE html>
<html lang="en">
<head>
<meta itemprop="dataUpdate" content="2020-04-24 19:00"/>
<meta itemprop="dataPublished" content="2020-04-24 19:00"/>
< meta name = "news _ site" content = "this is a news website"/>)
< meta name = "keywords" content = "keyword 1, keyword 2, keyword 3, keyword 4, keyword 5"/>)
< title > this is a news title </title >
</head>
<body>
<script type="text/javascript">
// writing a content in a page
Write ("here, a sentence-, -);
</script>
< div > < h1> this is a news headline </h1> </div >
<div class="article-info">
<span class="time" id="news-time"> 2020-04-24 17:00 </sapn>
< span data-role = "news _ site" > source: < a href = "" target = "_ blank" > first website </sapn >
</div>
<div class="content">
< p > this is the first segment of an example news text, three segments total >
< p > this is the second segment of an example news text, three segments total >
< p > this is a third segment of an example news text, three segments total </p >
</div>
</body>
</html>
The extraction of the rule from the obtained field specifically includes steps S21 to S23:
step S21: and cleaning the target webpage original text, removing interference information and obtaining the cleaned target webpage.
In order to improve the efficiency and accuracy of field information extraction, before formal field information extraction is carried out, the original text of the target webpage is firstly cleaned, and interference information is removed. The mainly removed interference information includes: comments in the web page, script code, predefined text, such as the code content of js script code ending with a < script > start </script > tag, including advertising words, special strings, and the like. In one embodiment, the above contents are all predefined set by a configuration file.
In the embodiment of the present invention, after the target web page in the above example is cleaned, the interference information is removed, and the following contents are obtained:
<head>
<meta itemprop="dataUpdate" content="2020-04-24 19:00"/>
<meta itemprop="dataPublished" content="2020-04-24 19:00"/>
< meta name = "news _ site" content = "this is a news website"/>)
< meta name = "keywords" content = "keyword 1, keyword 2, keyword 3, keyword 4, keyword 5"/>)
< title > this is a news title </title >
</head>
<body>
< div > < h1> this is a news headline </h1> </div >
<div class="article-info">
<span class="time" id="news-time"> 2020-04-24 17:00 </sapn>
< span data-role = "news _ site" > source: < a href = "" target = "_ blank" > first website </sapn >
</div>
<div class="content">
< p > this is the first segment of an example news text, three segments total >
< p > this is the second segment of an example news text, three segments total >
< p > this is a third segment of an example news text, three segments total </p >
</div>
</body>
Step S22: randomly selecting a part of webpages from the target webpages as webpage samples, acquiring field information according to the webpage samples, and verifying the correctness of the acquired field information.
In order to improve the efficiency and accuracy of field information extraction, before formal field information extraction is carried out, preliminary extraction is required to be carried out, and whether the field extraction rule is suitable for a target website webpage is verified.
In one embodiment, a part of webpages are randomly selected from the target webpages as webpage samples, the sample webpages are preliminarily extracted based on the existing field extraction rules, field information is obtained, and the correctness of the field information is verified. The verification process is automatically completed according to a defined rule, and the rule can be defined by using a regular expression, for example, the check rule for the news _ site field is as follows:
{ "news _ site" { "contacts" - [ "nets", "reports" - ], [ "len" - [3,7] }
For field information that does not comply with the field check rule, field information that is in error is recorded.
Step S23: and if the field information is wrong, acquiring the corresponding field extraction rule, and submitting a request for modifying the field extraction rule.
And if the field information with the error is obtained, the corresponding field extraction rule has the error and needs to be modified, so that the corresponding field extraction rule is obtained based on the field information with the error, a request for modifying the field extraction rule is submitted, and the modification can be completed by an operator.
Step S24: modified field extraction rules are received and replaced.
The verification of the field extraction rule is completed through the steps, so that the efficiency of acquiring subsequent formal field information can be greatly improved.
Referring to fig. 3, fig. 3 is a schematic diagram illustrating a process of obtaining corresponding field information based on a field extraction rule base according to an embodiment of the present invention, which specifically includes steps S31 to S35:
step S31: reading in a target webpage, and analyzing a head part and a text part of the target webpage;
in this embodiment, the processed target webpage is read in, and the header content and the text content of the webpage are obtained by parsing according to the html webpage code specification, where the header content refers to the following:
<head>
<meta itemprop="dataUpdate" content="2020-04-24 19:00"/>
<meta itemprop="dataPublished" content="2020-04-24 19:00"/>
< meta name = "news _ site" content = "this is a news website"/>)
< meta name = "keywords" content = "keyword 1, keyword 2, keyword 3, keyword 4, keyword 5"/>)
< title > this is a news title </title >
</head>
The text part content is as follows:
<body>
< div > < h1> this is a news headline </h1> </div >
<div class="article-info">
<span class="time" id="news-time"> 2020-04-24 17:00 </sapn>
< span data-role = "news _ site" > source: < a href = "" target = "_ blank" > first website </sapn >
</div>
<div class="content">
< p > this is the first segment of an example news text, three segments total >
< p > this is the second segment of an example news text, three segments total >
< p > this is a third segment of an example news text, three segments total </p >
</div>
</body>
Step S32: acquiring meta information in a meta tag from a header part of a target webpage;
the Meta tag may contain information such as title, source, time, etc. and may be temporarily stored as a supplement to the field information obtained from the body.
In this embodiment, the meta information obtained from the meta tag is:
{
"dataUpdate":"2020-04-24 19:00",
"dataPublished":"2020-04-24 19:00",
"news _ site": "this is a news web site",
"keywords": "keyword 1, keyword 2, keyword 3, keyword 4, keyword 5"
}
Step S33: acquiring field information related to the profile from a header part of a target webpage, wherein the profile comprises a title, a source and time;
in this embodiment, the field information related to the profile obtained from the header portion of the target web page includes:
{
"title": "this is a news headline",
"news-time":"2020-04-24 17:00",
"news _ site": first web site "
}
Step S34: acquiring field information from the text part of the target webpage;
in this embodiment, the field information obtained from the text part of the target web page includes:
content:[
"this is the first segment of an example news body, three total segments",
"this is the second, three total segments of an example news body",
"this is the third, total three segments of an example news text"
]
Step S35: the field information obtained in step S33 and step S34 is optimally supplemented based on the meta information obtained in step S32 described above.
The effect of this step is to further improve the accuracy and validity of the obtained field information by using the effective meta information obtained in step S32, and to form better verification and optimized supplementation for the target field information. In this embodiment, the process is automatically completed according to a preset optimization strategy, where the preset optimization strategy includes strategies for determining format, length, keywords, and the like. As an example, when the information of the destination field news _ site acquired from the body is NULL and the information of the site field acquired from the meta tag is not NULL in step S32, the information of the site field acquired from the meta tag is used as the information of the final destination field news _ site according to the policy. In another embodiment, if neither the information of the site field obtained from the meta tag nor the information of the target field news _ site obtained from the body is NULL, the lengths of the two fields are further compared, the information of the site field obtained from the meta tag is longer than the information of the target field news _ site obtained from the body, and the information of the site field obtained from the meta tag is still used as the final information of the target field news _ site according to the length priority principle.
Through the processing of the steps 31 to 35, the finally obtained field information is:
{
content:[
"this is the first segment of an example news body, three total segments",
"this is the second, three total segments of an example news body",
"this is the third, total three segments of an example news text"
],
"title": "this is a news headline",
"news-time":"2020-04-24 17:00",
"news _ site": first web site "
}
Referring to fig. 4, fig. 4 is a schematic structural diagram of a website field general extraction system according to an embodiment of the present invention, in which the system mainly includes: the device comprises a preprocessing module 01, a field information extraction module 02, an output processing module 03 and a field extraction rule base 04, wherein:
the preprocessing module 01: the method is used for preprocessing a target website page before extracting field information, and specifically comprises the steps of cleaning an input target webpage original text, removing interference information, and obtaining a field extraction rule according to a target webpage.
The front-end processing module 01 specifically includes two main functional modules: a first cleaning module 011 and a rule determining module 012, wherein:
the first cleaning module 011 is used for cleaning the website page original text or field information and removing interference information, wherein the interference information can be comments and script codes or predefined texts, such as advertisement words and special character strings.
The rule determining module 012: the method is used for learning elements of the website page, acquiring field extraction rules of the web page, and storing the field extraction rules to a field extraction rule base. The format of the field extraction rule is a regular expression, and please refer to the flowchart shown in fig. 2 for the process of acquiring the field extraction rule by the module.
The field information extraction module 02: the method is used for reading the content of the website page according to the field extraction rule base, segmenting the page content to obtain field information related to the field extraction rule, and performing preliminary correctness verification on the obtained field information. Please refer to the flowchart of fig. 3 for the process of extracting field information.
In one embodiment, the field information extraction module is formed by performing optimized cutting and expansion on the news paper of the open source library, and after the cutting, the field information extraction module has the following advantages:
the functions of the webpage request part are perfected, and the method has the advantages of reducing network IO and shortening processing time.
And the unnecessary element extraction process is cut, so that the time for extracting and processing the elements is reduced, and the interference caused by excessive result field sets is reduced.
The existing field extraction rule is extended and supplemented, and the field extraction range is enlarged.
The output processing module 03 is configured to optimize and verify the correctness of the extracted field information, and specifically includes a second cleaning module 031 and a detection verification module 032, where:
the second cleaning module 031: the field information extraction module is used for cleaning the extracted field information and removing interference information; the interference information may include the interference information including comment information, style labels, predefined text such as advertising words, etc.
The detection and verification module 032: and the field information is used for detecting the obtained field information and judging whether the field information accords with the field extraction rule.
And the field extraction rule base 04 comprises a plurality of field extraction rules and is used for storing the field extraction rules acquired by the rule determination module. The field extraction rule may be stored in a database manner or a file manner, and an example of the storage in the file manner is given below, in this embodiment, the field extraction rule base is stored as a json file, and the format is as follows:
re_pubdate_list = [
"(\ d {4} year \ d {1,2} month \ d {1,2} day \ s
Figure 418761DEST_PATH_IMAGE003
[1-24]0-60 of d time]\ d score) ([ 1-24)]\ d hours) ",
"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s
Figure 289765DEST_PATH_IMAGE003
[0-1]
Figure 365168DEST_PATH_IMAGE002
[0-9]:[0-5]
Figure 155270DEST_PATH_IMAGE002
[0-9]:[0-5]
Figure 814659DEST_PATH_IMAGE002
[0-9])",
"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s
Figure 223775DEST_PATH_IMAGE002
[2][0-3]:[0-5]
Figure 278318DEST_PATH_IMAGE002
[0-9]:[0-5]
Figure 380267DEST_PATH_IMAGE002
[0-9])",
"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s
Figure 887471DEST_PATH_IMAGE002
[0-1]
Figure 100278DEST_PATH_IMAGE002
[0-9]:[0-5]
Figure 648809DEST_PATH_IMAGE002
[0-9])",
"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s
Figure 780713DEST_PATH_IMAGE002
[2][0-3]:[0-5]
Figure 916159DEST_PATH_IMAGE002
[0-9])",
"(\ d {2} year \ d {1,2} month \ d {1,2} day \ s
Figure 526132DEST_PATH_IMAGE003
[1-24]0-60 of d time]\ d score) ([ 1-24)]'d shi'
]
Those of ordinary skill in the art will appreciate that the various illustrative modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present application, it should be understood that the disclosed system may be implemented in other ways. For example, the division of the modules into only one logical functional division may be implemented in another way, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (12)

1. A webpage information extraction method is characterized in that a target webpage is processed through a pre-preprocessing process, the processed target webpage and corresponding field extraction rules are obtained, a field extraction rule base is generated, and corresponding field information is extracted from the target webpage based on the field extraction rule base.
2. The method for extracting web page information as claimed in claim 1, comprising the steps of:
step S1: receiving a target webpage, carrying out pre-pretreatment on the original text of the target webpage, acquiring the processed target webpage and a field extraction rule, and further generating a field extraction rule base;
step S2: reading in the processed target webpage based on the field extraction rule base to obtain corresponding field information;
step S3: and carrying out correctness verification on the acquired field information.
3. The method for extracting web page information as claimed in claim 2, wherein the pre-preprocessing in step S1 includes the following steps:
s1-1, cleaning the target webpage original text, removing interference information and obtaining a cleaned target webpage;
step S1-2: and establishing a webpage sample based on the cleaned target webpage, and learning the webpage sample to obtain a field extraction rule.
4. The method for extracting web page information as claimed in claim 2, wherein the format of the field extraction rule is a regular expression.
5. A method for extracting web page information as claimed in claim 3, wherein said disturbance information includes comments, script codes, and predefined texts.
6. The method for extracting web page information as claimed in claim 3, wherein the step of obtaining the field extraction rule in the step S1-2 comprises the following steps:
randomly selecting a part of webpages from the target webpages as webpage samples, acquiring field information according to the webpage samples, and verifying the correctness of the acquired field information;
if the field information is wrong, acquiring a corresponding field extraction rule, and submitting a request for modifying the field extraction rule;
modified field extraction rules are received and replaced.
7. The method for extracting web page information as claimed in claim 3, wherein said step S2 specifically includes the following steps:
step S2-1: reading in a target webpage, and analyzing a header part and a text part of the target webpage;
step S2-2: acquiring meta information in a meta tag from a header part of the target webpage;
step S2-3: acquiring field information related to a profile from a header part of the target webpage, wherein the profile comprises a title, a source and time information;
step S2-4: acquiring field information from the text part of the target webpage;
step S2-5: the field information obtained in step S2-3 and step S2-4 is optimally supplemented based on the meta information obtained in the above-described step S2-2.
8. The method for extracting information on web pages as claimed in claim 3, wherein in said step S3, includes cleaning the obtained field information to remove the interference information, said interference information includes comment information, style label, predefined text.
9. The method for extracting web page information as claimed in claim 3, wherein in said step S3, further comprising detecting said obtained field information to determine whether a field extraction rule is satisfied.
10. A web page information extraction system is characterized by comprising:
a: the front-end processing module: the method is used for preprocessing the target website page before extracting the field information, and specifically comprises the following steps:
a01: a rule determination module: the method comprises the steps of learning elements of a website page, acquiring field extraction rules of the website page, and storing the field extraction rules to a field extraction rule base;
a02: the first cleaning module is used for cleaning the original text or field information of the website page and removing interference information;
b: the field information extraction module: the system comprises a field extraction rule base, a database and a database server, wherein the field extraction rule base is used for reading the content of a website page, dividing the content of the page to obtain field information related to the field extraction rule, and performing primary correctness verification on the obtained field information;
c: and the field extraction rule base comprises a plurality of field extraction rules and is used for storing the field extraction rules acquired by the rule determination module.
11. The system for extracting web page information according to claim 10, further comprising an output processing module, wherein the output processing module is configured to perform optimization and correctness verification on the extracted field information, and specifically comprises:
a second cleaning module: the field information extraction module is used for cleaning the extracted field information and removing interference information;
a detection and verification module: and the field information is used for detecting the obtained field information and judging whether the field information accords with the field extraction rule.
12. An electronic device, characterized in that the device comprises a processor and a memory,
the memory is used for storing an executable program;
the processor is used for executing the executable program to realize a webpage information extraction method of one of claims 1 to 9.
CN202011095088.1A 2020-10-14 2020-10-14 Webpage information extraction method and system and electronic equipment Pending CN111966881A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011095088.1A CN111966881A (en) 2020-10-14 2020-10-14 Webpage information extraction method and system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011095088.1A CN111966881A (en) 2020-10-14 2020-10-14 Webpage information extraction method and system and electronic equipment

Publications (1)

Publication Number Publication Date
CN111966881A true CN111966881A (en) 2020-11-20

Family

ID=73387085

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011095088.1A Pending CN111966881A (en) 2020-10-14 2020-10-14 Webpage information extraction method and system and electronic equipment

Country Status (1)

Country Link
CN (1) CN111966881A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254751A (en) * 2021-06-24 2021-08-13 北森云计算有限公司 Method, equipment and storage medium for accurately extracting complex webpage structured information

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714116A (en) * 2013-10-31 2014-04-09 北京奇虎科技有限公司 Webpage information extracting method and webpage information extracting equipment
CN104050281A (en) * 2014-06-26 2014-09-17 北京思特奇信息技术股份有限公司 Webpage information extraction method and device based on http protocol
CN104537128A (en) * 2015-01-30 2015-04-22 广联达软件股份有限公司 Webpage information extracting method and device
CN108334508A (en) * 2017-01-19 2018-07-27 阿里巴巴集团控股有限公司 The extracting method and device of webpage information
CN108520007A (en) * 2018-03-15 2018-09-11 江河瑞通(北京)技术有限公司 Web page information extracting method, storage medium and computer equipment
US20190362024A1 (en) * 2018-05-24 2019-11-28 Open Text Sa Ulc Systems and methods for intelligent content filtering and persistence
CN111475700A (en) * 2020-03-19 2020-07-31 平安国际智慧城市科技股份有限公司 Data extraction method and related equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714116A (en) * 2013-10-31 2014-04-09 北京奇虎科技有限公司 Webpage information extracting method and webpage information extracting equipment
CN104050281A (en) * 2014-06-26 2014-09-17 北京思特奇信息技术股份有限公司 Webpage information extraction method and device based on http protocol
CN104537128A (en) * 2015-01-30 2015-04-22 广联达软件股份有限公司 Webpage information extracting method and device
CN108334508A (en) * 2017-01-19 2018-07-27 阿里巴巴集团控股有限公司 The extracting method and device of webpage information
CN108520007A (en) * 2018-03-15 2018-09-11 江河瑞通(北京)技术有限公司 Web page information extracting method, storage medium and computer equipment
US20190362024A1 (en) * 2018-05-24 2019-11-28 Open Text Sa Ulc Systems and methods for intelligent content filtering and persistence
CN111475700A (en) * 2020-03-19 2020-07-31 平安国际智慧城市科技股份有限公司 Data extraction method and related equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254751A (en) * 2021-06-24 2021-08-13 北森云计算有限公司 Method, equipment and storage medium for accurately extracting complex webpage structured information

Similar Documents

Publication Publication Date Title
US10380197B2 (en) Network searching method and network searching system
US8321396B2 (en) Automatically extracting by-line information
CN107704539B (en) Method and device for large-scale text information batch structuring
CN109857956B (en) News webpage key information automatic extraction method based on label and block characteristics
CN105022803B (en) A kind of method and system for extracting Web page text content
CN111125598A (en) Intelligent data query method, device, equipment and storage medium
CN109543126B (en) Webpage text information extraction method based on block character ratio
US20110302486A1 (en) Method and apparatus for obtaining the effective contents of web page
US9514113B1 (en) Methods for automatic footnote generation
CN105335511A (en) Webpage access method and device
CN102073654B (en) Methods and equipment for generating and maintaining web content extraction template
WO2021051869A1 (en) Text data layout arrangement method, device, computer apparatus, and storage medium
CN103530430A (en) Method and system for cross-label processing of html rich text data with format
CN111475700A (en) Data extraction method and related equipment
CN109165373B (en) Data processing method and device
CN115270723A (en) PDF document splitting method, device, equipment and storage medium
CN102467501A (en) Method and system for extracting news record metadata from news list page
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN111966881A (en) Webpage information extraction method and system and electronic equipment
WO2022134577A1 (en) Translation error identification method and apparatus, and computer device and readable storage medium
CN107766384A (en) A kind of method and apparatus for determining page issuing time
CN113221031B (en) Method for automatically identifying website catalog page
CN106897287A (en) Homepage Publishing decimation in time method and the device for Homepage Publishing decimation in time
CN114220113A (en) Paper quality detection method, device and equipment
CN113987320A (en) Real-time information crawler method, device and equipment based on intelligent page analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201120