CN109885743A - A kind of webpage data information extracting method - Google Patents
A kind of webpage data information extracting method Download PDFInfo
- Publication number
- CN109885743A CN109885743A CN201910009284.3A CN201910009284A CN109885743A CN 109885743 A CN109885743 A CN 109885743A CN 201910009284 A CN201910009284 A CN 201910009284A CN 109885743 A CN109885743 A CN 109885743A
- Authority
- CN
- China
- Prior art keywords
- webpage
- data information
- webpage data
- web page
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 18
- 238000009877 rendering Methods 0.000 claims abstract description 8
- 238000013075 data extraction Methods 0.000 claims abstract description 5
- 230000004044 response Effects 0.000 claims abstract description 4
- 238000002224 dissection Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- BUGBHKTXTAQXES-UHFFFAOYSA-N Selenium Chemical compound [Se] BUGBHKTXTAQXES-UHFFFAOYSA-N 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 229910052711 selenium Inorganic materials 0.000 claims description 3
- 239000011669 selenium Substances 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 claims 1
- 238000000605 extraction Methods 0.000 abstract description 11
- 230000006872 improvement Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
A kind of webpage data information extracting method disclosed by the invention, the following steps are included: step S10, receive user input the web page address for needing to obtain data information, and according to the web page address obtain the web page address corresponding to webpage rendering after HTML content;Step S20 carries out web data extraction to the HTML content got using the general extracting rule of webpage, and generates webpage data information required for user;Step S30 is formatted processing according to webpage data information of the output format of user response to generation, and will format that treated webpage data information returns to user.The present invention carries out webpage data information extraction to different web pages by general extracting mode, dramatically save webpage data information extraction cost, improve webpage data information extraction efficiency, webpage data information extraction time is saved, while also avoiding later period secondary development cost for generating due to structure of web page variation.
Description
Technical field
The present invention relates to field of computer technology more particularly to a kind of webpage data information extracting methods.
Background technique
When needing to carry out data information to a large amount of different web pages to extract, need to formulate for each webpage corresponding
Data information extracting rule, the data information of webpage can be extracted, as shown in Figure 1.When a certain webpage data information
When structure changes, then needs that the data information extracting rule of the webpage is modified or is changed, have reached and be applicable in again
Requirement, meet data information extraction purpose.But existing webpage data information extracting method is asked there are following some
Topic: 1, must formulate a sets of data extracting rule to each webpage respectively, when the webpage quantity for needing to extract is excessive, need
It consumes a large amount of manpower and material resources and formulates data information extracting rule;2 when the data information structure of a certain webpage changes, and needs
Will the data information extracting rule in time to the webpage be modified or change, otherwise can not to the webpage carry out data information mention
It takes.
Summary of the invention
Technical problem solved by the invention is: providing a kind of raising web data letter in view of the deficiencies of the prior art
Breath extraction efficiency, reduce webpage data information extraction cost, avoid the later period due to structure of web page variation secondary development that generates at
This webpage data information extracting method.
The technical problems to be solved by the invention can adopt the following technical scheme that realize:
A kind of webpage data information extracting method, comprising the following steps:
Step S10, receives the web page address for needing to obtain data information of user's input, and is obtained according to the web page address
HTML content after taking webpage corresponding to the web page address to render;
Step S20 carries out web data extraction to the HTML content got using the general extracting rule of webpage, and generates
Webpage data information required for user;
Step S30 is formatted processing according to webpage data information of the output format of user response to generation, and will
Format that treated that webpage data information returns to user.
In a preferred embodiment of the invention, in the step S10, pass through Open Framework selenium dynamic wash with watercolours
Dye loads the HTML content after the webpage rendering got, and carries out head to HTML content using the included parsing library of python environment
Secondary parsing pretreatment, so that it becomes the structural body of the information content can be extracted directly using the general extracting rule of webpage.
In a preferred embodiment of the invention, the step S20 includes following sub-step:
Step S21 carries out dissection process to the structural body using python environment parsing library;
Step S22 formulates path searching rule, all labels in matched and searched webpage, to obtain web page body information;
Step S23 judges whether the web page body information got is complete, if it is determined that obtaining completely, then enters step
S24, if obtain not exclusively, gradually obtain webpage in label substance and/or attribute value carry out deep layer parsing extract, then into
And step S24;
Step S24 formulates a large amount of canonical matching rules, matches in webpage and more refines content information.
Due to using technical solution as above, the beneficial effects of the present invention are: the present invention passes through general extracting mode pair
Different web pages carry out webpage data information extraction, dramatically save webpage data information extraction cost, improve web data letter
Cease extraction efficiency, save webpage data information extraction time, while also avoid the later period because structure of web page variation due to generate two
Secondary development cost.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is the flow chart of traditional webpage data information extracting method.
Fig. 2 is the flow chart of webpage data information extracting method of the invention.
Specific embodiment
In order to be easy to understand the technical means, the creative features, the aims and the efficiencies achieved by the present invention, tie below
Conjunction is specifically illustrating, and the present invention is further explained.
Referring to fig. 2, what is provided in figure is a kind of webpage data information extracting method, comprising the following steps:
Step S10 receives the web page address for needing to obtain data information of user's input, and obtains net according to web page address
HTML content after the rendering of webpage corresponding to page address.Specifically, it is obtained by Open Framework selenium dynamic rendering load
HTML content after the webpage rendering got, and to HTML content progress, parsing is pre- for the first time using the included parsing library of python environment
Processing, so that it becomes the structural body of the information content can be extracted directly using the general extracting rule of webpage.
Step S20 carries out web data extraction to the HTML content got using the general extracting rule of webpage, and generates
Webpage data information required for user.
Step S30 is formatted processing according to webpage data information of the output format of user response to generation, and will
Format that treated that webpage data information returns to user.
In step S20, web data extraction, including following sub-step are carried out to the HTML content got:
Step S21 carries out dissection process to structural body using python environment parsing library, such as passes through xpath/
The mode of beautifulsoup carries out dissection process;
Step S22 formulates the labels such as all title/p/img/ that xpath path searching is regular, in matched and searched webpage,
To obtain web page body information;
Step S23 judges whether the web page body information got is complete, if it is determined that obtaining completely, then enters step
S24 gradually obtains the label substances such as the a/li under the div tag in webpage and/or attribute value carries out deeply if obtaining not exclusively
Layer parsing is extracted, then and then step S24;
Step S24 formulates a large amount of canonical matching rules, matches in webpage and more refines content information, to ensure logical
With the accuracy and diversity for guaranteeing acquisition information while obtaining webpage information.
The above shows and describes the basic principles and main features of the present invention and the advantages of the present invention.The technology of the industry
Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and the above embodiments and description only describe this
The principle of invention, without departing from the spirit and scope of the present invention, various changes and improvements may be made to the invention, these changes
Change and improvement all fall within the protetion scope of the claimed invention.The claimed scope of the invention by appended claims and its
Equivalent thereof.
Claims (3)
1. a kind of webpage data information extracting method, which comprises the following steps:
Step S10 receives the web page address for needing to obtain data information of user's input, and obtains institute according to the web page address
HTML content after stating the rendering of webpage corresponding to web page address;
Step S20 carries out web data extraction to the HTML content got using the general extracting rule of webpage, and generates user
Required webpage data information;
Step S30 is formatted processing according to webpage data information of the output format of user response to generation, and by format
Change that treated that webpage data information returns to user.
2. webpage data information extracting method as described in claim 1, which is characterized in that in the step S10, by opening
HTML content after the webpage rendering that frame selenium dynamic rendering load in source is got, and utilize the included solution of python environment
It analyses library and parsing pretreatment for the first time is carried out to HTML content, so that it becomes can directly be extracted in information using the general extracting rule of webpage
The structural body of appearance.
3. webpage data information extracting method as claimed in claim 2, which is characterized in that the step S20 includes following son
Step:
Step S21 carries out dissection process to the structural body using python environment parsing library;
Step S22 formulates path searching rule, all labels in matched and searched webpage, to obtain web page body information;
Step S23 judges whether the web page body information got is complete, if it is determined that obtaining completely, then enters step S24, if
It obtains not exclusively, then gradually obtains label substance and/or attribute value in webpage and carry out deep layer parsing and extract, then and then step
S24;
Step S24 formulates a large amount of canonical matching rules, matches in webpage and more refines content information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910009284.3A CN109885743B (en) | 2019-01-04 | 2019-01-04 | Webpage data information extraction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910009284.3A CN109885743B (en) | 2019-01-04 | 2019-01-04 | Webpage data information extraction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109885743A true CN109885743A (en) | 2019-06-14 |
CN109885743B CN109885743B (en) | 2024-01-02 |
Family
ID=66925652
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910009284.3A Active CN109885743B (en) | 2019-01-04 | 2019-01-04 | Webpage data information extraction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109885743B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112528205A (en) * | 2020-12-22 | 2021-03-19 | 中科院计算技术研究所大数据研究院 | Webpage main body information extraction method and device and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102682109A (en) * | 2012-05-09 | 2012-09-19 | 北京彼速信息技术有限公司 | Patent information analysis method and device |
CN103678432A (en) * | 2013-04-07 | 2014-03-26 | 南京邮电大学 | Webpage main body extraction method based on webpage main body features and intermediate true values |
CN104462532A (en) * | 2014-12-23 | 2015-03-25 | 北京奇虎科技有限公司 | Method and device for extracting webpage text |
US20160283461A1 (en) * | 2014-02-26 | 2016-09-29 | Tencent Technology (Shenzhen) Company Limited | Method and terminal for extracting webpage content, and non-transitory storage medium |
CN106547895A (en) * | 2016-11-03 | 2017-03-29 | 北京锐安科技有限公司 | A kind of extracting method and device of info web |
-
2019
- 2019-01-04 CN CN201910009284.3A patent/CN109885743B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102682109A (en) * | 2012-05-09 | 2012-09-19 | 北京彼速信息技术有限公司 | Patent information analysis method and device |
CN103678432A (en) * | 2013-04-07 | 2014-03-26 | 南京邮电大学 | Webpage main body extraction method based on webpage main body features and intermediate true values |
US20160283461A1 (en) * | 2014-02-26 | 2016-09-29 | Tencent Technology (Shenzhen) Company Limited | Method and terminal for extracting webpage content, and non-transitory storage medium |
CN104462532A (en) * | 2014-12-23 | 2015-03-25 | 北京奇虎科技有限公司 | Method and device for extracting webpage text |
CN106547895A (en) * | 2016-11-03 | 2017-03-29 | 北京锐安科技有限公司 | A kind of extracting method and device of info web |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112528205A (en) * | 2020-12-22 | 2021-03-19 | 中科院计算技术研究所大数据研究院 | Webpage main body information extraction method and device and storage medium |
CN112528205B (en) * | 2020-12-22 | 2021-10-29 | 中科院计算技术研究所大数据研究院 | Webpage main body information extraction method and device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109885743B (en) | 2024-01-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102253979B (en) | Vision-based web page extracting method | |
CN105022803B (en) | A kind of method and system for extracting Web page text content | |
CN102591612B (en) | General webpage text extraction method based on punctuation continuity and system thereof | |
CN103390051A (en) | Topic detection and tracking method based on microblog data | |
CN102609427A (en) | Public opinion vertical search analysis system and method | |
Ji et al. | Tag tree template for Web information and schema extraction | |
CN101609399A (en) | Intelligent website development system and method based on modeling | |
CN102929902A (en) | Character splitting method and device based on Chinese retrieval | |
CN103049536A (en) | Webpage main text content extracting method and webpage text content extracting system | |
CN103810251A (en) | Method and device for extracting text | |
WO2023155303A1 (en) | Webpage data extraction method and apparatus, computer device, and storage medium | |
CN102411602B (en) | Extensive makeup language (XML) parallel speculation analysis method realized on basis of field programmable gate array (FPGA) | |
CN105740355B (en) | Webpage context extraction method and device based on aggregation text density | |
Xu et al. | Novel approach of semantic annotation by fuzzy ontology based on variable precision rough set and concept lattice | |
CN102591931B (en) | Recognition and extraction method for webpage data records based on tree weight | |
CN109885743A (en) | A kind of webpage data information extracting method | |
KR20130099327A (en) | Apparatus for extracting information from open domains and method for the same | |
Della Penna et al. | A spatial relation-based framework to perform visual information extraction | |
Auger et al. | Probing Semantic Relations: Exploration and identification in specialized texts | |
Pazienza et al. | Combining Ontological Knowledge and Wrapper Induction techniques into an e-retail System1 | |
CN103116448A (en) | Extract method for visualizing information | |
CN108984676B (en) | Electronic book cross-terminal self-adaptive display system and method based on XML | |
Lim et al. | Generalized and lightweight algorithms for automated web forum content extraction | |
Han et al. | Automatic mobile content conversion using semantic image analysis | |
Yanfen et al. | Educational resources metadata automatically extracted strategy study |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |