CN109885743A - A kind of webpage data information extracting method - Google Patents

A kind of webpage data information extracting method Download PDF

Info

Publication number
CN109885743A
CN109885743A CN201910009284.3A CN201910009284A CN109885743A CN 109885743 A CN109885743 A CN 109885743A CN 201910009284 A CN201910009284 A CN 201910009284A CN 109885743 A CN109885743 A CN 109885743A
Authority
CN
China
Prior art keywords
webpage
data information
webpage data
web page
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910009284.3A
Other languages
Chinese (zh)
Other versions
CN109885743B (en
Inventor
胡成红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Seven India Mdt Infotech Ltd
Original Assignee
Shanghai Seven India Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Seven India Mdt Infotech Ltd filed Critical Shanghai Seven India Mdt Infotech Ltd
Priority to CN201910009284.3A priority Critical patent/CN109885743B/en
Publication of CN109885743A publication Critical patent/CN109885743A/en
Application granted granted Critical
Publication of CN109885743B publication Critical patent/CN109885743B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

A kind of webpage data information extracting method disclosed by the invention, the following steps are included: step S10, receive user input the web page address for needing to obtain data information, and according to the web page address obtain the web page address corresponding to webpage rendering after HTML content;Step S20 carries out web data extraction to the HTML content got using the general extracting rule of webpage, and generates webpage data information required for user;Step S30 is formatted processing according to webpage data information of the output format of user response to generation, and will format that treated webpage data information returns to user.The present invention carries out webpage data information extraction to different web pages by general extracting mode, dramatically save webpage data information extraction cost, improve webpage data information extraction efficiency, webpage data information extraction time is saved, while also avoiding later period secondary development cost for generating due to structure of web page variation.

Description

A kind of webpage data information extracting method
Technical field
The present invention relates to field of computer technology more particularly to a kind of webpage data information extracting methods.
Background technique
When needing to carry out data information to a large amount of different web pages to extract, need to formulate for each webpage corresponding Data information extracting rule, the data information of webpage can be extracted, as shown in Figure 1.When a certain webpage data information When structure changes, then needs that the data information extracting rule of the webpage is modified or is changed, have reached and be applicable in again Requirement, meet data information extraction purpose.But existing webpage data information extracting method is asked there are following some Topic: 1, must formulate a sets of data extracting rule to each webpage respectively, when the webpage quantity for needing to extract is excessive, need It consumes a large amount of manpower and material resources and formulates data information extracting rule;2 when the data information structure of a certain webpage changes, and needs Will the data information extracting rule in time to the webpage be modified or change, otherwise can not to the webpage carry out data information mention It takes.
Summary of the invention
Technical problem solved by the invention is: providing a kind of raising web data letter in view of the deficiencies of the prior art Breath extraction efficiency, reduce webpage data information extraction cost, avoid the later period due to structure of web page variation secondary development that generates at This webpage data information extracting method.
The technical problems to be solved by the invention can adopt the following technical scheme that realize:
A kind of webpage data information extracting method, comprising the following steps:
Step S10, receives the web page address for needing to obtain data information of user's input, and is obtained according to the web page address HTML content after taking webpage corresponding to the web page address to render;
Step S20 carries out web data extraction to the HTML content got using the general extracting rule of webpage, and generates Webpage data information required for user;
Step S30 is formatted processing according to webpage data information of the output format of user response to generation, and will Format that treated that webpage data information returns to user.
In a preferred embodiment of the invention, in the step S10, pass through Open Framework selenium dynamic wash with watercolours Dye loads the HTML content after the webpage rendering got, and carries out head to HTML content using the included parsing library of python environment Secondary parsing pretreatment, so that it becomes the structural body of the information content can be extracted directly using the general extracting rule of webpage.
In a preferred embodiment of the invention, the step S20 includes following sub-step:
Step S21 carries out dissection process to the structural body using python environment parsing library;
Step S22 formulates path searching rule, all labels in matched and searched webpage, to obtain web page body information;
Step S23 judges whether the web page body information got is complete, if it is determined that obtaining completely, then enters step S24, if obtain not exclusively, gradually obtain webpage in label substance and/or attribute value carry out deep layer parsing extract, then into And step S24;
Step S24 formulates a large amount of canonical matching rules, matches in webpage and more refines content information.
Due to using technical solution as above, the beneficial effects of the present invention are: the present invention passes through general extracting mode pair Different web pages carry out webpage data information extraction, dramatically save webpage data information extraction cost, improve web data letter Cease extraction efficiency, save webpage data information extraction time, while also avoid the later period because structure of web page variation due to generate two Secondary development cost.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is the flow chart of traditional webpage data information extracting method.
Fig. 2 is the flow chart of webpage data information extracting method of the invention.
Specific embodiment
In order to be easy to understand the technical means, the creative features, the aims and the efficiencies achieved by the present invention, tie below Conjunction is specifically illustrating, and the present invention is further explained.
Referring to fig. 2, what is provided in figure is a kind of webpage data information extracting method, comprising the following steps:
Step S10 receives the web page address for needing to obtain data information of user's input, and obtains net according to web page address HTML content after the rendering of webpage corresponding to page address.Specifically, it is obtained by Open Framework selenium dynamic rendering load HTML content after the webpage rendering got, and to HTML content progress, parsing is pre- for the first time using the included parsing library of python environment Processing, so that it becomes the structural body of the information content can be extracted directly using the general extracting rule of webpage.
Step S20 carries out web data extraction to the HTML content got using the general extracting rule of webpage, and generates Webpage data information required for user.
Step S30 is formatted processing according to webpage data information of the output format of user response to generation, and will Format that treated that webpage data information returns to user.
In step S20, web data extraction, including following sub-step are carried out to the HTML content got:
Step S21 carries out dissection process to structural body using python environment parsing library, such as passes through xpath/ The mode of beautifulsoup carries out dissection process;
Step S22 formulates the labels such as all title/p/img/ that xpath path searching is regular, in matched and searched webpage, To obtain web page body information;
Step S23 judges whether the web page body information got is complete, if it is determined that obtaining completely, then enters step S24 gradually obtains the label substances such as the a/li under the div tag in webpage and/or attribute value carries out deeply if obtaining not exclusively Layer parsing is extracted, then and then step S24;
Step S24 formulates a large amount of canonical matching rules, matches in webpage and more refines content information, to ensure logical With the accuracy and diversity for guaranteeing acquisition information while obtaining webpage information.
The above shows and describes the basic principles and main features of the present invention and the advantages of the present invention.The technology of the industry Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and the above embodiments and description only describe this The principle of invention, without departing from the spirit and scope of the present invention, various changes and improvements may be made to the invention, these changes Change and improvement all fall within the protetion scope of the claimed invention.The claimed scope of the invention by appended claims and its Equivalent thereof.

Claims (3)

1. a kind of webpage data information extracting method, which comprises the following steps:
Step S10 receives the web page address for needing to obtain data information of user's input, and obtains institute according to the web page address HTML content after stating the rendering of webpage corresponding to web page address;
Step S20 carries out web data extraction to the HTML content got using the general extracting rule of webpage, and generates user Required webpage data information;
Step S30 is formatted processing according to webpage data information of the output format of user response to generation, and by format Change that treated that webpage data information returns to user.
2. webpage data information extracting method as described in claim 1, which is characterized in that in the step S10, by opening HTML content after the webpage rendering that frame selenium dynamic rendering load in source is got, and utilize the included solution of python environment It analyses library and parsing pretreatment for the first time is carried out to HTML content, so that it becomes can directly be extracted in information using the general extracting rule of webpage The structural body of appearance.
3. webpage data information extracting method as claimed in claim 2, which is characterized in that the step S20 includes following son Step:
Step S21 carries out dissection process to the structural body using python environment parsing library;
Step S22 formulates path searching rule, all labels in matched and searched webpage, to obtain web page body information;
Step S23 judges whether the web page body information got is complete, if it is determined that obtaining completely, then enters step S24, if It obtains not exclusively, then gradually obtains label substance and/or attribute value in webpage and carry out deep layer parsing and extract, then and then step S24;
Step S24 formulates a large amount of canonical matching rules, matches in webpage and more refines content information.
CN201910009284.3A 2019-01-04 2019-01-04 Webpage data information extraction method Active CN109885743B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910009284.3A CN109885743B (en) 2019-01-04 2019-01-04 Webpage data information extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910009284.3A CN109885743B (en) 2019-01-04 2019-01-04 Webpage data information extraction method

Publications (2)

Publication Number Publication Date
CN109885743A true CN109885743A (en) 2019-06-14
CN109885743B CN109885743B (en) 2024-01-02

Family

ID=66925652

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910009284.3A Active CN109885743B (en) 2019-01-04 2019-01-04 Webpage data information extraction method

Country Status (1)

Country Link
CN (1) CN109885743B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528205A (en) * 2020-12-22 2021-03-19 中科院计算技术研究所大数据研究院 Webpage main body information extraction method and device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682109A (en) * 2012-05-09 2012-09-19 北京彼速信息技术有限公司 Patent information analysis method and device
CN103678432A (en) * 2013-04-07 2014-03-26 南京邮电大学 Webpage main body extraction method based on webpage main body features and intermediate true values
CN104462532A (en) * 2014-12-23 2015-03-25 北京奇虎科技有限公司 Method and device for extracting webpage text
US20160283461A1 (en) * 2014-02-26 2016-09-29 Tencent Technology (Shenzhen) Company Limited Method and terminal for extracting webpage content, and non-transitory storage medium
CN106547895A (en) * 2016-11-03 2017-03-29 北京锐安科技有限公司 A kind of extracting method and device of info web

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682109A (en) * 2012-05-09 2012-09-19 北京彼速信息技术有限公司 Patent information analysis method and device
CN103678432A (en) * 2013-04-07 2014-03-26 南京邮电大学 Webpage main body extraction method based on webpage main body features and intermediate true values
US20160283461A1 (en) * 2014-02-26 2016-09-29 Tencent Technology (Shenzhen) Company Limited Method and terminal for extracting webpage content, and non-transitory storage medium
CN104462532A (en) * 2014-12-23 2015-03-25 北京奇虎科技有限公司 Method and device for extracting webpage text
CN106547895A (en) * 2016-11-03 2017-03-29 北京锐安科技有限公司 A kind of extracting method and device of info web

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528205A (en) * 2020-12-22 2021-03-19 中科院计算技术研究所大数据研究院 Webpage main body information extraction method and device and storage medium
CN112528205B (en) * 2020-12-22 2021-10-29 中科院计算技术研究所大数据研究院 Webpage main body information extraction method and device and storage medium

Also Published As

Publication number Publication date
CN109885743B (en) 2024-01-02

Similar Documents

Publication Publication Date Title
CN102253979B (en) Vision-based web page extracting method
CN105022803B (en) A kind of method and system for extracting Web page text content
CN102591612B (en) General webpage text extraction method based on punctuation continuity and system thereof
CN103390051A (en) Topic detection and tracking method based on microblog data
CN102609427A (en) Public opinion vertical search analysis system and method
Ji et al. Tag tree template for Web information and schema extraction
CN101609399A (en) Intelligent website development system and method based on modeling
CN102929902A (en) Character splitting method and device based on Chinese retrieval
CN103049536A (en) Webpage main text content extracting method and webpage text content extracting system
CN103810251A (en) Method and device for extracting text
WO2023155303A1 (en) Webpage data extraction method and apparatus, computer device, and storage medium
CN102411602B (en) Extensive makeup language (XML) parallel speculation analysis method realized on basis of field programmable gate array (FPGA)
CN105740355B (en) Webpage context extraction method and device based on aggregation text density
Xu et al. Novel approach of semantic annotation by fuzzy ontology based on variable precision rough set and concept lattice
CN102591931B (en) Recognition and extraction method for webpage data records based on tree weight
CN109885743A (en) A kind of webpage data information extracting method
KR20130099327A (en) Apparatus for extracting information from open domains and method for the same
Della Penna et al. A spatial relation-based framework to perform visual information extraction
Auger et al. Probing Semantic Relations: Exploration and identification in specialized texts
Pazienza et al. Combining Ontological Knowledge and Wrapper Induction techniques into an e-retail System1
CN103116448A (en) Extract method for visualizing information
CN108984676B (en) Electronic book cross-terminal self-adaptive display system and method based on XML
Lim et al. Generalized and lightweight algorithms for automated web forum content extraction
Han et al. Automatic mobile content conversion using semantic image analysis
Yanfen et al. Educational resources metadata automatically extracted strategy study

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant