CN109885743B - Webpage data information extraction method - Google Patents

Webpage data information extraction method Download PDF

Info

Publication number
CN109885743B
CN109885743B CN201910009284.3A CN201910009284A CN109885743B CN 109885743 B CN109885743 B CN 109885743B CN 201910009284 A CN201910009284 A CN 201910009284A CN 109885743 B CN109885743 B CN 109885743B
Authority
CN
China
Prior art keywords
webpage
data information
webpage data
user
extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910009284.3A
Other languages
Chinese (zh)
Other versions
CN109885743A (en
Inventor
胡成红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Qiyin Information Technology Co ltd
Original Assignee
Shanghai Qiyin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Qiyin Information Technology Co ltd filed Critical Shanghai Qiyin Information Technology Co ltd
Priority to CN201910009284.3A priority Critical patent/CN109885743B/en
Publication of CN109885743A publication Critical patent/CN109885743A/en
Application granted granted Critical
Publication of CN109885743B publication Critical patent/CN109885743B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage data information extraction method, which comprises the following steps: step S10, receiving a webpage address which is input by a user and needs to acquire data information, and acquiring webpage rendered HTML content corresponding to the webpage address according to the webpage address; step S20, extracting webpage data of the obtained HTML content by adopting a webpage general extraction rule, and generating webpage data information required by a user; and step S30, formatting the generated webpage data information according to the output format responded by the user, and returning the formatted webpage data information to the user. According to the invention, the webpage data information is extracted from different webpages in a general extraction mode, so that the webpage data information extraction cost is greatly saved, the webpage data information extraction efficiency is improved, the webpage data information extraction time is saved, and the secondary development cost caused by the structural change of the webpages in the later period is avoided.

Description

Webpage data information extraction method
Technical Field
The invention relates to the technical field of computers, in particular to a webpage data information extraction method.
Background
When a large number of different web pages need to be extracted, corresponding data information extraction rules are formulated for each web page, and the data information of the web pages can be extracted, as shown in fig. 1. When the data information structure of a certain webpage changes, the data information extraction rule of the webpage needs to be corrected or changed, so that the requirement of re-application is met, and the purpose of data information extraction is fulfilled. However, the existing web page data information extraction method has the following problems: 1. a set of data extraction rules must be formulated for each webpage respectively, and when the number of the webpages to be extracted is excessive, a large amount of manpower and material resources are consumed to formulate data information extraction rules; 2 when the data information structure of a certain webpage changes, the data information extraction rule of the webpage needs to be corrected or changed in time, otherwise, the data information extraction cannot be carried out on the webpage.
Disclosure of Invention
The technical problems solved by the invention are as follows: aiming at the defects of the prior art, the webpage data information extraction method for improving the webpage data information extraction efficiency, reducing the webpage data information extraction cost and avoiding the secondary development cost caused by the structural change of the webpage in the later period is provided.
The technical problems to be solved by the invention can be realized by adopting the following technical scheme:
a webpage data information extraction method comprises the following steps:
step S10, receiving a webpage address which is input by a user and needs to acquire data information, and acquiring webpage rendered HTML content corresponding to the webpage address according to the webpage address;
step S20, extracting webpage data of the obtained HTML content by adopting a webpage general extraction rule, and generating webpage data information required by a user;
and step S30, formatting the generated webpage data information according to the output format responded by the user, and returning the formatted webpage data information to the user.
In a preferred embodiment of the present invention, in the step S10, the HTML content rendered by the web page obtained by the open source framework selenium dynamic rendering loading is subjected to the first parsing preprocessing by using the python environment self-contained parsing library, so that the HTML content becomes a structure body capable of directly extracting the information content by using the web page general extraction rule.
In a preferred embodiment of the invention, said step S20 comprises the sub-steps of:
s21, analyzing the structural body by using a python environment analysis library;
step S22, formulating a path searching rule, and matching and searching all labels in the webpage to obtain webpage main body information;
step S23, judging whether the acquired webpage main body information is complete, if yes, entering step S24, if not, gradually acquiring label content and/or attribute value in the webpage to perform deep analysis and extraction, and then, entering step S24;
and S24, formulating a large number of regular matching rules to match more refined content information in the webpage.
Due to the adoption of the technical scheme, the invention has the beneficial effects that: according to the invention, the webpage data information is extracted from different webpages in a general extraction mode, so that the webpage data information extraction cost is greatly saved, the webpage data information extraction efficiency is improved, the webpage data information extraction time is saved, and the secondary development cost caused by the structural change of the webpages in the later period is avoided.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a conventional web page data information extraction method.
Fig. 2 is a flowchart of a web page data information extraction method of the present invention.
Detailed Description
The invention is further described with reference to the following detailed drawings in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the implementation of the invention easy to understand.
Referring to fig. 2, a method for extracting web page data information is provided, which includes the following steps:
step S10, receiving a webpage address which is input by a user and needs to acquire data information, and acquiring webpage rendered HTML content corresponding to the webpage address according to the webpage address. Specifically, HTML content rendered by the webpage obtained through open source framework selenium dynamic rendering loading is subjected to first analysis pretreatment by utilizing a python environment self-contained analysis library, so that the HTML content becomes a structural body capable of directly extracting information content by utilizing a webpage general extraction rule.
And S20, extracting the webpage data of the acquired HTML content by adopting a webpage general extraction rule, and generating webpage data information required by a user.
And step S30, formatting the generated webpage data information according to the output format responded by the user, and returning the formatted webpage data information to the user.
In step S20, web page data extraction is performed on the acquired HTML content, including the following sub-steps:
step S21, analyzing the structure by using a python environment analysis library, for example, analyzing by an xpath/beautfulsource method;
step S22, formulating an xpath path searching rule, and matching and searching all title/p/img/etc tags in the webpage to obtain webpage main body information;
step S23, judging whether the acquired webpage body information is complete, if yes, entering step S24, if not, gradually acquiring tag contents and/or attribute values such as a/li under a div tag in the webpage to perform deep analysis and extraction, and then, entering step S24;
step S24, a large number of regular matching rules are formulated to match more refined content information in the webpage so as to ensure the accuracy and diversity of the acquired information while the webpage information is acquired in a general way.
The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (1)

1. The webpage data information extraction method is characterized by comprising the following steps of:
step S10, receiving a webpage address which is input by a user and needs to acquire data information, and acquiring webpage rendered HTML content corresponding to the webpage address according to the webpage address;
step S20, extracting webpage data of the obtained HTML content by adopting a webpage general extraction rule, and generating webpage data information required by a user;
step S30, formatting the generated webpage data information according to an output format responded by the user, and returning the formatted webpage data information to the user;
in the step S10, the HTML content rendered by the web page obtained by the open source framework selenium dynamic rendering loading is subjected to first analysis pretreatment by using the python environment self-contained analysis library, so that the HTML content becomes a structure body capable of directly extracting information content by using the web page general extraction rule;
the step S20 comprises the following sub-steps:
s21, analyzing the structural body by using a python environment analysis library;
step S22, formulating a path searching rule, and matching and searching all labels in the webpage to obtain webpage main body information;
step S23, judging whether the acquired webpage main body information is complete, if yes, entering step S24, if not, gradually acquiring label content and/or attribute value in the webpage to perform deep analysis and extraction, and then, entering step S24;
and S24, formulating a large number of regular matching rules to match more refined content information in the webpage.
CN201910009284.3A 2019-01-04 2019-01-04 Webpage data information extraction method Active CN109885743B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910009284.3A CN109885743B (en) 2019-01-04 2019-01-04 Webpage data information extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910009284.3A CN109885743B (en) 2019-01-04 2019-01-04 Webpage data information extraction method

Publications (2)

Publication Number Publication Date
CN109885743A CN109885743A (en) 2019-06-14
CN109885743B true CN109885743B (en) 2024-01-02

Family

ID=66925652

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910009284.3A Active CN109885743B (en) 2019-01-04 2019-01-04 Webpage data information extraction method

Country Status (1)

Country Link
CN (1) CN109885743B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528205B (en) * 2020-12-22 2021-10-29 中科院计算技术研究所大数据研究院 Webpage main body information extraction method and device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682109A (en) * 2012-05-09 2012-09-19 北京彼速信息技术有限公司 Patent information analysis method and device
CN103678432A (en) * 2013-04-07 2014-03-26 南京邮电大学 Webpage main body extraction method based on webpage main body features and intermediate true values
CN104462532A (en) * 2014-12-23 2015-03-25 北京奇虎科技有限公司 Method and device for extracting webpage text
US20160283461A1 (en) * 2014-02-26 2016-09-29 Tencent Technology (Shenzhen) Company Limited Method and terminal for extracting webpage content, and non-transitory storage medium
CN106547895A (en) * 2016-11-03 2017-03-29 北京锐安科技有限公司 A kind of extracting method and device of info web

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682109A (en) * 2012-05-09 2012-09-19 北京彼速信息技术有限公司 Patent information analysis method and device
CN103678432A (en) * 2013-04-07 2014-03-26 南京邮电大学 Webpage main body extraction method based on webpage main body features and intermediate true values
US20160283461A1 (en) * 2014-02-26 2016-09-29 Tencent Technology (Shenzhen) Company Limited Method and terminal for extracting webpage content, and non-transitory storage medium
CN104462532A (en) * 2014-12-23 2015-03-25 北京奇虎科技有限公司 Method and device for extracting webpage text
CN106547895A (en) * 2016-11-03 2017-03-29 北京锐安科技有限公司 A kind of extracting method and device of info web

Also Published As

Publication number Publication date
CN109885743A (en) 2019-06-14

Similar Documents

Publication Publication Date Title
US11669579B2 (en) Method and apparatus for providing search results
US10169305B2 (en) Marking comparison for similar documents
US9053296B2 (en) Detecting plagiarism in computer markup language files
US8122005B1 (en) Training set construction for taxonomic classification
CN106293675B (en) System static resource loading method and device
CN111831384B (en) Language switching method, device, equipment and storage medium
CN110609998A (en) Data extraction method of electronic document information, electronic equipment and storage medium
US20110191381A1 (en) Interactive System for Extracting Data from a Website
US20180260389A1 (en) Electronic document segmentation and relation discovery between elements for natural language processing
US11887011B2 (en) Schema augmentation system for exploratory research
CN114428861A (en) Enterprise policy intelligent reading method, system, equipment and storage medium
US20220121668A1 (en) Method for recommending document, electronic device and storage medium
CN103810251A (en) Method and device for extracting text
Wu Language independent web news extraction system based on text detection framework
CN109885743B (en) Webpage data information extraction method
CN113836316A (en) Processing method, training method, device, equipment and medium for ternary group data
CN112596688A (en) Web end custom printing method based on TinyMCE rich text
CN116415562A (en) Method, apparatus and medium for parsing financial data
CN107423271B (en) Document generation method and device
CN113392354B (en) Webpage text analysis method, system, medium and electronic equipment
CN115577689A (en) Table component generation method, device, equipment and medium
US20090259995A1 (en) Apparatus and Method for Standardizing Textual Elements of an Unstructured Text
CN108984676B (en) Electronic book cross-terminal self-adaptive display system and method based on XML
US20190317985A1 (en) Generating a structured document based on a machine readable document and artificial intelligence-generated annotations
Shiva Prakash et al. Review of techniques for automatic text summarization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant