CN109885743B - Webpage data information extraction method - Google Patents
Webpage data information extraction method Download PDFInfo
- Publication number
- CN109885743B CN109885743B CN201910009284.3A CN201910009284A CN109885743B CN 109885743 B CN109885743 B CN 109885743B CN 201910009284 A CN201910009284 A CN 201910009284A CN 109885743 B CN109885743 B CN 109885743B
- Authority
- CN
- China
- Prior art keywords
- webpage
- data information
- webpage data
- user
- extraction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 36
- BUGBHKTXTAQXES-UHFFFAOYSA-N Selenium Chemical compound [Se] BUGBHKTXTAQXES-UHFFFAOYSA-N 0.000 claims description 3
- 238000009877 rendering Methods 0.000 claims description 3
- 229910052711 selenium Inorganic materials 0.000 claims description 3
- 239000011669 selenium Substances 0.000 claims description 3
- 238000013075 data extraction Methods 0.000 description 2
- 238000000034 method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a webpage data information extraction method, which comprises the following steps: step S10, receiving a webpage address which is input by a user and needs to acquire data information, and acquiring webpage rendered HTML content corresponding to the webpage address according to the webpage address; step S20, extracting webpage data of the obtained HTML content by adopting a webpage general extraction rule, and generating webpage data information required by a user; and step S30, formatting the generated webpage data information according to the output format responded by the user, and returning the formatted webpage data information to the user. According to the invention, the webpage data information is extracted from different webpages in a general extraction mode, so that the webpage data information extraction cost is greatly saved, the webpage data information extraction efficiency is improved, the webpage data information extraction time is saved, and the secondary development cost caused by the structural change of the webpages in the later period is avoided.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a webpage data information extraction method.
Background
When a large number of different web pages need to be extracted, corresponding data information extraction rules are formulated for each web page, and the data information of the web pages can be extracted, as shown in fig. 1. When the data information structure of a certain webpage changes, the data information extraction rule of the webpage needs to be corrected or changed, so that the requirement of re-application is met, and the purpose of data information extraction is fulfilled. However, the existing web page data information extraction method has the following problems: 1. a set of data extraction rules must be formulated for each webpage respectively, and when the number of the webpages to be extracted is excessive, a large amount of manpower and material resources are consumed to formulate data information extraction rules; 2 when the data information structure of a certain webpage changes, the data information extraction rule of the webpage needs to be corrected or changed in time, otherwise, the data information extraction cannot be carried out on the webpage.
Disclosure of Invention
The technical problems solved by the invention are as follows: aiming at the defects of the prior art, the webpage data information extraction method for improving the webpage data information extraction efficiency, reducing the webpage data information extraction cost and avoiding the secondary development cost caused by the structural change of the webpage in the later period is provided.
The technical problems to be solved by the invention can be realized by adopting the following technical scheme:
a webpage data information extraction method comprises the following steps:
step S10, receiving a webpage address which is input by a user and needs to acquire data information, and acquiring webpage rendered HTML content corresponding to the webpage address according to the webpage address;
step S20, extracting webpage data of the obtained HTML content by adopting a webpage general extraction rule, and generating webpage data information required by a user;
and step S30, formatting the generated webpage data information according to the output format responded by the user, and returning the formatted webpage data information to the user.
In a preferred embodiment of the present invention, in the step S10, the HTML content rendered by the web page obtained by the open source framework selenium dynamic rendering loading is subjected to the first parsing preprocessing by using the python environment self-contained parsing library, so that the HTML content becomes a structure body capable of directly extracting the information content by using the web page general extraction rule.
In a preferred embodiment of the invention, said step S20 comprises the sub-steps of:
s21, analyzing the structural body by using a python environment analysis library;
step S22, formulating a path searching rule, and matching and searching all labels in the webpage to obtain webpage main body information;
step S23, judging whether the acquired webpage main body information is complete, if yes, entering step S24, if not, gradually acquiring label content and/or attribute value in the webpage to perform deep analysis and extraction, and then, entering step S24;
and S24, formulating a large number of regular matching rules to match more refined content information in the webpage.
Due to the adoption of the technical scheme, the invention has the beneficial effects that: according to the invention, the webpage data information is extracted from different webpages in a general extraction mode, so that the webpage data information extraction cost is greatly saved, the webpage data information extraction efficiency is improved, the webpage data information extraction time is saved, and the secondary development cost caused by the structural change of the webpages in the later period is avoided.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a conventional web page data information extraction method.
Fig. 2 is a flowchart of a web page data information extraction method of the present invention.
Detailed Description
The invention is further described with reference to the following detailed drawings in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the implementation of the invention easy to understand.
Referring to fig. 2, a method for extracting web page data information is provided, which includes the following steps:
step S10, receiving a webpage address which is input by a user and needs to acquire data information, and acquiring webpage rendered HTML content corresponding to the webpage address according to the webpage address. Specifically, HTML content rendered by the webpage obtained through open source framework selenium dynamic rendering loading is subjected to first analysis pretreatment by utilizing a python environment self-contained analysis library, so that the HTML content becomes a structural body capable of directly extracting information content by utilizing a webpage general extraction rule.
And S20, extracting the webpage data of the acquired HTML content by adopting a webpage general extraction rule, and generating webpage data information required by a user.
And step S30, formatting the generated webpage data information according to the output format responded by the user, and returning the formatted webpage data information to the user.
In step S20, web page data extraction is performed on the acquired HTML content, including the following sub-steps:
step S21, analyzing the structure by using a python environment analysis library, for example, analyzing by an xpath/beautfulsource method;
step S22, formulating an xpath path searching rule, and matching and searching all title/p/img/etc tags in the webpage to obtain webpage main body information;
step S23, judging whether the acquired webpage body information is complete, if yes, entering step S24, if not, gradually acquiring tag contents and/or attribute values such as a/li under a div tag in the webpage to perform deep analysis and extraction, and then, entering step S24;
step S24, a large number of regular matching rules are formulated to match more refined content information in the webpage so as to ensure the accuracy and diversity of the acquired information while the webpage information is acquired in a general way.
The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (1)
1. The webpage data information extraction method is characterized by comprising the following steps of:
step S10, receiving a webpage address which is input by a user and needs to acquire data information, and acquiring webpage rendered HTML content corresponding to the webpage address according to the webpage address;
step S20, extracting webpage data of the obtained HTML content by adopting a webpage general extraction rule, and generating webpage data information required by a user;
step S30, formatting the generated webpage data information according to an output format responded by the user, and returning the formatted webpage data information to the user;
in the step S10, the HTML content rendered by the web page obtained by the open source framework selenium dynamic rendering loading is subjected to first analysis pretreatment by using the python environment self-contained analysis library, so that the HTML content becomes a structure body capable of directly extracting information content by using the web page general extraction rule;
the step S20 comprises the following sub-steps:
s21, analyzing the structural body by using a python environment analysis library;
step S22, formulating a path searching rule, and matching and searching all labels in the webpage to obtain webpage main body information;
step S23, judging whether the acquired webpage main body information is complete, if yes, entering step S24, if not, gradually acquiring label content and/or attribute value in the webpage to perform deep analysis and extraction, and then, entering step S24;
and S24, formulating a large number of regular matching rules to match more refined content information in the webpage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910009284.3A CN109885743B (en) | 2019-01-04 | 2019-01-04 | Webpage data information extraction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910009284.3A CN109885743B (en) | 2019-01-04 | 2019-01-04 | Webpage data information extraction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109885743A CN109885743A (en) | 2019-06-14 |
CN109885743B true CN109885743B (en) | 2024-01-02 |
Family
ID=66925652
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910009284.3A Active CN109885743B (en) | 2019-01-04 | 2019-01-04 | Webpage data information extraction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109885743B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112528205B (en) * | 2020-12-22 | 2021-10-29 | 中科院计算技术研究所大数据研究院 | Webpage main body information extraction method and device and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102682109A (en) * | 2012-05-09 | 2012-09-19 | 北京彼速信息技术有限公司 | Patent information analysis method and device |
CN103678432A (en) * | 2013-04-07 | 2014-03-26 | 南京邮电大学 | Webpage main body extraction method based on webpage main body features and intermediate true values |
CN104462532A (en) * | 2014-12-23 | 2015-03-25 | 北京奇虎科技有限公司 | Method and device for extracting webpage text |
US20160283461A1 (en) * | 2014-02-26 | 2016-09-29 | Tencent Technology (Shenzhen) Company Limited | Method and terminal for extracting webpage content, and non-transitory storage medium |
CN106547895A (en) * | 2016-11-03 | 2017-03-29 | 北京锐安科技有限公司 | A kind of extracting method and device of info web |
-
2019
- 2019-01-04 CN CN201910009284.3A patent/CN109885743B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102682109A (en) * | 2012-05-09 | 2012-09-19 | 北京彼速信息技术有限公司 | Patent information analysis method and device |
CN103678432A (en) * | 2013-04-07 | 2014-03-26 | 南京邮电大学 | Webpage main body extraction method based on webpage main body features and intermediate true values |
US20160283461A1 (en) * | 2014-02-26 | 2016-09-29 | Tencent Technology (Shenzhen) Company Limited | Method and terminal for extracting webpage content, and non-transitory storage medium |
CN104462532A (en) * | 2014-12-23 | 2015-03-25 | 北京奇虎科技有限公司 | Method and device for extracting webpage text |
CN106547895A (en) * | 2016-11-03 | 2017-03-29 | 北京锐安科技有限公司 | A kind of extracting method and device of info web |
Also Published As
Publication number | Publication date |
---|---|
CN109885743A (en) | 2019-06-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11669579B2 (en) | Method and apparatus for providing search results | |
US10169305B2 (en) | Marking comparison for similar documents | |
US9053296B2 (en) | Detecting plagiarism in computer markup language files | |
US8122005B1 (en) | Training set construction for taxonomic classification | |
CN106293675B (en) | System static resource loading method and device | |
CN111831384B (en) | Language switching method, device, equipment and storage medium | |
CN110609998A (en) | Data extraction method of electronic document information, electronic equipment and storage medium | |
US20110191381A1 (en) | Interactive System for Extracting Data from a Website | |
US20180260389A1 (en) | Electronic document segmentation and relation discovery between elements for natural language processing | |
US11887011B2 (en) | Schema augmentation system for exploratory research | |
CN114428861A (en) | Enterprise policy intelligent reading method, system, equipment and storage medium | |
US20220121668A1 (en) | Method for recommending document, electronic device and storage medium | |
CN103810251A (en) | Method and device for extracting text | |
Wu | Language independent web news extraction system based on text detection framework | |
CN109885743B (en) | Webpage data information extraction method | |
CN113836316A (en) | Processing method, training method, device, equipment and medium for ternary group data | |
CN112596688A (en) | Web end custom printing method based on TinyMCE rich text | |
CN116415562A (en) | Method, apparatus and medium for parsing financial data | |
CN107423271B (en) | Document generation method and device | |
CN113392354B (en) | Webpage text analysis method, system, medium and electronic equipment | |
CN115577689A (en) | Table component generation method, device, equipment and medium | |
US20090259995A1 (en) | Apparatus and Method for Standardizing Textual Elements of an Unstructured Text | |
CN108984676B (en) | Electronic book cross-terminal self-adaptive display system and method based on XML | |
US20190317985A1 (en) | Generating a structured document based on a machine readable document and artificial intelligence-generated annotations | |
Shiva Prakash et al. | Review of techniques for automatic text summarization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |