CN109885743B

CN109885743B - Webpage data information extraction method

Info

Publication number: CN109885743B
Application number: CN201910009284.3A
Authority: CN
Inventors: 胡成红
Original assignee: Shanghai Qiyin Information Technology Co ltd
Current assignee: Shanghai Qiyin Information Technology Co ltd
Priority date: 2019-01-04
Filing date: 2019-01-04
Publication date: 2024-01-02
Anticipated expiration: 2039-01-04
Also published as: CN109885743A

Abstract

The invention discloses a webpage data information extraction method, which comprises the following steps: step S10, receiving a webpage address which is input by a user and needs to acquire data information, and acquiring webpage rendered HTML content corresponding to the webpage address according to the webpage address; step S20, extracting webpage data of the obtained HTML content by adopting a webpage general extraction rule, and generating webpage data information required by a user; and step S30, formatting the generated webpage data information according to the output format responded by the user, and returning the formatted webpage data information to the user. According to the invention, the webpage data information is extracted from different webpages in a general extraction mode, so that the webpage data information extraction cost is greatly saved, the webpage data information extraction efficiency is improved, the webpage data information extraction time is saved, and the secondary development cost caused by the structural change of the webpages in the later period is avoided.

Description

Webpage data information extraction method

Technical Field

The invention relates to the technical field of computers, in particular to a webpage data information extraction method.

Background

When a large number of different web pages need to be extracted, corresponding data information extraction rules are formulated for each web page, and the data information of the web pages can be extracted, as shown in fig. 1. When the data information structure of a certain webpage changes, the data information extraction rule of the webpage needs to be corrected or changed, so that the requirement of re-application is met, and the purpose of data information extraction is fulfilled. However, the existing web page data information extraction method has the following problems: 1. a set of data extraction rules must be formulated for each webpage respectively, and when the number of the webpages to be extracted is excessive, a large amount of manpower and material resources are consumed to formulate data information extraction rules; 2 when the data information structure of a certain webpage changes, the data information extraction rule of the webpage needs to be corrected or changed in time, otherwise, the data information extraction cannot be carried out on the webpage.

Disclosure of Invention

The technical problems solved by the invention are as follows: aiming at the defects of the prior art, the webpage data information extraction method for improving the webpage data information extraction efficiency, reducing the webpage data information extraction cost and avoiding the secondary development cost caused by the structural change of the webpage in the later period is provided.

The technical problems to be solved by the invention can be realized by adopting the following technical scheme:

a webpage data information extraction method comprises the following steps:

step S10, receiving a webpage address which is input by a user and needs to acquire data information, and acquiring webpage rendered HTML content corresponding to the webpage address according to the webpage address;

step S20, extracting webpage data of the obtained HTML content by adopting a webpage general extraction rule, and generating webpage data information required by a user;

and step S30, formatting the generated webpage data information according to the output format responded by the user, and returning the formatted webpage data information to the user.

In a preferred embodiment of the present invention, in the step S10, the HTML content rendered by the web page obtained by the open source framework selenium dynamic rendering loading is subjected to the first parsing preprocessing by using the python environment self-contained parsing library, so that the HTML content becomes a structure body capable of directly extracting the information content by using the web page general extraction rule.

In a preferred embodiment of the invention, said step S20 comprises the sub-steps of:

s21, analyzing the structural body by using a python environment analysis library;

step S22, formulating a path searching rule, and matching and searching all labels in the webpage to obtain webpage main body information;

step S23, judging whether the acquired webpage main body information is complete, if yes, entering step S24, if not, gradually acquiring label content and/or attribute value in the webpage to perform deep analysis and extraction, and then, entering step S24;

and S24, formulating a large number of regular matching rules to match more refined content information in the webpage.

Due to the adoption of the technical scheme, the invention has the beneficial effects that: according to the invention, the webpage data information is extracted from different webpages in a general extraction mode, so that the webpage data information extraction cost is greatly saved, the webpage data information extraction efficiency is improved, the webpage data information extraction time is saved, and the secondary development cost caused by the structural change of the webpages in the later period is avoided.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a conventional web page data information extraction method.

Fig. 2 is a flowchart of a web page data information extraction method of the present invention.

Detailed Description

The invention is further described with reference to the following detailed drawings in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the implementation of the invention easy to understand.

Referring to fig. 2, a method for extracting web page data information is provided, which includes the following steps:

step S10, receiving a webpage address which is input by a user and needs to acquire data information, and acquiring webpage rendered HTML content corresponding to the webpage address according to the webpage address. Specifically, HTML content rendered by the webpage obtained through open source framework selenium dynamic rendering loading is subjected to first analysis pretreatment by utilizing a python environment self-contained analysis library, so that the HTML content becomes a structural body capable of directly extracting information content by utilizing a webpage general extraction rule.

And S20, extracting the webpage data of the acquired HTML content by adopting a webpage general extraction rule, and generating webpage data information required by a user.

In step S20, web page data extraction is performed on the acquired HTML content, including the following sub-steps:

step S21, analyzing the structure by using a python environment analysis library, for example, analyzing by an xpath/beautfulsource method;

step S22, formulating an xpath path searching rule, and matching and searching all title/p/img/etc tags in the webpage to obtain webpage main body information;

step S23, judging whether the acquired webpage body information is complete, if yes, entering step S24, if not, gradually acquiring tag contents and/or attribute values such as a/li under a div tag in the webpage to perform deep analysis and extraction, and then, entering step S24;

step S24, a large number of regular matching rules are formulated to match more refined content information in the webpage so as to ensure the accuracy and diversity of the acquired information while the webpage information is acquired in a general way.

The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The webpage data information extraction method is characterized by comprising the following steps of:

step S30, formatting the generated webpage data information according to an output format responded by the user, and returning the formatted webpage data information to the user;

in the step S10, the HTML content rendered by the web page obtained by the open source framework selenium dynamic rendering loading is subjected to first analysis pretreatment by using the python environment self-contained analysis library, so that the HTML content becomes a structure body capable of directly extracting information content by using the web page general extraction rule;

the step S20 comprises the following sub-steps: