CN109885743A

CN109885743A - A kind of webpage data information extracting method

Info

Publication number: CN109885743A
Application number: CN201910009284.3A
Authority: CN
Inventors: 胡成红
Original assignee: Shanghai Seven India Mdt Infotech Ltd
Current assignee: Shanghai Seven India Mdt Infotech Ltd
Priority date: 2019-01-04
Filing date: 2019-01-04
Publication date: 2019-06-14
Anticipated expiration: 2039-01-04
Also published as: CN109885743B

Abstract

A kind of webpage data information extracting method disclosed by the invention, the following steps are included: step S10, receive user input the web page address for needing to obtain data information, and according to the web page address obtain the web page address corresponding to webpage rendering after HTML content；Step S20 carries out web data extraction to the HTML content got using the general extracting rule of webpage, and generates webpage data information required for user；Step S30 is formatted processing according to webpage data information of the output format of user response to generation, and will format that treated webpage data information returns to user.The present invention carries out webpage data information extraction to different web pages by general extracting mode, dramatically save webpage data information extraction cost, improve webpage data information extraction efficiency, webpage data information extraction time is saved, while also avoiding later period secondary development cost for generating due to structure of web page variation.

Description

A kind of webpage data information extracting method

Technical field

The present invention relates to field of computer technology more particularly to a kind of webpage data information extracting methods.

Background technique

When needing to carry out data information to a large amount of different web pages to extract, need to formulate for each webpage corresponding Data information extracting rule, the data information of webpage can be extracted, as shown in Figure 1.When a certain webpage data information When structure changes, then needs that the data information extracting rule of the webpage is modified or is changed, have reached and be applicable in again Requirement, meet data information extraction purpose.But existing webpage data information extracting method is asked there are following some Topic: 1, must formulate a sets of data extracting rule to each webpage respectively, when the webpage quantity for needing to extract is excessive, need It consumes a large amount of manpower and material resources and formulates data information extracting rule；2 when the data information structure of a certain webpage changes, and needs Will the data information extracting rule in time to the webpage be modified or change, otherwise can not to the webpage carry out data information mention It takes.

Summary of the invention

Technical problem solved by the invention is: providing a kind of raising web data letter in view of the deficiencies of the prior art Breath extraction efficiency, reduce webpage data information extraction cost, avoid the later period due to structure of web page variation secondary development that generates at This webpage data information extracting method.

The technical problems to be solved by the invention can adopt the following technical scheme that realize:

A kind of webpage data information extracting method, comprising the following steps:

Step S10, receives the web page address for needing to obtain data information of user's input, and is obtained according to the web page address HTML content after taking webpage corresponding to the web page address to render；

Step S20 carries out web data extraction to the HTML content got using the general extracting rule of webpage, and generates Webpage data information required for user；

Step S30 is formatted processing according to webpage data information of the output format of user response to generation, and will Format that treated that webpage data information returns to user.

In a preferred embodiment of the invention, in the step S10, pass through Open Framework selenium dynamic wash with watercolours Dye loads the HTML content after the webpage rendering got, and carries out head to HTML content using the included parsing library of python environment Secondary parsing pretreatment, so that it becomes the structural body of the information content can be extracted directly using the general extracting rule of webpage.

In a preferred embodiment of the invention, the step S20 includes following sub-step:

Step S21 carries out dissection process to the structural body using python environment parsing library；

Step S22 formulates path searching rule, all labels in matched and searched webpage, to obtain web page body information；

Step S23 judges whether the web page body information got is complete, if it is determined that obtaining completely, then enters step S24, if obtain not exclusively, gradually obtain webpage in label substance and/or attribute value carry out deep layer parsing extract, then into And step S24；

Step S24 formulates a large amount of canonical matching rules, matches in webpage and more refines content information.

Due to using technical solution as above, the beneficial effects of the present invention are: the present invention passes through general extracting mode pair Different web pages carry out webpage data information extraction, dramatically save webpage data information extraction cost, improve web data letter Cease extraction efficiency, save webpage data information extraction time, while also avoid the later period because structure of web page variation due to generate two Secondary development cost.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is the flow chart of traditional webpage data information extracting method.

Fig. 2 is the flow chart of webpage data information extracting method of the invention.

Specific embodiment

In order to be easy to understand the technical means, the creative features, the aims and the efficiencies achieved by the present invention, tie below Conjunction is specifically illustrating, and the present invention is further explained.

Referring to fig. 2, what is provided in figure is a kind of webpage data information extracting method, comprising the following steps:

Step S10 receives the web page address for needing to obtain data information of user's input, and obtains net according to web page address HTML content after the rendering of webpage corresponding to page address.Specifically, it is obtained by Open Framework selenium dynamic rendering load HTML content after the webpage rendering got, and to HTML content progress, parsing is pre- for the first time using the included parsing library of python environment Processing, so that it becomes the structural body of the information content can be extracted directly using the general extracting rule of webpage.

Step S20 carries out web data extraction to the HTML content got using the general extracting rule of webpage, and generates Webpage data information required for user.

In step S20, web data extraction, including following sub-step are carried out to the HTML content got:

Step S21 carries out dissection process to structural body using python environment parsing library, such as passes through xpath/ The mode of beautifulsoup carries out dissection process；

Step S22 formulates the labels such as all title/p/img/ that xpath path searching is regular, in matched and searched webpage, To obtain web page body information；

Step S23 judges whether the web page body information got is complete, if it is determined that obtaining completely, then enters step S24 gradually obtains the label substances such as the a/li under the div tag in webpage and/or attribute value carries out deeply if obtaining not exclusively Layer parsing is extracted, then and then step S24；

Step S24 formulates a large amount of canonical matching rules, matches in webpage and more refines content information, to ensure logical With the accuracy and diversity for guaranteeing acquisition information while obtaining webpage information.

The above shows and describes the basic principles and main features of the present invention and the advantages of the present invention.The technology of the industry Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and the above embodiments and description only describe this The principle of invention, without departing from the spirit and scope of the present invention, various changes and improvements may be made to the invention, these changes Change and improvement all fall within the protetion scope of the claimed invention.The claimed scope of the invention by appended claims and its Equivalent thereof.

Claims

1. a kind of webpage data information extracting method, which comprises the following steps:

Step S10 receives the web page address for needing to obtain data information of user's input, and obtains institute according to the web page address HTML content after stating the rendering of webpage corresponding to web page address；

Step S20 carries out web data extraction to the HTML content got using the general extracting rule of webpage, and generates user Required webpage data information；

Step S30 is formatted processing according to webpage data information of the output format of user response to generation, and by format Change that treated that webpage data information returns to user.

2. webpage data information extracting method as described in claim 1, which is characterized in that in the step S10, by opening HTML content after the webpage rendering that frame selenium dynamic rendering load in source is got, and utilize the included solution of python environment It analyses library and parsing pretreatment for the first time is carried out to HTML content, so that it becomes can directly be extracted in information using the general extracting rule of webpage The structural body of appearance.

3. webpage data information extracting method as claimed in claim 2, which is characterized in that the step S20 includes following son Step:

Step S23 judges whether the web page body information got is complete, if it is determined that obtaining completely, then enters step S24, if It obtains not exclusively, then gradually obtains label substance and/or attribute value in webpage and carry out deep layer parsing and extract, then and then step S24；