CN104050281A

CN104050281A - Webpage information extraction method and device based on http protocol

Info

Publication number: CN104050281A
Application number: CN201410299203.5A
Authority: CN
Inventors: 马春新; 董磊
Original assignee: Beijing Si Tech Information Technology Co Ltd
Current assignee: Beijing Si Tech Information Technology Co Ltd
Priority date: 2014-06-26
Filing date: 2014-06-26
Publication date: 2014-09-17

Abstract

The invention relates to a webpage information extraction method and device based on an http protocol. The method comprises the steps of template generation, webpage address analysis, information extraction, information checking and information storage, wherein in the template generation step, a corresponding page analysis template is customized according to a target page where information is about to be extracted, and a target field and checking rules are predefined in the page analysis template; in the webpage address analysis step, the webpage address of the target page is analyzed to obtain an HTML source file of the target page; in the information extraction step, the HTML source file of the target page is read and analyzed, and page information matched with the target field predefined in the page analysis template is extracted from the HTML source file of the target page; in the information checking step, whether the extracted page information meets requirements is checked according to the predefined checking rules; in the information storage step, the page information subjected to information checking is stored. According to the webpage information extraction method and device, the page information in a network is subjected to effective data filtration, acquisition and collection through the open http protocol, templates are customized according to different target pages, and extraction of customizing information is achieved.

Description

A kind of info web extracting method and device based on http agreement

Technical field

The information the present invention relates in network technology crawls and parsing field, particularly relates to a kind of info web extracting method and device based on http agreement.

Background technology

The Web2.0 epoch are epoch of information big bang, and the data message of magnanimity is full of the every aspect in work and life, and therefore the analysis based on data and the excavation demand of potential value are also day by day urgently got up.Yet in practice, the factor data side of having is very strict to the management and control of data, a lot of valuable data messages are can not be very easily collected and extract.Under such background, data importance highlights, and data availability is but not high, is restricted even.Therefore, the Internet characteristics based on data how, be concerned about target data is gathered, extracted and is used becomes a problem anxious to be resolved.

Summary of the invention

Technical matters to be solved by this invention is to provide a kind of information extracting method and device based on http agreement, the technical matters that is difficult for obtaining for solving prior art target information.

The technical scheme that the present invention solves the problems of the technologies described above is as follows: a kind of info web extracting method based on http agreement, comprising:

Template generates step: according to the target pages of wanting information extraction, customize the corresponding page and resolve template, and resolve predefine aiming field and verification rule in template at the page;

Web page address analyzing step: resolve the web page address of target pages, obtain the html source file of target pages;

Information extraction step: read and resolve the html source file of target pages, extract in the html source file of target pages with the page and resolve the page info that the predefined aiming field of template matches;

Information checking step: according to predefined verification rule, whether the page info that verification extracts meets the requirements;

Information is preserved step: preserve the page info after information checking.

On the basis of technique scheme, the present invention can also do following improvement.

Further, in described information extraction step, with block mode, extract the page info matching.

Further, it is XML file that the page of customization is resolved template, and this XML file comprises predefined nodal information, aiming field information and verification Rule Information.

Further, predefined verification rule is regular expression.

Further, adopt SAX technology to read and resolve the html source file of target pages.

Further, in the html source file of target pages, extracting the page info matching with the predefined aiming field of page parsing template specifically comprises: the page that adopts DOM technology to read customization is resolved template, and travel through the page and resolve the node that template comprises, in the html source file of target pages, mate aiming field, propose the page info matching with aiming field, and be saved in temporary table.

Corresponding above-mentioned info web extracting method, technical scheme of the present invention also comprises a kind of info web extraction element based on http agreement, comprising:

Template generation module, for according to the target pages of wanting information extraction, customizes the corresponding page and resolves template, and resolves predefine aiming field and verification rule in template at the page;

Web page address parsing module, it,, for resolving the web page address of target pages, obtains the html source file of target pages;

Information extraction modules, it extracts with the page and resolves the page info that the predefined aiming field of template matches for reading and resolve the html source file of target pages in the html source file of target pages;

Information checking module, whether its page info extracting for information extraction modules described in verification meets demand;

Information is preserved module, and it is for preserving the page info after information checking.Further, described information is preserved module and is adopted database server.

The invention has the beneficial effects as follows: the present invention does not rely on the openness of the data side of having, can data in internet be gathered, be extracted according to basic internet communication protocol (http), be conducive to carry out the value excavation of data and analyze.The present invention, by open http agreement, carries out valid data to the page info that can have access in network and filters collection, collects, and different target pages is carried out to model customization, realizes the extraction of customizing messages.Be different from the form that extraction limit, limit generates template, the present invention customizes the template of target pages in advance, and specific aim is stronger, is conducive to improve efficiency and the accuracy of information extraction.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet that the present invention is based on the information extracting method of http agreement;

Fig. 2 is the system architecture diagram of information extracting method described in the embodiment of the present invention;

Fig. 3 is the structural representation that the present invention is based on the information extracting device of http agreement.

Embodiment

Below in conjunction with accompanying drawing, principle of the present invention and feature are described, example, only for explaining the present invention, is not intended to limit scope of the present invention.

As shown in Figure 1, the present embodiment has provided a kind of info web extracting method based on http agreement, comprising:

Wherein, described template generation step also comprises: in the page parsing module of customization, be configured for the key word that extracts required page info.To should step, in described information extraction step, from target pages, extract required page info and specifically comprise: according to the key word configuring page parsing module, from target pages, with block mode, extract required page info.

As shown in Figure 2, in the specific implementation, corresponding architecture system is page object surface layer, service layer and data Layer three-tier architecture.Page object surface layer is mainly the page that needs obtaining information; And service layer is deployed with some acquisition servers, corresponding program function as service arrangement in service layer, realization is to the modeling of target pages and collection, and the corresponding function that above-mentioned template generates step, web page address analyzing step, information extraction step and information checking step is all in this layer realization; Data Layer is deployed with some database servers, for the effective information gathering and extract is stored as data.

The program thread of corresponding program function is described as follows: the Page Template save as xml file of 1, first setting objectives on foreground, comprises nodename and verification Rule Information (being generally regular expression) in file.2, foreground manual triggers (also having backstage to start regularly thread sends request) sends request to target URL, and obtains the html source file of target pages.And source file is saved as to temporary file.3, use SAX to resolve and read temporary file, use DOM to read template file, traversal every template node is mated aiming field from temporary file, and the data that match are saved in to temporary table temporarily.4, read data in temporary table, according to predefined verification rule, data are carried out to verification, the data that satisfy condition are inserted and extracted result table.

While adopting above-mentioned three-tier architecture, in the specific implementation, to above-mentioned steps refinement, concrete implementation process is as follows:

The first step, system is carried out initialization loading, and loading service layer program, for target network address information analysis, verification and preservation.

Second step, arranges the web page address that need to extract, has customized the page and resolve template.Need to be for target pages, customized web page is resolved template, the configurable interface elements key word of paying close attention to content in template in advance.

The 3rd step, starts extraction, analysis service.

The 4th step, according to template and the web page address of second step customization, from target network address obtaining information, load page information, completes the extraction of block mode, can be according to key word configuration extraction key content.

The 5th step, carries out verification, uses regular expression the information of extracting, and special character is replaced, to preserve warehouse-in, completes preservation;

As shown in Figure 3, corresponding above-mentioned information extracting method, the present embodiment gives a kind of information extracting device based on http agreement, comprising:

Information is preserved module, and it is for preserving the page info after information checking.

Much more no longer the principle of work of the information extracting device of the present embodiment and concrete implementation detail are identical with above-mentioned information extracting method, to state here.

The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the info web extracting method based on http agreement, is characterized in that, comprising:

2. info web extracting method according to claim 1, is characterized in that, in described information extraction step, with block mode, extracts the page info matching.

3. info web extracting method according to claim 1, is characterized in that, it is XML file that the page of customization is resolved template, and this XML file comprises predefined nodal information, aiming field information and verification Rule Information.

4. info web extracting method according to claim 1, is characterized in that, predefined verification rule is regular expression.

5. info web extracting method according to claim 1, is characterized in that, adopts SAX technology to read and resolve the html source file of target pages.

6. info web extracting method according to claim 1, it is characterized in that, in the html source file of target pages, extracting the page info matching with the predefined aiming field of page parsing template specifically comprises: the page that adopts DOM technology to read customization is resolved template, and travel through the page and resolve the node that template comprises, in the html source file of target pages, mate aiming field, propose the page info matching with aiming field, and be saved in temporary table.

7. the info web extraction element based on http agreement, is characterized in that, comprising:

8. info web extraction element according to claim 7, is characterized in that, described information is preserved module and adopted database server.