CN106202348A

CN106202348A - A kind of web page form information extraction method

Info

Publication number: CN106202348A
Application number: CN201610524342.2A
Authority: CN
Inventors: 胡生辉; 龙冬阳; 衣杨; 袁野; 杨洋
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2016-07-04
Filing date: 2016-07-04
Publication date: 2016-12-07

Abstract

The invention discloses a kind of web page form information extraction method, including: operation file is configured by user in advance；User inputs the URL address of webpage to be captured, and is captured this webpage by java system；The java system webpage to grabbing carries out pretreatment；Webpage is resolved and positions by the information that java system is manually entered according to user, meanwhile, generates decimation rule and is stored in rule base and safeguards；Java system extracts desired data according to decimation rule at the page, and the data being drawn into are stored in MySQL database；Java system, according to the configuration of operation file, utilizes JSP with the form of dynamic page, the data being drawn into be showed user, for user operation.The present invention reduces the consumption to system resource during web page form information extraction, accelerate the speed of web page form information extraction, facilitate user and the web page form data extracted are carried out after-treatment, improve user and web page form data are carried out the efficiency of after-treatment.

Description

A kind of web page form information extraction method

Technical field

The present invention relates to Web page information extraction field, be specifically related to a kind of web page form information extraction method.

Background technology

The progress maked rapid progress along with web technology, webpage can accommodate mass data, but, there are many users to be indifferent to Information be full of in webpage, such as advertising image with promote link etc., these information are even obscured with the body matter in webpage, Make user be difficult to promptly from webpage, obtain important information.If additionally, user wants to allow target information divert to other purpose, can only First manually win information, html tag to be removed or other noise information, rearrange the most again, finally could be according to The wish of oneself represents these target informations, and do so not only accuracy rate is low, and time-consuming takes a lot of work, inefficiency.

Form (Table) because of its can the feature of relationship between expression information succinctly and effectively, quilt in the webpage in each field Being widely used, for major part Table label, they are used for showing relation data to user, such as railway timetable, net Upper shopping, Web bank and management information system etc..And web information extraction technique application in the table now is limited, most It is all to first pass through after webpage is processed by DOM (Document Object Model, DOM Document Object Model) tree to count According to extraction, first the full detail of webpage can be loaded into internal memory carrying out when by this processing mode, if the page processed Face is more, then can consume very much internal memory；Additionally, most application are also only limited in form in the processing procedure to list data Its some publicly-owned operations that may carry out out, are not extracted, are made further processed by data pick-up, past Toward have impact on the efficiency that data are carried out after-treatment.

In view of this, it is badly in need of providing the new method of a kind of web page form information extraction, solves existing web page form information Extraction technique is relatively big to the consumption of system resource, web data carries out the inefficient problem of after-treatment.

Summary of the invention

The technical problem to be solved is to solve existing web page form information extraction technique to system resource Consume the inefficient problem relatively big, web data is carried out after-treatment.

In order to solve above-mentioned technical problem, the technical solution adopted in the present invention is to provide a kind of web page form information extraction Method, comprises the following steps:

Operation file is configured by user in advance；

User inputs the URL address of webpage to be captured, and is captured this webpage by java system；

The java system webpage to grabbing carries out pretreatment；

Webpage is resolved and positions by the information that java system is manually entered according to user, meanwhile, generates decimation rule And be stored in rule base and safeguard；

Java system extracts desired data according to decimation rule at the page, and the data being drawn into are stored in MySQL data Storehouse；

Java system according to the configuration of operation file, utilizes JSP by the data that are drawn into the form of dynamic page Show user, for user operation.

In technique scheme, user adds in described operation file to carry out according to the form pre-defined Operation, including confirmation, amendment and deletion action to web page form data.

In technique scheme, described in the form that pre-defines with described java system, described operation file is carried out During parsing, form used is consistent.

In technique scheme, the described java system webpage to grabbing carries out pretreatment and includes removing in webpage not Relevant picture, video, music, big section word, navigation bar and html file is formatted.

In technique scheme, when described web page contents is more, described webpage is carried out denoising.

In technique scheme, described operation file is manipulate.conf.

In technique scheme, described decimation rule includes that desired data is positioned at which table of the described page and which arranges Information.

In technique scheme, described java system uses java Open-Source Tools bag Jsoup to solve described webpage Analysis and location.

The present invention reduces the consumption to system resource during web page form information extraction, accelerate web page form information The speed of extraction, facilitates user and the web page form data extracted is carried out after-treatment, improve user to web page form Data carry out the efficiency of after-treatment.

Accompanying drawing explanation

A kind of web page form information extraction method flow chart that Fig. 1 provides for the embodiment of the present invention；

The data flowchart that the web page form data being drawn into are reprocessed that Fig. 2 provides for the embodiment of the present invention.

Detailed description of the invention

The present invention relates to operation file manipulate.conf configuration, webpage capture, Web-page preprocessing, Jsoup resolve, Form locating based on form sample row, rule-based knowledge base maintenance, data pick-up, data normalization and data persistence, data And the content of the operation several aspects of displaying, the target table of locating web-pages is inputted by the example of user, and by target table Data store data base, the wish further according to user displays, and allows target data be re-used.Thus decrease Consumption to system resource during web page form information extraction, accelerates the speed of web page form information extraction, facilitates use Family carries out after-treatment to the web page form data extracted, and improves user and web page form data carry out the effect of after-treatment Rate.

Below in conjunction with specification drawings and specific embodiments, the present invention is described in detail.

Embodiments provide a kind of web page form information extraction method, as it is shown in figure 1, comprise the following steps:

Operation file manipulate.conf is configured by S1, user the most in advance.

The operation that the web page form data of extraction finally to be carried out by user according to oneself, in advance to operation file Manipulate.conf configures, and adds certainly according to the form pre-defined in operation file manipulate.conf Oneself operation to be performed, including to the confirmation of web page form data, revise and the operation such as deletion, above-mentioned predefined lattice Formula, when only need to resolve this operation file manipulate.conf with java system, form used is consistent.

S2, user input URL (Uniform Resource Locators, the unified resource location of webpage to be captured Device) address, and by the java system write, this webpage is captured.

For improving java system stability, it is desirable to the URL address of user's input meets URL standard, otherwise, when input When URL mail returned on ground of incorrect address closes URL standard, corresponding prompt window during submission, will be ejected.

S3, java system webpage to grabbing carries out pretreatment.

For optimizing the follow-up extraction to webpage, the present invention webpage to grabbing carries out pretreatment, including removing in webpage Incoherent picture, video, music, big section word, navigation bar and by html file format (i.e. to html page XML sequence Change).Meanwhile, if web page contents is more webpage can be carried out denoising.

The information that S4, java system is manually entered according to user, uses java Open-Source Tools bag Jsoup to solve webpage Analysis and location, meanwhile, generate decimation rule and be stored in rule base and safeguard.

User is manually entered certain a line content interested, such as, can be a line ticket data of train ticketing net, bag Include flight number, price, originating point, terminal and residue poll etc., facilitate java system quickly to position user requested data.

Decimation rule includes that desired data is positioned at the information such as which table of the page, which row, after being stored in rule base convenience Under this webpage is similar to the page, direct Extracting Information is with the saving time.

S5, java system extracts desired data according to decimation rule at the page, and the data being drawn into are stored in MySQL number According to storehouse, data are carried out after-treatment the most later.

As in figure 2 it is shown, the data that the web page form data being drawn into are reprocessed provided for the embodiment of the present invention Flow chart.

S6, java system, according to the configuration of operation file manipulate.conf, utilizes JSP (Java Server Pages, the java server page) data being drawn into are showed user with the form of dynamic page, for user operation.

Embodiment 1.

First the configuration format of defining operation file manipulate.conf is as follows:

1: dump browses, 0, nothing；

2: single register, 2, register, operation；

3: double register, 4, employee operates, and employee registers, and supervisor's operation, supervisor registers.

Assume that certain company line manager needs everyone of its team to render an account team confirmation of singly registering, but, Owing to enterprise-level application structure is complicated, the external resource related to is numerous, and employee cannot obtain the interface of these data, can only be manual Making a copy of by form data and release, meanwhile, confirmation of registering also cannot realize automatization, the most greatly wastes Manpower and time, but also can not ensure that data accurately can use.Scheme designed according to this invention, first line manager can join Put operation file manipulate.conf；Input comprises the URL address of team's account webpage, webpage capture is got off；To crawl To webpage carry out XML serialization process, meanwhile, if webpage noise is more, denoising can be carried out；It is inputted by line manager In the total data of some people's bill, accordingly generate decimation rule, decimation rule added rule base simultaneously and safeguard；Root According to decimation rule, this webpage is processed, extract desired data；The data being drawn into are stored in data base；Finally, will be The data being drawn into eventually and the operation configured in operation file manipulate.conf show line manager, at this webpage In the department at this line manager place these data can be operated, and line manager is it can also be seen that this department colleague Whether register.

Said method specifically includes following steps:

S10, line manager add following statement according to the form defined in operation file manipulate.conf: Single register, 2, register, operation.

S11, line manager's login service webpage, input the URL address of webpage to be captured in the web page.

If this webpage grabbed of S12 is more than 4MB, it is billing data owing to this is to be extracted, can be by webpage In audio frequency and the information such as picture all remove, to accelerate form locating and identification.

S13, line manager input a line according to the data form of former webpage and render an account data.

The billing data that S14, java system inputs according to line manager, uses java Open-Source Tools bag Jsoup to webpage Resolving and position, the decimation rule finally this obtained, the information such as which table, which row that is such as positioned at is stored in rule Storehouse, convenient after under this website is similar to the page direct Extracting Information with the saving time.

The data that the billing data inputted according to line manager is drawn into are stored in MySQL database by S15, java system, After convenient, data are carried out after-treatment.

S16, java system is according in storage in the configuration of operation file manipulate.conf and MySQL database Hold, utilize JSP that the data being drawn into are passed through browser-presented to line manager, in this page, line manager place The employee of department can carry out confirmation of registering, and this line manager it can also be seen that whether certain employee registers confirmation.

The present invention is not limited to above-mentioned preferred forms, the structure change that anyone makes under the enlightenment of the present invention, Every have same or like technical scheme, within each falling within protection scope of the present invention with the present invention.

Claims

1. a web page form information extraction method, it is characterised in that comprise the following steps:

Operation file is configured by user in advance；

The java system webpage to grabbing carries out pretreatment；

Webpage is resolved and positions by the information that java system is manually entered according to user, meanwhile, generates decimation rule and also deposits Enter rule base to safeguard；

Java system extracts desired data according to decimation rule at the page, and the data being drawn into are stored in MySQL database；

Java system, according to the configuration of operation file, utilizes JSP the data being drawn into be showed with dynamic page form User, for user operation.

2. the method for claim 1, it is characterised in that user according to the form pre-defined at described operation file Middle interpolation operation to be performed, including confirmation, amendment and deletion action to web page form data.

3. method as claimed in claim 2, it is characterised in that described in the form that pre-defines with described java system to institute State form used when operation file resolves consistent.

4. the method for claim 1, it is characterised in that the described java system webpage to grabbing carries out pretreatment bag Include incoherent picture in removal webpage, video, music, big section word, navigation bar and html file is formatted.

5. method as claimed in claim 4, it is characterised in that when described web page contents is more, described webpage is gone Make an uproar process.

6. the method for claim 1, it is characterised in that described operation file is manipulate.conf.

7. the method for claim 1, it is characterised in that described decimation rule includes that desired data is positioned at the described page Several tables and the information of which row.

8. the method for claim 1, it is characterised in that described java system uses java Open-Source Tools bag Jsoup pair Described webpage resolves and positions.