CN106202348A - A kind of web page form information extraction method - Google Patents

A kind of web page form information extraction method Download PDF

Info

Publication number
CN106202348A
CN106202348A CN201610524342.2A CN201610524342A CN106202348A CN 106202348 A CN106202348 A CN 106202348A CN 201610524342 A CN201610524342 A CN 201610524342A CN 106202348 A CN106202348 A CN 106202348A
Authority
CN
China
Prior art keywords
webpage
user
data
web page
java system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610524342.2A
Other languages
Chinese (zh)
Inventor
胡生辉
龙冬阳
衣杨
袁野
杨洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201610524342.2A priority Critical patent/CN106202348A/en
Publication of CN106202348A publication Critical patent/CN106202348A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of web page form information extraction method, including: operation file is configured by user in advance;User inputs the URL address of webpage to be captured, and is captured this webpage by java system;The java system webpage to grabbing carries out pretreatment;Webpage is resolved and positions by the information that java system is manually entered according to user, meanwhile, generates decimation rule and is stored in rule base and safeguards;Java system extracts desired data according to decimation rule at the page, and the data being drawn into are stored in MySQL database;Java system, according to the configuration of operation file, utilizes JSP with the form of dynamic page, the data being drawn into be showed user, for user operation.The present invention reduces the consumption to system resource during web page form information extraction, accelerate the speed of web page form information extraction, facilitate user and the web page form data extracted are carried out after-treatment, improve user and web page form data are carried out the efficiency of after-treatment.

Description

A kind of web page form information extraction method
Technical field
The present invention relates to Web page information extraction field, be specifically related to a kind of web page form information extraction method.
Background technology
The progress maked rapid progress along with web technology, webpage can accommodate mass data, but, there are many users to be indifferent to Information be full of in webpage, such as advertising image with promote link etc., these information are even obscured with the body matter in webpage, Make user be difficult to promptly from webpage, obtain important information.If additionally, user wants to allow target information divert to other purpose, can only First manually win information, html tag to be removed or other noise information, rearrange the most again, finally could be according to The wish of oneself represents these target informations, and do so not only accuracy rate is low, and time-consuming takes a lot of work, inefficiency.
Form (Table) because of its can the feature of relationship between expression information succinctly and effectively, quilt in the webpage in each field Being widely used, for major part Table label, they are used for showing relation data to user, such as railway timetable, net Upper shopping, Web bank and management information system etc..And web information extraction technique application in the table now is limited, most It is all to first pass through after webpage is processed by DOM (Document Object Model, DOM Document Object Model) tree to count According to extraction, first the full detail of webpage can be loaded into internal memory carrying out when by this processing mode, if the page processed Face is more, then can consume very much internal memory;Additionally, most application are also only limited in form in the processing procedure to list data Its some publicly-owned operations that may carry out out, are not extracted, are made further processed by data pick-up, past Toward have impact on the efficiency that data are carried out after-treatment.
In view of this, it is badly in need of providing the new method of a kind of web page form information extraction, solves existing web page form information Extraction technique is relatively big to the consumption of system resource, web data carries out the inefficient problem of after-treatment.
Summary of the invention
The technical problem to be solved is to solve existing web page form information extraction technique to system resource Consume the inefficient problem relatively big, web data is carried out after-treatment.
In order to solve above-mentioned technical problem, the technical solution adopted in the present invention is to provide a kind of web page form information extraction Method, comprises the following steps:
Operation file is configured by user in advance;
User inputs the URL address of webpage to be captured, and is captured this webpage by java system;
The java system webpage to grabbing carries out pretreatment;
Webpage is resolved and positions by the information that java system is manually entered according to user, meanwhile, generates decimation rule And be stored in rule base and safeguard;
Java system extracts desired data according to decimation rule at the page, and the data being drawn into are stored in MySQL data Storehouse;
Java system according to the configuration of operation file, utilizes JSP by the data that are drawn into the form of dynamic page Show user, for user operation.
In technique scheme, user adds in described operation file to carry out according to the form pre-defined Operation, including confirmation, amendment and deletion action to web page form data.
In technique scheme, described in the form that pre-defines with described java system, described operation file is carried out During parsing, form used is consistent.
In technique scheme, the described java system webpage to grabbing carries out pretreatment and includes removing in webpage not Relevant picture, video, music, big section word, navigation bar and html file is formatted.
In technique scheme, when described web page contents is more, described webpage is carried out denoising.
In technique scheme, described operation file is manipulate.conf.
In technique scheme, described decimation rule includes that desired data is positioned at which table of the described page and which arranges Information.
In technique scheme, described java system uses java Open-Source Tools bag Jsoup to solve described webpage Analysis and location.
The present invention reduces the consumption to system resource during web page form information extraction, accelerate web page form information The speed of extraction, facilitates user and the web page form data extracted is carried out after-treatment, improve user to web page form Data carry out the efficiency of after-treatment.
Accompanying drawing explanation
A kind of web page form information extraction method flow chart that Fig. 1 provides for the embodiment of the present invention;
The data flowchart that the web page form data being drawn into are reprocessed that Fig. 2 provides for the embodiment of the present invention.
Detailed description of the invention
The present invention relates to operation file manipulate.conf configuration, webpage capture, Web-page preprocessing, Jsoup resolve, Form locating based on form sample row, rule-based knowledge base maintenance, data pick-up, data normalization and data persistence, data And the content of the operation several aspects of displaying, the target table of locating web-pages is inputted by the example of user, and by target table Data store data base, the wish further according to user displays, and allows target data be re-used.Thus decrease Consumption to system resource during web page form information extraction, accelerates the speed of web page form information extraction, facilitates use Family carries out after-treatment to the web page form data extracted, and improves user and web page form data carry out the effect of after-treatment Rate.
Below in conjunction with specification drawings and specific embodiments, the present invention is described in detail.
Embodiments provide a kind of web page form information extraction method, as it is shown in figure 1, comprise the following steps:
Operation file manipulate.conf is configured by S1, user the most in advance.
The operation that the web page form data of extraction finally to be carried out by user according to oneself, in advance to operation file Manipulate.conf configures, and adds certainly according to the form pre-defined in operation file manipulate.conf Oneself operation to be performed, including to the confirmation of web page form data, revise and the operation such as deletion, above-mentioned predefined lattice Formula, when only need to resolve this operation file manipulate.conf with java system, form used is consistent.
S2, user input URL (Uniform Resource Locators, the unified resource location of webpage to be captured Device) address, and by the java system write, this webpage is captured.
For improving java system stability, it is desirable to the URL address of user's input meets URL standard, otherwise, when input When URL mail returned on ground of incorrect address closes URL standard, corresponding prompt window during submission, will be ejected.
S3, java system webpage to grabbing carries out pretreatment.
For optimizing the follow-up extraction to webpage, the present invention webpage to grabbing carries out pretreatment, including removing in webpage Incoherent picture, video, music, big section word, navigation bar and by html file format (i.e. to html page XML sequence Change).Meanwhile, if web page contents is more webpage can be carried out denoising.
The information that S4, java system is manually entered according to user, uses java Open-Source Tools bag Jsoup to solve webpage Analysis and location, meanwhile, generate decimation rule and be stored in rule base and safeguard.
User is manually entered certain a line content interested, such as, can be a line ticket data of train ticketing net, bag Include flight number, price, originating point, terminal and residue poll etc., facilitate java system quickly to position user requested data.
Decimation rule includes that desired data is positioned at the information such as which table of the page, which row, after being stored in rule base convenience Under this webpage is similar to the page, direct Extracting Information is with the saving time.
S5, java system extracts desired data according to decimation rule at the page, and the data being drawn into are stored in MySQL number According to storehouse, data are carried out after-treatment the most later.
As in figure 2 it is shown, the data that the web page form data being drawn into are reprocessed provided for the embodiment of the present invention Flow chart.
S6, java system, according to the configuration of operation file manipulate.conf, utilizes JSP (Java Server Pages, the java server page) data being drawn into are showed user with the form of dynamic page, for user operation.
Embodiment 1.
First the configuration format of defining operation file manipulate.conf is as follows:
1: dump browses, 0, nothing;
2: single register, 2, register, operation;
3: double register, 4, employee operates, and employee registers, and supervisor's operation, supervisor registers.
Assume that certain company line manager needs everyone of its team to render an account team confirmation of singly registering, but, Owing to enterprise-level application structure is complicated, the external resource related to is numerous, and employee cannot obtain the interface of these data, can only be manual Making a copy of by form data and release, meanwhile, confirmation of registering also cannot realize automatization, the most greatly wastes Manpower and time, but also can not ensure that data accurately can use.Scheme designed according to this invention, first line manager can join Put operation file manipulate.conf;Input comprises the URL address of team's account webpage, webpage capture is got off;To crawl To webpage carry out XML serialization process, meanwhile, if webpage noise is more, denoising can be carried out;It is inputted by line manager In the total data of some people's bill, accordingly generate decimation rule, decimation rule added rule base simultaneously and safeguard;Root According to decimation rule, this webpage is processed, extract desired data;The data being drawn into are stored in data base;Finally, will be The data being drawn into eventually and the operation configured in operation file manipulate.conf show line manager, at this webpage In the department at this line manager place these data can be operated, and line manager is it can also be seen that this department colleague Whether register.
Said method specifically includes following steps:
S10, line manager add following statement according to the form defined in operation file manipulate.conf: Single register, 2, register, operation.
S11, line manager's login service webpage, input the URL address of webpage to be captured in the web page.
If this webpage grabbed of S12 is more than 4MB, it is billing data owing to this is to be extracted, can be by webpage In audio frequency and the information such as picture all remove, to accelerate form locating and identification.
S13, line manager input a line according to the data form of former webpage and render an account data.
The billing data that S14, java system inputs according to line manager, uses java Open-Source Tools bag Jsoup to webpage Resolving and position, the decimation rule finally this obtained, the information such as which table, which row that is such as positioned at is stored in rule Storehouse, convenient after under this website is similar to the page direct Extracting Information with the saving time.
The data that the billing data inputted according to line manager is drawn into are stored in MySQL database by S15, java system, After convenient, data are carried out after-treatment.
S16, java system is according in storage in the configuration of operation file manipulate.conf and MySQL database Hold, utilize JSP that the data being drawn into are passed through browser-presented to line manager, in this page, line manager place The employee of department can carry out confirmation of registering, and this line manager it can also be seen that whether certain employee registers confirmation.
The present invention is not limited to above-mentioned preferred forms, the structure change that anyone makes under the enlightenment of the present invention, Every have same or like technical scheme, within each falling within protection scope of the present invention with the present invention.

Claims (8)

1. a web page form information extraction method, it is characterised in that comprise the following steps:
Operation file is configured by user in advance;
User inputs the URL address of webpage to be captured, and is captured this webpage by java system;
The java system webpage to grabbing carries out pretreatment;
Webpage is resolved and positions by the information that java system is manually entered according to user, meanwhile, generates decimation rule and also deposits Enter rule base to safeguard;
Java system extracts desired data according to decimation rule at the page, and the data being drawn into are stored in MySQL database;
Java system, according to the configuration of operation file, utilizes JSP the data being drawn into be showed with dynamic page form User, for user operation.
2. the method for claim 1, it is characterised in that user according to the form pre-defined at described operation file Middle interpolation operation to be performed, including confirmation, amendment and deletion action to web page form data.
3. method as claimed in claim 2, it is characterised in that described in the form that pre-defines with described java system to institute State form used when operation file resolves consistent.
4. the method for claim 1, it is characterised in that the described java system webpage to grabbing carries out pretreatment bag Include incoherent picture in removal webpage, video, music, big section word, navigation bar and html file is formatted.
5. method as claimed in claim 4, it is characterised in that when described web page contents is more, described webpage is gone Make an uproar process.
6. the method for claim 1, it is characterised in that described operation file is manipulate.conf.
7. the method for claim 1, it is characterised in that described decimation rule includes that desired data is positioned at the described page Several tables and the information of which row.
8. the method for claim 1, it is characterised in that described java system uses java Open-Source Tools bag Jsoup pair Described webpage resolves and positions.
CN201610524342.2A 2016-07-04 2016-07-04 A kind of web page form information extraction method Pending CN106202348A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610524342.2A CN106202348A (en) 2016-07-04 2016-07-04 A kind of web page form information extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610524342.2A CN106202348A (en) 2016-07-04 2016-07-04 A kind of web page form information extraction method

Publications (1)

Publication Number Publication Date
CN106202348A true CN106202348A (en) 2016-12-07

Family

ID=57466185

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610524342.2A Pending CN106202348A (en) 2016-07-04 2016-07-04 A kind of web page form information extraction method

Country Status (1)

Country Link
CN (1) CN106202348A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291298A (en) * 2018-12-10 2020-06-16 航天信息股份有限公司 Page display method and device, electronic equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101576891A (en) * 2008-05-05 2009-11-11 北京瑞佳晨科技有限公司 Method for analyzing web page form object nodes
CN101819584A (en) * 2010-03-18 2010-09-01 上海引跑信息科技有限公司 Light weight intelligent webpage content analysis method
US20110276561A1 (en) * 2003-07-03 2011-11-10 Daniel Dulitz Representative Document Selection for Sets of Duplicate Documents in a Web Crawler System
CN102254009A (en) * 2011-07-15 2011-11-23 福建星网锐捷通讯股份有限公司 Method for extracting data of webpage table
CN102360368A (en) * 2011-10-09 2012-02-22 山东大学 Web data extraction method based on visual customization of extraction template
CN102982161A (en) * 2012-12-05 2013-03-20 北京奇虎科技有限公司 Method and device for acquiring webpage information
CN103853845A (en) * 2014-03-24 2014-06-11 南通大学 Dynamic analytic method of complex form
CN103870441A (en) * 2012-12-14 2014-06-18 苏州精易会信息技术有限公司 Method for converting webpage table data into Excel
CN103902684A (en) * 2014-03-25 2014-07-02 浪潮电子信息产业股份有限公司 Method for structuralizing content acquired by crawler
CN105718584A (en) * 2016-01-26 2016-06-29 中国人民解放军国防科学技术大学 Web page content extracting method and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110276561A1 (en) * 2003-07-03 2011-11-10 Daniel Dulitz Representative Document Selection for Sets of Duplicate Documents in a Web Crawler System
CN101576891A (en) * 2008-05-05 2009-11-11 北京瑞佳晨科技有限公司 Method for analyzing web page form object nodes
CN101819584A (en) * 2010-03-18 2010-09-01 上海引跑信息科技有限公司 Light weight intelligent webpage content analysis method
CN102254009A (en) * 2011-07-15 2011-11-23 福建星网锐捷通讯股份有限公司 Method for extracting data of webpage table
CN102360368A (en) * 2011-10-09 2012-02-22 山东大学 Web data extraction method based on visual customization of extraction template
CN102982161A (en) * 2012-12-05 2013-03-20 北京奇虎科技有限公司 Method and device for acquiring webpage information
CN103870441A (en) * 2012-12-14 2014-06-18 苏州精易会信息技术有限公司 Method for converting webpage table data into Excel
CN103853845A (en) * 2014-03-24 2014-06-11 南通大学 Dynamic analytic method of complex form
CN103902684A (en) * 2014-03-25 2014-07-02 浪潮电子信息产业股份有限公司 Method for structuralizing content acquired by crawler
CN105718584A (en) * 2016-01-26 2016-06-29 中国人民解放军国防科学技术大学 Web page content extracting method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
彭文滔,叶飞跃: ""信息抽取中基于DOM树的过滤器方法的研究"", 《微计算机信息》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291298A (en) * 2018-12-10 2020-06-16 航天信息股份有限公司 Page display method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN105138652B (en) A kind of enterprise's incidence relation recognition methods and system
CN107766371B (en) Text information classification method and device
US10380197B2 (en) Network searching method and network searching system
CN100444591C (en) Method for acquiring front-page keyword and its application system
WO2014005657A4 (en) A system and method for automatic generation of information-rich content from multiple microblogs, each microblog containing only sparse information
CN104408334A (en) Design patent early-warning method
CN107545460B (en) Digital color page promotion management and analysis method, storage device and mobile terminal
WO2016018683A1 (en) Image based search to identify objects in documents
CN103136259B (en) A kind of method and apparatus based on content block identification processing web page contents
CN104486495A (en) Method and device for displaying prompt message of new message at terminal
CN103699591A (en) Page body extraction method based on sample page
CN103699544B (en) The method and system of cross-page selection data
CN110909123A (en) Data extraction method and device, terminal equipment and storage medium
CN110020026A (en) The duplicate checking system and method for project application data
TW201333722A (en) Mechanism and method for mass diversified data screening and management
CN102253939A (en) Searching method and system based on cloud computing technology
CN112418813A (en) AEO qualification intelligent rating management system and method based on intelligent analysis and identification and storage medium
TWI306202B (en) Method and system for parsing e-mail
US20150331886A1 (en) Determining images of article for extraction
CN106202348A (en) A kind of web page form information extraction method
CN105373562A (en) Acquisition method and device of PDF (Portable Document Format) documentation comment
CN103678601A (en) Model essay retrieval request processing method and device
CN111047455A (en) Personal statue method and system for mail
CN113779343B (en) Mass data processing method and device, medium and electronic equipment
CN113051333B (en) Data processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20161207