CN106202348A - A kind of web page form information extraction method - Google Patents
A kind of web page form information extraction method Download PDFInfo
- Publication number
- CN106202348A CN106202348A CN201610524342.2A CN201610524342A CN106202348A CN 106202348 A CN106202348 A CN 106202348A CN 201610524342 A CN201610524342 A CN 201610524342A CN 106202348 A CN106202348 A CN 106202348A
- Authority
- CN
- China
- Prior art keywords
- webpage
- user
- data
- web page
- java system
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 22
- 239000000284 extract Substances 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 22
- 238000012790 confirmation Methods 0.000 claims description 7
- 238000012217 deletion Methods 0.000 claims description 3
- 230000037430 deletion Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000002688 persistence Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of web page form information extraction method, including: operation file is configured by user in advance;User inputs the URL address of webpage to be captured, and is captured this webpage by java system;The java system webpage to grabbing carries out pretreatment;Webpage is resolved and positions by the information that java system is manually entered according to user, meanwhile, generates decimation rule and is stored in rule base and safeguards;Java system extracts desired data according to decimation rule at the page, and the data being drawn into are stored in MySQL database;Java system, according to the configuration of operation file, utilizes JSP with the form of dynamic page, the data being drawn into be showed user, for user operation.The present invention reduces the consumption to system resource during web page form information extraction, accelerate the speed of web page form information extraction, facilitate user and the web page form data extracted are carried out after-treatment, improve user and web page form data are carried out the efficiency of after-treatment.
Description
Technical field
The present invention relates to Web page information extraction field, be specifically related to a kind of web page form information extraction method.
Background technology
The progress maked rapid progress along with web technology, webpage can accommodate mass data, but, there are many users to be indifferent to
Information be full of in webpage, such as advertising image with promote link etc., these information are even obscured with the body matter in webpage,
Make user be difficult to promptly from webpage, obtain important information.If additionally, user wants to allow target information divert to other purpose, can only
First manually win information, html tag to be removed or other noise information, rearrange the most again, finally could be according to
The wish of oneself represents these target informations, and do so not only accuracy rate is low, and time-consuming takes a lot of work, inefficiency.
Form (Table) because of its can the feature of relationship between expression information succinctly and effectively, quilt in the webpage in each field
Being widely used, for major part Table label, they are used for showing relation data to user, such as railway timetable, net
Upper shopping, Web bank and management information system etc..And web information extraction technique application in the table now is limited, most
It is all to first pass through after webpage is processed by DOM (Document Object Model, DOM Document Object Model) tree to count
According to extraction, first the full detail of webpage can be loaded into internal memory carrying out when by this processing mode, if the page processed
Face is more, then can consume very much internal memory;Additionally, most application are also only limited in form in the processing procedure to list data
Its some publicly-owned operations that may carry out out, are not extracted, are made further processed by data pick-up, past
Toward have impact on the efficiency that data are carried out after-treatment.
In view of this, it is badly in need of providing the new method of a kind of web page form information extraction, solves existing web page form information
Extraction technique is relatively big to the consumption of system resource, web data carries out the inefficient problem of after-treatment.
Summary of the invention
The technical problem to be solved is to solve existing web page form information extraction technique to system resource
Consume the inefficient problem relatively big, web data is carried out after-treatment.
In order to solve above-mentioned technical problem, the technical solution adopted in the present invention is to provide a kind of web page form information extraction
Method, comprises the following steps:
Operation file is configured by user in advance;
User inputs the URL address of webpage to be captured, and is captured this webpage by java system;
The java system webpage to grabbing carries out pretreatment;
Webpage is resolved and positions by the information that java system is manually entered according to user, meanwhile, generates decimation rule
And be stored in rule base and safeguard;
Java system extracts desired data according to decimation rule at the page, and the data being drawn into are stored in MySQL data
Storehouse;
Java system according to the configuration of operation file, utilizes JSP by the data that are drawn into the form of dynamic page
Show user, for user operation.
In technique scheme, user adds in described operation file to carry out according to the form pre-defined
Operation, including confirmation, amendment and deletion action to web page form data.
In technique scheme, described in the form that pre-defines with described java system, described operation file is carried out
During parsing, form used is consistent.
In technique scheme, the described java system webpage to grabbing carries out pretreatment and includes removing in webpage not
Relevant picture, video, music, big section word, navigation bar and html file is formatted.
In technique scheme, when described web page contents is more, described webpage is carried out denoising.
In technique scheme, described operation file is manipulate.conf.
In technique scheme, described decimation rule includes that desired data is positioned at which table of the described page and which arranges
Information.
In technique scheme, described java system uses java Open-Source Tools bag Jsoup to solve described webpage
Analysis and location.
The present invention reduces the consumption to system resource during web page form information extraction, accelerate web page form information
The speed of extraction, facilitates user and the web page form data extracted is carried out after-treatment, improve user to web page form
Data carry out the efficiency of after-treatment.
Accompanying drawing explanation
A kind of web page form information extraction method flow chart that Fig. 1 provides for the embodiment of the present invention;
The data flowchart that the web page form data being drawn into are reprocessed that Fig. 2 provides for the embodiment of the present invention.
Detailed description of the invention
The present invention relates to operation file manipulate.conf configuration, webpage capture, Web-page preprocessing, Jsoup resolve,
Form locating based on form sample row, rule-based knowledge base maintenance, data pick-up, data normalization and data persistence, data
And the content of the operation several aspects of displaying, the target table of locating web-pages is inputted by the example of user, and by target table
Data store data base, the wish further according to user displays, and allows target data be re-used.Thus decrease
Consumption to system resource during web page form information extraction, accelerates the speed of web page form information extraction, facilitates use
Family carries out after-treatment to the web page form data extracted, and improves user and web page form data carry out the effect of after-treatment
Rate.
Below in conjunction with specification drawings and specific embodiments, the present invention is described in detail.
Embodiments provide a kind of web page form information extraction method, as it is shown in figure 1, comprise the following steps:
Operation file manipulate.conf is configured by S1, user the most in advance.
The operation that the web page form data of extraction finally to be carried out by user according to oneself, in advance to operation file
Manipulate.conf configures, and adds certainly according to the form pre-defined in operation file manipulate.conf
Oneself operation to be performed, including to the confirmation of web page form data, revise and the operation such as deletion, above-mentioned predefined lattice
Formula, when only need to resolve this operation file manipulate.conf with java system, form used is consistent.
S2, user input URL (Uniform Resource Locators, the unified resource location of webpage to be captured
Device) address, and by the java system write, this webpage is captured.
For improving java system stability, it is desirable to the URL address of user's input meets URL standard, otherwise, when input
When URL mail returned on ground of incorrect address closes URL standard, corresponding prompt window during submission, will be ejected.
S3, java system webpage to grabbing carries out pretreatment.
For optimizing the follow-up extraction to webpage, the present invention webpage to grabbing carries out pretreatment, including removing in webpage
Incoherent picture, video, music, big section word, navigation bar and by html file format (i.e. to html page XML sequence
Change).Meanwhile, if web page contents is more webpage can be carried out denoising.
The information that S4, java system is manually entered according to user, uses java Open-Source Tools bag Jsoup to solve webpage
Analysis and location, meanwhile, generate decimation rule and be stored in rule base and safeguard.
User is manually entered certain a line content interested, such as, can be a line ticket data of train ticketing net, bag
Include flight number, price, originating point, terminal and residue poll etc., facilitate java system quickly to position user requested data.
Decimation rule includes that desired data is positioned at the information such as which table of the page, which row, after being stored in rule base convenience
Under this webpage is similar to the page, direct Extracting Information is with the saving time.
S5, java system extracts desired data according to decimation rule at the page, and the data being drawn into are stored in MySQL number
According to storehouse, data are carried out after-treatment the most later.
As in figure 2 it is shown, the data that the web page form data being drawn into are reprocessed provided for the embodiment of the present invention
Flow chart.
S6, java system, according to the configuration of operation file manipulate.conf, utilizes JSP (Java Server
Pages, the java server page) data being drawn into are showed user with the form of dynamic page, for user operation.
Embodiment 1.
First the configuration format of defining operation file manipulate.conf is as follows:
1: dump browses, 0, nothing;
2: single register, 2, register, operation;
3: double register, 4, employee operates, and employee registers, and supervisor's operation, supervisor registers.
Assume that certain company line manager needs everyone of its team to render an account team confirmation of singly registering, but,
Owing to enterprise-level application structure is complicated, the external resource related to is numerous, and employee cannot obtain the interface of these data, can only be manual
Making a copy of by form data and release, meanwhile, confirmation of registering also cannot realize automatization, the most greatly wastes
Manpower and time, but also can not ensure that data accurately can use.Scheme designed according to this invention, first line manager can join
Put operation file manipulate.conf;Input comprises the URL address of team's account webpage, webpage capture is got off;To crawl
To webpage carry out XML serialization process, meanwhile, if webpage noise is more, denoising can be carried out;It is inputted by line manager
In the total data of some people's bill, accordingly generate decimation rule, decimation rule added rule base simultaneously and safeguard;Root
According to decimation rule, this webpage is processed, extract desired data;The data being drawn into are stored in data base;Finally, will be
The data being drawn into eventually and the operation configured in operation file manipulate.conf show line manager, at this webpage
In the department at this line manager place these data can be operated, and line manager is it can also be seen that this department colleague
Whether register.
Said method specifically includes following steps:
S10, line manager add following statement according to the form defined in operation file manipulate.conf:
Single register, 2, register, operation.
S11, line manager's login service webpage, input the URL address of webpage to be captured in the web page.
If this webpage grabbed of S12 is more than 4MB, it is billing data owing to this is to be extracted, can be by webpage
In audio frequency and the information such as picture all remove, to accelerate form locating and identification.
S13, line manager input a line according to the data form of former webpage and render an account data.
The billing data that S14, java system inputs according to line manager, uses java Open-Source Tools bag Jsoup to webpage
Resolving and position, the decimation rule finally this obtained, the information such as which table, which row that is such as positioned at is stored in rule
Storehouse, convenient after under this website is similar to the page direct Extracting Information with the saving time.
The data that the billing data inputted according to line manager is drawn into are stored in MySQL database by S15, java system,
After convenient, data are carried out after-treatment.
S16, java system is according in storage in the configuration of operation file manipulate.conf and MySQL database
Hold, utilize JSP that the data being drawn into are passed through browser-presented to line manager, in this page, line manager place
The employee of department can carry out confirmation of registering, and this line manager it can also be seen that whether certain employee registers confirmation.
The present invention is not limited to above-mentioned preferred forms, the structure change that anyone makes under the enlightenment of the present invention,
Every have same or like technical scheme, within each falling within protection scope of the present invention with the present invention.
Claims (8)
1. a web page form information extraction method, it is characterised in that comprise the following steps:
Operation file is configured by user in advance;
User inputs the URL address of webpage to be captured, and is captured this webpage by java system;
The java system webpage to grabbing carries out pretreatment;
Webpage is resolved and positions by the information that java system is manually entered according to user, meanwhile, generates decimation rule and also deposits
Enter rule base to safeguard;
Java system extracts desired data according to decimation rule at the page, and the data being drawn into are stored in MySQL database;
Java system, according to the configuration of operation file, utilizes JSP the data being drawn into be showed with dynamic page form
User, for user operation.
2. the method for claim 1, it is characterised in that user according to the form pre-defined at described operation file
Middle interpolation operation to be performed, including confirmation, amendment and deletion action to web page form data.
3. method as claimed in claim 2, it is characterised in that described in the form that pre-defines with described java system to institute
State form used when operation file resolves consistent.
4. the method for claim 1, it is characterised in that the described java system webpage to grabbing carries out pretreatment bag
Include incoherent picture in removal webpage, video, music, big section word, navigation bar and html file is formatted.
5. method as claimed in claim 4, it is characterised in that when described web page contents is more, described webpage is gone
Make an uproar process.
6. the method for claim 1, it is characterised in that described operation file is manipulate.conf.
7. the method for claim 1, it is characterised in that described decimation rule includes that desired data is positioned at the described page
Several tables and the information of which row.
8. the method for claim 1, it is characterised in that described java system uses java Open-Source Tools bag Jsoup pair
Described webpage resolves and positions.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610524342.2A CN106202348A (en) | 2016-07-04 | 2016-07-04 | A kind of web page form information extraction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610524342.2A CN106202348A (en) | 2016-07-04 | 2016-07-04 | A kind of web page form information extraction method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106202348A true CN106202348A (en) | 2016-12-07 |
Family
ID=57466185
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610524342.2A Pending CN106202348A (en) | 2016-07-04 | 2016-07-04 | A kind of web page form information extraction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106202348A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111291298A (en) * | 2018-12-10 | 2020-06-16 | 航天信息股份有限公司 | Page display method and device, electronic equipment and storage medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101576891A (en) * | 2008-05-05 | 2009-11-11 | 北京瑞佳晨科技有限公司 | Method for analyzing web page form object nodes |
CN101819584A (en) * | 2010-03-18 | 2010-09-01 | 上海引跑信息科技有限公司 | Light weight intelligent webpage content analysis method |
US20110276561A1 (en) * | 2003-07-03 | 2011-11-10 | Daniel Dulitz | Representative Document Selection for Sets of Duplicate Documents in a Web Crawler System |
CN102254009A (en) * | 2011-07-15 | 2011-11-23 | 福建星网锐捷通讯股份有限公司 | Method for extracting data of webpage table |
CN102360368A (en) * | 2011-10-09 | 2012-02-22 | 山东大学 | Web data extraction method based on visual customization of extraction template |
CN102982161A (en) * | 2012-12-05 | 2013-03-20 | 北京奇虎科技有限公司 | Method and device for acquiring webpage information |
CN103853845A (en) * | 2014-03-24 | 2014-06-11 | 南通大学 | Dynamic analytic method of complex form |
CN103870441A (en) * | 2012-12-14 | 2014-06-18 | 苏州精易会信息技术有限公司 | Method for converting webpage table data into Excel |
CN103902684A (en) * | 2014-03-25 | 2014-07-02 | 浪潮电子信息产业股份有限公司 | Method for structuralizing content acquired by crawler |
CN105718584A (en) * | 2016-01-26 | 2016-06-29 | 中国人民解放军国防科学技术大学 | Web page content extracting method and device |
-
2016
- 2016-07-04 CN CN201610524342.2A patent/CN106202348A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110276561A1 (en) * | 2003-07-03 | 2011-11-10 | Daniel Dulitz | Representative Document Selection for Sets of Duplicate Documents in a Web Crawler System |
CN101576891A (en) * | 2008-05-05 | 2009-11-11 | 北京瑞佳晨科技有限公司 | Method for analyzing web page form object nodes |
CN101819584A (en) * | 2010-03-18 | 2010-09-01 | 上海引跑信息科技有限公司 | Light weight intelligent webpage content analysis method |
CN102254009A (en) * | 2011-07-15 | 2011-11-23 | 福建星网锐捷通讯股份有限公司 | Method for extracting data of webpage table |
CN102360368A (en) * | 2011-10-09 | 2012-02-22 | 山东大学 | Web data extraction method based on visual customization of extraction template |
CN102982161A (en) * | 2012-12-05 | 2013-03-20 | 北京奇虎科技有限公司 | Method and device for acquiring webpage information |
CN103870441A (en) * | 2012-12-14 | 2014-06-18 | 苏州精易会信息技术有限公司 | Method for converting webpage table data into Excel |
CN103853845A (en) * | 2014-03-24 | 2014-06-11 | 南通大学 | Dynamic analytic method of complex form |
CN103902684A (en) * | 2014-03-25 | 2014-07-02 | 浪潮电子信息产业股份有限公司 | Method for structuralizing content acquired by crawler |
CN105718584A (en) * | 2016-01-26 | 2016-06-29 | 中国人民解放军国防科学技术大学 | Web page content extracting method and device |
Non-Patent Citations (1)
Title |
---|
彭文滔,叶飞跃: ""信息抽取中基于DOM树的过滤器方法的研究"", 《微计算机信息》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111291298A (en) * | 2018-12-10 | 2020-06-16 | 航天信息股份有限公司 | Page display method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105138652B (en) | A kind of enterprise's incidence relation recognition methods and system | |
CN107766371B (en) | Text information classification method and device | |
US10380197B2 (en) | Network searching method and network searching system | |
CN100444591C (en) | Method for acquiring front-page keyword and its application system | |
WO2014005657A4 (en) | A system and method for automatic generation of information-rich content from multiple microblogs, each microblog containing only sparse information | |
CN104408334A (en) | Design patent early-warning method | |
CN107545460B (en) | Digital color page promotion management and analysis method, storage device and mobile terminal | |
WO2016018683A1 (en) | Image based search to identify objects in documents | |
CN103136259B (en) | A kind of method and apparatus based on content block identification processing web page contents | |
CN104486495A (en) | Method and device for displaying prompt message of new message at terminal | |
CN103699591A (en) | Page body extraction method based on sample page | |
CN103699544B (en) | The method and system of cross-page selection data | |
CN110909123A (en) | Data extraction method and device, terminal equipment and storage medium | |
CN110020026A (en) | The duplicate checking system and method for project application data | |
TW201333722A (en) | Mechanism and method for mass diversified data screening and management | |
CN102253939A (en) | Searching method and system based on cloud computing technology | |
CN112418813A (en) | AEO qualification intelligent rating management system and method based on intelligent analysis and identification and storage medium | |
TWI306202B (en) | Method and system for parsing e-mail | |
US20150331886A1 (en) | Determining images of article for extraction | |
CN106202348A (en) | A kind of web page form information extraction method | |
CN105373562A (en) | Acquisition method and device of PDF (Portable Document Format) documentation comment | |
CN103678601A (en) | Model essay retrieval request processing method and device | |
CN111047455A (en) | Personal statue method and system for mail | |
CN113779343B (en) | Mass data processing method and device, medium and electronic equipment | |
CN113051333B (en) | Data processing method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20161207 |