CN104050281A - Webpage information extraction method and device based on http protocol - Google Patents

Webpage information extraction method and device based on http protocol Download PDF

Info

Publication number
CN104050281A
CN104050281A CN201410299203.5A CN201410299203A CN104050281A CN 104050281 A CN104050281 A CN 104050281A CN 201410299203 A CN201410299203 A CN 201410299203A CN 104050281 A CN104050281 A CN 104050281A
Authority
CN
China
Prior art keywords
page
information
template
target pages
info
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410299203.5A
Other languages
Chinese (zh)
Inventor
马春新
董磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Si Tech Information Technology Co Ltd
Original Assignee
Beijing Si Tech Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Si Tech Information Technology Co Ltd filed Critical Beijing Si Tech Information Technology Co Ltd
Priority to CN201410299203.5A priority Critical patent/CN104050281A/en
Publication of CN104050281A publication Critical patent/CN104050281A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a webpage information extraction method and device based on an http protocol. The method comprises the steps of template generation, webpage address analysis, information extraction, information checking and information storage, wherein in the template generation step, a corresponding page analysis template is customized according to a target page where information is about to be extracted, and a target field and checking rules are predefined in the page analysis template; in the webpage address analysis step, the webpage address of the target page is analyzed to obtain an HTML source file of the target page; in the information extraction step, the HTML source file of the target page is read and analyzed, and page information matched with the target field predefined in the page analysis template is extracted from the HTML source file of the target page; in the information checking step, whether the extracted page information meets requirements is checked according to the predefined checking rules; in the information storage step, the page information subjected to information checking is stored. According to the webpage information extraction method and device, the page information in a network is subjected to effective data filtration, acquisition and collection through the open http protocol, templates are customized according to different target pages, and extraction of customizing information is achieved.

Description

A kind of info web extracting method and device based on http agreement
Technical field
The information the present invention relates in network technology crawls and parsing field, particularly relates to a kind of info web extracting method and device based on http agreement.
Background technology
The Web2.0 epoch are epoch of information big bang, and the data message of magnanimity is full of the every aspect in work and life, and therefore the analysis based on data and the excavation demand of potential value are also day by day urgently got up.Yet in practice, the factor data side of having is very strict to the management and control of data, a lot of valuable data messages are can not be very easily collected and extract.Under such background, data importance highlights, and data availability is but not high, is restricted even.Therefore, the Internet characteristics based on data how, be concerned about target data is gathered, extracted and is used becomes a problem anxious to be resolved.
Summary of the invention
Technical matters to be solved by this invention is to provide a kind of information extracting method and device based on http agreement, the technical matters that is difficult for obtaining for solving prior art target information.
The technical scheme that the present invention solves the problems of the technologies described above is as follows: a kind of info web extracting method based on http agreement, comprising:
Template generates step: according to the target pages of wanting information extraction, customize the corresponding page and resolve template, and resolve predefine aiming field and verification rule in template at the page;
Web page address analyzing step: resolve the web page address of target pages, obtain the html source file of target pages;
Information extraction step: read and resolve the html source file of target pages, extract in the html source file of target pages with the page and resolve the page info that the predefined aiming field of template matches;
Information checking step: according to predefined verification rule, whether the page info that verification extracts meets the requirements;
Information is preserved step: preserve the page info after information checking.
On the basis of technique scheme, the present invention can also do following improvement.
Further, in described information extraction step, with block mode, extract the page info matching.
Further, it is XML file that the page of customization is resolved template, and this XML file comprises predefined nodal information, aiming field information and verification Rule Information.
Further, predefined verification rule is regular expression.
Further, adopt SAX technology to read and resolve the html source file of target pages.
Further, in the html source file of target pages, extracting the page info matching with the predefined aiming field of page parsing template specifically comprises: the page that adopts DOM technology to read customization is resolved template, and travel through the page and resolve the node that template comprises, in the html source file of target pages, mate aiming field, propose the page info matching with aiming field, and be saved in temporary table.
Corresponding above-mentioned info web extracting method, technical scheme of the present invention also comprises a kind of info web extraction element based on http agreement, comprising:
Template generation module, for according to the target pages of wanting information extraction, customizes the corresponding page and resolves template, and resolves predefine aiming field and verification rule in template at the page;
Web page address parsing module, it,, for resolving the web page address of target pages, obtains the html source file of target pages;
Information extraction modules, it extracts with the page and resolves the page info that the predefined aiming field of template matches for reading and resolve the html source file of target pages in the html source file of target pages;
Information checking module, whether its page info extracting for information extraction modules described in verification meets demand;
Information is preserved module, and it is for preserving the page info after information checking.Further, described information is preserved module and is adopted database server.
The invention has the beneficial effects as follows: the present invention does not rely on the openness of the data side of having, can data in internet be gathered, be extracted according to basic internet communication protocol (http), be conducive to carry out the value excavation of data and analyze.The present invention, by open http agreement, carries out valid data to the page info that can have access in network and filters collection, collects, and different target pages is carried out to model customization, realizes the extraction of customizing messages.Be different from the form that extraction limit, limit generates template, the present invention customizes the template of target pages in advance, and specific aim is stronger, is conducive to improve efficiency and the accuracy of information extraction.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet that the present invention is based on the information extracting method of http agreement;
Fig. 2 is the system architecture diagram of information extracting method described in the embodiment of the present invention;
Fig. 3 is the structural representation that the present invention is based on the information extracting device of http agreement.
Embodiment
Below in conjunction with accompanying drawing, principle of the present invention and feature are described, example, only for explaining the present invention, is not intended to limit scope of the present invention.
As shown in Figure 1, the present embodiment has provided a kind of info web extracting method based on http agreement, comprising:
Template generates step: according to the target pages of wanting information extraction, customize the corresponding page and resolve template, and resolve predefine aiming field and verification rule in template at the page;
Web page address analyzing step: resolve the web page address of target pages, obtain the html source file of target pages;
Information extraction step: read and resolve the html source file of target pages, extract in the html source file of target pages with the page and resolve the page info that the predefined aiming field of template matches;
Information checking step: according to predefined verification rule, whether the page info that verification extracts meets the requirements;
Information is preserved step: preserve the page info after information checking.
Wherein, described template generation step also comprises: in the page parsing module of customization, be configured for the key word that extracts required page info.To should step, in described information extraction step, from target pages, extract required page info and specifically comprise: according to the key word configuring page parsing module, from target pages, with block mode, extract required page info.
As shown in Figure 2, in the specific implementation, corresponding architecture system is page object surface layer, service layer and data Layer three-tier architecture.Page object surface layer is mainly the page that needs obtaining information; And service layer is deployed with some acquisition servers, corresponding program function as service arrangement in service layer, realization is to the modeling of target pages and collection, and the corresponding function that above-mentioned template generates step, web page address analyzing step, information extraction step and information checking step is all in this layer realization; Data Layer is deployed with some database servers, for the effective information gathering and extract is stored as data.
The program thread of corresponding program function is described as follows: the Page Template save as xml file of 1, first setting objectives on foreground, comprises nodename and verification Rule Information (being generally regular expression) in file.2, foreground manual triggers (also having backstage to start regularly thread sends request) sends request to target URL, and obtains the html source file of target pages.And source file is saved as to temporary file.3, use SAX to resolve and read temporary file, use DOM to read template file, traversal every template node is mated aiming field from temporary file, and the data that match are saved in to temporary table temporarily.4, read data in temporary table, according to predefined verification rule, data are carried out to verification, the data that satisfy condition are inserted and extracted result table.
While adopting above-mentioned three-tier architecture, in the specific implementation, to above-mentioned steps refinement, concrete implementation process is as follows:
The first step, system is carried out initialization loading, and loading service layer program, for target network address information analysis, verification and preservation.
Second step, arranges the web page address that need to extract, has customized the page and resolve template.Need to be for target pages, customized web page is resolved template, the configurable interface elements key word of paying close attention to content in template in advance.
The 3rd step, starts extraction, analysis service.
The 4th step, according to template and the web page address of second step customization, from target network address obtaining information, load page information, completes the extraction of block mode, can be according to key word configuration extraction key content.
The 5th step, carries out verification, uses regular expression the information of extracting, and special character is replaced, to preserve warehouse-in, completes preservation;
As shown in Figure 3, corresponding above-mentioned information extracting method, the present embodiment gives a kind of information extracting device based on http agreement, comprising:
Template generation module, for according to the target pages of wanting information extraction, customizes the corresponding page and resolves template, and resolves predefine aiming field and verification rule in template at the page;
Web page address parsing module, it,, for resolving the web page address of target pages, obtains the html source file of target pages;
Information extraction modules, it extracts with the page and resolves the page info that the predefined aiming field of template matches for reading and resolve the html source file of target pages in the html source file of target pages;
Information checking module, whether its page info extracting for information extraction modules described in verification meets demand;
Information is preserved module, and it is for preserving the page info after information checking.
Much more no longer the principle of work of the information extracting device of the present embodiment and concrete implementation detail are identical with above-mentioned information extracting method, to state here.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (8)

1. the info web extracting method based on http agreement, is characterized in that, comprising:
Template generates step: according to the target pages of wanting information extraction, customize the corresponding page and resolve template, and resolve predefine aiming field and verification rule in template at the page;
Web page address analyzing step: resolve the web page address of target pages, obtain the html source file of target pages;
Information extraction step: read and resolve the html source file of target pages, extract in the html source file of target pages with the page and resolve the page info that the predefined aiming field of template matches;
Information checking step: according to predefined verification rule, whether the page info that verification extracts meets the requirements;
Information is preserved step: preserve the page info after information checking.
2. info web extracting method according to claim 1, is characterized in that, in described information extraction step, with block mode, extracts the page info matching.
3. info web extracting method according to claim 1, is characterized in that, it is XML file that the page of customization is resolved template, and this XML file comprises predefined nodal information, aiming field information and verification Rule Information.
4. info web extracting method according to claim 1, is characterized in that, predefined verification rule is regular expression.
5. info web extracting method according to claim 1, is characterized in that, adopts SAX technology to read and resolve the html source file of target pages.
6. info web extracting method according to claim 1, it is characterized in that, in the html source file of target pages, extracting the page info matching with the predefined aiming field of page parsing template specifically comprises: the page that adopts DOM technology to read customization is resolved template, and travel through the page and resolve the node that template comprises, in the html source file of target pages, mate aiming field, propose the page info matching with aiming field, and be saved in temporary table.
7. the info web extraction element based on http agreement, is characterized in that, comprising:
Template generation module, for according to the target pages of wanting information extraction, customizes the corresponding page and resolves template, and resolves predefine aiming field and verification rule in template at the page;
Web page address parsing module, it,, for resolving the web page address of target pages, obtains the html source file of target pages;
Information extraction modules, it extracts with the page and resolves the page info that the predefined aiming field of template matches for reading and resolve the html source file of target pages in the html source file of target pages;
Information checking module, whether its page info extracting for information extraction modules described in verification meets demand;
Information is preserved module, and it is for preserving the page info after information checking.
8. info web extraction element according to claim 7, is characterized in that, described information is preserved module and adopted database server.
CN201410299203.5A 2014-06-26 2014-06-26 Webpage information extraction method and device based on http protocol Pending CN104050281A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410299203.5A CN104050281A (en) 2014-06-26 2014-06-26 Webpage information extraction method and device based on http protocol

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410299203.5A CN104050281A (en) 2014-06-26 2014-06-26 Webpage information extraction method and device based on http protocol

Publications (1)

Publication Number Publication Date
CN104050281A true CN104050281A (en) 2014-09-17

Family

ID=51503113

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410299203.5A Pending CN104050281A (en) 2014-06-26 2014-06-26 Webpage information extraction method and device based on http protocol

Country Status (1)

Country Link
CN (1) CN104050281A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239577A (en) * 2014-10-09 2014-12-24 北京奇虎科技有限公司 Method and device for detecting authenticity of webpage data
CN104267953A (en) * 2014-09-27 2015-01-07 昆明钢铁集团有限责任公司 Control and method for importing Word test questions based on browser
CN104317948A (en) * 2014-11-05 2015-01-28 北京中科辅龙信息技术有限公司 Page data capturing method and system
CN104484424A (en) * 2014-12-19 2015-04-01 浪潮通用软件有限公司 Establishing method for resource price information base of construction enterprise based on internet
CN104965783A (en) * 2015-06-16 2015-10-07 百度在线网络技术(北京)有限公司 Method and apparatus for monitoring web content presentation
CN105468730A (en) * 2015-11-20 2016-04-06 广州华多网络科技有限公司 Webpage information extraction method and equipment
CN106445950A (en) * 2015-08-10 2017-02-22 刘挺 Personalized distributed data mining system
CN106547749A (en) * 2015-09-16 2017-03-29 北京国双科技有限公司 The method and apparatus of collecting webpage data
CN106570133A (en) * 2016-10-27 2017-04-19 任子行网络技术股份有限公司 Method and device for constructing visual webpage information extracting rule
CN106649392A (en) * 2015-11-03 2017-05-10 任子行网络技术股份有限公司 Method and apparatus for obtaining information based on what-you-see-is-what-you-get technology
CN106845092A (en) * 2017-01-03 2017-06-13 青岛海信医疗设备股份有限公司 A kind of system docking method and device
CN107302584A (en) * 2017-07-11 2017-10-27 上海精数信息科技有限公司 A kind of efficient collecting method
CN107623624A (en) * 2016-07-15 2018-01-23 阿里巴巴集团控股有限公司 The method and device of notification message is provided
CN107992346A (en) * 2017-10-19 2018-05-04 用友网络科技股份有限公司 Interface display method, the interface display system of application program
CN108460001A (en) * 2017-12-29 2018-08-28 中国平安财产保险股份有限公司 Interconnection method, device, equipment and storage medium on a kind of partner product line
CN109474678A (en) * 2018-10-31 2019-03-15 新华三信息安全技术有限公司 A kind of information transferring method and device
CN109683951A (en) * 2018-12-21 2019-04-26 北京量子保科技有限公司 A kind of code method for automatically releasing, system, medium and electronic equipment
CN110020358A (en) * 2017-11-07 2019-07-16 北京京东尚科信息技术有限公司 Method and apparatus for generating dynamic page
CN111125589A (en) * 2018-10-31 2020-05-08 北大方正集团有限公司 Data acquisition method and device and computer readable storage medium
CN111125483A (en) * 2019-12-17 2020-05-08 湖南星汉数智科技有限公司 Method and device for generating webpage data extraction template, computer device and computer readable storage medium
CN111966881A (en) * 2020-10-14 2020-11-20 成都数联铭品科技有限公司 Webpage information extraction method and system and electronic equipment
CN113535568A (en) * 2021-07-22 2021-10-22 工银科技有限公司 Verification method, device, equipment and medium for application deployment version

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020129067A1 (en) * 2001-03-06 2002-09-12 Dwayne Dames Method and apparatus for repurposing formatted content
CN101561802A (en) * 2008-04-18 2009-10-21 上海复旦光华信息科技股份有限公司 Web page structural data extraction method and system
CN103514189A (en) * 2012-06-25 2014-01-15 上海博腾信息科技有限公司 Implementing method for web crawler based on search engines

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020129067A1 (en) * 2001-03-06 2002-09-12 Dwayne Dames Method and apparatus for repurposing formatted content
CN101561802A (en) * 2008-04-18 2009-10-21 上海复旦光华信息科技股份有限公司 Web page structural data extraction method and system
CN103514189A (en) * 2012-06-25 2014-01-15 上海博腾信息科技有限公司 Implementing method for web crawler based on search engines

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张彦超 等: "基于自动生成模板的Web信息抽取技术", 《北京交通大学学报》 *

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104267953A (en) * 2014-09-27 2015-01-07 昆明钢铁集团有限责任公司 Control and method for importing Word test questions based on browser
CN104239577A (en) * 2014-10-09 2014-12-24 北京奇虎科技有限公司 Method and device for detecting authenticity of webpage data
CN104317948A (en) * 2014-11-05 2015-01-28 北京中科辅龙信息技术有限公司 Page data capturing method and system
CN104484424A (en) * 2014-12-19 2015-04-01 浪潮通用软件有限公司 Establishing method for resource price information base of construction enterprise based on internet
CN104965783A (en) * 2015-06-16 2015-10-07 百度在线网络技术(北京)有限公司 Method and apparatus for monitoring web content presentation
CN106445950A (en) * 2015-08-10 2017-02-22 刘挺 Personalized distributed data mining system
CN106547749A (en) * 2015-09-16 2017-03-29 北京国双科技有限公司 The method and apparatus of collecting webpage data
CN106547749B (en) * 2015-09-16 2021-02-12 北京国双科技有限公司 Webpage data acquisition method and device
CN106649392A (en) * 2015-11-03 2017-05-10 任子行网络技术股份有限公司 Method and apparatus for obtaining information based on what-you-see-is-what-you-get technology
CN105468730A (en) * 2015-11-20 2016-04-06 广州华多网络科技有限公司 Webpage information extraction method and equipment
CN107623624B (en) * 2016-07-15 2021-03-16 阿里巴巴集团控股有限公司 Method and device for providing notification message
CN107623624A (en) * 2016-07-15 2018-01-23 阿里巴巴集团控股有限公司 The method and device of notification message is provided
CN106570133A (en) * 2016-10-27 2017-04-19 任子行网络技术股份有限公司 Method and device for constructing visual webpage information extracting rule
CN106570133B (en) * 2016-10-27 2019-07-23 任子行网络技术股份有限公司 A kind of construction method and device of visual webpage information extracting rule
CN106845092A (en) * 2017-01-03 2017-06-13 青岛海信医疗设备股份有限公司 A kind of system docking method and device
CN107302584A (en) * 2017-07-11 2017-10-27 上海精数信息科技有限公司 A kind of efficient collecting method
CN107992346B (en) * 2017-10-19 2021-09-03 用友网络科技股份有限公司 Interface display method and interface display system of application program
CN107992346A (en) * 2017-10-19 2018-05-04 用友网络科技股份有限公司 Interface display method, the interface display system of application program
CN110020358B (en) * 2017-11-07 2021-08-17 北京京东尚科信息技术有限公司 Method and device for generating dynamic page
CN110020358A (en) * 2017-11-07 2019-07-16 北京京东尚科信息技术有限公司 Method and apparatus for generating dynamic page
CN108460001A (en) * 2017-12-29 2018-08-28 中国平安财产保险股份有限公司 Interconnection method, device, equipment and storage medium on a kind of partner product line
CN111125589A (en) * 2018-10-31 2020-05-08 北大方正集团有限公司 Data acquisition method and device and computer readable storage medium
CN109474678B (en) * 2018-10-31 2021-04-02 新华三信息安全技术有限公司 Information transmission method and device
CN109474678A (en) * 2018-10-31 2019-03-15 新华三信息安全技术有限公司 A kind of information transferring method and device
CN111125589B (en) * 2018-10-31 2023-09-05 新方正控股发展有限责任公司 Data acquisition method and device and computer readable storage medium
CN109683951A (en) * 2018-12-21 2019-04-26 北京量子保科技有限公司 A kind of code method for automatically releasing, system, medium and electronic equipment
CN111125483A (en) * 2019-12-17 2020-05-08 湖南星汉数智科技有限公司 Method and device for generating webpage data extraction template, computer device and computer readable storage medium
CN111966881A (en) * 2020-10-14 2020-11-20 成都数联铭品科技有限公司 Webpage information extraction method and system and electronic equipment
CN113535568A (en) * 2021-07-22 2021-10-22 工银科技有限公司 Verification method, device, equipment and medium for application deployment version
CN113535568B (en) * 2021-07-22 2023-09-05 工银科技有限公司 Verification method, device, equipment and medium for application deployment version

Similar Documents

Publication Publication Date Title
CN104050281A (en) Webpage information extraction method and device based on http protocol
CN101441629A (en) Automatic acquiring method of non-structured web page information
CN105049247A (en) Network safety log template extraction method and device
CN107085549B (en) Method and device for generating fault information
CN104317948A (en) Page data capturing method and system
CN105335246B (en) A kind of program crashing defect self-repairing method based on question and answer web analytics
CN102214244A (en) Analytic method and system for docx file information
CN106021301B (en) Data comparison system and method for different file formats
CN106341407A (en) Abnormal access log mining method based on website picture and apparatus thereof
CN102571922B (en) Method and device for processing data stream
CN104038821A (en) Method for uniformly gathering fault information of each functional module of Android television
CN103412742A (en) Method and device for application program to be configured with UI
CN104391917A (en) Method for incrementally capturing webpage contents
CN105808417A (en) Automated testing method and proxy server
CN111046000A (en) Government data exchange sharing oriented security supervision metadata organization method
CN105335516A (en) Construction method of universal acquisition system
CN101819584A (en) Light weight intelligent webpage content analysis method
CN105550179A (en) Webpage collection method and browser plug-in
CN103853770A (en) Method and system for abstracting information of posts from forum website
CN108228664B (en) Unstructured data processing method and device
CN105975599B (en) Method and device for monitoring page embedded points of website
CN104166545A (en) Webpage resource sniffing method and device
CN109116828A (en) Model code configuration method and device in a kind of controller
CN104636340A (en) Webpage URL filtering method, device and system
CN103678041A (en) Incremental backup method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140917