CN104050281A - Webpage information extraction method and device based on http protocol - Google Patents
Webpage information extraction method and device based on http protocol Download PDFInfo
- Publication number
- CN104050281A CN104050281A CN201410299203.5A CN201410299203A CN104050281A CN 104050281 A CN104050281 A CN 104050281A CN 201410299203 A CN201410299203 A CN 201410299203A CN 104050281 A CN104050281 A CN 104050281A
- Authority
- CN
- China
- Prior art keywords
- page
- information
- template
- target pages
- info
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9577—Optimising the visualization of content, e.g. distillation of HTML documents
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention relates to a webpage information extraction method and device based on an http protocol. The method comprises the steps of template generation, webpage address analysis, information extraction, information checking and information storage, wherein in the template generation step, a corresponding page analysis template is customized according to a target page where information is about to be extracted, and a target field and checking rules are predefined in the page analysis template; in the webpage address analysis step, the webpage address of the target page is analyzed to obtain an HTML source file of the target page; in the information extraction step, the HTML source file of the target page is read and analyzed, and page information matched with the target field predefined in the page analysis template is extracted from the HTML source file of the target page; in the information checking step, whether the extracted page information meets requirements is checked according to the predefined checking rules; in the information storage step, the page information subjected to information checking is stored. According to the webpage information extraction method and device, the page information in a network is subjected to effective data filtration, acquisition and collection through the open http protocol, templates are customized according to different target pages, and extraction of customizing information is achieved.
Description
Technical field
The information the present invention relates in network technology crawls and parsing field, particularly relates to a kind of info web extracting method and device based on http agreement.
Background technology
The Web2.0 epoch are epoch of information big bang, and the data message of magnanimity is full of the every aspect in work and life, and therefore the analysis based on data and the excavation demand of potential value are also day by day urgently got up.Yet in practice, the factor data side of having is very strict to the management and control of data, a lot of valuable data messages are can not be very easily collected and extract.Under such background, data importance highlights, and data availability is but not high, is restricted even.Therefore, the Internet characteristics based on data how, be concerned about target data is gathered, extracted and is used becomes a problem anxious to be resolved.
Summary of the invention
Technical matters to be solved by this invention is to provide a kind of information extracting method and device based on http agreement, the technical matters that is difficult for obtaining for solving prior art target information.
The technical scheme that the present invention solves the problems of the technologies described above is as follows: a kind of info web extracting method based on http agreement, comprising:
Template generates step: according to the target pages of wanting information extraction, customize the corresponding page and resolve template, and resolve predefine aiming field and verification rule in template at the page;
Web page address analyzing step: resolve the web page address of target pages, obtain the html source file of target pages;
Information extraction step: read and resolve the html source file of target pages, extract in the html source file of target pages with the page and resolve the page info that the predefined aiming field of template matches;
Information checking step: according to predefined verification rule, whether the page info that verification extracts meets the requirements;
Information is preserved step: preserve the page info after information checking.
On the basis of technique scheme, the present invention can also do following improvement.
Further, in described information extraction step, with block mode, extract the page info matching.
Further, it is XML file that the page of customization is resolved template, and this XML file comprises predefined nodal information, aiming field information and verification Rule Information.
Further, predefined verification rule is regular expression.
Further, adopt SAX technology to read and resolve the html source file of target pages.
Further, in the html source file of target pages, extracting the page info matching with the predefined aiming field of page parsing template specifically comprises: the page that adopts DOM technology to read customization is resolved template, and travel through the page and resolve the node that template comprises, in the html source file of target pages, mate aiming field, propose the page info matching with aiming field, and be saved in temporary table.
Corresponding above-mentioned info web extracting method, technical scheme of the present invention also comprises a kind of info web extraction element based on http agreement, comprising:
Template generation module, for according to the target pages of wanting information extraction, customizes the corresponding page and resolves template, and resolves predefine aiming field and verification rule in template at the page;
Web page address parsing module, it,, for resolving the web page address of target pages, obtains the html source file of target pages;
Information extraction modules, it extracts with the page and resolves the page info that the predefined aiming field of template matches for reading and resolve the html source file of target pages in the html source file of target pages;
Information checking module, whether its page info extracting for information extraction modules described in verification meets demand;
Information is preserved module, and it is for preserving the page info after information checking.Further, described information is preserved module and is adopted database server.
The invention has the beneficial effects as follows: the present invention does not rely on the openness of the data side of having, can data in internet be gathered, be extracted according to basic internet communication protocol (http), be conducive to carry out the value excavation of data and analyze.The present invention, by open http agreement, carries out valid data to the page info that can have access in network and filters collection, collects, and different target pages is carried out to model customization, realizes the extraction of customizing messages.Be different from the form that extraction limit, limit generates template, the present invention customizes the template of target pages in advance, and specific aim is stronger, is conducive to improve efficiency and the accuracy of information extraction.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet that the present invention is based on the information extracting method of http agreement;
Fig. 2 is the system architecture diagram of information extracting method described in the embodiment of the present invention;
Fig. 3 is the structural representation that the present invention is based on the information extracting device of http agreement.
Embodiment
Below in conjunction with accompanying drawing, principle of the present invention and feature are described, example, only for explaining the present invention, is not intended to limit scope of the present invention.
As shown in Figure 1, the present embodiment has provided a kind of info web extracting method based on http agreement, comprising:
Template generates step: according to the target pages of wanting information extraction, customize the corresponding page and resolve template, and resolve predefine aiming field and verification rule in template at the page;
Web page address analyzing step: resolve the web page address of target pages, obtain the html source file of target pages;
Information extraction step: read and resolve the html source file of target pages, extract in the html source file of target pages with the page and resolve the page info that the predefined aiming field of template matches;
Information checking step: according to predefined verification rule, whether the page info that verification extracts meets the requirements;
Information is preserved step: preserve the page info after information checking.
Wherein, described template generation step also comprises: in the page parsing module of customization, be configured for the key word that extracts required page info.To should step, in described information extraction step, from target pages, extract required page info and specifically comprise: according to the key word configuring page parsing module, from target pages, with block mode, extract required page info.
As shown in Figure 2, in the specific implementation, corresponding architecture system is page object surface layer, service layer and data Layer three-tier architecture.Page object surface layer is mainly the page that needs obtaining information; And service layer is deployed with some acquisition servers, corresponding program function as service arrangement in service layer, realization is to the modeling of target pages and collection, and the corresponding function that above-mentioned template generates step, web page address analyzing step, information extraction step and information checking step is all in this layer realization; Data Layer is deployed with some database servers, for the effective information gathering and extract is stored as data.
The program thread of corresponding program function is described as follows: the Page Template save as xml file of 1, first setting objectives on foreground, comprises nodename and verification Rule Information (being generally regular expression) in file.2, foreground manual triggers (also having backstage to start regularly thread sends request) sends request to target URL, and obtains the html source file of target pages.And source file is saved as to temporary file.3, use SAX to resolve and read temporary file, use DOM to read template file, traversal every template node is mated aiming field from temporary file, and the data that match are saved in to temporary table temporarily.4, read data in temporary table, according to predefined verification rule, data are carried out to verification, the data that satisfy condition are inserted and extracted result table.
While adopting above-mentioned three-tier architecture, in the specific implementation, to above-mentioned steps refinement, concrete implementation process is as follows:
The first step, system is carried out initialization loading, and loading service layer program, for target network address information analysis, verification and preservation.
Second step, arranges the web page address that need to extract, has customized the page and resolve template.Need to be for target pages, customized web page is resolved template, the configurable interface elements key word of paying close attention to content in template in advance.
The 3rd step, starts extraction, analysis service.
The 4th step, according to template and the web page address of second step customization, from target network address obtaining information, load page information, completes the extraction of block mode, can be according to key word configuration extraction key content.
The 5th step, carries out verification, uses regular expression the information of extracting, and special character is replaced, to preserve warehouse-in, completes preservation;
As shown in Figure 3, corresponding above-mentioned information extracting method, the present embodiment gives a kind of information extracting device based on http agreement, comprising:
Template generation module, for according to the target pages of wanting information extraction, customizes the corresponding page and resolves template, and resolves predefine aiming field and verification rule in template at the page;
Web page address parsing module, it,, for resolving the web page address of target pages, obtains the html source file of target pages;
Information extraction modules, it extracts with the page and resolves the page info that the predefined aiming field of template matches for reading and resolve the html source file of target pages in the html source file of target pages;
Information checking module, whether its page info extracting for information extraction modules described in verification meets demand;
Information is preserved module, and it is for preserving the page info after information checking.
Much more no longer the principle of work of the information extracting device of the present embodiment and concrete implementation detail are identical with above-mentioned information extracting method, to state here.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.
Claims (8)
1. the info web extracting method based on http agreement, is characterized in that, comprising:
Template generates step: according to the target pages of wanting information extraction, customize the corresponding page and resolve template, and resolve predefine aiming field and verification rule in template at the page;
Web page address analyzing step: resolve the web page address of target pages, obtain the html source file of target pages;
Information extraction step: read and resolve the html source file of target pages, extract in the html source file of target pages with the page and resolve the page info that the predefined aiming field of template matches;
Information checking step: according to predefined verification rule, whether the page info that verification extracts meets the requirements;
Information is preserved step: preserve the page info after information checking.
2. info web extracting method according to claim 1, is characterized in that, in described information extraction step, with block mode, extracts the page info matching.
3. info web extracting method according to claim 1, is characterized in that, it is XML file that the page of customization is resolved template, and this XML file comprises predefined nodal information, aiming field information and verification Rule Information.
4. info web extracting method according to claim 1, is characterized in that, predefined verification rule is regular expression.
5. info web extracting method according to claim 1, is characterized in that, adopts SAX technology to read and resolve the html source file of target pages.
6. info web extracting method according to claim 1, it is characterized in that, in the html source file of target pages, extracting the page info matching with the predefined aiming field of page parsing template specifically comprises: the page that adopts DOM technology to read customization is resolved template, and travel through the page and resolve the node that template comprises, in the html source file of target pages, mate aiming field, propose the page info matching with aiming field, and be saved in temporary table.
7. the info web extraction element based on http agreement, is characterized in that, comprising:
Template generation module, for according to the target pages of wanting information extraction, customizes the corresponding page and resolves template, and resolves predefine aiming field and verification rule in template at the page;
Web page address parsing module, it,, for resolving the web page address of target pages, obtains the html source file of target pages;
Information extraction modules, it extracts with the page and resolves the page info that the predefined aiming field of template matches for reading and resolve the html source file of target pages in the html source file of target pages;
Information checking module, whether its page info extracting for information extraction modules described in verification meets demand;
Information is preserved module, and it is for preserving the page info after information checking.
8. info web extraction element according to claim 7, is characterized in that, described information is preserved module and adopted database server.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410299203.5A CN104050281A (en) | 2014-06-26 | 2014-06-26 | Webpage information extraction method and device based on http protocol |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410299203.5A CN104050281A (en) | 2014-06-26 | 2014-06-26 | Webpage information extraction method and device based on http protocol |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104050281A true CN104050281A (en) | 2014-09-17 |
Family
ID=51503113
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410299203.5A Pending CN104050281A (en) | 2014-06-26 | 2014-06-26 | Webpage information extraction method and device based on http protocol |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104050281A (en) |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104239577A (en) * | 2014-10-09 | 2014-12-24 | 北京奇虎科技有限公司 | Method and device for detecting authenticity of webpage data |
CN104267953A (en) * | 2014-09-27 | 2015-01-07 | 昆明钢铁集团有限责任公司 | Control and method for importing Word test questions based on browser |
CN104317948A (en) * | 2014-11-05 | 2015-01-28 | 北京中科辅龙信息技术有限公司 | Page data capturing method and system |
CN104484424A (en) * | 2014-12-19 | 2015-04-01 | 浪潮通用软件有限公司 | Establishing method for resource price information base of construction enterprise based on internet |
CN104965783A (en) * | 2015-06-16 | 2015-10-07 | 百度在线网络技术(北京)有限公司 | Method and apparatus for monitoring web content presentation |
CN105468730A (en) * | 2015-11-20 | 2016-04-06 | 广州华多网络科技有限公司 | Webpage information extraction method and equipment |
CN106445950A (en) * | 2015-08-10 | 2017-02-22 | 刘挺 | Personalized distributed data mining system |
CN106547749A (en) * | 2015-09-16 | 2017-03-29 | 北京国双科技有限公司 | The method and apparatus of collecting webpage data |
CN106570133A (en) * | 2016-10-27 | 2017-04-19 | 任子行网络技术股份有限公司 | Method and device for constructing visual webpage information extracting rule |
CN106649392A (en) * | 2015-11-03 | 2017-05-10 | 任子行网络技术股份有限公司 | Method and apparatus for obtaining information based on what-you-see-is-what-you-get technology |
CN106845092A (en) * | 2017-01-03 | 2017-06-13 | 青岛海信医疗设备股份有限公司 | A kind of system docking method and device |
CN107302584A (en) * | 2017-07-11 | 2017-10-27 | 上海精数信息科技有限公司 | A kind of efficient collecting method |
CN107623624A (en) * | 2016-07-15 | 2018-01-23 | 阿里巴巴集团控股有限公司 | The method and device of notification message is provided |
CN107992346A (en) * | 2017-10-19 | 2018-05-04 | 用友网络科技股份有限公司 | Interface display method, the interface display system of application program |
CN108460001A (en) * | 2017-12-29 | 2018-08-28 | 中国平安财产保险股份有限公司 | Interconnection method, device, equipment and storage medium on a kind of partner product line |
CN109474678A (en) * | 2018-10-31 | 2019-03-15 | 新华三信息安全技术有限公司 | A kind of information transferring method and device |
CN109683951A (en) * | 2018-12-21 | 2019-04-26 | 北京量子保科技有限公司 | A kind of code method for automatically releasing, system, medium and electronic equipment |
CN110020358A (en) * | 2017-11-07 | 2019-07-16 | 北京京东尚科信息技术有限公司 | Method and apparatus for generating dynamic page |
CN111125589A (en) * | 2018-10-31 | 2020-05-08 | 北大方正集团有限公司 | Data acquisition method and device and computer readable storage medium |
CN111125483A (en) * | 2019-12-17 | 2020-05-08 | 湖南星汉数智科技有限公司 | Method and device for generating webpage data extraction template, computer device and computer readable storage medium |
CN111966881A (en) * | 2020-10-14 | 2020-11-20 | 成都数联铭品科技有限公司 | Webpage information extraction method and system and electronic equipment |
CN113535568A (en) * | 2021-07-22 | 2021-10-22 | 工银科技有限公司 | Verification method, device, equipment and medium for application deployment version |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020129067A1 (en) * | 2001-03-06 | 2002-09-12 | Dwayne Dames | Method and apparatus for repurposing formatted content |
CN101561802A (en) * | 2008-04-18 | 2009-10-21 | 上海复旦光华信息科技股份有限公司 | Web page structural data extraction method and system |
CN103514189A (en) * | 2012-06-25 | 2014-01-15 | 上海博腾信息科技有限公司 | Implementing method for web crawler based on search engines |
-
2014
- 2014-06-26 CN CN201410299203.5A patent/CN104050281A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020129067A1 (en) * | 2001-03-06 | 2002-09-12 | Dwayne Dames | Method and apparatus for repurposing formatted content |
CN101561802A (en) * | 2008-04-18 | 2009-10-21 | 上海复旦光华信息科技股份有限公司 | Web page structural data extraction method and system |
CN103514189A (en) * | 2012-06-25 | 2014-01-15 | 上海博腾信息科技有限公司 | Implementing method for web crawler based on search engines |
Non-Patent Citations (1)
Title |
---|
张彦超 等: "基于自动生成模板的Web信息抽取技术", 《北京交通大学学报》 * |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104267953A (en) * | 2014-09-27 | 2015-01-07 | 昆明钢铁集团有限责任公司 | Control and method for importing Word test questions based on browser |
CN104239577A (en) * | 2014-10-09 | 2014-12-24 | 北京奇虎科技有限公司 | Method and device for detecting authenticity of webpage data |
CN104317948A (en) * | 2014-11-05 | 2015-01-28 | 北京中科辅龙信息技术有限公司 | Page data capturing method and system |
CN104484424A (en) * | 2014-12-19 | 2015-04-01 | 浪潮通用软件有限公司 | Establishing method for resource price information base of construction enterprise based on internet |
CN104965783A (en) * | 2015-06-16 | 2015-10-07 | 百度在线网络技术(北京)有限公司 | Method and apparatus for monitoring web content presentation |
CN106445950A (en) * | 2015-08-10 | 2017-02-22 | 刘挺 | Personalized distributed data mining system |
CN106547749A (en) * | 2015-09-16 | 2017-03-29 | 北京国双科技有限公司 | The method and apparatus of collecting webpage data |
CN106547749B (en) * | 2015-09-16 | 2021-02-12 | 北京国双科技有限公司 | Webpage data acquisition method and device |
CN106649392A (en) * | 2015-11-03 | 2017-05-10 | 任子行网络技术股份有限公司 | Method and apparatus for obtaining information based on what-you-see-is-what-you-get technology |
CN105468730A (en) * | 2015-11-20 | 2016-04-06 | 广州华多网络科技有限公司 | Webpage information extraction method and equipment |
CN107623624B (en) * | 2016-07-15 | 2021-03-16 | 阿里巴巴集团控股有限公司 | Method and device for providing notification message |
CN107623624A (en) * | 2016-07-15 | 2018-01-23 | 阿里巴巴集团控股有限公司 | The method and device of notification message is provided |
CN106570133A (en) * | 2016-10-27 | 2017-04-19 | 任子行网络技术股份有限公司 | Method and device for constructing visual webpage information extracting rule |
CN106570133B (en) * | 2016-10-27 | 2019-07-23 | 任子行网络技术股份有限公司 | A kind of construction method and device of visual webpage information extracting rule |
CN106845092A (en) * | 2017-01-03 | 2017-06-13 | 青岛海信医疗设备股份有限公司 | A kind of system docking method and device |
CN107302584A (en) * | 2017-07-11 | 2017-10-27 | 上海精数信息科技有限公司 | A kind of efficient collecting method |
CN107992346B (en) * | 2017-10-19 | 2021-09-03 | 用友网络科技股份有限公司 | Interface display method and interface display system of application program |
CN107992346A (en) * | 2017-10-19 | 2018-05-04 | 用友网络科技股份有限公司 | Interface display method, the interface display system of application program |
CN110020358B (en) * | 2017-11-07 | 2021-08-17 | 北京京东尚科信息技术有限公司 | Method and device for generating dynamic page |
CN110020358A (en) * | 2017-11-07 | 2019-07-16 | 北京京东尚科信息技术有限公司 | Method and apparatus for generating dynamic page |
CN108460001A (en) * | 2017-12-29 | 2018-08-28 | 中国平安财产保险股份有限公司 | Interconnection method, device, equipment and storage medium on a kind of partner product line |
CN111125589A (en) * | 2018-10-31 | 2020-05-08 | 北大方正集团有限公司 | Data acquisition method and device and computer readable storage medium |
CN109474678B (en) * | 2018-10-31 | 2021-04-02 | 新华三信息安全技术有限公司 | Information transmission method and device |
CN109474678A (en) * | 2018-10-31 | 2019-03-15 | 新华三信息安全技术有限公司 | A kind of information transferring method and device |
CN111125589B (en) * | 2018-10-31 | 2023-09-05 | 新方正控股发展有限责任公司 | Data acquisition method and device and computer readable storage medium |
CN109683951A (en) * | 2018-12-21 | 2019-04-26 | 北京量子保科技有限公司 | A kind of code method for automatically releasing, system, medium and electronic equipment |
CN111125483A (en) * | 2019-12-17 | 2020-05-08 | 湖南星汉数智科技有限公司 | Method and device for generating webpage data extraction template, computer device and computer readable storage medium |
CN111966881A (en) * | 2020-10-14 | 2020-11-20 | 成都数联铭品科技有限公司 | Webpage information extraction method and system and electronic equipment |
CN113535568A (en) * | 2021-07-22 | 2021-10-22 | 工银科技有限公司 | Verification method, device, equipment and medium for application deployment version |
CN113535568B (en) * | 2021-07-22 | 2023-09-05 | 工银科技有限公司 | Verification method, device, equipment and medium for application deployment version |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104050281A (en) | Webpage information extraction method and device based on http protocol | |
CN101441629A (en) | Automatic acquiring method of non-structured web page information | |
CN105049247A (en) | Network safety log template extraction method and device | |
CN107085549B (en) | Method and device for generating fault information | |
CN104317948A (en) | Page data capturing method and system | |
CN105335246B (en) | A kind of program crashing defect self-repairing method based on question and answer web analytics | |
CN102214244A (en) | Analytic method and system for docx file information | |
CN106021301B (en) | Data comparison system and method for different file formats | |
CN106341407A (en) | Abnormal access log mining method based on website picture and apparatus thereof | |
CN102571922B (en) | Method and device for processing data stream | |
CN104038821A (en) | Method for uniformly gathering fault information of each functional module of Android television | |
CN103412742A (en) | Method and device for application program to be configured with UI | |
CN104391917A (en) | Method for incrementally capturing webpage contents | |
CN105808417A (en) | Automated testing method and proxy server | |
CN111046000A (en) | Government data exchange sharing oriented security supervision metadata organization method | |
CN105335516A (en) | Construction method of universal acquisition system | |
CN101819584A (en) | Light weight intelligent webpage content analysis method | |
CN105550179A (en) | Webpage collection method and browser plug-in | |
CN103853770A (en) | Method and system for abstracting information of posts from forum website | |
CN108228664B (en) | Unstructured data processing method and device | |
CN105975599B (en) | Method and device for monitoring page embedded points of website | |
CN104166545A (en) | Webpage resource sniffing method and device | |
CN109116828A (en) | Model code configuration method and device in a kind of controller | |
CN104636340A (en) | Webpage URL filtering method, device and system | |
CN103678041A (en) | Incremental backup method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20140917 |