CN104317845A - Method and system for automatic extraction of deep web data - Google Patents

Method and system for automatic extraction of deep web data Download PDF

Info

Publication number
CN104317845A
CN104317845A CN201410537825.7A CN201410537825A CN104317845A CN 104317845 A CN104317845 A CN 104317845A CN 201410537825 A CN201410537825 A CN 201410537825A CN 104317845 A CN104317845 A CN 104317845A
Authority
CN
China
Prior art keywords
data
degree
automatic extraction
depth network
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410537825.7A
Other languages
Chinese (zh)
Inventor
贾岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Original Assignee
ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd filed Critical ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority to CN201410537825.7A priority Critical patent/CN104317845A/en
Publication of CN104317845A publication Critical patent/CN104317845A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a system for automatic extraction of deep web data. The method comprises the following steps of detecting and extracting relevant data of an industry; analyzing a WEB page and extracting a semantic abstract; automatically extracting the Deep Web data. The method has the advantages that under the condition of no loss of data collection amount of the industry, the bandwidth and data retrieval amount are greatly saved, the data loading cycle is improved, and the real-time degree is improved.

Description

A kind of degree of depth network data Automatic Extraction method and system
Technical field
The present invention relates to grid computing technology field, particularly relate to a kind of degree of depth network data Automatic Extraction method and system.
Background technology
Along with the level of informatization is constantly deepened, enterprise to informationization integrated crave for also day by day strong; The information resources of internet sustainable growth have contained the information with commercial value of flood tide, become important information source.There is provided the company of information customization search and intelligence analysis Related product few in number at present, and the Back ground Information facility requirements of product to user itself is high, the implementation cycle is long, system Construction and maintenance cost high, major customer is ultra-large type business and government, and ordinary enterprises is unable bears.
Summary of the invention
In order to solve the technical matters existed in background technology, the present invention proposes a kind of degree of depth network data Automatic Extraction method and system, greatly reducing the requirement of system to company information facility, enterprise's Back ground Information facility deploy that can vary.
A kind of degree of depth network data Automatic Extraction method that the present invention proposes, comprises the following steps:
Carry out the detection of industry related data and crawl;
Carry out WEB page to resolve and semantic abstract extraction;
Carry out Deep web data Automatic Extraction.
Preferably, described in carry out the detection of industry related data and crawl, be specially fixed point collection, configured by user and gather known data source.
Preferably, describedly carry out the detection of industry related data and crawl, be specially and adopt web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.
Preferably, described in carry out WEB page and resolve and semantic abstract extraction, be specially and utilize HTML specification and view-based access control model Segment technology, the extraction metamessage of the page and body text.
Preferably, described in carry out the detection of industry related data and crawl, specifically comprise:
Adopt network probe technology, constantly a detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form.After finding list form, automatic submission form, compares acquisition webpage;
Obtain page dom tree before and after analyzing, extract the node that dom tree interior joint content is different, Here it is needs the data of collection.
Preferably, after extracting correct data, notice administrator configurations data layout, completes Deep Web site and finds and gather.
This kind of degree of depth network data Automatic Extraction system proposed, comprising:
Acquisition module, for carrying out the detection of industry related data and crawl;
Resolving and extraction module, be connected with described acquisition module, resolving and semantic abstract extraction for carrying out WEB page;
Automatic Extraction module, is connected with described parsing and extraction module, for carrying out Deep web data Automatic Extraction.
Preferably, described acquisition module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.
Preferably, described parsing and extraction module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.
Preferably, described Automatic Extraction module, specifically for adopting network probe technology, a constantly detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form.After finding list form, automatic submission form, compares acquisition webpage; Obtain page dom tree before and after analyzing, extract the node that dom tree interior joint content is different, Here it is needs the data of collection.
In the present invention, when not losing industry data acquisition amount, greatly saving bandwidth sum data retrieval amount, and improve the data loading cycle, improving and spend in real time.
Accompanying drawing explanation
Fig. 1 is a kind of degree of depth network data Automatic Extraction method flow diagram that the embodiment of the present invention proposes;
Fig. 2 is a kind of degree of depth network data Automatic Extraction system construction drawing that the embodiment of the present invention proposes.
Embodiment
As shown in Figure 1, the embodiment of the present invention proposes a kind of degree of depth network data Automatic Extraction method and system, comprises the following steps:
Step 101, carries out the detection of industry related data and crawl.Because the present invention is enterprise's customized searches, basis, IT application in enterprises aspect varies on the one hand, and resource is all relatively limited, on the other hand, also only needs industry relevant information, without the need to editing and recording whole internet.So the present invention carries out the detection of industry related data and crawl by two kinds of approach: one is fixed point collection, is configured gather known data source by user; Adopt web trade information probe on the other hand, utilize industry body, by URL (Uniform Resource Locator, URL(uniform resource locator)) means such as link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form etc. excavates degree of depth network (deep web), to look for potential data source.URL is a kind of expression succinctly of position to the resource that can obtain from internet and access method, is the address of standard resource on internet.Each file on internet has a unique URL, and its information comprised points out how the position of file and browser should process.Wherein, because deep web is much the good data of structuring, be convenient to analyze, and often cannot search under universal search engine and obtain, have immense value to client.
Step 102, carries out WEB page and resolves and semantic abstract extraction.Web page is resolved namely by analyzing tags, and parsing HTML ((HyperText Mark-up Language, i.e. HTML (Hypertext Markup Language)) page, and extract body matter.The present invention utilizes HTML specification and view-based access control model Segment technology, extracts metamessage (as title, key word etc.) and the body text of the page, effectively avoids the interference of irrelevant information.In addition the present invention can support other common data forms well, comprises the data layout of XML, PDF and MS Office series.
Wherein, there are two kinds of situations in semantic summary problem in the present invention, and a kind of situation is the full text summary done for the ease of client's browsing information; Another kind is the informative abstract of Search Results.As far as possible the first kind contains document main information for starting point, and Equations of The Second Kind also will consider the problems such as the density of user search word under the prerequisite of first.In the present invention, utilize semantic analysis technology, semantic analysis is done to chapter every words, the semantic point of mark verb, nominal semanteme point and semantic tendency, then be aggregated into the semantic side emphasis of paragraph and whole chapter, finally utilize semantic side emphasis, in conjunction with chapter feature, with number of words (as 400 words) for constraint condition, select and contain several " sentence groups " composition summary in full semantic in full as far as possible.The documentation summary of Search Results realizes this constraint condition of density that upper difference is to increase search word (comprising concept close to word).
Step 103, carries out Deep web data Automatic Extraction.Deep Web refers to that those are stored in network data base, do not need the resource collection of being accessed by dynamic web page technique by hyperlink access.And in applying in practice, the content value in Deep Web is larger, integrated more meaningful to structural data of this part content.The present invention adopts network probe technology, constantly a detection site pages, by the mode of automatic filling list, and test return data, thus find most suitable list form.After finding list form, automatic submission form, compares acquisition webpage.Find in the experiment of invention, the Deep web resource back page structural difference of same website is very little.Utilize this feature, obtain page dom tree, extract the node that dom tree interior joint content is different before and after analyzing, Here it is needs the data of collection.After extracting correct data, notice administrator configurations data layout, completes Deep Web site and finds and gather.
As shown in Figure 2, the embodiment of the present invention proposes a kind of degree of depth network data Automatic Extraction system, comprising: acquisition module 10, for carrying out the detection of industry related data and crawl; Resolving and extraction module 20, be connected with described acquisition module 10, resolving and semantic abstract extraction for carrying out WEB page; Automatic Extraction module 30, is connected with described parsing and extraction module 20, for carrying out Deep web data Automatic Extraction.
Described acquisition module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.
Described parsing and extraction module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.
Described Automatic Extraction module, specifically for adopting network probe technology, a constantly detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form.After finding list form, automatic submission form, compares acquisition webpage; Obtain page dom tree before and after analyzing, extract the node that dom tree interior joint content is different, Here it is needs the data of collection.
The above; be only the present invention's preferably embodiment; but protection scope of the present invention is not limited thereto; anyly be familiar with those skilled in the art in the technical scope that the present invention discloses; be equal to according to technical scheme of the present invention and inventive concept thereof and replace or change, all should be encompassed within protection scope of the present invention.

Claims (10)

1. a degree of depth network data Automatic Extraction method, is characterized in that, comprise the following steps:
Carry out the detection of industry related data and crawl;
Carry out WEB page to resolve and semantic abstract extraction;
Carry out Deep web data Automatic Extraction.
2. degree of depth network data Automatic Extraction method according to claim 1, is characterized in that, described in carry out the detection of industry related data and crawl, be specially fixed point collection, configured by user and gather known data source.
3. degree of depth network data Automatic Extraction method according to claim 1, it is characterized in that, describedly carry out the detection of industry related data and crawl, be specially and adopt web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.
4. degree of depth network data Automatic Extraction method according to claim 1, it is characterized in that, described WEB page of carrying out is resolved and semantic abstract extraction, is specially and utilizes HTML specification and view-based access control model Segment technology, extracts metamessage and the body text of the page.
5. degree of depth network data Automatic Extraction method according to claim 1, is characterized in that, described in carry out the detection of industry related data and crawl, specifically comprise:
Adopt network probe technology, constantly a detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form; After finding list form, automatic submission form, compares acquisition webpage;
Obtain page dom tree before and after analyzing, extract the node that dom tree interior joint content is different, obtain the data needing to gather.
6. degree of depth network data Automatic Extraction method according to claim 5, is characterized in that, after extracting correct data, notice administrator configurations data layout, completes Deep Web site and find and gather.
7. a degree of depth network data Automatic Extraction system, is characterized in that, comprising:
Acquisition module, for carrying out the detection of industry related data and crawl;
Resolving and extraction module, be connected with described acquisition module, resolving and semantic abstract extraction for carrying out WEB page;
Automatic Extraction module, is connected with described parsing and extraction module, for carrying out Deep web data Automatic Extraction.
8. degree of depth network data Automatic Extraction system according to claim 7, it is characterized in that, described acquisition module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.
9. degree of depth network data Automatic Extraction system according to claim 7, it is characterized in that, described parsing and extraction module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.
10. degree of depth network data Automatic Extraction system according to claim 7, it is characterized in that, described Automatic Extraction module, specifically for adopting network probe technology, continuous detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form.After finding list form, automatic submission form, compares acquisition webpage; Obtain page dom tree before and after analyzing, extract the node that dom tree interior joint content is different, Here it is needs the data of collection.
CN201410537825.7A 2014-10-13 2014-10-13 Method and system for automatic extraction of deep web data Pending CN104317845A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410537825.7A CN104317845A (en) 2014-10-13 2014-10-13 Method and system for automatic extraction of deep web data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410537825.7A CN104317845A (en) 2014-10-13 2014-10-13 Method and system for automatic extraction of deep web data

Publications (1)

Publication Number Publication Date
CN104317845A true CN104317845A (en) 2015-01-28

Family

ID=52373077

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410537825.7A Pending CN104317845A (en) 2014-10-13 2014-10-13 Method and system for automatic extraction of deep web data

Country Status (1)

Country Link
CN (1) CN104317845A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105988994A (en) * 2015-02-06 2016-10-05 北京询达数据科技有限公司 Web field distributed real time extraction system
CN106326225A (en) * 2015-06-16 2017-01-11 阿里巴巴集团控股有限公司 Page data acquisition method and device
CN108090200A (en) * 2017-12-22 2018-05-29 中央财经大学 A kind of sequence type hides the acquisition methods of grid database data
CN108446076A (en) * 2018-01-30 2018-08-24 上海天旦网络科技发展有限公司 Index creation method and system based on web feed data
CN113190500A (en) * 2021-04-23 2021-07-30 广东云智安信科技有限公司 Information accumulation filing system and method based on internet report
CN116822472A (en) * 2023-08-31 2023-09-29 青岛诺亚信息技术有限公司 Method and system for rapidly pulling multi-source data to fill complex interface form

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101051313A (en) * 2007-05-09 2007-10-10 崔志明 Integrated data source finding method for deep layer net page data source
CN101231661A (en) * 2008-02-19 2008-07-30 上海估家网络科技有限公司 Method and system for digging object grade knowledge
CN103389998A (en) * 2012-05-11 2013-11-13 安徽华贞信息科技有限公司 Novel Internet commercial intelligence information semantic analysis technology based on cloud service
CN103473369A (en) * 2013-09-27 2013-12-25 清华大学 Semantic-based information acquisition method and semantic-based information acquisition system
CN103838785A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Vertical search engine in patent field

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101051313A (en) * 2007-05-09 2007-10-10 崔志明 Integrated data source finding method for deep layer net page data source
CN101231661A (en) * 2008-02-19 2008-07-30 上海估家网络科技有限公司 Method and system for digging object grade knowledge
CN103389998A (en) * 2012-05-11 2013-11-13 安徽华贞信息科技有限公司 Novel Internet commercial intelligence information semantic analysis technology based on cloud service
CN103838785A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Vertical search engine in patent field
CN103473369A (en) * 2013-09-27 2013-12-25 清华大学 Semantic-based information acquisition method and semantic-based information acquisition system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
(美)托比等编: "《数据之美》", 31 October 2010 *
寇月等: "D-EEM:一种基于DOM树的Deep Web实体抽取机制", 《计算机研究与发展》 *
赵朋朋等: "DeepSearcher:一个中文Deep Web分类搜索引擎", 《WEB技术与应用》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105988994A (en) * 2015-02-06 2016-10-05 北京询达数据科技有限公司 Web field distributed real time extraction system
CN106326225A (en) * 2015-06-16 2017-01-11 阿里巴巴集团控股有限公司 Page data acquisition method and device
CN108090200A (en) * 2017-12-22 2018-05-29 中央财经大学 A kind of sequence type hides the acquisition methods of grid database data
CN108446076A (en) * 2018-01-30 2018-08-24 上海天旦网络科技发展有限公司 Index creation method and system based on web feed data
CN113190500A (en) * 2021-04-23 2021-07-30 广东云智安信科技有限公司 Information accumulation filing system and method based on internet report
CN116822472A (en) * 2023-08-31 2023-09-29 青岛诺亚信息技术有限公司 Method and system for rapidly pulling multi-source data to fill complex interface form
CN116822472B (en) * 2023-08-31 2023-11-17 青岛诺亚信息技术有限公司 Method and system for rapidly pulling multi-source data to fill complex interface form

Similar Documents

Publication Publication Date Title
CN104317845A (en) Method and system for automatic extraction of deep web data
US8321396B2 (en) Automatically extracting by-line information
CN101515300B (en) Method and system for grabbing Ajax webpage content
Malik et al. Information extraction using web usage mining, web scrapping and semantic annotation
CN107423391B (en) Information extraction method of webpage structured data
CN103246732B (en) A kind of abstracting method of online Web news content and system
CN103714176A (en) Webpage text extraction method based on maximum text density
CN105045901A (en) Search keyword push method and device
CN104239298A (en) Text message recommendation method, server, browser and system
CN104182412A (en) Webpage crawling method and webpage crawling system
CN105159930A (en) Search keyword pushing method and apparatus
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN102355488A (en) Crawler seed obtaining method and equipment and crawler crawling method and equipment
CN103853760A (en) Method and device for extracting contents of bodies of web pages
CN103530429A (en) Webpage content extracting method
CN104899219A (en) Screening method and system of pseudo-static URL (Uniform Resource Locator) and webpage crawling method and system
CN105528357A (en) Webpage content extraction method based on similarity of URLs and similarity of webpage document structures
CN104572934A (en) Webpage key content extracting method based on DOM
CN104765882A (en) Internet website statistics method based on web page characteristic strings
CN102915361A (en) Webpage text extracting method based on character distribution characteristic
CN106294885A (en) A kind of data collection towards isomery webpage and mask method
CN105022824A (en) Method and device for recognizing invalid link
CN103297498A (en) Relevant content pushing method based on mobile phone client side
CN103309954A (en) Html webpage based data extracting system
CN103729354B (en) web information processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150128