CN104317845A - Method and system for automatic extraction of deep web data - Google Patents
Method and system for automatic extraction of deep web data Download PDFInfo
- Publication number
- CN104317845A CN104317845A CN201410537825.7A CN201410537825A CN104317845A CN 104317845 A CN104317845 A CN 104317845A CN 201410537825 A CN201410537825 A CN 201410537825A CN 104317845 A CN104317845 A CN 104317845A
- Authority
- CN
- China
- Prior art keywords
- data
- degree
- automatic extraction
- depth network
- url
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and a system for automatic extraction of deep web data. The method comprises the following steps of detecting and extracting relevant data of an industry; analyzing a WEB page and extracting a semantic abstract; automatically extracting the Deep Web data. The method has the advantages that under the condition of no loss of data collection amount of the industry, the bandwidth and data retrieval amount are greatly saved, the data loading cycle is improved, and the real-time degree is improved.
Description
Technical field
The present invention relates to grid computing technology field, particularly relate to a kind of degree of depth network data Automatic Extraction method and system.
Background technology
Along with the level of informatization is constantly deepened, enterprise to informationization integrated crave for also day by day strong; The information resources of internet sustainable growth have contained the information with commercial value of flood tide, become important information source.There is provided the company of information customization search and intelligence analysis Related product few in number at present, and the Back ground Information facility requirements of product to user itself is high, the implementation cycle is long, system Construction and maintenance cost high, major customer is ultra-large type business and government, and ordinary enterprises is unable bears.
Summary of the invention
In order to solve the technical matters existed in background technology, the present invention proposes a kind of degree of depth network data Automatic Extraction method and system, greatly reducing the requirement of system to company information facility, enterprise's Back ground Information facility deploy that can vary.
A kind of degree of depth network data Automatic Extraction method that the present invention proposes, comprises the following steps:
Carry out the detection of industry related data and crawl;
Carry out WEB page to resolve and semantic abstract extraction;
Carry out Deep web data Automatic Extraction.
Preferably, described in carry out the detection of industry related data and crawl, be specially fixed point collection, configured by user and gather known data source.
Preferably, describedly carry out the detection of industry related data and crawl, be specially and adopt web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.
Preferably, described in carry out WEB page and resolve and semantic abstract extraction, be specially and utilize HTML specification and view-based access control model Segment technology, the extraction metamessage of the page and body text.
Preferably, described in carry out the detection of industry related data and crawl, specifically comprise:
Adopt network probe technology, constantly a detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form.After finding list form, automatic submission form, compares acquisition webpage;
Obtain page dom tree before and after analyzing, extract the node that dom tree interior joint content is different, Here it is needs the data of collection.
Preferably, after extracting correct data, notice administrator configurations data layout, completes Deep Web site and finds and gather.
This kind of degree of depth network data Automatic Extraction system proposed, comprising:
Acquisition module, for carrying out the detection of industry related data and crawl;
Resolving and extraction module, be connected with described acquisition module, resolving and semantic abstract extraction for carrying out WEB page;
Automatic Extraction module, is connected with described parsing and extraction module, for carrying out Deep web data Automatic Extraction.
Preferably, described acquisition module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.
Preferably, described parsing and extraction module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.
Preferably, described Automatic Extraction module, specifically for adopting network probe technology, a constantly detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form.After finding list form, automatic submission form, compares acquisition webpage; Obtain page dom tree before and after analyzing, extract the node that dom tree interior joint content is different, Here it is needs the data of collection.
In the present invention, when not losing industry data acquisition amount, greatly saving bandwidth sum data retrieval amount, and improve the data loading cycle, improving and spend in real time.
Accompanying drawing explanation
Fig. 1 is a kind of degree of depth network data Automatic Extraction method flow diagram that the embodiment of the present invention proposes;
Fig. 2 is a kind of degree of depth network data Automatic Extraction system construction drawing that the embodiment of the present invention proposes.
Embodiment
As shown in Figure 1, the embodiment of the present invention proposes a kind of degree of depth network data Automatic Extraction method and system, comprises the following steps:
Step 101, carries out the detection of industry related data and crawl.Because the present invention is enterprise's customized searches, basis, IT application in enterprises aspect varies on the one hand, and resource is all relatively limited, on the other hand, also only needs industry relevant information, without the need to editing and recording whole internet.So the present invention carries out the detection of industry related data and crawl by two kinds of approach: one is fixed point collection, is configured gather known data source by user; Adopt web trade information probe on the other hand, utilize industry body, by URL (Uniform Resource Locator, URL(uniform resource locator)) means such as link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form etc. excavates degree of depth network (deep web), to look for potential data source.URL is a kind of expression succinctly of position to the resource that can obtain from internet and access method, is the address of standard resource on internet.Each file on internet has a unique URL, and its information comprised points out how the position of file and browser should process.Wherein, because deep web is much the good data of structuring, be convenient to analyze, and often cannot search under universal search engine and obtain, have immense value to client.
Step 102, carries out WEB page and resolves and semantic abstract extraction.Web page is resolved namely by analyzing tags, and parsing HTML ((HyperText Mark-up Language, i.e. HTML (Hypertext Markup Language)) page, and extract body matter.The present invention utilizes HTML specification and view-based access control model Segment technology, extracts metamessage (as title, key word etc.) and the body text of the page, effectively avoids the interference of irrelevant information.In addition the present invention can support other common data forms well, comprises the data layout of XML, PDF and MS Office series.
Wherein, there are two kinds of situations in semantic summary problem in the present invention, and a kind of situation is the full text summary done for the ease of client's browsing information; Another kind is the informative abstract of Search Results.As far as possible the first kind contains document main information for starting point, and Equations of The Second Kind also will consider the problems such as the density of user search word under the prerequisite of first.In the present invention, utilize semantic analysis technology, semantic analysis is done to chapter every words, the semantic point of mark verb, nominal semanteme point and semantic tendency, then be aggregated into the semantic side emphasis of paragraph and whole chapter, finally utilize semantic side emphasis, in conjunction with chapter feature, with number of words (as 400 words) for constraint condition, select and contain several " sentence groups " composition summary in full semantic in full as far as possible.The documentation summary of Search Results realizes this constraint condition of density that upper difference is to increase search word (comprising concept close to word).
Step 103, carries out Deep web data Automatic Extraction.Deep Web refers to that those are stored in network data base, do not need the resource collection of being accessed by dynamic web page technique by hyperlink access.And in applying in practice, the content value in Deep Web is larger, integrated more meaningful to structural data of this part content.The present invention adopts network probe technology, constantly a detection site pages, by the mode of automatic filling list, and test return data, thus find most suitable list form.After finding list form, automatic submission form, compares acquisition webpage.Find in the experiment of invention, the Deep web resource back page structural difference of same website is very little.Utilize this feature, obtain page dom tree, extract the node that dom tree interior joint content is different before and after analyzing, Here it is needs the data of collection.After extracting correct data, notice administrator configurations data layout, completes Deep Web site and finds and gather.
As shown in Figure 2, the embodiment of the present invention proposes a kind of degree of depth network data Automatic Extraction system, comprising: acquisition module 10, for carrying out the detection of industry related data and crawl; Resolving and extraction module 20, be connected with described acquisition module 10, resolving and semantic abstract extraction for carrying out WEB page; Automatic Extraction module 30, is connected with described parsing and extraction module 20, for carrying out Deep web data Automatic Extraction.
Described acquisition module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.
Described parsing and extraction module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.
Described Automatic Extraction module, specifically for adopting network probe technology, a constantly detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form.After finding list form, automatic submission form, compares acquisition webpage; Obtain page dom tree before and after analyzing, extract the node that dom tree interior joint content is different, Here it is needs the data of collection.
The above; be only the present invention's preferably embodiment; but protection scope of the present invention is not limited thereto; anyly be familiar with those skilled in the art in the technical scope that the present invention discloses; be equal to according to technical scheme of the present invention and inventive concept thereof and replace or change, all should be encompassed within protection scope of the present invention.
Claims (10)
1. a degree of depth network data Automatic Extraction method, is characterized in that, comprise the following steps:
Carry out the detection of industry related data and crawl;
Carry out WEB page to resolve and semantic abstract extraction;
Carry out Deep web data Automatic Extraction.
2. degree of depth network data Automatic Extraction method according to claim 1, is characterized in that, described in carry out the detection of industry related data and crawl, be specially fixed point collection, configured by user and gather known data source.
3. degree of depth network data Automatic Extraction method according to claim 1, it is characterized in that, describedly carry out the detection of industry related data and crawl, be specially and adopt web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.
4. degree of depth network data Automatic Extraction method according to claim 1, it is characterized in that, described WEB page of carrying out is resolved and semantic abstract extraction, is specially and utilizes HTML specification and view-based access control model Segment technology, extracts metamessage and the body text of the page.
5. degree of depth network data Automatic Extraction method according to claim 1, is characterized in that, described in carry out the detection of industry related data and crawl, specifically comprise:
Adopt network probe technology, constantly a detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form; After finding list form, automatic submission form, compares acquisition webpage;
Obtain page dom tree before and after analyzing, extract the node that dom tree interior joint content is different, obtain the data needing to gather.
6. degree of depth network data Automatic Extraction method according to claim 5, is characterized in that, after extracting correct data, notice administrator configurations data layout, completes Deep Web site and find and gather.
7. a degree of depth network data Automatic Extraction system, is characterized in that, comprising:
Acquisition module, for carrying out the detection of industry related data and crawl;
Resolving and extraction module, be connected with described acquisition module, resolving and semantic abstract extraction for carrying out WEB page;
Automatic Extraction module, is connected with described parsing and extraction module, for carrying out Deep web data Automatic Extraction.
8. degree of depth network data Automatic Extraction system according to claim 7, it is characterized in that, described acquisition module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.
9. degree of depth network data Automatic Extraction system according to claim 7, it is characterized in that, described parsing and extraction module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.
10. degree of depth network data Automatic Extraction system according to claim 7, it is characterized in that, described Automatic Extraction module, specifically for adopting network probe technology, continuous detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form.After finding list form, automatic submission form, compares acquisition webpage; Obtain page dom tree before and after analyzing, extract the node that dom tree interior joint content is different, Here it is needs the data of collection.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410537825.7A CN104317845A (en) | 2014-10-13 | 2014-10-13 | Method and system for automatic extraction of deep web data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410537825.7A CN104317845A (en) | 2014-10-13 | 2014-10-13 | Method and system for automatic extraction of deep web data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104317845A true CN104317845A (en) | 2015-01-28 |
Family
ID=52373077
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410537825.7A Pending CN104317845A (en) | 2014-10-13 | 2014-10-13 | Method and system for automatic extraction of deep web data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104317845A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105988994A (en) * | 2015-02-06 | 2016-10-05 | 北京询达数据科技有限公司 | Web field distributed real time extraction system |
CN106326225A (en) * | 2015-06-16 | 2017-01-11 | 阿里巴巴集团控股有限公司 | Page data acquisition method and device |
CN108090200A (en) * | 2017-12-22 | 2018-05-29 | 中央财经大学 | A kind of sequence type hides the acquisition methods of grid database data |
CN108446076A (en) * | 2018-01-30 | 2018-08-24 | 上海天旦网络科技发展有限公司 | Index creation method and system based on web feed data |
CN113190500A (en) * | 2021-04-23 | 2021-07-30 | 广东云智安信科技有限公司 | Information accumulation filing system and method based on internet report |
CN116822472A (en) * | 2023-08-31 | 2023-09-29 | 青岛诺亚信息技术有限公司 | Method and system for rapidly pulling multi-source data to fill complex interface form |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101051313A (en) * | 2007-05-09 | 2007-10-10 | 崔志明 | Integrated data source finding method for deep layer net page data source |
CN101231661A (en) * | 2008-02-19 | 2008-07-30 | 上海估家网络科技有限公司 | Method and system for digging object grade knowledge |
CN103389998A (en) * | 2012-05-11 | 2013-11-13 | 安徽华贞信息科技有限公司 | Novel Internet commercial intelligence information semantic analysis technology based on cloud service |
CN103473369A (en) * | 2013-09-27 | 2013-12-25 | 清华大学 | Semantic-based information acquisition method and semantic-based information acquisition system |
CN103838785A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Vertical search engine in patent field |
-
2014
- 2014-10-13 CN CN201410537825.7A patent/CN104317845A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101051313A (en) * | 2007-05-09 | 2007-10-10 | 崔志明 | Integrated data source finding method for deep layer net page data source |
CN101231661A (en) * | 2008-02-19 | 2008-07-30 | 上海估家网络科技有限公司 | Method and system for digging object grade knowledge |
CN103389998A (en) * | 2012-05-11 | 2013-11-13 | 安徽华贞信息科技有限公司 | Novel Internet commercial intelligence information semantic analysis technology based on cloud service |
CN103838785A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Vertical search engine in patent field |
CN103473369A (en) * | 2013-09-27 | 2013-12-25 | 清华大学 | Semantic-based information acquisition method and semantic-based information acquisition system |
Non-Patent Citations (3)
Title |
---|
(美)托比等编: "《数据之美》", 31 October 2010 * |
寇月等: "D-EEM:一种基于DOM树的Deep Web实体抽取机制", 《计算机研究与发展》 * |
赵朋朋等: "DeepSearcher:一个中文Deep Web分类搜索引擎", 《WEB技术与应用》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105988994A (en) * | 2015-02-06 | 2016-10-05 | 北京询达数据科技有限公司 | Web field distributed real time extraction system |
CN106326225A (en) * | 2015-06-16 | 2017-01-11 | 阿里巴巴集团控股有限公司 | Page data acquisition method and device |
CN108090200A (en) * | 2017-12-22 | 2018-05-29 | 中央财经大学 | A kind of sequence type hides the acquisition methods of grid database data |
CN108446076A (en) * | 2018-01-30 | 2018-08-24 | 上海天旦网络科技发展有限公司 | Index creation method and system based on web feed data |
CN113190500A (en) * | 2021-04-23 | 2021-07-30 | 广东云智安信科技有限公司 | Information accumulation filing system and method based on internet report |
CN116822472A (en) * | 2023-08-31 | 2023-09-29 | 青岛诺亚信息技术有限公司 | Method and system for rapidly pulling multi-source data to fill complex interface form |
CN116822472B (en) * | 2023-08-31 | 2023-11-17 | 青岛诺亚信息技术有限公司 | Method and system for rapidly pulling multi-source data to fill complex interface form |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104317845A (en) | Method and system for automatic extraction of deep web data | |
US8321396B2 (en) | Automatically extracting by-line information | |
CN101515300B (en) | Method and system for grabbing Ajax webpage content | |
Malik et al. | Information extraction using web usage mining, web scrapping and semantic annotation | |
CN107423391B (en) | Information extraction method of webpage structured data | |
CN103246732B (en) | A kind of abstracting method of online Web news content and system | |
CN103714176A (en) | Webpage text extraction method based on maximum text density | |
CN105045901A (en) | Search keyword push method and device | |
CN104239298A (en) | Text message recommendation method, server, browser and system | |
CN104182412A (en) | Webpage crawling method and webpage crawling system | |
CN105159930A (en) | Search keyword pushing method and apparatus | |
CN110457579B (en) | Webpage denoising method and system based on cooperative work of template and classifier | |
CN102355488A (en) | Crawler seed obtaining method and equipment and crawler crawling method and equipment | |
CN103853760A (en) | Method and device for extracting contents of bodies of web pages | |
CN103530429A (en) | Webpage content extracting method | |
CN104899219A (en) | Screening method and system of pseudo-static URL (Uniform Resource Locator) and webpage crawling method and system | |
CN105528357A (en) | Webpage content extraction method based on similarity of URLs and similarity of webpage document structures | |
CN104572934A (en) | Webpage key content extracting method based on DOM | |
CN104765882A (en) | Internet website statistics method based on web page characteristic strings | |
CN102915361A (en) | Webpage text extracting method based on character distribution characteristic | |
CN106294885A (en) | A kind of data collection towards isomery webpage and mask method | |
CN105022824A (en) | Method and device for recognizing invalid link | |
CN103297498A (en) | Relevant content pushing method based on mobile phone client side | |
CN103309954A (en) | Html webpage based data extracting system | |
CN103729354B (en) | web information processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150128 |