CN104317845A

CN104317845A - Method and system for automatic extraction of deep web data

Info

Publication number: CN104317845A
Application number: CN201410537825.7A
Authority: CN
Inventors: 贾岩
Original assignee: ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Current assignee: ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority date: 2014-10-13
Filing date: 2014-10-13
Publication date: 2015-01-28

Abstract

The invention discloses a method and a system for automatic extraction of deep web data. The method comprises the following steps of detecting and extracting relevant data of an industry; analyzing a WEB page and extracting a semantic abstract; automatically extracting the Deep Web data. The method has the advantages that under the condition of no loss of data collection amount of the industry, the bandwidth and data retrieval amount are greatly saved, the data loading cycle is improved, and the real-time degree is improved.

Description

A kind of degree of depth network data Automatic Extraction method and system

Technical field

The present invention relates to grid computing technology field, particularly relate to a kind of degree of depth network data Automatic Extraction method and system.

Background technology

Along with the level of informatization is constantly deepened, enterprise to informationization integrated crave for also day by day strong; The information resources of internet sustainable growth have contained the information with commercial value of flood tide, become important information source.There is provided the company of information customization search and intelligence analysis Related product few in number at present, and the Back ground Information facility requirements of product to user itself is high, the implementation cycle is long, system Construction and maintenance cost high, major customer is ultra-large type business and government, and ordinary enterprises is unable bears.

Summary of the invention

In order to solve the technical matters existed in background technology, the present invention proposes a kind of degree of depth network data Automatic Extraction method and system, greatly reducing the requirement of system to company information facility, enterprise's Back ground Information facility deploy that can vary.

A kind of degree of depth network data Automatic Extraction method that the present invention proposes, comprises the following steps:

Carry out the detection of industry related data and crawl;

Carry out WEB page to resolve and semantic abstract extraction;

Carry out Deep web data Automatic Extraction.

Preferably, described in carry out the detection of industry related data and crawl, be specially fixed point collection, configured by user and gather known data source.

Preferably, describedly carry out the detection of industry related data and crawl, be specially and adopt web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.

Preferably, described in carry out WEB page and resolve and semantic abstract extraction, be specially and utilize HTML specification and view-based access control model Segment technology, the extraction metamessage of the page and body text.

Preferably, described in carry out the detection of industry related data and crawl, specifically comprise:

Adopt network probe technology, constantly a detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form.After finding list form, automatic submission form, compares acquisition webpage;

Obtain page dom tree before and after analyzing, extract the node that dom tree interior joint content is different, Here it is needs the data of collection.

Preferably, after extracting correct data, notice administrator configurations data layout, completes Deep Web site and finds and gather.

This kind of degree of depth network data Automatic Extraction system proposed, comprising:

Acquisition module, for carrying out the detection of industry related data and crawl;

Resolving and extraction module, be connected with described acquisition module, resolving and semantic abstract extraction for carrying out WEB page;

Automatic Extraction module, is connected with described parsing and extraction module, for carrying out Deep web data Automatic Extraction.

Preferably, described acquisition module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.

Preferably, described parsing and extraction module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.

Preferably, described Automatic Extraction module, specifically for adopting network probe technology, a constantly detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form.After finding list form, automatic submission form, compares acquisition webpage; Obtain page dom tree before and after analyzing, extract the node that dom tree interior joint content is different, Here it is needs the data of collection.

In the present invention, when not losing industry data acquisition amount, greatly saving bandwidth sum data retrieval amount, and improve the data loading cycle, improving and spend in real time.

Accompanying drawing explanation

Fig. 1 is a kind of degree of depth network data Automatic Extraction method flow diagram that the embodiment of the present invention proposes;

Fig. 2 is a kind of degree of depth network data Automatic Extraction system construction drawing that the embodiment of the present invention proposes.

Embodiment

As shown in Figure 1, the embodiment of the present invention proposes a kind of degree of depth network data Automatic Extraction method and system, comprises the following steps:

Step 101, carries out the detection of industry related data and crawl.Because the present invention is enterprise's customized searches, basis, IT application in enterprises aspect varies on the one hand, and resource is all relatively limited, on the other hand, also only needs industry relevant information, without the need to editing and recording whole internet.So the present invention carries out the detection of industry related data and crawl by two kinds of approach: one is fixed point collection, is configured gather known data source by user; Adopt web trade information probe on the other hand, utilize industry body, by URL (Uniform Resource Locator, URL(uniform resource locator)) means such as link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form etc. excavates degree of depth network (deep web), to look for potential data source.URL is a kind of expression succinctly of position to the resource that can obtain from internet and access method, is the address of standard resource on internet.Each file on internet has a unique URL, and its information comprised points out how the position of file and browser should process.Wherein, because deep web is much the good data of structuring, be convenient to analyze, and often cannot search under universal search engine and obtain, have immense value to client.

Step 102, carries out WEB page and resolves and semantic abstract extraction.Web page is resolved namely by analyzing tags, and parsing HTML ((HyperText Mark-up Language, i.e. HTML (Hypertext Markup Language)) page, and extract body matter.The present invention utilizes HTML specification and view-based access control model Segment technology, extracts metamessage (as title, key word etc.) and the body text of the page, effectively avoids the interference of irrelevant information.In addition the present invention can support other common data forms well, comprises the data layout of XML, PDF and MS Office series.

Wherein, there are two kinds of situations in semantic summary problem in the present invention, and a kind of situation is the full text summary done for the ease of client's browsing information; Another kind is the informative abstract of Search Results.As far as possible the first kind contains document main information for starting point, and Equations of The Second Kind also will consider the problems such as the density of user search word under the prerequisite of first.In the present invention, utilize semantic analysis technology, semantic analysis is done to chapter every words, the semantic point of mark verb, nominal semanteme point and semantic tendency, then be aggregated into the semantic side emphasis of paragraph and whole chapter, finally utilize semantic side emphasis, in conjunction with chapter feature, with number of words (as 400 words) for constraint condition, select and contain several " sentence groups " composition summary in full semantic in full as far as possible.The documentation summary of Search Results realizes this constraint condition of density that upper difference is to increase search word (comprising concept close to word).

Step 103, carries out Deep web data Automatic Extraction.Deep Web refers to that those are stored in network data base, do not need the resource collection of being accessed by dynamic web page technique by hyperlink access.And in applying in practice, the content value in Deep Web is larger, integrated more meaningful to structural data of this part content.The present invention adopts network probe technology, constantly a detection site pages, by the mode of automatic filling list, and test return data, thus find most suitable list form.After finding list form, automatic submission form, compares acquisition webpage.Find in the experiment of invention, the Deep web resource back page structural difference of same website is very little.Utilize this feature, obtain page dom tree, extract the node that dom tree interior joint content is different before and after analyzing, Here it is needs the data of collection.After extracting correct data, notice administrator configurations data layout, completes Deep Web site and finds and gather.

As shown in Figure 2, the embodiment of the present invention proposes a kind of degree of depth network data Automatic Extraction system, comprising: acquisition module 10, for carrying out the detection of industry related data and crawl; Resolving and extraction module 20, be connected with described acquisition module 10, resolving and semantic abstract extraction for carrying out WEB page; Automatic Extraction module 30, is connected with described parsing and extraction module 20, for carrying out Deep web data Automatic Extraction.

Described acquisition module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.

Described parsing and extraction module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.

Described Automatic Extraction module, specifically for adopting network probe technology, a constantly detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form.After finding list form, automatic submission form, compares acquisition webpage; Obtain page dom tree before and after analyzing, extract the node that dom tree interior joint content is different, Here it is needs the data of collection.

The above; be only the present invention's preferably embodiment; but protection scope of the present invention is not limited thereto; anyly be familiar with those skilled in the art in the technical scope that the present invention discloses; be equal to according to technical scheme of the present invention and inventive concept thereof and replace or change, all should be encompassed within protection scope of the present invention.

Claims

1. a degree of depth network data Automatic Extraction method, is characterized in that, comprise the following steps:

Carry out the detection of industry related data and crawl;

Carry out WEB page to resolve and semantic abstract extraction;

Carry out Deep web data Automatic Extraction.

2. degree of depth network data Automatic Extraction method according to claim 1, is characterized in that, described in carry out the detection of industry related data and crawl, be specially fixed point collection, configured by user and gather known data source.

3. degree of depth network data Automatic Extraction method according to claim 1, it is characterized in that, describedly carry out the detection of industry related data and crawl, be specially and adopt web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.

4. degree of depth network data Automatic Extraction method according to claim 1, it is characterized in that, described WEB page of carrying out is resolved and semantic abstract extraction, is specially and utilizes HTML specification and view-based access control model Segment technology, extracts metamessage and the body text of the page.

5. degree of depth network data Automatic Extraction method according to claim 1, is characterized in that, described in carry out the detection of industry related data and crawl, specifically comprise:

Adopt network probe technology, constantly a detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form; After finding list form, automatic submission form, compares acquisition webpage;

Obtain page dom tree before and after analyzing, extract the node that dom tree interior joint content is different, obtain the data needing to gather.

6. degree of depth network data Automatic Extraction method according to claim 5, is characterized in that, after extracting correct data, notice administrator configurations data layout, completes Deep Web site and find and gather.

7. a degree of depth network data Automatic Extraction system, is characterized in that, comprising:

8. degree of depth network data Automatic Extraction system according to claim 7, it is characterized in that, described acquisition module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.

9. degree of depth network data Automatic Extraction system according to claim 7, it is characterized in that, described parsing and extraction module, specifically for adopting web trade information probe, by URL URL(uniform resource locator) link, search engine springboard, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL structure, form form excavates degree of depth network, to look for potential data source.

10. degree of depth network data Automatic Extraction system according to claim 7, it is characterized in that, described Automatic Extraction module, specifically for adopting network probe technology, continuous detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form.After finding list form, automatic submission form, compares acquisition webpage; Obtain page dom tree before and after analyzing, extract the node that dom tree interior joint content is different, Here it is needs the data of collection.