CN104182467A - Network data source detection method - Google Patents
Network data source detection method Download PDFInfo
- Publication number
- CN104182467A CN104182467A CN201410348451.4A CN201410348451A CN104182467A CN 104182467 A CN104182467 A CN 104182467A CN 201410348451 A CN201410348451 A CN 201410348451A CN 104182467 A CN104182467 A CN 104182467A
- Authority
- CN
- China
- Prior art keywords
- data
- network
- network data
- website
- data source
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention provides a network data source detection method and solves problems that internet data are difficult to utilize effectively due to the fact that the internet data are low in value density and contain numerous and complicated information. The method comprises steps as follows: S1, setting an industry information network probe, automatically detecting network data according to a preset industry body, and determining an alternative website; S2, verifying data information of the alternative website and screening network data sources; S3, formulating an acquisition strategy and performing customized website acquisition on the network data sources. The industry information network probe can mine deep networks and analyze potential data sources. The industry information network probe mines the deep networks with a heuristic progressive scanning method which comprises steps as follows: S11, continuously detecting a same network and automatically filling in a form; S12, testing return data, confirming the format of the form and submitting the form; S13, establishing a DOM (document object model) tree according to the submitted form and extracting nodes with different node content from the DOM tree for data acquisition; S14, informing an administrator of configuration of data formats and setting acquisition modes of the deep websites.
Description
Technical field
The present invention relates to network data Detection Techniques field, relate in particular to a kind of network data source detection method.
Background technology
Along with the level of informatization is constantly deepened, enterprise is also day by day strong to craving for of " large data " Analysis Service; Internet sustainable growth information resources have contained the information with commercial value of flood tide, become important commercial intelligence service information source.But, as the internet of the main carriers of large data, because data volume is huge, obtain that difficulty is large, unit value is relatively low, be almost the difficult points such as non-structured data such as text entirely, it is worth not by the abundant development and utilization of industry.
Along with the development of internet, the information that individual and enterprise obtain is more and more, but internet data value density is relatively low, in the face of the information of numerous and complicated like this, if do not had, effective source is surveyed, quality judgment mechanism, user often cannot therefrom extract real effectively information, cannot realize effective utilization.
Summary of the invention
The problem existing based on background technology, the present invention proposes a kind of network data source detection method, has solved internet data value density low, inclusion information numerous and complicated, thus be difficult to the problem of effectively utilizing.
A kind of network data source detection method that the present invention proposes, the method detection network data that Adoption Network probe is found automatically, comprising:
S1, trade information network probe is set, and according to prefabricated industry body automatic detection network data, determines alternative website;
S2, verify the data message of alternative website, screening objective network data source;
S3, formulation acquisition strategies, customize website collection to objective network data source.
Preferably, in step S1, trade information network probe is found alternative website by URL link and/or search engine springboard.
Preferably, in step S1, trade information network probe can excavating depth network, analyzes potential data source.
Preferably, the method for the heuristic scanning of going forward one by one of trade information network probe excavating depth network using, comprising:
S11, same website is continued to survey and automatic filling list;
S12, test return data, confirm list form and submit to;
S13, set up dom tree according to submission form, and extract the different node of node content in dom tree and carry out data acquisition;
S14, notice administrator configurations data layout, arrange the drainage pattern of degree of depth website.
Preferably, in step S2, according to website domain name, catalogue and URL structure analysis, judge in conjunction with text classification and industry vocabulary distribution density whether the data under website or directory web site are industry data, and judge its trade information density, thereby the value of this data source of Comprehensive Assessment, screening objective network data source.
Preferably, in step S3, each objective network data source is arranged to reliability weight, formulate acquisition strategies according to the value of reliability weight and objective network data source.
Preferably, it is characterized in that, network data source comprises website, news, blog and forum.
The present invention is according to industry body detection network data, dwindle investigative range, improve data snooping efficiency, by the checking to alternative website data and screening, can contain target information density high, authority, the website that has quality assurance, carry out network data acquisition targetedly, extract real effectively information.The invention solves the network large data analysis of enterprise and extract problem, realized the effective utilization to internet data.
Brief description of the drawings
Fig. 1 is the process flow diagram of a kind of network data source detection method of proposing of the present invention.
Embodiment
As shown in Figure 1, Fig. 1 is the one that the present invention proposes
With reference to Fig. 1, a kind of network data source detection method that the present invention proposes, the method detection network data that Adoption Network probe is found automatically, comprise the following steps:
S1, trade information network probe is set, and according to prefabricated industry body automatic detection network data, determines alternative website;
S2, verify the data message of alternative website, screening objective network data source;
S3, formulation acquisition strategies, customize website collection to objective network data source.
In step S1, trade information network probe is by URL (Uniform Resoure Locator, uniform resource locator) link and/or search engine springboard find alternative website, and trade information network probe can excavating depth network, analyzes potential data source.
The method of the heuristic scanning of going forward one by one of trade information network probe excavating depth network using, comprises the following steps:
S11, same website is continued to survey and automatic filling list;
S12, test return data, confirm list form and submit to;
S13, set up dom tree (Document Object Model, DOM Document Object Model) according to submission form, and extract the different node of node content in dom tree and carry out data acquisition;
S14, notice administrator configurations data layout, arrange the drainage pattern of degree of depth website.
In step S1, according to industry body detection network data, dwindle investigative range, improved data snooping efficiency.And only have in the time that the network data being detected meets the requirements, just can carry out degree of depth Web Mining, neither can omit significant data, can not lose time again to flog a dead horse.This strategy, in the situation that not losing the industry data amount of including, is greatly saved bandwidth and data retrieval amount, and has been improved the data loading cycle, improves degree in real time.
In step S2, according to website domain name, catalogue and URL structure analysis, judge in conjunction with text classification and industry vocabulary distribution density whether the data under website or directory web site are industry data, and judge its trade information density, thereby the value of this data source of Comprehensive Assessment, screening objective network data source, abandons being worth low network data source, further dwindle the scope of available data sources, improve information extraction efficiency.
In step S3, each objective network data source is arranged to reliability weight, formulate acquisition strategies according to the value of reliability weight and objective network data source.
Comprehensive step S2, S3, can contain that target information density is high, website authority, that have quality assurance, to comparatively sparse data source grading, and the unified acquisition strategies of formulating, make user determine information value, save extraction time.
Network data source in above embodiment comprises website, news, blog and forum etc.
Present networks data source detection method, through verifying, can complete the source detecting of the main network data of a specific industry (as cable industry) in 24 hours.And guaranteeing not cause under the frequency acquisition prerequisite of targeted website dislike, within 25 minutes, can differentiate plate that the target information density of a medium site is higher and entrance, industry attribute, data originality, data source aspect confidence level etc., and automatically formulate reptile acquisition strategies.
The above; it is only preferably embodiment of the present invention; but protection scope of the present invention is not limited to this; any be familiar with those skilled in the art the present invention disclose technical scope in; be equal to replacement or changed according to technical scheme of the present invention and inventive concept thereof, within all should being encompassed in protection scope of the present invention.
Claims (7)
1. a network data source detection method, is characterized in that, the method detection network data that Adoption Network probe is found automatically, comprising:
S1, trade information network probe is set, and according to prefabricated industry body automatic detection network data, determines alternative website;
S2, verify the data message of alternative website, screening objective network data source;
S3, formulation acquisition strategies, customize website collection to objective network data source.
2. network data as claimed in claim 1 source detection method, is characterized in that, in step S1, trade information network probe is found alternative website by URL link and/or search engine springboard.
3. network data as claimed in claim 1 source detection method, is characterized in that, in step S1, trade information network probe can excavating depth network, analyzes potential data source.
4. network data as claimed in claim 3 source detection method, is characterized in that, the method for the heuristic scanning of going forward one by one of trade information network probe excavating depth network using, comprising:
S11, same website is continued to survey and automatic filling list;
S12, test return data, confirm list form and submit to;
S13, set up dom tree according to submission form, and extract the different node of node content in dom tree and carry out data acquisition;
S14, notice administrator configurations data layout, arrange the drainage pattern of degree of depth website.
5. network data as claimed in claim 1 source detection method, it is characterized in that, in step S2, according to website domain name, catalogue and URL structure analysis, judge in conjunction with text classification and industry vocabulary distribution density whether the data under website or directory web site are industry data, and judge its trade information density, thus the value of this data source of Comprehensive Assessment, screening objective network data source.
6. network data as claimed in claim 1 source detection method, is characterized in that, in step S3, each objective network data source is arranged to reliability weight, formulates acquisition strategies according to the value of reliability weight and objective network data source.
7. the network data source detection method as described in claim 1 to 6 any one, is characterized in that, network data source comprises website, news, blog and forum.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410348451.4A CN104182467A (en) | 2014-07-21 | 2014-07-21 | Network data source detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410348451.4A CN104182467A (en) | 2014-07-21 | 2014-07-21 | Network data source detection method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104182467A true CN104182467A (en) | 2014-12-03 |
Family
ID=51963507
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410348451.4A Pending CN104182467A (en) | 2014-07-21 | 2014-07-21 | Network data source detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104182467A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106681973A (en) * | 2016-12-20 | 2017-05-17 | 北京奇虎科技有限公司 | Method and device for achieving automatically filling browser sheet in test |
CN108156024A (en) * | 2017-12-11 | 2018-06-12 | 深圳市易聆科信息技术股份有限公司 | One kind is based on distributed website availability detection method, system and storage medium |
CN111008226A (en) * | 2019-12-24 | 2020-04-14 | 韶关学院 | Novel data mining method |
CN112291121A (en) * | 2020-12-30 | 2021-01-29 | 金锐同创(北京)科技股份有限公司 | Data processing method and related equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101414377A (en) * | 2008-10-22 | 2009-04-22 | 中国移动通信集团福建有限公司 | Method for filtrating people's livelihood information for communication vocation based on data storehouse |
CN102332025A (en) * | 2011-09-29 | 2012-01-25 | 奇智软件(北京)有限公司 | Intelligent vertical search method and system |
CN102841898A (en) * | 2011-06-23 | 2012-12-26 | 张家港凯纳信息技术有限公司 | Network information monitoring and analyzing system |
CN103365961A (en) * | 2013-06-19 | 2013-10-23 | 北京时间中国网科技有限公司 | Accurate search-oriented website structurization labeling method and system |
CN103389998A (en) * | 2012-05-11 | 2013-11-13 | 安徽华贞信息科技有限公司 | Novel Internet commercial intelligence information semantic analysis technology based on cloud service |
-
2014
- 2014-07-21 CN CN201410348451.4A patent/CN104182467A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101414377A (en) * | 2008-10-22 | 2009-04-22 | 中国移动通信集团福建有限公司 | Method for filtrating people's livelihood information for communication vocation based on data storehouse |
CN102841898A (en) * | 2011-06-23 | 2012-12-26 | 张家港凯纳信息技术有限公司 | Network information monitoring and analyzing system |
CN102332025A (en) * | 2011-09-29 | 2012-01-25 | 奇智软件(北京)有限公司 | Intelligent vertical search method and system |
CN103389998A (en) * | 2012-05-11 | 2013-11-13 | 安徽华贞信息科技有限公司 | Novel Internet commercial intelligence information semantic analysis technology based on cloud service |
CN103365961A (en) * | 2013-06-19 | 2013-10-23 | 北京时间中国网科技有限公司 | Accurate search-oriented website structurization labeling method and system |
Non-Patent Citations (2)
Title |
---|
(德)默沙伊恩著: "《大学治理与教师参与决策》", 31 January 2014, 知识产权出版社 * |
高明等: "基于语义支持的Deep Web数据抽取", 《计算机科学》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106681973A (en) * | 2016-12-20 | 2017-05-17 | 北京奇虎科技有限公司 | Method and device for achieving automatically filling browser sheet in test |
CN106681973B (en) * | 2016-12-20 | 2020-07-24 | 北京奇虎科技有限公司 | Method and device for automatically filling browser forms in test |
CN108156024A (en) * | 2017-12-11 | 2018-06-12 | 深圳市易聆科信息技术股份有限公司 | One kind is based on distributed website availability detection method, system and storage medium |
CN108156024B (en) * | 2017-12-11 | 2021-06-01 | 深圳市易聆科信息技术股份有限公司 | Method, system and storage medium for detecting availability based on distributed website |
CN111008226A (en) * | 2019-12-24 | 2020-04-14 | 韶关学院 | Novel data mining method |
CN112291121A (en) * | 2020-12-30 | 2021-01-29 | 金锐同创(北京)科技股份有限公司 | Data processing method and related equipment |
CN112291121B (en) * | 2020-12-30 | 2021-08-03 | 金锐同创(北京)科技股份有限公司 | Data processing method and related equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Loss et al. | Estimates of bird collision mortality at wind facilities in the contiguous United States | |
CN104182467A (en) | Network data source detection method | |
CN103763124A (en) | Internet user behavior analyzing and early-warning system and method | |
CN103530565A (en) | Method and device for scanning website program bugs based on web | |
CN102567494B (en) | Website classification method and device | |
CN104536956A (en) | A Microblog platform based event visualization method and system | |
CN105138907B (en) | A kind of active probe is attacked the method and system of website | |
CN103428249B (en) | A kind of Collecting and dealing method of HTTP request bag, system and server | |
CN111104579A (en) | Identification method and device for public network assets and storage medium | |
CN104090931A (en) | Information prediction and acquisition method based on webpage link parameter analysis | |
CN105808417A (en) | Automated testing method and proxy server | |
CN106202563A (en) | A kind of real time correlation evental news recommends method and system | |
CN103927400A (en) | Web site product detailed information classification crawling and product information base establishing method | |
CN104636408A (en) | News authentication early warning method and system based on user generated content | |
CN104899219A (en) | Screening method and system of pseudo-static URL (Uniform Resource Locator) and webpage crawling method and system | |
CN107590083A (en) | A kind of magnanimity remote sensing tile data release quickly method based on OWGA memory caches | |
CN104537105A (en) | Automatic network physical landmark excavating method based on Web maps | |
CN102571922B (en) | Method and device for processing data stream | |
CN104317845A (en) | Method and system for automatic extraction of deep web data | |
CN102063484A (en) | Discovery method and device of third-party WEB application program | |
CN106411906A (en) | SQL (Structured Query Language) injection flaw positioning and detecting method | |
CN105069079A (en) | Method and device for screening point of interest POI data | |
CN106407811A (en) | SQL injection loophole positioning detection system | |
CN105956069A (en) | Network information collection and analysis method and network information collection and analysis system | |
CN103714093A (en) | Method and device for mining key pages of website |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20141203 |
|
RJ01 | Rejection of invention patent application after publication |