CN104182467A - Network data source detection method - Google Patents

Network data source detection method Download PDF

Info

Publication number
CN104182467A
CN104182467A CN201410348451.4A CN201410348451A CN104182467A CN 104182467 A CN104182467 A CN 104182467A CN 201410348451 A CN201410348451 A CN 201410348451A CN 104182467 A CN104182467 A CN 104182467A
Authority
CN
China
Prior art keywords
data
network
network data
website
data source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410348451.4A
Other languages
Chinese (zh)
Inventor
贾岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Original Assignee
ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd filed Critical ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority to CN201410348451.4A priority Critical patent/CN104182467A/en
Publication of CN104182467A publication Critical patent/CN104182467A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a network data source detection method and solves problems that internet data are difficult to utilize effectively due to the fact that the internet data are low in value density and contain numerous and complicated information. The method comprises steps as follows: S1, setting an industry information network probe, automatically detecting network data according to a preset industry body, and determining an alternative website; S2, verifying data information of the alternative website and screening network data sources; S3, formulating an acquisition strategy and performing customized website acquisition on the network data sources. The industry information network probe can mine deep networks and analyze potential data sources. The industry information network probe mines the deep networks with a heuristic progressive scanning method which comprises steps as follows: S11, continuously detecting a same network and automatically filling in a form; S12, testing return data, confirming the format of the form and submitting the form; S13, establishing a DOM (document object model) tree according to the submitted form and extracting nodes with different node content from the DOM tree for data acquisition; S14, informing an administrator of configuration of data formats and setting acquisition modes of the deep websites.

Description

A kind of network data source detection method
Technical field
The present invention relates to network data Detection Techniques field, relate in particular to a kind of network data source detection method.
Background technology
Along with the level of informatization is constantly deepened, enterprise is also day by day strong to craving for of " large data " Analysis Service; Internet sustainable growth information resources have contained the information with commercial value of flood tide, become important commercial intelligence service information source.But, as the internet of the main carriers of large data, because data volume is huge, obtain that difficulty is large, unit value is relatively low, be almost the difficult points such as non-structured data such as text entirely, it is worth not by the abundant development and utilization of industry.
Along with the development of internet, the information that individual and enterprise obtain is more and more, but internet data value density is relatively low, in the face of the information of numerous and complicated like this, if do not had, effective source is surveyed, quality judgment mechanism, user often cannot therefrom extract real effectively information, cannot realize effective utilization.
Summary of the invention
The problem existing based on background technology, the present invention proposes a kind of network data source detection method, has solved internet data value density low, inclusion information numerous and complicated, thus be difficult to the problem of effectively utilizing.
A kind of network data source detection method that the present invention proposes, the method detection network data that Adoption Network probe is found automatically, comprising:
S1, trade information network probe is set, and according to prefabricated industry body automatic detection network data, determines alternative website;
S2, verify the data message of alternative website, screening objective network data source;
S3, formulation acquisition strategies, customize website collection to objective network data source.
Preferably, in step S1, trade information network probe is found alternative website by URL link and/or search engine springboard.
Preferably, in step S1, trade information network probe can excavating depth network, analyzes potential data source.
Preferably, the method for the heuristic scanning of going forward one by one of trade information network probe excavating depth network using, comprising:
S11, same website is continued to survey and automatic filling list;
S12, test return data, confirm list form and submit to;
S13, set up dom tree according to submission form, and extract the different node of node content in dom tree and carry out data acquisition;
S14, notice administrator configurations data layout, arrange the drainage pattern of degree of depth website.
Preferably, in step S2, according to website domain name, catalogue and URL structure analysis, judge in conjunction with text classification and industry vocabulary distribution density whether the data under website or directory web site are industry data, and judge its trade information density, thereby the value of this data source of Comprehensive Assessment, screening objective network data source.
Preferably, in step S3, each objective network data source is arranged to reliability weight, formulate acquisition strategies according to the value of reliability weight and objective network data source.
Preferably, it is characterized in that, network data source comprises website, news, blog and forum.
The present invention is according to industry body detection network data, dwindle investigative range, improve data snooping efficiency, by the checking to alternative website data and screening, can contain target information density high, authority, the website that has quality assurance, carry out network data acquisition targetedly, extract real effectively information.The invention solves the network large data analysis of enterprise and extract problem, realized the effective utilization to internet data.
Brief description of the drawings
Fig. 1 is the process flow diagram of a kind of network data source detection method of proposing of the present invention.
Embodiment
As shown in Figure 1, Fig. 1 is the one that the present invention proposes
With reference to Fig. 1, a kind of network data source detection method that the present invention proposes, the method detection network data that Adoption Network probe is found automatically, comprise the following steps:
S1, trade information network probe is set, and according to prefabricated industry body automatic detection network data, determines alternative website;
S2, verify the data message of alternative website, screening objective network data source;
S3, formulation acquisition strategies, customize website collection to objective network data source.
In step S1, trade information network probe is by URL (Uniform Resoure Locator, uniform resource locator) link and/or search engine springboard find alternative website, and trade information network probe can excavating depth network, analyzes potential data source.
The method of the heuristic scanning of going forward one by one of trade information network probe excavating depth network using, comprises the following steps:
S11, same website is continued to survey and automatic filling list;
S12, test return data, confirm list form and submit to;
S13, set up dom tree (Document Object Model, DOM Document Object Model) according to submission form, and extract the different node of node content in dom tree and carry out data acquisition;
S14, notice administrator configurations data layout, arrange the drainage pattern of degree of depth website.
In step S1, according to industry body detection network data, dwindle investigative range, improved data snooping efficiency.And only have in the time that the network data being detected meets the requirements, just can carry out degree of depth Web Mining, neither can omit significant data, can not lose time again to flog a dead horse.This strategy, in the situation that not losing the industry data amount of including, is greatly saved bandwidth and data retrieval amount, and has been improved the data loading cycle, improves degree in real time.
In step S2, according to website domain name, catalogue and URL structure analysis, judge in conjunction with text classification and industry vocabulary distribution density whether the data under website or directory web site are industry data, and judge its trade information density, thereby the value of this data source of Comprehensive Assessment, screening objective network data source, abandons being worth low network data source, further dwindle the scope of available data sources, improve information extraction efficiency.
In step S3, each objective network data source is arranged to reliability weight, formulate acquisition strategies according to the value of reliability weight and objective network data source.
Comprehensive step S2, S3, can contain that target information density is high, website authority, that have quality assurance, to comparatively sparse data source grading, and the unified acquisition strategies of formulating, make user determine information value, save extraction time.
Network data source in above embodiment comprises website, news, blog and forum etc.
Present networks data source detection method, through verifying, can complete the source detecting of the main network data of a specific industry (as cable industry) in 24 hours.And guaranteeing not cause under the frequency acquisition prerequisite of targeted website dislike, within 25 minutes, can differentiate plate that the target information density of a medium site is higher and entrance, industry attribute, data originality, data source aspect confidence level etc., and automatically formulate reptile acquisition strategies.
The above; it is only preferably embodiment of the present invention; but protection scope of the present invention is not limited to this; any be familiar with those skilled in the art the present invention disclose technical scope in; be equal to replacement or changed according to technical scheme of the present invention and inventive concept thereof, within all should being encompassed in protection scope of the present invention.

Claims (7)

1. a network data source detection method, is characterized in that, the method detection network data that Adoption Network probe is found automatically, comprising:
S1, trade information network probe is set, and according to prefabricated industry body automatic detection network data, determines alternative website;
S2, verify the data message of alternative website, screening objective network data source;
S3, formulation acquisition strategies, customize website collection to objective network data source.
2. network data as claimed in claim 1 source detection method, is characterized in that, in step S1, trade information network probe is found alternative website by URL link and/or search engine springboard.
3. network data as claimed in claim 1 source detection method, is characterized in that, in step S1, trade information network probe can excavating depth network, analyzes potential data source.
4. network data as claimed in claim 3 source detection method, is characterized in that, the method for the heuristic scanning of going forward one by one of trade information network probe excavating depth network using, comprising:
S11, same website is continued to survey and automatic filling list;
S12, test return data, confirm list form and submit to;
S13, set up dom tree according to submission form, and extract the different node of node content in dom tree and carry out data acquisition;
S14, notice administrator configurations data layout, arrange the drainage pattern of degree of depth website.
5. network data as claimed in claim 1 source detection method, it is characterized in that, in step S2, according to website domain name, catalogue and URL structure analysis, judge in conjunction with text classification and industry vocabulary distribution density whether the data under website or directory web site are industry data, and judge its trade information density, thus the value of this data source of Comprehensive Assessment, screening objective network data source.
6. network data as claimed in claim 1 source detection method, is characterized in that, in step S3, each objective network data source is arranged to reliability weight, formulates acquisition strategies according to the value of reliability weight and objective network data source.
7. the network data source detection method as described in claim 1 to 6 any one, is characterized in that, network data source comprises website, news, blog and forum.
CN201410348451.4A 2014-07-21 2014-07-21 Network data source detection method Pending CN104182467A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410348451.4A CN104182467A (en) 2014-07-21 2014-07-21 Network data source detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410348451.4A CN104182467A (en) 2014-07-21 2014-07-21 Network data source detection method

Publications (1)

Publication Number Publication Date
CN104182467A true CN104182467A (en) 2014-12-03

Family

ID=51963507

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410348451.4A Pending CN104182467A (en) 2014-07-21 2014-07-21 Network data source detection method

Country Status (1)

Country Link
CN (1) CN104182467A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106681973A (en) * 2016-12-20 2017-05-17 北京奇虎科技有限公司 Method and device for achieving automatically filling browser sheet in test
CN108156024A (en) * 2017-12-11 2018-06-12 深圳市易聆科信息技术股份有限公司 One kind is based on distributed website availability detection method, system and storage medium
CN111008226A (en) * 2019-12-24 2020-04-14 韶关学院 Novel data mining method
CN112291121A (en) * 2020-12-30 2021-01-29 金锐同创(北京)科技股份有限公司 Data processing method and related equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101414377A (en) * 2008-10-22 2009-04-22 中国移动通信集团福建有限公司 Method for filtrating people's livelihood information for communication vocation based on data storehouse
CN102332025A (en) * 2011-09-29 2012-01-25 奇智软件(北京)有限公司 Intelligent vertical search method and system
CN102841898A (en) * 2011-06-23 2012-12-26 张家港凯纳信息技术有限公司 Network information monitoring and analyzing system
CN103365961A (en) * 2013-06-19 2013-10-23 北京时间中国网科技有限公司 Accurate search-oriented website structurization labeling method and system
CN103389998A (en) * 2012-05-11 2013-11-13 安徽华贞信息科技有限公司 Novel Internet commercial intelligence information semantic analysis technology based on cloud service

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101414377A (en) * 2008-10-22 2009-04-22 中国移动通信集团福建有限公司 Method for filtrating people's livelihood information for communication vocation based on data storehouse
CN102841898A (en) * 2011-06-23 2012-12-26 张家港凯纳信息技术有限公司 Network information monitoring and analyzing system
CN102332025A (en) * 2011-09-29 2012-01-25 奇智软件(北京)有限公司 Intelligent vertical search method and system
CN103389998A (en) * 2012-05-11 2013-11-13 安徽华贞信息科技有限公司 Novel Internet commercial intelligence information semantic analysis technology based on cloud service
CN103365961A (en) * 2013-06-19 2013-10-23 北京时间中国网科技有限公司 Accurate search-oriented website structurization labeling method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
(德)默沙伊恩著: "《大学治理与教师参与决策》", 31 January 2014, 知识产权出版社 *
高明等: "基于语义支持的Deep Web数据抽取", 《计算机科学》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106681973A (en) * 2016-12-20 2017-05-17 北京奇虎科技有限公司 Method and device for achieving automatically filling browser sheet in test
CN106681973B (en) * 2016-12-20 2020-07-24 北京奇虎科技有限公司 Method and device for automatically filling browser forms in test
CN108156024A (en) * 2017-12-11 2018-06-12 深圳市易聆科信息技术股份有限公司 One kind is based on distributed website availability detection method, system and storage medium
CN108156024B (en) * 2017-12-11 2021-06-01 深圳市易聆科信息技术股份有限公司 Method, system and storage medium for detecting availability based on distributed website
CN111008226A (en) * 2019-12-24 2020-04-14 韶关学院 Novel data mining method
CN112291121A (en) * 2020-12-30 2021-01-29 金锐同创(北京)科技股份有限公司 Data processing method and related equipment
CN112291121B (en) * 2020-12-30 2021-08-03 金锐同创(北京)科技股份有限公司 Data processing method and related equipment

Similar Documents

Publication Publication Date Title
Loss et al. Estimates of bird collision mortality at wind facilities in the contiguous United States
CN104182467A (en) Network data source detection method
CN103763124A (en) Internet user behavior analyzing and early-warning system and method
CN103530565A (en) Method and device for scanning website program bugs based on web
CN102567494B (en) Website classification method and device
CN104536956A (en) A Microblog platform based event visualization method and system
CN105138907B (en) A kind of active probe is attacked the method and system of website
CN103428249B (en) A kind of Collecting and dealing method of HTTP request bag, system and server
CN111104579A (en) Identification method and device for public network assets and storage medium
CN104090931A (en) Information prediction and acquisition method based on webpage link parameter analysis
CN105808417A (en) Automated testing method and proxy server
CN106202563A (en) A kind of real time correlation evental news recommends method and system
CN103927400A (en) Web site product detailed information classification crawling and product information base establishing method
CN104636408A (en) News authentication early warning method and system based on user generated content
CN104899219A (en) Screening method and system of pseudo-static URL (Uniform Resource Locator) and webpage crawling method and system
CN107590083A (en) A kind of magnanimity remote sensing tile data release quickly method based on OWGA memory caches
CN104537105A (en) Automatic network physical landmark excavating method based on Web maps
CN102571922B (en) Method and device for processing data stream
CN104317845A (en) Method and system for automatic extraction of deep web data
CN102063484A (en) Discovery method and device of third-party WEB application program
CN106411906A (en) SQL (Structured Query Language) injection flaw positioning and detecting method
CN105069079A (en) Method and device for screening point of interest POI data
CN106407811A (en) SQL injection loophole positioning detection system
CN105956069A (en) Network information collection and analysis method and network information collection and analysis system
CN103714093A (en) Method and device for mining key pages of website

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20141203

RJ01 Rejection of invention patent application after publication