CN104951539B - Internet data center's harmful information monitoring system - Google Patents

Internet data center's harmful information monitoring system Download PDF

Info

Publication number
CN104951539B
CN104951539B CN201510343226.6A CN201510343226A CN104951539B CN 104951539 B CN104951539 B CN 104951539B CN 201510343226 A CN201510343226 A CN 201510343226A CN 104951539 B CN104951539 B CN 104951539B
Authority
CN
China
Prior art keywords
module
search
reptile
web
harmful information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510343226.6A
Other languages
Chinese (zh)
Other versions
CN104951539A (en
Inventor
彭光辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Aierpu Science & Technology Co Ltd
Original Assignee
Chengdu Aierpu Science & Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Aierpu Science & Technology Co Ltd filed Critical Chengdu Aierpu Science & Technology Co Ltd
Priority to CN201510343226.6A priority Critical patent/CN104951539B/en
Publication of CN104951539A publication Critical patent/CN104951539A/en
Application granted granted Critical
Publication of CN104951539B publication Critical patent/CN104951539B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Abstract

The invention discloses Internet data center's harmful information monitoring system, including crawler system and harmful information monitoring system, harmful information monitoring system obtains the web data in Internet data center by crawler system, and Harmful analysis is carried out to it, crawler system includes multiple reptile clusters being made up of reptile node and reptile root node, and reptile node includes multithreading web retrieval module, web page library, code identification processing module, web page contents and automatically extracts module, url filtering device, URL deduplication modules and URL scheduler modules;Harmful information monitoring system realizes more accurate search by harmful information search unit, automatic word segmentation unit, keyword processing unit and fuzzy matching unit.The invention provides powerful data collection function, and comprehensive monitoring in real time is carried out to dynamic web page and static Web page by multiple reptile clusters, the data relevant with sensitive word can be collected from mass data, accomplish to actively discover harmful webpage.

Description

Internet data center's harmful information monitoring system
Technical field
The present invention relates to Internet data center's harmful information monitoring system.
Background technology
With developing rapidly for network, WWW turns into the carrier of bulk information, how to efficiently extract and utilize these Information turns into a huge challenge.The instrument that search engine retrieves information as an auxiliary people turns into the dimension of user's access ten thousand The entrance and guide of net.But there is also certain limitation for these versatility search engines.
In face of the Web Community's environment to become increasingly active, each netizen is likely to become publisher and the distribution of harmful information Person, network, which is harmful to route of transmission, more and more extensively includes blog, news, forum, microblogging and other approach.Web crawlers is each The precursor technique that kind search engine can be realized, the arriving in big data epoch and the rapid development of Internet technology so that net Network reptile has more great Research Significance.Web data amount has a big increase, the network text update cycle is short and webpage knot for reply The web crawlers of the series of challenges such as structure dynamic change, high efficiency and non-stop run turns into the research heat that harmful information excavates Point.
The content of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide harmful information monitoring system of Internet data center System, present system provides powerful data collection function, dynamic web page and static Web page is carried out by multiple reptile clusters complete The real-time monitoring in face;And from mass data, the data relevant with sensitive word are collected, accomplish to actively discover harmful webpage, lead to Harmful information search unit, automatic word segmentation unit, keyword processing unit and the realization of fuzzy matching unit is crossed more accurately to search Rope.
The purpose of the present invention is achieved through the following technical solutions:Harmful information monitoring system of Internet data center System, it includes crawler system and harmful information monitoring system, and harmful information monitoring system obtains interconnection netting index by crawler system Harmful analysis is carried out according to the web data in center, and to it.
The crawler system includes one or more reptile clusters, and each reptile cluster include multiple reptile nodes and One reptile root node, a distributed data acquisition network is formed, wherein, reptile root node is used in the reptile cluster Reptile node be controlled and manage, and be in communication with each other with harmful information monitoring system, reptile node is used to gather net Harmful information in network, each reptile node form by following multiple module:
1st, multithreading web retrieval module, including a variety of web retrieval passages and web analysis module, for different type Webpage, it is acquired by the web retrieval passage and web analysis module that match with it.
2nd, web page library, the webpage that storage multithreading web retrieval module is gathered.
3rd, code identification processing module, the type of coding of automatic identification webpage, and code conversion processing is carried out to it.
4th, web page contents automatically extract module, including dynamic web content extraction module and static web contents extraction mould Block, the URL that harmful Intelligence Page after code conversion is handled be present is captured according to sensitive dictionary according to sensitive dictionary.
5th, url filtering device, the URL that need not be downloaded is filtered.
6th, URL deduplication modules, for judging whether the URL after filtering is consistent with the URL stored in URL memories, if Consistent then no longer follow-up to URL progress processing.
7th, URL scheduler modules, according to the URL queues after duplicate removal, control multithreading web retrieval module downloads corresponding net Page.
8th, removing duplicate webpages module, for judging whether web page contents are consistent with the web page contents downloaded, if consistent Follow-up processing is no longer carried out to the webpage, and is deleted from web page library.
The removing duplicate webpages module includes fingerprint computing module, fingerprint base and fingerprint deduplication module, fingerprint computing module root According to web page fingerprint algorithm, by the content of webpage by calculating generation fingerprint, fingerprint deduplication module is by the generation fingerprint and fingerprint base In fingerprint contrasted, if exist it is same or like as fingerprint, judge that the web page contents had been downloaded, fingerprint base is used for Finger print data is stored, and the fingerprint base of each reptile node synchronizes renewal.
9th, handling module is spaced, interval handling module automatically generates interval rule by webpage scoring and weight of website, and Control web page contents automatically extract module and corresponding interval crawl are carried out to webpage.
10th, rules for grasping setup module, rules for grasping setup module control web page contents according to set rules for grasping Automatically extract module and corresponding grasping movement is carried out to webpage.
11st, anti-crawler capturing module, when webpage is provided with anti-crawlers, anti-crawler capturing module is started, to target Webpage carries out pressure collection.
12nd, acquisition monitoring module, acquisition monitoring module by the working condition of reptile node, acquisition tasks, sampling depth and Log information is transmitted to reptile root node and carries out convergence processing, and receives the control of reptile root node.
The harmful information monitoring system includes harmful information search unit, automatic word segmentation unit, keyword processing unit With fuzzy matching unit.
Harmful information search unit includes local search port and web search port, and local search port is used to start this The search engine of ground reptile node, it is performed locally the harmful information search mission.Web search port is used to start multiple climb The search engine of worm node, the harmful information search mission is performed simultaneously by multiple reptile nodes, also passes through the web search Search result is returned to the local reptile node by port.
Harmful information search unit also include keyword screening washer, label field screening washer, metadata fields screening washer and One or more combinations in time screening washer, precise search is completed by a variety of screening washers and combinations thereof.
Keyword processing unit is used to generate keyword search instruction, and harmful information search unit is according to the keyword search Instruction performs harmful information search mission.
Fuzzy matching unit is used for the approximate vocabularies being similar according to the searching character String matching of input, searches harmful information While cable elements scan for search string, the search of approximate vocabularies is also completed, and returns to approximate vocabularies search result.
Automatic word segmentation unit is used to carry out the search string of input to automatically extract keyword, makes harmful information search single Member automatically extracts keyword according to this and completes precise search.
The keyword search instruction includes classify ID number, event title, keyword option, exclusion keyword option, power Weight, initial time.The keyword option that excludes is used to make the webpage comprising any keyword in exclusion keyword option will not It is matched and regards as harmful information webpage.
The harmful information monitoring system also includes autoabstract generation unit, and autoabstract generation unit is according to input Search string and its approximate vocabularies are dynamically generated web-page summarization to target web.
The autoabstract generation unit also carries out crucial word analysis by keyword processing unit to webpage, automatically extracts Critical field generates web-page summarization.
The harmful information monitoring system also includes result statistical analysis unit, and as a result statistical analysis unit is used for returning Search result carry out analytic statistics, the statistical analysis unit include task public sentiment figure generation module, report generation module, appoint Business paper statistics module, task trend analysis module and duty profile analysis module.
The task public sentiment figure generation module generates task public sentiment figure, including harmful letter according to search condition and search result Cease quantity statistics, acceptance of the bid keyword quantity statistics and webpage quantitative classification statistics.
The report generation module is used to generate form according to search result information.
The task trend analysis module is used to generate increment graph.
The duty profile analysis module is used to generate task list, website distribution map and media distribution figure.
The harmful information monitoring system also includes fire wall, and crawler system is by fire wall in Internet data center Web data carry out safety crawl.
The beneficial effects of the invention are as follows:Internet data center's harmful information monitoring system proposed by the invention, can be from In mass data, the data relevant with sensitive word are collected, accomplish to actively discover harmful;Include harmful distribution site, propagate The relevant informations such as approach, money order receipt to be signed and returned to the sender rate, clicking rate, participant, assistant analysis are harmful to temperature, importance, the development trend of webpage, done It is harmful to accurate analysis;Set a suspect's virtual identity to carry out key monitoring, analytic activity model is carried out according to data are collected Enclose, spread content, activity time etc.;Settable speech qualitative data analysis;Event temperature fast positioning is analyzed.
The present invention also has following multiple functional characteristics:
1)Multithreading gathers:Different strategies is customized for different types of website, multithreading is supported in collection, is realized quick Information gathering;
2)Distributed capture:Larger scale data acquisition is carried out by multiple reptile clusters, some reptile nodes;
3)Acquisition monitoring:Reptile node working condition, acquisition tasks, sampling depth, daily record, system operation report etc. are entered Row monitoring and management;
4)Web page contents automatically extract:Can gather a variety of dynamics and static Web page, for example, HTM, HTML, SHTML, XML, The webpages such as PHP, ASP, JSP, JavaScript;
5)Encode automatic identification conversion:Support GBK, GB2312, BIG5, UTF-8, UTF-16, BIGENDIAN, A variety of coding automatic identifications such as ISO8859-1, it is UTF that system carries out code conversion automatically;
6)Incremental update:Ensure that reptile node only gathers newly-generated or change webpage after last time renewal, without adopting again Collect the webpage downloaded to ensure the efficiency of information updating, user can also can also set whole collections as needed;
7)Anti- crawler capturing:For partly setting anti-crawlers website to set corresponding strategies, avoid that page can not be captured Face;
8)Reptile interval captures:Interval rule is automatically generated using webpage scoring and weight of website etc., phase is carried out to webpage The interval crawl answered;
9)Self-defined rules for grasping:User oneself can also set rules for grasping.
Brief description of the drawings
Fig. 1 is the crawler system structured flowchart of the present invention;
Fig. 2 is the structural principle block diagram of reptile node in the present invention;
Fig. 3 is the structural principle block diagram of harmful information monitoring system in the present invention.
Embodiment
Technical scheme is described in further detail below in conjunction with the accompanying drawings, but protection scope of the present invention is not limited to It is as described below.
Internet data center's harmful information monitoring system, it includes crawler system and harmful information monitoring system, is harmful to Information monitoring system obtains the web data in Internet data center by crawler system, and carries out Harmful analysis to it.
(One)Crawler system
As shown in figure 1, the crawler system is responsible for carrying out the discovery of initial data from internet, crawled and data requirement Change.According to the difference of interconnection web-based applications, including one or more reptile clusters, and each reptile cluster includes multiple reptiles Node and a reptile root node, a distributed data acquisition network is formed, wherein, reptile root node is used for the reptile Reptile node in cluster is controlled and managed, and is in communication with each other with harmful information monitoring system, and reptile node is used for Gather the harmful information in network.
As shown in Fig. 2 in the present invention, each reptile node forms by following multiple module:
1st, multithreading web retrieval module, including a variety of web retrieval passages and web analysis module, for different type Webpage, it is acquired by the web retrieval passage and web analysis module that match with it;The web analysis mould Block includes dns resolution module, HTTP parsing modules, FTP parsing modules, GOPHER parsing modules etc.;
Realize multithreading acquisition function:Different types of website can be directed to and customize different strategies, collection is supported multi-thread Journey, realize that snap information gathers;
2nd, web page library, the webpage that storage multithreading web retrieval module is gathered;
3rd, code identification processing module, the type of coding of automatic identification webpage, and code conversion processing is carried out to it;Support A variety of coding automatic identifications such as GBK, GB2312, BIG5, UTF-8, UTF-16, BIGENDIAN, ISO8859-1, system are entered automatically Row code conversion is UTF;
4th, web page contents automatically extract module, including dynamic web content extraction module and static web contents extraction mould Block, the URL that harmful Intelligence Page after code conversion is handled be present is captured according to sensitive dictionary;A variety of dynamics and static network can be gathered Page, such as the webpage such as HTM, HTML, SHTML, XML, PHP, ASP, JSP, JavaScript;
5th, url filtering device, the URL that need not be downloaded is filtered;
6th, URL deduplication modules, for judging whether the URL after filtering is consistent with the URL stored in URL memories, if Consistent then no longer follow-up to URL progress processing;Incremental update function is realized, after ensureing that reptile node only gathers last time renewal Newly-generated or change webpage, ensures the efficiency of information updating, user can also root without resurveying the webpage downloaded Whole collections can be also set according to needs;
7th, URL scheduler modules, according to the URL queues after duplicate removal, control multithreading web retrieval module downloads corresponding net Page.
8th, removing duplicate webpages module, for judging whether web page contents are consistent with the web page contents downloaded, if consistent Follow-up processing is no longer carried out to the webpage, and is deleted from web page library.
9th, fingerprint computing module, fingerprint base and fingerprint deduplication module, fingerprint computing module is according to web page fingerprint algorithm, by net The content of page is contrasted the generation fingerprint with the fingerprint in fingerprint base by calculating generation fingerprint, fingerprint deduplication module, if Exist it is same or like as fingerprint, then judge that the web page contents had been downloaded, fingerprint base is used to store finger print data, and each The fingerprint base of reptile node synchronizes renewal.
10th, handling module is spaced, interval handling module automatically generates interval rule by webpage scoring and weight of website, and Control web page contents automatically extract module and corresponding interval crawl are carried out to webpage.
11st, rules for grasping setup module, rules for grasping setup module control web page contents according to set rules for grasping Automatically extract module and corresponding grasping movement is carried out to webpage.
12nd, anti-crawler capturing module, when webpage is provided with anti-crawlers, anti-crawler capturing module is started, to target Webpage carries out pressure collection.
13rd, acquisition monitoring module, acquisition monitoring module by the working condition of reptile node, acquisition tasks, sampling depth and Log information is transmitted to reptile root node and carries out convergence processing, and receives the control of reptile root node.
The reptile node also includes label counter and label counting journal file, and label counter is used to record webpage Download number in storehouse, and the data are recorded in label counting journal file.
The crawler system also includes full-text database, index data base and row sequence database, full-text database, index number It is connected according to storehouse and row sequence database with reptile node and reptile root node.
(Two)Harmful information monitoring system
As shown in figure 1, the harmful information monitoring system includes harmful information search unit, automatic word segmentation unit, key Word processing unit and fuzzy matching unit.
1st, harmful information search unit, including local search port and web search port, local search port are used to open The search engine of local reptile node is moved, is performed locally the harmful information search mission.Web search port is more for starting The search engine of individual reptile node, the harmful information search mission is performed simultaneously by multiple reptile nodes, also passes through the network Search result is returned to the local reptile node by search port.
Harmful information search unit also include keyword screening washer, label field screening washer, metadata fields screening washer and One or more combinations in time screening washer, precise search is completed by a variety of screening washers and combinations thereof, search is such as provided The weight combinatorial search etc. of the weight of keyword, multiple metadata fields.
Keyword screening washer:The combination of keyword logical expression is supported, including AND, OR, NOT etc..
Label field screening washer:Support the logic AND-OR INVERTER limit search combined by multiple label fields.
Metadata fields screening washer:Multiple metadata fields can be defined, search result is selected by parameter.
Time screening washer:Support the ranking function according to date, the degree of correlation and other field combinations.
Field label search is the label field by establishing index text, and user can targetedly select set of tags Close, so as to return to corresponding restriction result.
Harmful information search unit carries out the whole network search according to burst deleterious network hot word, and quick search accident has Evil quantity, distribution site, harmful temperature.
2nd, keyword processing unit, for generating keyword search instruction, harmful information search unit uses Boolean logic Expression formula, and instructed according to the keyword search and perform harmful information search mission.
The keyword search instruction includes classify ID number, event title, keyword option, exclusion keyword option, power Weight, initial time.The keyword option that excludes is used to make the webpage comprising any keyword in exclusion keyword option will not It is matched and regards as harmful information webpage.
3rd, fuzzy matching unit, for the approximate vocabularies being similar according to the searching character String matching of input, harmful letter is made While breath search unit scans for search string, the search of approximate vocabularies is also completed, and returns to approximate vocabularies search As a result.
User can input a word, a passage even entire article, and system can analyze user search condition Contents concept, the result of user's care is then found out from the degree of correlation of concept.If user does not know the content of inquiry such as What is spelt, can be by searching for generally, and system is in addition to returning to corresponding search result, also similar in return and input character string Other vocabulary, so as to allow user to find other related results.
4th, automatic word segmentation unit, for carrying out the search string of input to automatically extract keyword, search harmful information Cable elements automatically extract keyword according to this and complete precise search.Automatic word segmentation module is the base of Chinese information processing and analysis Plinth.Based on dictionary and rule, the language model method based on probability analysis is comprehensively utilized, and can enter according to different applications The participle of the suitable particular requirement of row.
5th, autoabstract generation unit, autoabstract generation unit is according to the search string and its approximate vocabularies pair of input Target web is dynamically generated web-page summarization.Webpage can be dynamically generated different according to the different search strings of input Web-page summarization, user can be investigated according to the web-page summarization to judge whether to need to open the webpage, and can be by dynamic Web-page summarization understands the relation in returning result between each webpage.
The autoabstract generation unit also carries out crucial word analysis by keyword processing unit to webpage, automatically extracts Critical field generates web-page summarization.When user checks the particular content of webpage, autoabstract generation unit also can be automatically right Article content generates web-page summarization, webpage need not now be analyzed according to search string and its approximate vocabularies.
Autoabstract generation unit can consider word frequency, part of speech, positional information, and realization accurately automatically extracts analysis Keyword, and the keyword analyzed according to it automatically generates web-page summarization.
6th, result statistical analysis unit, as a result statistical analysis unit be used to carry out analytic statistics to the search result of return, The statistical analysis unit includes task public sentiment figure generation module, report generation module, task paper statistics module, task trend Analysis module and duty profile analysis module.
The task public sentiment figure generation module generates task public sentiment figure, including harmful letter according to search condition and search result Cease quantity statistics, acceptance of the bid keyword quantity statistics and webpage quantitative classification statistics.
The report generation module be used for according to search result information generate form, including block diagram, line chart single rod figure, Double stick figure, three rod figures, multiple chart and X-Y figures.
The task trend analysis module is used to generate increment graph, including daily increment graph, weekly increment graph, monthly increment Figure etc..
The duty profile analysis module is used to generate patterned task list, website distribution map and media distribution figure.
The search result includes harmful distribution site, route of transmission, money order receipt to be signed and returned to the sender rate, clicking rate and participant's information.
Statistical analysis unit has provided the user powerful query function, is divided for real-time and historical data Analyse, show, data mining, including historical data, inspection data, network data, monitor node are carried out for historical data application Data.Can be as needed, various querying conditions are flexibly set, there is provided a variety of statistical forms, such as single rod figure, double stick figure, three Rod figure, multiple chart, X-Y figures(Coordinate points are drawn)It etc. form, and can be combined with dispatch service, generate the report of a variety of output formats Table such as word forms, PDF format, Excel forms etc., are sent to specified user, enrich Analysis of Policy Making function, facilitate user to inquire about Data, analytic trend, formulate Adjusted Option.It is user's editing picture meanwhile system has autgmentability.
Harmful information monitoring system of the present invention also includes fire wall, and crawler system is by fire wall to Internet data center In web data carry out safety crawl.

Claims (9)

1. Internet data center's harmful information monitoring system, it includes crawler system and harmful information monitoring system, You Haixin Monitoring system is ceased by the web data in crawler system acquisition Internet data center, and Harmful analysis is carried out to it, and it is special Sign is:The crawler system includes one or more reptile clusters, and each reptile cluster include multiple reptile nodes and One reptile root node, a distributed data acquisition network is formed, wherein, reptile root node is used in the reptile cluster Reptile node be controlled and manage, and be in communication with each other with harmful information monitoring system, reptile node is used to gather net Harmful information in network, reptile node form by following multiple module:
Multithreading web retrieval module, including a variety of web retrieval passages and web analysis module, for different types of webpage, It is acquired by the web retrieval passage and web analysis module that match with it;
Web page library, the webpage that storage multithreading web retrieval module is gathered;
Code identification processing module, the type of coding of automatic identification webpage, and code conversion processing is carried out to it;
Web page contents automatically extract module, including dynamic web content extraction module and static web contents extraction module, according to Sensitive dictionary captures the URL that harmful Intelligence Page after code conversion is handled be present according to sensitive dictionary;
Url filtering device, filter the URL that need not be downloaded;
URL deduplication modules, for judging whether the URL after filtering is consistent with the URL stored in URL memories, if consistent Follow-up processing is no longer carried out to the URL;
URL scheduler modules, according to the URL queues after duplicate removal, control multithreading web retrieval module downloads corresponding webpage;
The harmful information monitoring system includes harmful information search unit, automatic word segmentation unit, keyword processing unit and mould Paste matching unit;
Harmful information search unit includes local search port and web search port, and local search port is used to start local climb The search engine of worm node, it is performed locally the harmful information search mission;Web search port is used to start multiple reptile knots The search engine of point, the harmful information search mission is performed simultaneously by multiple reptile nodes, also passes through the web search port Search result is returned into the local reptile node;
Harmful information search unit also includes keyword screening washer, label field screening washer, metadata fields screening washer and time One or more combinations in screening washer, precise search is completed by a variety of screening washers and combinations thereof;
Keyword processing unit is used to generate keyword search instruction, and harmful information search unit instructs according to the keyword search Perform harmful information search mission;
Fuzzy matching unit is used for the approximate vocabularies being similar according to the searching character String matching of input, makes harmful information search single While member scans for search string, the search of approximate vocabularies is also completed, and returns to approximate vocabularies search result;
Automatic word segmentation unit is used to carry out the search string of input to automatically extract keyword, makes harmful information search unit root Keyword, which is automatically extracted, according to this completes precise search.
2. Internet data center's harmful information monitoring system according to claim 1, it is characterised in that:The reptile knot Point also includes removing duplicate webpages module, for judging whether web page contents are consistent with the web page contents downloaded, if consistent not Follow-up processing is carried out to the webpage again, and is deleted from web page library.
3. Internet data center's harmful information monitoring system according to claim 2, it is characterised in that:The webpage is gone Molality block includes fingerprint computing module, fingerprint base and fingerprint deduplication module, and fingerprint computing module is according to web page fingerprint algorithm, by net The content of page is contrasted the generation fingerprint with the fingerprint in fingerprint base by calculating generation fingerprint, fingerprint deduplication module, if Exist it is same or like as fingerprint, then judge that the web page contents had been downloaded, fingerprint base is used to store finger print data, and each The fingerprint base of reptile node synchronizes renewal.
4. Internet data center's harmful information monitoring system according to claim 1, it is characterised in that:The reptile knot Point also includes interval handling module, and interval handling module automatically generates interval rule by webpage scoring and weight of website, and controls Web page contents processed automatically extract module and corresponding interval crawl are carried out to webpage;
The reptile node also includes rules for grasping setup module, rules for grasping setup module according to set rules for grasping, Control web page contents automatically extract module and carry out corresponding grasping movement to webpage;
The reptile node also includes anti-crawler capturing module, when webpage is provided with anti-crawlers, starts anti-crawler capturing Module, pressure collection is carried out to target web;
The reptile node also includes acquisition monitoring module, acquisition monitoring module by the working condition of reptile node, acquisition tasks, Sampling depth and log information are transmitted to reptile root node and carry out convergence processing, and receive the control of reptile root node.
5. Internet data center's harmful information monitoring system according to claim 1, it is characterised in that:The keyword Search instruction includes classification ID number, event title, keyword option, exclusion keyword option, weight, initial time;The row Except keyword option is used to make to be matched comprising the webpage for excluding any keyword in keyword option to regard as harmful letter Cease webpage.
6. Internet data center's harmful information monitoring system according to claim 1, it is characterised in that:Harmful letter Breath monitoring system also includes autoabstract generation unit, and autoabstract generation unit is according to the search string and its approximation of input Vocabulary is dynamically generated web-page summarization to target web;
The autoabstract generation unit also carries out crucial word analysis by keyword processing unit to webpage, automatically extracts key Field generates web-page summarization.
7. Internet data center's harmful information monitoring system according to claim 1, it is characterised in that:Harmful letter Breath monitoring system also includes result statistical analysis unit, and as a result statistical analysis unit is used to analyze the search result of return Statistics, the statistical analysis unit include task public sentiment figure generation module, report generation module, task paper statistics module, appoint Trend analysis module of being engaged in and duty profile analysis module.
8. Internet data center's harmful information monitoring system according to claim 7, it is characterised in that:The task carriage Feelings figure generation module generates task public sentiment figure according to search condition and search result, including harmful information content statistics, acceptance of the bid are closed Keyword quantity statistics and webpage quantitative classification statistics;
The report generation module is used to generate form according to search result information;
The task trend analysis module is used to generate increment graph;
The duty profile analysis module is used to generate task list, website distribution map and media distribution figure.
9. Internet data center's harmful information monitoring system according to claim 1, it is characterised in that:Harmful letter Breath monitoring system also includes fire wall, and crawler system carries out safety by fire wall to the web data in Internet data center Crawl.
CN201510343226.6A 2015-06-19 2015-06-19 Internet data center's harmful information monitoring system Expired - Fee Related CN104951539B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510343226.6A CN104951539B (en) 2015-06-19 2015-06-19 Internet data center's harmful information monitoring system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510343226.6A CN104951539B (en) 2015-06-19 2015-06-19 Internet data center's harmful information monitoring system

Publications (2)

Publication Number Publication Date
CN104951539A CN104951539A (en) 2015-09-30
CN104951539B true CN104951539B (en) 2017-12-22

Family

ID=54166197

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510343226.6A Expired - Fee Related CN104951539B (en) 2015-06-19 2015-06-19 Internet data center's harmful information monitoring system

Country Status (1)

Country Link
CN (1) CN104951539B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447081A (en) * 2015-11-04 2016-03-30 国云科技股份有限公司 Cloud platform-oriented government affair and public opinion monitoring method
CN105743901B (en) * 2016-03-07 2019-04-09 携程计算机技术(上海)有限公司 Server, anti-crawler system and anti-crawler verification method
CN107291778B (en) * 2016-04-11 2023-05-30 中兴通讯股份有限公司 Data collection method and device
CN105974811A (en) * 2016-07-05 2016-09-28 无锡市华东电力设备有限公司 Smart home control method and system
CN106302797B (en) * 2016-08-31 2019-08-13 北京锐安科技有限公司 A kind of cookie access De-weight method and device
US10447738B2 (en) 2016-09-16 2019-10-15 Oracle International Corporation Dynamic policy injection and access visualization for threat detection
US10721239B2 (en) 2017-03-31 2020-07-21 Oracle International Corporation Mechanisms for anomaly detection and access management
CN109886764B (en) * 2017-12-06 2021-01-26 航天信息股份有限公司 Commodity de-weight method and system based on material combination
CN108304481A (en) * 2017-12-29 2018-07-20 成都三零凯天通信实业有限公司 A kind of visible image content supervision method towards multichannel internet new media data
CN110020256A (en) * 2017-12-30 2019-07-16 惠州学院 The method and system of the harmful video of identification based on User ID and trailer content
CN108536788A (en) * 2018-03-29 2018-09-14 合肥俊刚机械科技有限公司 A kind of data capture method and its system based on distributed reptile
CN108550380A (en) * 2018-04-12 2018-09-18 北京深度智耀科技有限公司 A kind of drug safety information monitoring method and device based on public network
CN109145233A (en) * 2018-08-27 2019-01-04 山东浪潮商用系统有限公司 internet information acquisition system
CN109286613A (en) * 2018-08-28 2019-01-29 刘琦 Control system is led in a kind of monitoring of network public-opinion
CN109783619A (en) * 2018-12-14 2019-05-21 广东创我科技发展有限公司 A kind of data filtering method for digging
CN110399554A (en) * 2019-07-12 2019-11-01 苏州浪潮智能科技有限公司 A kind of detection method, device and the storage system of web site contents specific information
CN110543595B (en) * 2019-08-12 2023-07-04 南京莱斯信息技术股份有限公司 In-station searching system and method
CN111191098B (en) * 2019-12-25 2022-10-18 山石网科通信技术股份有限公司 Data filtering method and device
CN112131462A (en) * 2020-09-10 2020-12-25 中数通信息有限公司 Keyword discovery method and system based on information monitoring and electronic equipment
CN112148956A (en) * 2020-09-30 2020-12-29 上海交通大学 Hidden net threat information mining system and method based on machine learning
CN112632355A (en) * 2020-11-26 2021-04-09 武汉虹旭信息技术有限责任公司 Fragment content processing method and device for harmful information
CN114238962A (en) * 2021-09-29 2022-03-25 睿贸恒诚(山东)科技发展有限责任公司 Harmful information filtering system and method based on mobile internet

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231661A (en) * 2008-02-19 2008-07-30 上海估家网络科技有限公司 Method and system for digging object grade knowledge
US7743045B2 (en) * 2005-08-10 2010-06-22 Google Inc. Detecting spam related and biased contexts for programmable search engines
CN102841898A (en) * 2011-06-23 2012-12-26 张家港凯纳信息技术有限公司 Network information monitoring and analyzing system
CN103023714A (en) * 2012-11-21 2013-04-03 上海交通大学 Activeness and cluster structure analyzing system and method based on network topics
CN103310026A (en) * 2013-07-08 2013-09-18 焦点科技股份有限公司 Lightweight common webpage topic crawler method based on search engine
CN103902667A (en) * 2014-03-14 2014-07-02 浪潮电子信息产业股份有限公司 Simple network information collector achieving method based on meta-search
US8782037B1 (en) * 2010-06-20 2014-07-15 Remeztech Ltd. System and method for mark-up language document rank analysis

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7743045B2 (en) * 2005-08-10 2010-06-22 Google Inc. Detecting spam related and biased contexts for programmable search engines
CN101231661A (en) * 2008-02-19 2008-07-30 上海估家网络科技有限公司 Method and system for digging object grade knowledge
US8782037B1 (en) * 2010-06-20 2014-07-15 Remeztech Ltd. System and method for mark-up language document rank analysis
CN102841898A (en) * 2011-06-23 2012-12-26 张家港凯纳信息技术有限公司 Network information monitoring and analyzing system
CN103023714A (en) * 2012-11-21 2013-04-03 上海交通大学 Activeness and cluster structure analyzing system and method based on network topics
CN103310026A (en) * 2013-07-08 2013-09-18 焦点科技股份有限公司 Lightweight common webpage topic crawler method based on search engine
CN103902667A (en) * 2014-03-14 2014-07-02 浪潮电子信息产业股份有限公司 Simple network information collector achieving method based on meta-search

Also Published As

Publication number Publication date
CN104951539A (en) 2015-09-30

Similar Documents

Publication Publication Date Title
CN104951539B (en) Internet data center's harmful information monitoring system
CN104899324B (en) One kind monitoring systematic sample training system based on IDC harmful informations
Wu et al. Modeling method of internet public information data mining based on probabilistic topic model
CN110597981B (en) Network news summary system for automatically generating summary by adopting multiple strategies
CN106096056A (en) A kind of based on distributed public sentiment data real-time collecting method and system
CN102063488A (en) Code searching method based on semantics
CN104899323A (en) Crawler system used for IDC harmful information monitoring platform
CN103593336A (en) Knowledge pushing system and method based on semantic analysis
CN112749284A (en) Knowledge graph construction method, device, equipment and storage medium
Das et al. A CV parser model using entity extraction process and big data tools
CN104615627A (en) Event public sentiment information extracting method and system based on micro-blog platform
Nikhil et al. A survey on text mining and sentiment analysis for unstructured web data
CN115757689A (en) Information query system, method and equipment
Nigam et al. Web scraping: from tools to related legislation and implementation using python
KR102107474B1 (en) Social issue deduction system and method using crawling
Nadee et al. Towards data extraction of dynamic content from JavaScript Web applications
Pandya et al. Mated: metadata-assisted twitter event detection system
CN103886033B (en) Intelligent vertical searching device and method for safety industry chain
CN116226494A (en) Crawler system and method for information search
CN115269862A (en) Electric power question-answering and visualization system based on knowledge graph
Sun et al. Big data analysis on social networking
CN114117242A (en) Data query method and device, computer equipment and storage medium
KR102252096B1 (en) System for providing bigdata based minutes process service
Preetha et al. Personalized search engines on mining user preferences using clickthrough data
Singh et al. Semantic web mining: survey and analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20171222

Termination date: 20180619

CF01 Termination of patent right due to non-payment of annual fee