CN104951539B

CN104951539B - Internet data center's harmful information monitoring system

Info

Publication number: CN104951539B
Application number: CN201510343226.6A
Authority: CN
Inventors: 彭光辉
Original assignee: Chengdu Aierpu Science & Technology Co Ltd
Current assignee: Chengdu Aierpu Science & Technology Co Ltd
Priority date: 2015-06-19
Filing date: 2015-06-19
Publication date: 2017-12-22
Anticipated expiration: 2035-06-19
Also published as: CN104951539A

Abstract

The invention discloses Internet data center's harmful information monitoring system, including crawler system and harmful information monitoring system, harmful information monitoring system obtains the web data in Internet data center by crawler system, and Harmful analysis is carried out to it, crawler system includes multiple reptile clusters being made up of reptile node and reptile root node, and reptile node includes multithreading web retrieval module, web page library, code identification processing module, web page contents and automatically extracts module, url filtering device, URL deduplication modules and URL scheduler modules；Harmful information monitoring system realizes more accurate search by harmful information search unit, automatic word segmentation unit, keyword processing unit and fuzzy matching unit.The invention provides powerful data collection function, and comprehensive monitoring in real time is carried out to dynamic web page and static Web page by multiple reptile clusters, the data relevant with sensitive word can be collected from mass data, accomplish to actively discover harmful webpage.

Description

Internet data center's harmful information monitoring system

Technical field

The present invention relates to Internet data center's harmful information monitoring system.

Background technology

With developing rapidly for network, WWW turns into the carrier of bulk information, how to efficiently extract and utilize these Information turns into a huge challenge.The instrument that search engine retrieves information as an auxiliary people turns into the dimension of user's access ten thousand The entrance and guide of net.But there is also certain limitation for these versatility search engines.

In face of the Web Community's environment to become increasingly active, each netizen is likely to become publisher and the distribution of harmful information Person, network, which is harmful to route of transmission, more and more extensively includes blog, news, forum, microblogging and other approach.Web crawlers is each The precursor technique that kind search engine can be realized, the arriving in big data epoch and the rapid development of Internet technology so that net Network reptile has more great Research Significance.Web data amount has a big increase, the network text update cycle is short and webpage knot for reply The web crawlers of the series of challenges such as structure dynamic change, high efficiency and non-stop run turns into the research heat that harmful information excavates Point.

The content of the invention

It is an object of the invention to overcome the deficiencies of the prior art and provide harmful information monitoring system of Internet data center System, present system provides powerful data collection function, dynamic web page and static Web page is carried out by multiple reptile clusters complete The real-time monitoring in face；And from mass data, the data relevant with sensitive word are collected, accomplish to actively discover harmful webpage, lead to Harmful information search unit, automatic word segmentation unit, keyword processing unit and the realization of fuzzy matching unit is crossed more accurately to search Rope.

The purpose of the present invention is achieved through the following technical solutions：Harmful information monitoring system of Internet data center System, it includes crawler system and harmful information monitoring system, and harmful information monitoring system obtains interconnection netting index by crawler system Harmful analysis is carried out according to the web data in center, and to it.

The crawler system includes one or more reptile clusters, and each reptile cluster include multiple reptile nodes and One reptile root node, a distributed data acquisition network is formed, wherein, reptile root node is used in the reptile cluster Reptile node be controlled and manage, and be in communication with each other with harmful information monitoring system, reptile node is used to gather net Harmful information in network, each reptile node form by following multiple module：

1st, multithreading web retrieval module, including a variety of web retrieval passages and web analysis module, for different type Webpage, it is acquired by the web retrieval passage and web analysis module that match with it.

2nd, web page library, the webpage that storage multithreading web retrieval module is gathered.

3rd, code identification processing module, the type of coding of automatic identification webpage, and code conversion processing is carried out to it.

4th, web page contents automatically extract module, including dynamic web content extraction module and static web contents extraction mould Block, the URL that harmful Intelligence Page after code conversion is handled be present is captured according to sensitive dictionary according to sensitive dictionary.

5th, url filtering device, the URL that need not be downloaded is filtered.

6th, URL deduplication modules, for judging whether the URL after filtering is consistent with the URL stored in URL memories, if Consistent then no longer follow-up to URL progress processing.

7th, URL scheduler modules, according to the URL queues after duplicate removal, control multithreading web retrieval module downloads corresponding net Page.

8th, removing duplicate webpages module, for judging whether web page contents are consistent with the web page contents downloaded, if consistent Follow-up processing is no longer carried out to the webpage, and is deleted from web page library.

The removing duplicate webpages module includes fingerprint computing module, fingerprint base and fingerprint deduplication module, fingerprint computing module root According to web page fingerprint algorithm, by the content of webpage by calculating generation fingerprint, fingerprint deduplication module is by the generation fingerprint and fingerprint base In fingerprint contrasted, if exist it is same or like as fingerprint, judge that the web page contents had been downloaded, fingerprint base is used for Finger print data is stored, and the fingerprint base of each reptile node synchronizes renewal.

9th, handling module is spaced, interval handling module automatically generates interval rule by webpage scoring and weight of website, and Control web page contents automatically extract module and corresponding interval crawl are carried out to webpage.

10th, rules for grasping setup module, rules for grasping setup module control web page contents according to set rules for grasping Automatically extract module and corresponding grasping movement is carried out to webpage.

11st, anti-crawler capturing module, when webpage is provided with anti-crawlers, anti-crawler capturing module is started, to target Webpage carries out pressure collection.

12nd, acquisition monitoring module, acquisition monitoring module by the working condition of reptile node, acquisition tasks, sampling depth and Log information is transmitted to reptile root node and carries out convergence processing, and receives the control of reptile root node.

The harmful information monitoring system includes harmful information search unit, automatic word segmentation unit, keyword processing unit With fuzzy matching unit.

Harmful information search unit includes local search port and web search port, and local search port is used to start this The search engine of ground reptile node, it is performed locally the harmful information search mission.Web search port is used to start multiple climb The search engine of worm node, the harmful information search mission is performed simultaneously by multiple reptile nodes, also passes through the web search Search result is returned to the local reptile node by port.

Harmful information search unit also include keyword screening washer, label field screening washer, metadata fields screening washer and One or more combinations in time screening washer, precise search is completed by a variety of screening washers and combinations thereof.

Keyword processing unit is used to generate keyword search instruction, and harmful information search unit is according to the keyword search Instruction performs harmful information search mission.

Fuzzy matching unit is used for the approximate vocabularies being similar according to the searching character String matching of input, searches harmful information While cable elements scan for search string, the search of approximate vocabularies is also completed, and returns to approximate vocabularies search result.

Automatic word segmentation unit is used to carry out the search string of input to automatically extract keyword, makes harmful information search single Member automatically extracts keyword according to this and completes precise search.

The keyword search instruction includes classify ID number, event title, keyword option, exclusion keyword option, power Weight, initial time.The keyword option that excludes is used to make the webpage comprising any keyword in exclusion keyword option will not It is matched and regards as harmful information webpage.

The harmful information monitoring system also includes autoabstract generation unit, and autoabstract generation unit is according to input Search string and its approximate vocabularies are dynamically generated web-page summarization to target web.

The autoabstract generation unit also carries out crucial word analysis by keyword processing unit to webpage, automatically extracts Critical field generates web-page summarization.

The harmful information monitoring system also includes result statistical analysis unit, and as a result statistical analysis unit is used for returning Search result carry out analytic statistics, the statistical analysis unit include task public sentiment figure generation module, report generation module, appoint Business paper statistics module, task trend analysis module and duty profile analysis module.

The task public sentiment figure generation module generates task public sentiment figure, including harmful letter according to search condition and search result Cease quantity statistics, acceptance of the bid keyword quantity statistics and webpage quantitative classification statistics.

The report generation module is used to generate form according to search result information.

The task trend analysis module is used to generate increment graph.

The duty profile analysis module is used to generate task list, website distribution map and media distribution figure.

The harmful information monitoring system also includes fire wall, and crawler system is by fire wall in Internet data center Web data carry out safety crawl.

The beneficial effects of the invention are as follows：Internet data center's harmful information monitoring system proposed by the invention, can be from In mass data, the data relevant with sensitive word are collected, accomplish to actively discover harmful；Include harmful distribution site, propagate The relevant informations such as approach, money order receipt to be signed and returned to the sender rate, clicking rate, participant, assistant analysis are harmful to temperature, importance, the development trend of webpage, done It is harmful to accurate analysis；Set a suspect's virtual identity to carry out key monitoring, analytic activity model is carried out according to data are collected Enclose, spread content, activity time etc.；Settable speech qualitative data analysis；Event temperature fast positioning is analyzed.

The present invention also has following multiple functional characteristics：

1）Multithreading gathers：Different strategies is customized for different types of website, multithreading is supported in collection, is realized quick Information gathering；

2）Distributed capture：Larger scale data acquisition is carried out by multiple reptile clusters, some reptile nodes；

3）Acquisition monitoring：Reptile node working condition, acquisition tasks, sampling depth, daily record, system operation report etc. are entered Row monitoring and management；

4）Web page contents automatically extract：Can gather a variety of dynamics and static Web page, for example, HTM, HTML, SHTML, XML, The webpages such as PHP, ASP, JSP, JavaScript；

5）Encode automatic identification conversion：Support GBK, GB2312, BIG5, UTF-8, UTF-16, BIGENDIAN, A variety of coding automatic identifications such as ISO8859-1, it is UTF that system carries out code conversion automatically；

6）Incremental update：Ensure that reptile node only gathers newly-generated or change webpage after last time renewal, without adopting again Collect the webpage downloaded to ensure the efficiency of information updating, user can also can also set whole collections as needed；

7）Anti- crawler capturing：For partly setting anti-crawlers website to set corresponding strategies, avoid that page can not be captured Face；

8）Reptile interval captures：Interval rule is automatically generated using webpage scoring and weight of website etc., phase is carried out to webpage The interval crawl answered；

9）Self-defined rules for grasping：User oneself can also set rules for grasping.

Brief description of the drawings

Fig. 1 is the crawler system structured flowchart of the present invention；

Fig. 2 is the structural principle block diagram of reptile node in the present invention；

Fig. 3 is the structural principle block diagram of harmful information monitoring system in the present invention.

Embodiment

Technical scheme is described in further detail below in conjunction with the accompanying drawings, but protection scope of the present invention is not limited to It is as described below.

Internet data center's harmful information monitoring system, it includes crawler system and harmful information monitoring system, is harmful to Information monitoring system obtains the web data in Internet data center by crawler system, and carries out Harmful analysis to it.

（One）Crawler system

As shown in figure 1, the crawler system is responsible for carrying out the discovery of initial data from internet, crawled and data requirement Change.According to the difference of interconnection web-based applications, including one or more reptile clusters, and each reptile cluster includes multiple reptiles Node and a reptile root node, a distributed data acquisition network is formed, wherein, reptile root node is used for the reptile Reptile node in cluster is controlled and managed, and is in communication with each other with harmful information monitoring system, and reptile node is used for Gather the harmful information in network.

As shown in Fig. 2 in the present invention, each reptile node forms by following multiple module：

1st, multithreading web retrieval module, including a variety of web retrieval passages and web analysis module, for different type Webpage, it is acquired by the web retrieval passage and web analysis module that match with it；The web analysis mould Block includes dns resolution module, HTTP parsing modules, FTP parsing modules, GOPHER parsing modules etc.；

Realize multithreading acquisition function：Different types of website can be directed to and customize different strategies, collection is supported multi-thread Journey, realize that snap information gathers；

2nd, web page library, the webpage that storage multithreading web retrieval module is gathered；

3rd, code identification processing module, the type of coding of automatic identification webpage, and code conversion processing is carried out to it；Support A variety of coding automatic identifications such as GBK, GB2312, BIG5, UTF-8, UTF-16, BIGENDIAN, ISO8859-1, system are entered automatically Row code conversion is UTF；

4th, web page contents automatically extract module, including dynamic web content extraction module and static web contents extraction mould Block, the URL that harmful Intelligence Page after code conversion is handled be present is captured according to sensitive dictionary；A variety of dynamics and static network can be gathered Page, such as the webpage such as HTM, HTML, SHTML, XML, PHP, ASP, JSP, JavaScript；

5th, url filtering device, the URL that need not be downloaded is filtered；

6th, URL deduplication modules, for judging whether the URL after filtering is consistent with the URL stored in URL memories, if Consistent then no longer follow-up to URL progress processing；Incremental update function is realized, after ensureing that reptile node only gathers last time renewal Newly-generated or change webpage, ensures the efficiency of information updating, user can also root without resurveying the webpage downloaded Whole collections can be also set according to needs；

9th, fingerprint computing module, fingerprint base and fingerprint deduplication module, fingerprint computing module is according to web page fingerprint algorithm, by net The content of page is contrasted the generation fingerprint with the fingerprint in fingerprint base by calculating generation fingerprint, fingerprint deduplication module, if Exist it is same or like as fingerprint, then judge that the web page contents had been downloaded, fingerprint base is used to store finger print data, and each The fingerprint base of reptile node synchronizes renewal.

10th, handling module is spaced, interval handling module automatically generates interval rule by webpage scoring and weight of website, and Control web page contents automatically extract module and corresponding interval crawl are carried out to webpage.

11st, rules for grasping setup module, rules for grasping setup module control web page contents according to set rules for grasping Automatically extract module and corresponding grasping movement is carried out to webpage.

12nd, anti-crawler capturing module, when webpage is provided with anti-crawlers, anti-crawler capturing module is started, to target Webpage carries out pressure collection.

13rd, acquisition monitoring module, acquisition monitoring module by the working condition of reptile node, acquisition tasks, sampling depth and Log information is transmitted to reptile root node and carries out convergence processing, and receives the control of reptile root node.

The reptile node also includes label counter and label counting journal file, and label counter is used to record webpage Download number in storehouse, and the data are recorded in label counting journal file.

The crawler system also includes full-text database, index data base and row sequence database, full-text database, index number It is connected according to storehouse and row sequence database with reptile node and reptile root node.

（Two）Harmful information monitoring system

As shown in figure 1, the harmful information monitoring system includes harmful information search unit, automatic word segmentation unit, key Word processing unit and fuzzy matching unit.

1st, harmful information search unit, including local search port and web search port, local search port are used to open The search engine of local reptile node is moved, is performed locally the harmful information search mission.Web search port is more for starting The search engine of individual reptile node, the harmful information search mission is performed simultaneously by multiple reptile nodes, also passes through the network Search result is returned to the local reptile node by search port.

Harmful information search unit also include keyword screening washer, label field screening washer, metadata fields screening washer and One or more combinations in time screening washer, precise search is completed by a variety of screening washers and combinations thereof, search is such as provided The weight combinatorial search etc. of the weight of keyword, multiple metadata fields.

Keyword screening washer：The combination of keyword logical expression is supported, including AND, OR, NOT etc..

Label field screening washer：Support the logic AND-OR INVERTER limit search combined by multiple label fields.

Metadata fields screening washer：Multiple metadata fields can be defined, search result is selected by parameter.

Time screening washer：Support the ranking function according to date, the degree of correlation and other field combinations.

Field label search is the label field by establishing index text, and user can targetedly select set of tags Close, so as to return to corresponding restriction result.

Harmful information search unit carries out the whole network search according to burst deleterious network hot word, and quick search accident has Evil quantity, distribution site, harmful temperature.

2nd, keyword processing unit, for generating keyword search instruction, harmful information search unit uses Boolean logic Expression formula, and instructed according to the keyword search and perform harmful information search mission.

3rd, fuzzy matching unit, for the approximate vocabularies being similar according to the searching character String matching of input, harmful letter is made While breath search unit scans for search string, the search of approximate vocabularies is also completed, and returns to approximate vocabularies search As a result.

User can input a word, a passage even entire article, and system can analyze user search condition Contents concept, the result of user's care is then found out from the degree of correlation of concept.If user does not know the content of inquiry such as What is spelt, can be by searching for generally, and system is in addition to returning to corresponding search result, also similar in return and input character string Other vocabulary, so as to allow user to find other related results.

4th, automatic word segmentation unit, for carrying out the search string of input to automatically extract keyword, search harmful information Cable elements automatically extract keyword according to this and complete precise search.Automatic word segmentation module is the base of Chinese information processing and analysis Plinth.Based on dictionary and rule, the language model method based on probability analysis is comprehensively utilized, and can enter according to different applications The participle of the suitable particular requirement of row.

5th, autoabstract generation unit, autoabstract generation unit is according to the search string and its approximate vocabularies pair of input Target web is dynamically generated web-page summarization.Webpage can be dynamically generated different according to the different search strings of input Web-page summarization, user can be investigated according to the web-page summarization to judge whether to need to open the webpage, and can be by dynamic Web-page summarization understands the relation in returning result between each webpage.

The autoabstract generation unit also carries out crucial word analysis by keyword processing unit to webpage, automatically extracts Critical field generates web-page summarization.When user checks the particular content of webpage, autoabstract generation unit also can be automatically right Article content generates web-page summarization, webpage need not now be analyzed according to search string and its approximate vocabularies.

Autoabstract generation unit can consider word frequency, part of speech, positional information, and realization accurately automatically extracts analysis Keyword, and the keyword analyzed according to it automatically generates web-page summarization.

6th, result statistical analysis unit, as a result statistical analysis unit be used to carry out analytic statistics to the search result of return, The statistical analysis unit includes task public sentiment figure generation module, report generation module, task paper statistics module, task trend Analysis module and duty profile analysis module.

The report generation module be used for according to search result information generate form, including block diagram, line chart single rod figure, Double stick figure, three rod figures, multiple chart and X-Y figures.

The task trend analysis module is used to generate increment graph, including daily increment graph, weekly increment graph, monthly increment Figure etc..

The duty profile analysis module is used to generate patterned task list, website distribution map and media distribution figure.

The search result includes harmful distribution site, route of transmission, money order receipt to be signed and returned to the sender rate, clicking rate and participant's information.

Statistical analysis unit has provided the user powerful query function, is divided for real-time and historical data Analyse, show, data mining, including historical data, inspection data, network data, monitor node are carried out for historical data application Data.Can be as needed, various querying conditions are flexibly set, there is provided a variety of statistical forms, such as single rod figure, double stick figure, three Rod figure, multiple chart, X-Y figures（Coordinate points are drawn）It etc. form, and can be combined with dispatch service, generate the report of a variety of output formats Table such as word forms, PDF format, Excel forms etc., are sent to specified user, enrich Analysis of Policy Making function, facilitate user to inquire about Data, analytic trend, formulate Adjusted Option.It is user's editing picture meanwhile system has autgmentability.

Harmful information monitoring system of the present invention also includes fire wall, and crawler system is by fire wall to Internet data center In web data carry out safety crawl.

Claims

1. Internet data center's harmful information monitoring system, it includes crawler system and harmful information monitoring system, You Haixin Monitoring system is ceased by the web data in crawler system acquisition Internet data center, and Harmful analysis is carried out to it, and it is special Sign is：The crawler system includes one or more reptile clusters, and each reptile cluster include multiple reptile nodes and One reptile root node, a distributed data acquisition network is formed, wherein, reptile root node is used in the reptile cluster Reptile node be controlled and manage, and be in communication with each other with harmful information monitoring system, reptile node is used to gather net Harmful information in network, reptile node form by following multiple module：

Multithreading web retrieval module, including a variety of web retrieval passages and web analysis module, for different types of webpage, It is acquired by the web retrieval passage and web analysis module that match with it；

Web page library, the webpage that storage multithreading web retrieval module is gathered；

Code identification processing module, the type of coding of automatic identification webpage, and code conversion processing is carried out to it；

Web page contents automatically extract module, including dynamic web content extraction module and static web contents extraction module, according to Sensitive dictionary captures the URL that harmful Intelligence Page after code conversion is handled be present according to sensitive dictionary；

Url filtering device, filter the URL that need not be downloaded；

URL deduplication modules, for judging whether the URL after filtering is consistent with the URL stored in URL memories, if consistent Follow-up processing is no longer carried out to the URL；

URL scheduler modules, according to the URL queues after duplicate removal, control multithreading web retrieval module downloads corresponding webpage；

The harmful information monitoring system includes harmful information search unit, automatic word segmentation unit, keyword processing unit and mould Paste matching unit；

Harmful information search unit includes local search port and web search port, and local search port is used to start local climb The search engine of worm node, it is performed locally the harmful information search mission；Web search port is used to start multiple reptile knots The search engine of point, the harmful information search mission is performed simultaneously by multiple reptile nodes, also passes through the web search port Search result is returned into the local reptile node；

Harmful information search unit also includes keyword screening washer, label field screening washer, metadata fields screening washer and time One or more combinations in screening washer, precise search is completed by a variety of screening washers and combinations thereof；

Keyword processing unit is used to generate keyword search instruction, and harmful information search unit instructs according to the keyword search Perform harmful information search mission；

Fuzzy matching unit is used for the approximate vocabularies being similar according to the searching character String matching of input, makes harmful information search single While member scans for search string, the search of approximate vocabularies is also completed, and returns to approximate vocabularies search result；

Automatic word segmentation unit is used to carry out the search string of input to automatically extract keyword, makes harmful information search unit root Keyword, which is automatically extracted, according to this completes precise search.

2. Internet data center's harmful information monitoring system according to claim 1, it is characterised in that：The reptile knot Point also includes removing duplicate webpages module, for judging whether web page contents are consistent with the web page contents downloaded, if consistent not Follow-up processing is carried out to the webpage again, and is deleted from web page library.

3. Internet data center's harmful information monitoring system according to claim 2, it is characterised in that：The webpage is gone Molality block includes fingerprint computing module, fingerprint base and fingerprint deduplication module, and fingerprint computing module is according to web page fingerprint algorithm, by net The content of page is contrasted the generation fingerprint with the fingerprint in fingerprint base by calculating generation fingerprint, fingerprint deduplication module, if Exist it is same or like as fingerprint, then judge that the web page contents had been downloaded, fingerprint base is used to store finger print data, and each The fingerprint base of reptile node synchronizes renewal.

4. Internet data center's harmful information monitoring system according to claim 1, it is characterised in that：The reptile knot Point also includes interval handling module, and interval handling module automatically generates interval rule by webpage scoring and weight of website, and controls Web page contents processed automatically extract module and corresponding interval crawl are carried out to webpage；

The reptile node also includes rules for grasping setup module, rules for grasping setup module according to set rules for grasping, Control web page contents automatically extract module and carry out corresponding grasping movement to webpage；

The reptile node also includes anti-crawler capturing module, when webpage is provided with anti-crawlers, starts anti-crawler capturing Module, pressure collection is carried out to target web；

The reptile node also includes acquisition monitoring module, acquisition monitoring module by the working condition of reptile node, acquisition tasks, Sampling depth and log information are transmitted to reptile root node and carry out convergence processing, and receive the control of reptile root node.

5. Internet data center's harmful information monitoring system according to claim 1, it is characterised in that：The keyword Search instruction includes classification ID number, event title, keyword option, exclusion keyword option, weight, initial time；The row Except keyword option is used to make to be matched comprising the webpage for excluding any keyword in keyword option to regard as harmful letter Cease webpage.

6. Internet data center's harmful information monitoring system according to claim 1, it is characterised in that：Harmful letter Breath monitoring system also includes autoabstract generation unit, and autoabstract generation unit is according to the search string and its approximation of input Vocabulary is dynamically generated web-page summarization to target web；

The autoabstract generation unit also carries out crucial word analysis by keyword processing unit to webpage, automatically extracts key Field generates web-page summarization.

7. Internet data center's harmful information monitoring system according to claim 1, it is characterised in that：Harmful letter Breath monitoring system also includes result statistical analysis unit, and as a result statistical analysis unit is used to analyze the search result of return Statistics, the statistical analysis unit include task public sentiment figure generation module, report generation module, task paper statistics module, appoint Trend analysis module of being engaged in and duty profile analysis module.

8. Internet data center's harmful information monitoring system according to claim 7, it is characterised in that：The task carriage Feelings figure generation module generates task public sentiment figure according to search condition and search result, including harmful information content statistics, acceptance of the bid are closed Keyword quantity statistics and webpage quantitative classification statistics；

The report generation module is used to generate form according to search result information；

The task trend analysis module is used to generate increment graph；

9. Internet data center's harmful information monitoring system according to claim 1, it is characterised in that：Harmful letter Breath monitoring system also includes fire wall, and crawler system carries out safety by fire wall to the web data in Internet data center Crawl.