CN109460393A

CN109460393A - Visualization system is repaired in a kind of preliminary examination based on big data in advance

Info

Publication number: CN109460393A
Application number: CN201811322934.1A
Authority: CN
Inventors: 郭淑琴; 贾翼; 任宏亮
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-11-08
Filing date: 2018-11-08
Publication date: 2019-03-12
Anticipated expiration: 2038-11-08
Also published as: CN109460393B

Abstract

Visualization system, including intelligent data sampling module, data cleansing warning module, data cleansing maintenance module, high-risk data alarm module, data quick storage module and GIS data dlm (dynamic loading module) are repaired in a kind of preliminary examination based on big data in advance.Intelligent classification is carried out to data by intelligent data sampling module, to improve the cleaning efficiency to data file；By prediction policy, high-risk data is subjected to blacklist label, and iteration is updated to blacklist with PLRU algorithm, substantially improves the ability of system wrong report；By repairing strategy in advance, incomplete data are repaired, greatly improve the utilization rate of data；By data quick storage module by secure data quick storage, data visualization real-time loading rate and historical data loading speed are improved；The data flow that preliminary examination is repaired in advance is shown in a manner of GIS dynamic map finally, manager is more conducive to and more directly carries out air control scheduling and system optimization.

Description

Visualization system is repaired in a kind of preliminary examination based on big data in advance

Technical field

The present invention relates to data processing, field of data storage, especially a kind of visualization repaired in advance based on big data preliminary examination System.

Background technique

With the development of new and high technology, big data have become various countries development important tool, push big data development and Using making the governance new model of precisely improvement, multiparty collaboration in future, foundation runs smoothly, is safe and efficient economical New mechanism is run, people's livelihood new system for health service that is people-oriented, benefiting the whole people is constructed, opens the innovation of public foundation, millions of people innovation New frame is driven, high-end intelligence, the industry development nascent state of joyful prosperity are cultivated.

Along with the arrival in DT epoch, people can more be collected into data abundant than ever, and the report of IDC is aobvious Show: expecting the year two thousand twenty, global metadata total amount will be more than 40ZB (being equivalent to 40,000,000,000,000 GB), this data volume is 22 in 2011 Times！It is being in how the data of " explosive " growth are become based on the information of high value come decision, prediction, strategic development New research hotspot.

After having collected a large amount of initial data in acquisition system, data are only integrated and calculate, and can just be used for hole Business rule is examined, potential information is excavated, to realize the value of big data, achieve the purpose that energize in business and create value. In face of the data and complicated calculating of magnanimity, data computation layer includes two big systems: data storage and computing platform；And data are dug The development of pick technology and data warehousing and computing technique be it is complementary, not the development of data infrastructure and it is distributed simultaneously The technology that row calculates, there will be no deep learnings, can less witness the mystery of AlphaGo, the development of cloud computing platform, so that extra large Amount, high speed, changeableization, the structure of multiple terminals and unstructured data are able to store and efficiently calculate, as in electric business field Global portrait towards magnanimity member and commodity, sends word the universe ID-Mapping, advertisement accurately release platform, thousand people of natural person The personalized search in thousand faces and recommended technology, the identification of non-flow of the people and rogue device, commercial competition information automatic excavating System has been deep into the links of enterprise development, " no data not intelligence, no intelligence is not commercially ", and big data is melted with machine learning New commercial revolution after conjunction already arrives.

The quality of data is basis and the premise of everything of data analysis conclusion validity and accuracy, how to be guaranteed The quality of data, it is ensured that the availability of data is that Construction of Data Warehouse cannot be neglected link, and data have become important production Element, allows the value maximization of data application, such as search, recommendation, advertisement, finance, credit, insurance, entertainment, logistics business. Businessman is served data to, can be used for instructing the digitization operation of businessman, provides diversification for businessman, the data of general favour are assigned Energy；It can be used to implement better search experience, more accurately personalized recommendation, optimize shopping experience, more accurately carry out wide It accuses and launches, the financial service of more Hewlett-Packard；Employee is served data to, can be used for digitization operation and decision；

Big data processing platform generally used now lacks the prerinse strategy for data source access, in particular so that largely Missing, in vain, high-risk and duplicate missing, in vain, high-risk data enter data analysis, seriously affect the knot of data analysis Fruit, and the accuracy of prediction and regression model.

And distributed file system relies on the advantages of its high fault tolerance, scalable and expensive storage to support large-scale dataset Storage, but it is not high for magnanimity, high concurrent, the reception of continuous, high speed small data file and storage efficiency, every time into A large amount of IO exchange can be all done when row insertion, lookup, deletion, update operation with distributed file system, greatly reduces distribution The new energy of formula file system.

And current data visibility solution mainly uses some commercial solutions, is meeting the more demands of client While since the solution of its own is fixed and data height modelization can not provide more personalized service, depending on Inhibition and generation cost is excessively high, can not effectively match the specific business of each single item, and develops solution party according to the business characteristic of itself Case, then seeming since the development cycle is too long and cost is excessively high, it is unreachable to touch.

Current big data visualization application often show after processing analysis as a result, lacking necessary early warning and announcement Alert instruction, generally requires decision-maker and optimizes by industry experience, and there are flows of personnel for personnel's decision, unsustainable The problem of sex work, is urgently to be resolved.

Summary of the invention

In order to overcome big data data to filter out the data of most of potential value when acquiring, with distributed file system frequency The deficiency that performance bottleneck caused by numerous IO interaction and visualization interface integrated information customize, the present invention provides a kind of bases In the dynamic and visual system that big data preliminary examination is repaired in advance, by cleaning, the amendment of the complete paired data of strategy that preliminary examination is repaired in advance, and tie After structure deformation process, data at this time have had been provided with structuring or semi-structured feature, it may be convenient to by relationship Type database is loaded and is used, and is promoted the utilization rate to source data and is improved the safety and stability of system, is calculated by PLRU Method quickly updates iteration black list database, to promote storage efficiency and data filtering efficiency, lifting system safety and stabilization Property, promote the robustness of whole system.Using consistent hashing algorithm, data file queue is evenly dispersed to each of cluster On a server, to solve data skew, good effect is obtained in cluster load balance, and data file is stored into plan Metamessage under slightly is cached into relevant database, to solve distributed file system as NameNode metadata in HDFS The pressure of storage, uses high-frequency data file cache is fast in hard disk heat, is sentenced according to the added field to data file Break whether on the hard disk of file caching server, directly reads and read the data in data file cache server, to subtract Lack the interaction of inquiry with distributed file system, update, improved access speed, greatly improves the comprehensive read-write property of system Energy.This method is especially rapid in the storage and inquiry response speed for facing magnanimity small data file.

To achieve the goals above, the technical solution adopted by the present invention are as follows:

A kind of dynamic and visual system repaired in advance based on big data preliminary examination, including intelligent data sampling module, data cleansing Warning module, data cleansing maintenance module, high-risk data alarm module, data quick storage module and GIS data dynamically load Module；

The intelligent data sampling module, in such a way that data cache server addend is according to buffer queue to difference Data source is classified, and is marked, and storage manages the metamessage of data；Collected message is sent in data cache server, In conjunction with the data characteristics in itself field, it is contemplated that the diversity of data file size, according to BLOCK in distributed file system The critical value T of a data file is arranged in size, and cache server is used to judge the size of this file, to the data text less than T Part adds Data Identification, i.e. KEY is sent directly to after the completion of data processing when being greater than given T such as the size of data file Distributed file system；It is stored in corresponding data queue according to label point, until triggering merger threshold value TH2；

The data cleansing warning module relies on algorithm to identify improper flow and data simultaneously for parsing data source Summarize corresponding filtering rule filtered out and downstream use.

Module is overhauled in the data cleansing, for carrying out data lacuna with digital dictionary using data cleansing maintenance module Amendment, invalid data are rejected；

The high-risk data alarm module, for using PLRU algorithm dynamically load to update by way of establishing blacklist Blacklist data improves the fault rate of PLRU algorithm by way of establishing white list.

In the high-risk data alarm module, using PLRU algorithm；

The data quick storage module, stores, significantly for the mark data after cleaning data processing module Improve by small documents caused to system bottleneck brought by the frequent I/O operation of distributed file system, using consistency Hash algorithm obtains good effect in cluster load balance；

The data of the GIS data visualization model, the legitimate secure for that will clean carry out Dynamic Display, the module Open source library ECharts is encapsulated, the module for being suitble to this business is can choose according to the difference of data type, provides more accurately Spatial geographic information, intuitively, interaction are abundant, height personalized customization and can develop and complete the customization of front end UI personalized theme, And by high-risk data information, overhaul data information is presented in front end page, and the analysis of more integrated information can be carried out from front end.

Further, in the intelligent data sampling module, comprising the following steps:

1.1.1 data hash is stored using the consistent hashing algorithm of data quick storage module；

1.1.2 the mode of metamessage management: made using prerinse warning module identification flow attacking, web crawlers and flow Disadvantage；And the data after missing mark are sent into data cleansing and overhaul module, the high-risk data after label are sent into malicious data alarm Module；

1.1.3 black and white lists database is constructed using relevant database, and relationship is written into the metamessage of 1.1.2 label In type database.

In the data cleansing warning module, flowed to using step 1.1.3 black and white lists database decision data；It is walked The merger of rapid 1.1.2 metadata.

In the data cleansing maintenance module, comprising the following steps:

1.3.1 it in cleaning warning module, shows as mentioned null cell or is shown as NAN (nonnumeric), N/A or None, it is right In the category column that may include significant missing data, a new classification, referred to as Misssing, then as general can be created Logical column are equally handled；

1.3.2 in step 1.3.1, representative value if necessary then converts the data repaired in advance to significant numerical value, such as Take the median of business datum.

In the high-risk data alarm module, using PLRU algorithm, steps are as follows:

1.4.1 by one group of hash function W={ W1, W2 ... Wn } composition, the domain output of hash function is X, for number Be each of Q={ q1, q2 ... qn } qi according to source, obtained under the n independent hash Function Mappings of W n [1, M] between number；

1.4.2 if a is input object, when carrying out PLRU algorithm, then n number can be mapped, otherwise a determines For new object, in one section of detection time, it is 1 that data stream size, which obeys parameter, and the Pareto that distortion parameter is α is distributed；

1.4.3 assume that remote server cluster data packet in observing and controlling time is K, then PLRU is average every J data packet A New Data Flag is established, and eliminates some data of blacklist bottom；

1.4.4 assume that certain big stream E size is exactly equal to threshold value TH, then without there is big number in continuous J data file Hypergeometric distribution is obeyed according to the probability of file E:As K > > J, E removed probability are as follows:

Wherein

1.4.5 black list database is updated according to step 1.4.3 and step 1.4.4；

1.4.6 since PLRU algorithm has wrong report, the wrong report sample having been found that can be prevented by establishing white list Only report by mistake；

In the data quick storage module.The following steps are included:

1.5.1 the metadata that relevant database is used to store the generation of small data file merging process is introduced；

1.5.2 currently processed server is obtained by adding number or port numbers behind machine IP or host name Cryptographic Hash HS={ hs1, hs2 ... ..., hsn }, and be the closed loop configuration in space by HS compound mapping；

1.5.3 by the taking-up of the window data of message queue cache server be put into set G=to be combined g1 ... ... g2, Gn }, n indicates the number of file to be combined, and gi indicates i-th of data file to be combined, to meeting intelligent data sampling module Trigger condition data file carry out 1.5.4 operation；

1.5.4 the data file for triggering TH2 is taken out from sliding window Wn, merger behaviour is carried out to Wn using multithreading Make, the data after merging is uploaded into distributed memory system, while the metamessage that merger operation generates being stored to relationship type In database；

1.5.5 relevant database is written in the metamessage Di of i-th of the data file generated in merging process.Wherein Di= F1, f2 ..., and fn }, wherein fi is the data characteristics of metamessage set；

1.5.6 when client sends the request of reading small data file message queue, access relational database is counted According to the metamessage Di of file；

1.5.7 the large data files where distributed file system small documents data are accessed according to the feature field in Di；

1.5.8 corresponding small data file is parsed according to the feature field in large data files；

1.5.9 field identification F, the access frequency of recording data files are added to each data file；

1.5.10 it uses high-frequency data file cache is fast in hard disk heat, is carried out according to the added field to data file Judge whether to directly read on the hard disk of file caching server and read the data in data file cache server.

Beneficial effects of the present invention are mainly manifested in: promoting to the utilization rate of source data and improve the safety of system and steady It is qualitative, iteration black list database is quickly updated by PLRU algorithm, to promote storage efficiency and data filtering efficiency, promotes system System safety and stability, promote the robustness of whole system.

Detailed description of the invention

Fig. 1 shows the preliminary examination based on big data and repairs visualization system model figure in advance.

Fig. 2 shows the preliminary examinations based on big data to repair visualization system flow chart in advance.

Fig. 3 shows the preliminary examination based on big data and repairs visualization system data preview in advance repairs procedure chart in advance.

Fig. 4 shows the preliminary examination based on big data and repairs visualization system data quick storage modular model figure in advance.

Specific embodiment

The invention will be further described below in conjunction with the accompanying drawings.

System operation and method of the invention is described more fully below, it is clear that described specific implementation case is only Portion case of the present invention, rather than whole project examples, those of ordinary skill in the art change not making creativeness Every other project example obtained under the premise of change or substantive optimization, shall fall within the protection scope of the present invention.

Referring to Fig.1~Fig. 4, a kind of dynamic and visual system repaired in advance based on big data preliminary examination, including intelligent data acquisition Module 3, high-risk data alarm module 4, data quick storage module 5 are overhauled in module 1, data cleansing warning module 2, data cleansing With GIS data visualization model 6, intelligent data sampling module is connected with data cleansing warning module, data cleansing warning module It is connected with data cleansing maintenance module, data cleansing warning module is connected with high-risk data alarm module, data cleansing early warning mould Block, high-risk data alarm module are connected with GIS data visualization model, referring in particular to Fig. 1.

The intelligent data sampling module, in such a way that data cache server addend is according to buffer queue to difference Data source is classified, and is marked, and storage manages the metamessage of data；Collected message is sent in data cache server, In conjunction with the data characteristics in itself field, it is contemplated that the diversity of data file size, according to BLOCK in distributed file system The critical value T of a data file is arranged in size, and cache server is used to judge the size of this file, to the data text less than T Part adds Data Identification, i.e. KEY is sent directly to after the completion of data processing when being greater than given T such as the size of data file Distributed file system；It is stored in corresponding data queue according to label point, until triggering merger threshold value TH2；Including following step It is rapid:

1.1.1 data hash is stored using the consistent hashing algorithm of data quick storage module.

1.1.2 the mode of metamessage management: made using prerinse warning module identification flow attacking, web crawlers and flow Disadvantage (false flow)；And the data after missing mark are sent into data cleansing and overhaul module, the high-risk data after label, which are sent into, dislikes Meaning data alarm module；

1.1.3 black and white lists database is constructed using relevant database, and relationship is written into the metamessage of 1.1.2 label In type database；

The data cleansing warning module utilizes step 1.1.3 black and white lists database decision number for parsing data source According to flow direction；Carry out the merger of step 1.1.2 metadata；It relies on algorithm to identify improper flow and data and summarizes corresponding Filtering rule is filtered out and downstream uses.

Module is overhauled in the data cleansing, for carrying out data lacuna with digital dictionary using data cleansing maintenance module Amendment, invalid data are rejected；The following steps are included:

In the high-risk data alarm module, using PLRU algorithm, steps are as follows:

Wherein

1.4.5 black list database is updated according to step 1.4.3 and step 1.4.4；

The data quick storage module, stores, significantly for the mark data after cleaning data processing module Improve by small documents caused to system bottleneck brought by the frequent I/O operation of distributed file system, using consistency Hash algorithm obtains good effect in cluster load balance；The following steps are included:

1.5.10 it uses high-frequency data file cache is fast in hard disk heat, is carried out according to the added field to data file Judge whether to directly read on the hard disk of file caching server and read the data in data file cache server；

The data of the GIS data visualization model, the legitimate secure for that will clean carry out Dynamic Display, the module Open source library ECharts is encapsulated, the module of formula He this business is can choose according to the difference of data type, provides more accurately Spatial geographic information, intuitively, interaction are abundant, height personalized customization and can develop and complete the customization of front end UI personalized theme, And by high-risk data information, overhaul data information is presented in front end page, and the analysis of more integrated information can be carried out from front end.

Now be illustrated to Fig. 2: when data are acquired into system, i.e. inbound message caching 11, message caching 11 presses data Classification storage 14 is carried out according to classificating requirement, is then fed into prerinse module 12, which carries out missing data and high-risk data Regularization processing, and processing metamessage is written in relevant database 17, the data after prerinse is complete are divided according to actual needs Real-time computing engines and offline 13 and off-line calculation engine are dealt into, and utilize the queue server of the synchronous calculated result of synchronization program 14 and relevant database 17, finally data result is saved in distributed file system 15, can be grasped in management platform 19 Make real time data and off-line data carries out visualization 20 and shows.

The workflow of the visualization system repaired in advance based on big data preliminary examination of the present embodiment, comprising the following steps:

Step S000: in module 1, power is stored according to own cluster and calculates the threshold value of power preset data file size TH1。

Step S001: the file addition identification information of different data sources is acquired, such as 101 operating modes in Fig. 3.It is slow with message The size that the judgement of server 11 receives file is deposited, the threshold value TH1 in such as less than S000 is then added field identification KEY.Reference In Fig. 4 shown in 201.

Step S002: the data in S001 are first carried out to blacklist 104 and are filtered.

Step S003: according to the data source Q={ q1, q2 ... qn } in S001 determine hash function W=W1, W2 ... Wn composition, the domain output of hash function is X.

Step S004: if data source according to compound customized data rule, for data source be Q=q1, Q2 ... qn each of qi, obtain the number between n [1, M] under the n independent hash Function Mappings of W.

Step S005: assuming that remote server cluster data packet in observing and controlling time is K, then PLRU is average every J number A New Data Flag is established according to packet, and eliminates some data of blacklist bottom.

Step S006: it is assumed that certain data file E size is exactly equal to threshold value TH, then do not have in continuous J data file The probability for large data files E occur obeys hypergeometric distribution:

Step S007: as K > > J, E removed probability are as follows:Wherein

Step S009: black list database 104 is updated according to S007 iteration.

Step S010: data collected in S001 are loaded into corresponding data queue.

Step S011: in S010, only when data file, which is sent, requests, data queue just can be according to request application one A data queue, if the data queue is empty queue, while data cache server is not sky, then carries out FIFO operation, otherwise The data queue is by Free up Memory.

Step S012: it by the data in S010 data queue by repairing strategy in advance for certain missing data, is usually expressed as Mentioned null cell or be shown as NAN (nonnumeric), N/A or None can for that may include the category column of significant missing data To create a new classification, then referred to as Miss is handled as commonly arranging, representative value if necessary, then the number that will be repaired in advance According to being converted into significant numerical value.Referring to Fig.1 12 in module 3 and Fig. 2.

Step S013: identification flow attacking, web crawlers and flow cheating (false flow) update iteration in S009 Data are synchronized to high-risk data alarm module.Module 4 referring to Fig.1.

Step S104: being stored in corresponding message queue cache server according to label point, until triggering merger threshold value TH2。

Step S105: the data file for triggering TH2 is taken out from sliding window Wn, carries out merger to Wn using multithreading Operation.

Step S106: the data after merging are uploaded into distributed memory system 15, while the member that merger operation is generated Information is stored into relevant database 17 according to configuration rule 16.

Step S107: currently processed service is obtained by adding number or port numbers behind machine IP or host name The cryptographic Hash HS={ hs1, hs2 ... ..., hsn } of device, and be the closed loop configuration in space by HS compound mapping.Such as 202 institutes of Fig. 4 Show.

Step S108: being put into set G=to be combined { g1 ... ... g2, gn } for the window data Wn taking-up of data queue, N indicates the number of file to be combined, and gi indicates i-th of data file to be combined.

Step S109: relevant database is written in the metamessage Di of i-th of the data file generated in merging process.Wherein Di=f1, f2 ..., and fn }, wherein fi is the data characteristics of metamessage set.

Step S110: when client sends the request of reading message queue, access relational database obtains data file Metamessage Di.

Step S111: the big data text where distributed file system small documents data is accessed according to the feature field in Di Part.

Step S112: corresponding small data file is parsed according to the feature field in large data files.

Step S113: field identification F, the access frequency of recording data files are added to each data file.

Step S114: high-frequency data file cache judges in hard disk heat block according to the added field to data file Whether on the hard disk of file caching server, directly reads and read data in data file cache server

Step S115: processed data are added up to be shown according to temperature, show the geography information for having merged data It is shown with the real-time deal of data.

Claims

1. visualization system is repaired in a kind of preliminary examination based on big data in advance, it is characterised in that: the system comprises intelligent data acquisitions Module, data cleansing warning module, data cleansing maintenance module, high-risk data alarm module, data quick storage module and GIS Data dlm (dynamic loading module)；

The intelligent data sampling module, in such a way that data cache server addend is according to buffer queue to different data Source is classified, and is marked, and storage manages the metamessage of data；Collected message is sent in data cache server, in conjunction with The data characteristics in itself field, it is contemplated that the diversity of data file size, according to the size of BLOCK in distributed file system The critical value T of one data file is set, and cache server is used to judge the size of this file, adds to the data file less than T Add Data Identification, i.e. KEY, when being greater than given T such as the size of data file, is sent directly to be distributed after the completion of data processing Formula file system；It is stored in corresponding data queue according to label point, until triggering merger threshold value TH2；

The data cleansing warning module, for parsing data source, algorithm is relied on to identify improper flow and data and concluding Corresponding filtering rule is filtered out out and downstream uses.

Module is overhauled in the data cleansing, is repaired for carrying out data lacuna with digital dictionary using data cleansing maintenance module Just, invalid data is rejected；

The high-risk data alarm module, for using PLRU algorithm dynamically load to update black name by way of establishing blacklist Forms data improves the fault rate of PLRU algorithm by way of establishing white list.

In the high-risk data alarm module, using PLRU algorithm；

The data quick storage module, stores for the mark data after cleaning data processing module, substantially improves By small documents caused to system bottleneck brought by the frequent I/O operation of distributed file system, using consistent hashing Algorithm obtains good effect in cluster load balance；

The GIS data visualization model, the data of the legitimate secure for that will clean carry out Dynamic Display, module encapsulation Open source library ECharts, can choose the module for being suitble to this business according to the difference of data type, provide more accurately space Geography information, intuitively, interaction are abundant, height personalized customization and can develop and complete the customization of front end UI personalized theme, and will High-risk data information, overhaul data information are presented in front end page, and the analysis of more integrated information can be carried out from front end.

2. dynamic and visual system is repaired in a kind of preliminary examination based on big data as described in claim 1 in advance, it is characterised in that: described In intelligent data sampling module, comprising the following steps:

1.1.2 it the mode of metamessage management: is practised fraud using prerinse warning module identification flow attacking, web crawlers and flow； And the data after missing mark are sent into data cleansing and overhaul module, the high-risk data after label are sent into malicious data and alert mould Block；

1.1.3 black and white lists database is constructed using relevant database, and relationship type number is written into the metamessage of 1.1.2 label According in library.

3. a kind of dynamic and visual system repaired in advance based on big data preliminary examination as claimed in claim 2, it is characterised in that: described In data cleansing warning module, flowed to using step 1.1.3 black and white lists database decision data；Carry out step 1.1.2 member number According to merger.

4. a kind of dynamic and visual system repaired in advance based on big data preliminary examination as described in one of claims 1 to 3, feature are existed In: in the data cleansing maintenance module, comprising the following steps:

1.3.1 it in cleaning warning module, shows as mentioned null cell or is shown as NAN (nonnumeric), N/A or None, for can It can include the category column of significant missing data, a new classification, referred to as Misssing, then as commonly arranging can be created Equally handle；

1.3.2 in step 1.3.1, representative value if necessary then converts the data repaired in advance to significant numerical value, such as takes industry The median for data of being engaged in.

5. a kind of dynamic and visual system repaired in advance based on big data preliminary examination as described in one of claims 1 to 3, feature are existed In: in the high-risk data alarm module, using PLRU algorithm, steps are as follows:

1.4.1 by one group of hash function W={ W1, W2 ... Wn } composition, the domain output of hash function is X, for data source For each of Q={ q1, q2 ... qn } qi, obtained under the n independent hash Function Mappings of W n [1, M] it Between number；

1.4.2 if a is input object, when carrying out PLRU algorithm, then n number can be mapped, otherwise a is determined as newly Object, in one section of detection time, it is 1 that data stream size, which obeys parameter, and the Pareto that distortion parameter is α is distributed；

1.4.3 assume that remote server cluster data packet in observing and controlling time is K, then PLRU is average establishes every J data packet One New Data Flag, and eliminate some data of blacklist bottom；

1.4.4 assume that certain big stream E size is exactly equal to threshold value TH, then do not occur big data text in continuous J data file The probability of part E obeys hypergeometric distribution:As K > > J, E removed probability are as follows:

Wherein

1.4.5 black list database is updated according to step 1.4.3 and step 1.4.4；

1.4.6 since PLRU algorithm has wrong report, the wrong report sample having been found that can be prevented from missing by establishing white list Report.

6. a kind of dynamic and visual system repaired in advance based on big data preliminary examination as described in one of claims 1 to 3, feature are existed In: in the data quick storage module.The following steps are included:

1.5.2 the Hash of currently processed server is obtained by adding number or port numbers behind machine IP or host name Value HS={ hs1, hs2 ... ..., hsn }, and be the closed loop configuration in space by HS compound mapping；

1.5.3 the window data taking-up of message queue cache server is put into set G=to be combined { g1 ... ... g2, gn }, N indicates the number of file to be combined, and gi indicates i-th of data file to be combined, the touching to intelligent data sampling module is met The data file of clockwork spring part carries out 1.5.4 operation；

1.5.4 the data file for triggering TH2 is taken out from sliding window Wn, merger operation is carried out to Wn using multithreading, it will Data after merging upload to distributed memory system, while the metamessage that merger operation generates being stored to relevant database In；

1.5.5 relevant database is written in the metamessage Di of i-th of the data file generated in merging process.Wherein Di=f1, F2 ..., fn }, wherein fi is the data characteristics of metamessage set；

1.5.6 when client sends the request of reading small data file message queue, access relational database obtains data text The metamessage Di of part；

1.5.10 it uses high-frequency data file cache is fast in hard disk heat, is judged according to the added field to data file Whether on the hard disk of file caching server, directly reads and read the data in data file cache server.