CN109460393A - Visualization system is repaired in a kind of preliminary examination based on big data in advance - Google Patents

Visualization system is repaired in a kind of preliminary examination based on big data in advance Download PDF

Info

Publication number
CN109460393A
CN109460393A CN201811322934.1A CN201811322934A CN109460393A CN 109460393 A CN109460393 A CN 109460393A CN 201811322934 A CN201811322934 A CN 201811322934A CN 109460393 A CN109460393 A CN 109460393A
Authority
CN
China
Prior art keywords
data
module
file
repaired
cleansing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811322934.1A
Other languages
Chinese (zh)
Other versions
CN109460393B (en
Inventor
郭淑琴
贾翼
任宏亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201811322934.1A priority Critical patent/CN109460393B/en
Publication of CN109460393A publication Critical patent/CN109460393A/en
Application granted granted Critical
Publication of CN109460393B publication Critical patent/CN109460393B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Visualization system, including intelligent data sampling module, data cleansing warning module, data cleansing maintenance module, high-risk data alarm module, data quick storage module and GIS data dlm (dynamic loading module) are repaired in a kind of preliminary examination based on big data in advance.Intelligent classification is carried out to data by intelligent data sampling module, to improve the cleaning efficiency to data file;By prediction policy, high-risk data is subjected to blacklist label, and iteration is updated to blacklist with PLRU algorithm, substantially improves the ability of system wrong report;By repairing strategy in advance, incomplete data are repaired, greatly improve the utilization rate of data;By data quick storage module by secure data quick storage, data visualization real-time loading rate and historical data loading speed are improved;The data flow that preliminary examination is repaired in advance is shown in a manner of GIS dynamic map finally, manager is more conducive to and more directly carries out air control scheduling and system optimization.

Description

Visualization system is repaired in a kind of preliminary examination based on big data in advance
Technical field
The present invention relates to data processing, field of data storage, especially a kind of visualization repaired in advance based on big data preliminary examination System.
Background technique
With the development of new and high technology, big data have become various countries development important tool, push big data development and Using making the governance new model of precisely improvement, multiparty collaboration in future, foundation runs smoothly, is safe and efficient economical New mechanism is run, people's livelihood new system for health service that is people-oriented, benefiting the whole people is constructed, opens the innovation of public foundation, millions of people innovation New frame is driven, high-end intelligence, the industry development nascent state of joyful prosperity are cultivated.
Along with the arrival in DT epoch, people can more be collected into data abundant than ever, and the report of IDC is aobvious Show: expecting the year two thousand twenty, global metadata total amount will be more than 40ZB (being equivalent to 40,000,000,000,000 GB), this data volume is 22 in 2011 Times!It is being in how the data of " explosive " growth are become based on the information of high value come decision, prediction, strategic development New research hotspot.
After having collected a large amount of initial data in acquisition system, data are only integrated and calculate, and can just be used for hole Business rule is examined, potential information is excavated, to realize the value of big data, achieve the purpose that energize in business and create value. In face of the data and complicated calculating of magnanimity, data computation layer includes two big systems: data storage and computing platform;And data are dug The development of pick technology and data warehousing and computing technique be it is complementary, not the development of data infrastructure and it is distributed simultaneously The technology that row calculates, there will be no deep learnings, can less witness the mystery of AlphaGo, the development of cloud computing platform, so that extra large Amount, high speed, changeableization, the structure of multiple terminals and unstructured data are able to store and efficiently calculate, as in electric business field Global portrait towards magnanimity member and commodity, sends word the universe ID-Mapping, advertisement accurately release platform, thousand people of natural person The personalized search in thousand faces and recommended technology, the identification of non-flow of the people and rogue device, commercial competition information automatic excavating System has been deep into the links of enterprise development, " no data not intelligence, no intelligence is not commercially ", and big data is melted with machine learning New commercial revolution after conjunction already arrives.
The quality of data is basis and the premise of everything of data analysis conclusion validity and accuracy, how to be guaranteed The quality of data, it is ensured that the availability of data is that Construction of Data Warehouse cannot be neglected link, and data have become important production Element, allows the value maximization of data application, such as search, recommendation, advertisement, finance, credit, insurance, entertainment, logistics business. Businessman is served data to, can be used for instructing the digitization operation of businessman, provides diversification for businessman, the data of general favour are assigned Energy;It can be used to implement better search experience, more accurately personalized recommendation, optimize shopping experience, more accurately carry out wide It accuses and launches, the financial service of more Hewlett-Packard;Employee is served data to, can be used for digitization operation and decision;
Big data processing platform generally used now lacks the prerinse strategy for data source access, in particular so that largely Missing, in vain, high-risk and duplicate missing, in vain, high-risk data enter data analysis, seriously affect the knot of data analysis Fruit, and the accuracy of prediction and regression model.
And distributed file system relies on the advantages of its high fault tolerance, scalable and expensive storage to support large-scale dataset Storage, but it is not high for magnanimity, high concurrent, the reception of continuous, high speed small data file and storage efficiency, every time into A large amount of IO exchange can be all done when row insertion, lookup, deletion, update operation with distributed file system, greatly reduces distribution The new energy of formula file system.
And current data visibility solution mainly uses some commercial solutions, is meeting the more demands of client While since the solution of its own is fixed and data height modelization can not provide more personalized service, depending on Inhibition and generation cost is excessively high, can not effectively match the specific business of each single item, and develops solution party according to the business characteristic of itself Case, then seeming since the development cycle is too long and cost is excessively high, it is unreachable to touch.
Current big data visualization application often show after processing analysis as a result, lacking necessary early warning and announcement Alert instruction, generally requires decision-maker and optimizes by industry experience, and there are flows of personnel for personnel's decision, unsustainable The problem of sex work, is urgently to be resolved.
Summary of the invention
In order to overcome big data data to filter out the data of most of potential value when acquiring, with distributed file system frequency The deficiency that performance bottleneck caused by numerous IO interaction and visualization interface integrated information customize, the present invention provides a kind of bases In the dynamic and visual system that big data preliminary examination is repaired in advance, by cleaning, the amendment of the complete paired data of strategy that preliminary examination is repaired in advance, and tie After structure deformation process, data at this time have had been provided with structuring or semi-structured feature, it may be convenient to by relationship Type database is loaded and is used, and is promoted the utilization rate to source data and is improved the safety and stability of system, is calculated by PLRU Method quickly updates iteration black list database, to promote storage efficiency and data filtering efficiency, lifting system safety and stabilization Property, promote the robustness of whole system.Using consistent hashing algorithm, data file queue is evenly dispersed to each of cluster On a server, to solve data skew, good effect is obtained in cluster load balance, and data file is stored into plan Metamessage under slightly is cached into relevant database, to solve distributed file system as NameNode metadata in HDFS The pressure of storage, uses high-frequency data file cache is fast in hard disk heat, is sentenced according to the added field to data file Break whether on the hard disk of file caching server, directly reads and read the data in data file cache server, to subtract Lack the interaction of inquiry with distributed file system, update, improved access speed, greatly improves the comprehensive read-write property of system Energy.This method is especially rapid in the storage and inquiry response speed for facing magnanimity small data file.
To achieve the goals above, the technical solution adopted by the present invention are as follows:
A kind of dynamic and visual system repaired in advance based on big data preliminary examination, including intelligent data sampling module, data cleansing Warning module, data cleansing maintenance module, high-risk data alarm module, data quick storage module and GIS data dynamically load Module;
The intelligent data sampling module, in such a way that data cache server addend is according to buffer queue to difference Data source is classified, and is marked, and storage manages the metamessage of data;Collected message is sent in data cache server, In conjunction with the data characteristics in itself field, it is contemplated that the diversity of data file size, according to BLOCK in distributed file system The critical value T of a data file is arranged in size, and cache server is used to judge the size of this file, to the data text less than T Part adds Data Identification, i.e. KEY is sent directly to after the completion of data processing when being greater than given T such as the size of data file Distributed file system;It is stored in corresponding data queue according to label point, until triggering merger threshold value TH2;
The data cleansing warning module relies on algorithm to identify improper flow and data simultaneously for parsing data source Summarize corresponding filtering rule filtered out and downstream use.
Module is overhauled in the data cleansing, for carrying out data lacuna with digital dictionary using data cleansing maintenance module Amendment, invalid data are rejected;
The high-risk data alarm module, for using PLRU algorithm dynamically load to update by way of establishing blacklist Blacklist data improves the fault rate of PLRU algorithm by way of establishing white list.
In the high-risk data alarm module, using PLRU algorithm;
The data quick storage module, stores, significantly for the mark data after cleaning data processing module Improve by small documents caused to system bottleneck brought by the frequent I/O operation of distributed file system, using consistency Hash algorithm obtains good effect in cluster load balance;
The data of the GIS data visualization model, the legitimate secure for that will clean carry out Dynamic Display, the module Open source library ECharts is encapsulated, the module for being suitble to this business is can choose according to the difference of data type, provides more accurately Spatial geographic information, intuitively, interaction are abundant, height personalized customization and can develop and complete the customization of front end UI personalized theme, And by high-risk data information, overhaul data information is presented in front end page, and the analysis of more integrated information can be carried out from front end.
Further, in the intelligent data sampling module, comprising the following steps:
1.1.1 data hash is stored using the consistent hashing algorithm of data quick storage module;
1.1.2 the mode of metamessage management: made using prerinse warning module identification flow attacking, web crawlers and flow Disadvantage;And the data after missing mark are sent into data cleansing and overhaul module, the high-risk data after label are sent into malicious data alarm Module;
1.1.3 black and white lists database is constructed using relevant database, and relationship is written into the metamessage of 1.1.2 label In type database.
In the data cleansing warning module, flowed to using step 1.1.3 black and white lists database decision data;It is walked The merger of rapid 1.1.2 metadata.
In the data cleansing maintenance module, comprising the following steps:
1.3.1 it in cleaning warning module, shows as mentioned null cell or is shown as NAN (nonnumeric), N/A or None, it is right In the category column that may include significant missing data, a new classification, referred to as Misssing, then as general can be created Logical column are equally handled;
1.3.2 in step 1.3.1, representative value if necessary then converts the data repaired in advance to significant numerical value, such as Take the median of business datum.
In the high-risk data alarm module, using PLRU algorithm, steps are as follows:
1.4.1 by one group of hash function W={ W1, W2 ... Wn } composition, the domain output of hash function is X, for number Be each of Q={ q1, q2 ... qn } qi according to source, obtained under the n independent hash Function Mappings of W n [1, M] between number;
1.4.2 if a is input object, when carrying out PLRU algorithm, then n number can be mapped, otherwise a determines For new object, in one section of detection time, it is 1 that data stream size, which obeys parameter, and the Pareto that distortion parameter is α is distributed;
1.4.3 assume that remote server cluster data packet in observing and controlling time is K, then PLRU is average every J data packet A New Data Flag is established, and eliminates some data of blacklist bottom;
1.4.4 assume that certain big stream E size is exactly equal to threshold value TH, then without there is big number in continuous J data file Hypergeometric distribution is obeyed according to the probability of file E:As K > > J, E removed probability are as follows:
Wherein
1.4.5 black list database is updated according to step 1.4.3 and step 1.4.4;
1.4.6 since PLRU algorithm has wrong report, the wrong report sample having been found that can be prevented by establishing white list Only report by mistake;
In the data quick storage module.The following steps are included:
1.5.1 the metadata that relevant database is used to store the generation of small data file merging process is introduced;
1.5.2 currently processed server is obtained by adding number or port numbers behind machine IP or host name Cryptographic Hash HS={ hs1, hs2 ... ..., hsn }, and be the closed loop configuration in space by HS compound mapping;
1.5.3 by the taking-up of the window data of message queue cache server be put into set G=to be combined g1 ... ... g2, Gn }, n indicates the number of file to be combined, and gi indicates i-th of data file to be combined, to meeting intelligent data sampling module Trigger condition data file carry out 1.5.4 operation;
1.5.4 the data file for triggering TH2 is taken out from sliding window Wn, merger behaviour is carried out to Wn using multithreading Make, the data after merging is uploaded into distributed memory system, while the metamessage that merger operation generates being stored to relationship type In database;
1.5.5 relevant database is written in the metamessage Di of i-th of the data file generated in merging process.Wherein Di= F1, f2 ..., and fn }, wherein fi is the data characteristics of metamessage set;
1.5.6 when client sends the request of reading small data file message queue, access relational database is counted According to the metamessage Di of file;
1.5.7 the large data files where distributed file system small documents data are accessed according to the feature field in Di;
1.5.8 corresponding small data file is parsed according to the feature field in large data files;
1.5.9 field identification F, the access frequency of recording data files are added to each data file;
1.5.10 it uses high-frequency data file cache is fast in hard disk heat, is carried out according to the added field to data file Judge whether to directly read on the hard disk of file caching server and read the data in data file cache server.
Beneficial effects of the present invention are mainly manifested in: promoting to the utilization rate of source data and improve the safety of system and steady It is qualitative, iteration black list database is quickly updated by PLRU algorithm, to promote storage efficiency and data filtering efficiency, promotes system System safety and stability, promote the robustness of whole system.
Detailed description of the invention
Fig. 1 shows the preliminary examination based on big data and repairs visualization system model figure in advance.
Fig. 2 shows the preliminary examinations based on big data to repair visualization system flow chart in advance.
Fig. 3 shows the preliminary examination based on big data and repairs visualization system data preview in advance repairs procedure chart in advance.
Fig. 4 shows the preliminary examination based on big data and repairs visualization system data quick storage modular model figure in advance.
Specific embodiment
The invention will be further described below in conjunction with the accompanying drawings.
System operation and method of the invention is described more fully below, it is clear that described specific implementation case is only Portion case of the present invention, rather than whole project examples, those of ordinary skill in the art change not making creativeness Every other project example obtained under the premise of change or substantive optimization, shall fall within the protection scope of the present invention.
Referring to Fig.1~Fig. 4, a kind of dynamic and visual system repaired in advance based on big data preliminary examination, including intelligent data acquisition Module 3, high-risk data alarm module 4, data quick storage module 5 are overhauled in module 1, data cleansing warning module 2, data cleansing With GIS data visualization model 6, intelligent data sampling module is connected with data cleansing warning module, data cleansing warning module It is connected with data cleansing maintenance module, data cleansing warning module is connected with high-risk data alarm module, data cleansing early warning mould Block, high-risk data alarm module are connected with GIS data visualization model, referring in particular to Fig. 1.
The intelligent data sampling module, in such a way that data cache server addend is according to buffer queue to difference Data source is classified, and is marked, and storage manages the metamessage of data;Collected message is sent in data cache server, In conjunction with the data characteristics in itself field, it is contemplated that the diversity of data file size, according to BLOCK in distributed file system The critical value T of a data file is arranged in size, and cache server is used to judge the size of this file, to the data text less than T Part adds Data Identification, i.e. KEY is sent directly to after the completion of data processing when being greater than given T such as the size of data file Distributed file system;It is stored in corresponding data queue according to label point, until triggering merger threshold value TH2;Including following step It is rapid:
1.1.1 data hash is stored using the consistent hashing algorithm of data quick storage module.
1.1.2 the mode of metamessage management: made using prerinse warning module identification flow attacking, web crawlers and flow Disadvantage (false flow);And the data after missing mark are sent into data cleansing and overhaul module, the high-risk data after label, which are sent into, dislikes Meaning data alarm module;
1.1.3 black and white lists database is constructed using relevant database, and relationship is written into the metamessage of 1.1.2 label In type database;
The data cleansing warning module utilizes step 1.1.3 black and white lists database decision number for parsing data source According to flow direction;Carry out the merger of step 1.1.2 metadata;It relies on algorithm to identify improper flow and data and summarizes corresponding Filtering rule is filtered out and downstream uses.
Module is overhauled in the data cleansing, for carrying out data lacuna with digital dictionary using data cleansing maintenance module Amendment, invalid data are rejected;The following steps are included:
1.3.1 it in cleaning warning module, shows as mentioned null cell or is shown as NAN (nonnumeric), N/A or None, it is right In the category column that may include significant missing data, a new classification, referred to as Misssing, then as general can be created Logical column are equally handled;
1.3.2 in step 1.3.1, representative value if necessary then converts the data repaired in advance to significant numerical value, such as Take the median of business datum.
The high-risk data alarm module, for using PLRU algorithm dynamically load to update by way of establishing blacklist Blacklist data improves the fault rate of PLRU algorithm by way of establishing white list.
In the high-risk data alarm module, using PLRU algorithm, steps are as follows:
1.4.1 by one group of hash function W={ W1, W2 ... Wn } composition, the domain output of hash function is X, for number Be each of Q={ q1, q2 ... qn } qi according to source, obtained under the n independent hash Function Mappings of W n [1, M] between number;
1.4.2 if a is input object, when carrying out PLRU algorithm, then n number can be mapped, otherwise a determines For new object, in one section of detection time, it is 1 that data stream size, which obeys parameter, and the Pareto that distortion parameter is α is distributed;
1.4.3 assume that remote server cluster data packet in observing and controlling time is K, then PLRU is average every J data packet A New Data Flag is established, and eliminates some data of blacklist bottom;
1.4.4 assume that certain big stream E size is exactly equal to threshold value TH, then without there is big number in continuous J data file Hypergeometric distribution is obeyed according to the probability of file E:As K > > J, E removed probability are as follows:
Wherein
1.4.5 black list database is updated according to step 1.4.3 and step 1.4.4;
1.4.6 since PLRU algorithm has wrong report, the wrong report sample having been found that can be prevented by establishing white list Only report by mistake;
The data quick storage module, stores, significantly for the mark data after cleaning data processing module Improve by small documents caused to system bottleneck brought by the frequent I/O operation of distributed file system, using consistency Hash algorithm obtains good effect in cluster load balance;The following steps are included:
1.5.1 the metadata that relevant database is used to store the generation of small data file merging process is introduced;
1.5.2 currently processed server is obtained by adding number or port numbers behind machine IP or host name Cryptographic Hash HS={ hs1, hs2 ... ..., hsn }, and be the closed loop configuration in space by HS compound mapping;
1.5.3 by the taking-up of the window data of message queue cache server be put into set G=to be combined g1 ... ... g2, Gn }, n indicates the number of file to be combined, and gi indicates i-th of data file to be combined, to meeting intelligent data sampling module Trigger condition data file carry out 1.5.4 operation;
1.5.4 the data file for triggering TH2 is taken out from sliding window Wn, merger behaviour is carried out to Wn using multithreading Make, the data after merging is uploaded into distributed memory system, while the metamessage that merger operation generates being stored to relationship type In database;
1.5.5 relevant database is written in the metamessage Di of i-th of the data file generated in merging process.Wherein Di= F1, f2 ..., and fn }, wherein fi is the data characteristics of metamessage set;
1.5.6 when client sends the request of reading small data file message queue, access relational database is counted According to the metamessage Di of file;
1.5.7 the large data files where distributed file system small documents data are accessed according to the feature field in Di;
1.5.8 corresponding small data file is parsed according to the feature field in large data files;
1.5.9 field identification F, the access frequency of recording data files are added to each data file;
1.5.10 it uses high-frequency data file cache is fast in hard disk heat, is carried out according to the added field to data file Judge whether to directly read on the hard disk of file caching server and read the data in data file cache server;
The data of the GIS data visualization model, the legitimate secure for that will clean carry out Dynamic Display, the module Open source library ECharts is encapsulated, the module of formula He this business is can choose according to the difference of data type, provides more accurately Spatial geographic information, intuitively, interaction are abundant, height personalized customization and can develop and complete the customization of front end UI personalized theme, And by high-risk data information, overhaul data information is presented in front end page, and the analysis of more integrated information can be carried out from front end.
Now be illustrated to Fig. 2: when data are acquired into system, i.e. inbound message caching 11, message caching 11 presses data Classification storage 14 is carried out according to classificating requirement, is then fed into prerinse module 12, which carries out missing data and high-risk data Regularization processing, and processing metamessage is written in relevant database 17, the data after prerinse is complete are divided according to actual needs Real-time computing engines and offline 13 and off-line calculation engine are dealt into, and utilize the queue server of the synchronous calculated result of synchronization program 14 and relevant database 17, finally data result is saved in distributed file system 15, can be grasped in management platform 19 Make real time data and off-line data carries out visualization 20 and shows.
The workflow of the visualization system repaired in advance based on big data preliminary examination of the present embodiment, comprising the following steps:
Step S000: in module 1, power is stored according to own cluster and calculates the threshold value of power preset data file size TH1。
Step S001: the file addition identification information of different data sources is acquired, such as 101 operating modes in Fig. 3.It is slow with message The size that the judgement of server 11 receives file is deposited, the threshold value TH1 in such as less than S000 is then added field identification KEY.Reference In Fig. 4 shown in 201.
Step S002: the data in S001 are first carried out to blacklist 104 and are filtered.
Step S003: according to the data source Q={ q1, q2 ... qn } in S001 determine hash function W=W1, W2 ... Wn composition, the domain output of hash function is X.
Step S004: if data source according to compound customized data rule, for data source be Q=q1, Q2 ... qn each of qi, obtain the number between n [1, M] under the n independent hash Function Mappings of W.
Step S005: assuming that remote server cluster data packet in observing and controlling time is K, then PLRU is average every J number A New Data Flag is established according to packet, and eliminates some data of blacklist bottom.
Step S006: it is assumed that certain data file E size is exactly equal to threshold value TH, then do not have in continuous J data file The probability for large data files E occur obeys hypergeometric distribution:
Step S007: as K > > J, E removed probability are as follows:Wherein
Step S009: black list database 104 is updated according to S007 iteration.
Step S010: data collected in S001 are loaded into corresponding data queue.
Step S011: in S010, only when data file, which is sent, requests, data queue just can be according to request application one A data queue, if the data queue is empty queue, while data cache server is not sky, then carries out FIFO operation, otherwise The data queue is by Free up Memory.
Step S012: it by the data in S010 data queue by repairing strategy in advance for certain missing data, is usually expressed as Mentioned null cell or be shown as NAN (nonnumeric), N/A or None can for that may include the category column of significant missing data To create a new classification, then referred to as Miss is handled as commonly arranging, representative value if necessary, then the number that will be repaired in advance According to being converted into significant numerical value.Referring to Fig.1 12 in module 3 and Fig. 2.
Step S013: identification flow attacking, web crawlers and flow cheating (false flow) update iteration in S009 Data are synchronized to high-risk data alarm module.Module 4 referring to Fig.1.
Step S104: being stored in corresponding message queue cache server according to label point, until triggering merger threshold value TH2。
Step S105: the data file for triggering TH2 is taken out from sliding window Wn, carries out merger to Wn using multithreading Operation.
Step S106: the data after merging are uploaded into distributed memory system 15, while the member that merger operation is generated Information is stored into relevant database 17 according to configuration rule 16.
Step S107: currently processed service is obtained by adding number or port numbers behind machine IP or host name The cryptographic Hash HS={ hs1, hs2 ... ..., hsn } of device, and be the closed loop configuration in space by HS compound mapping.Such as 202 institutes of Fig. 4 Show.
Step S108: being put into set G=to be combined { g1 ... ... g2, gn } for the window data Wn taking-up of data queue, N indicates the number of file to be combined, and gi indicates i-th of data file to be combined.
Step S109: relevant database is written in the metamessage Di of i-th of the data file generated in merging process.Wherein Di=f1, f2 ..., and fn }, wherein fi is the data characteristics of metamessage set.
Step S110: when client sends the request of reading message queue, access relational database obtains data file Metamessage Di.
Step S111: the big data text where distributed file system small documents data is accessed according to the feature field in Di Part.
Step S112: corresponding small data file is parsed according to the feature field in large data files.
Step S113: field identification F, the access frequency of recording data files are added to each data file.
Step S114: high-frequency data file cache judges in hard disk heat block according to the added field to data file Whether on the hard disk of file caching server, directly reads and read data in data file cache server
Step S115: processed data are added up to be shown according to temperature, show the geography information for having merged data It is shown with the real-time deal of data.

Claims (6)

1. visualization system is repaired in a kind of preliminary examination based on big data in advance, it is characterised in that: the system comprises intelligent data acquisitions Module, data cleansing warning module, data cleansing maintenance module, high-risk data alarm module, data quick storage module and GIS Data dlm (dynamic loading module);
The intelligent data sampling module, in such a way that data cache server addend is according to buffer queue to different data Source is classified, and is marked, and storage manages the metamessage of data;Collected message is sent in data cache server, in conjunction with The data characteristics in itself field, it is contemplated that the diversity of data file size, according to the size of BLOCK in distributed file system The critical value T of one data file is set, and cache server is used to judge the size of this file, adds to the data file less than T Add Data Identification, i.e. KEY, when being greater than given T such as the size of data file, is sent directly to be distributed after the completion of data processing Formula file system;It is stored in corresponding data queue according to label point, until triggering merger threshold value TH2;
The data cleansing warning module, for parsing data source, algorithm is relied on to identify improper flow and data and concluding Corresponding filtering rule is filtered out out and downstream uses.
Module is overhauled in the data cleansing, is repaired for carrying out data lacuna with digital dictionary using data cleansing maintenance module Just, invalid data is rejected;
The high-risk data alarm module, for using PLRU algorithm dynamically load to update black name by way of establishing blacklist Forms data improves the fault rate of PLRU algorithm by way of establishing white list.
In the high-risk data alarm module, using PLRU algorithm;
The data quick storage module, stores for the mark data after cleaning data processing module, substantially improves By small documents caused to system bottleneck brought by the frequent I/O operation of distributed file system, using consistent hashing Algorithm obtains good effect in cluster load balance;
The GIS data visualization model, the data of the legitimate secure for that will clean carry out Dynamic Display, module encapsulation Open source library ECharts, can choose the module for being suitble to this business according to the difference of data type, provide more accurately space Geography information, intuitively, interaction are abundant, height personalized customization and can develop and complete the customization of front end UI personalized theme, and will High-risk data information, overhaul data information are presented in front end page, and the analysis of more integrated information can be carried out from front end.
2. dynamic and visual system is repaired in a kind of preliminary examination based on big data as described in claim 1 in advance, it is characterised in that: described In intelligent data sampling module, comprising the following steps:
1.1.1 data hash is stored using the consistent hashing algorithm of data quick storage module;
1.1.2 it the mode of metamessage management: is practised fraud using prerinse warning module identification flow attacking, web crawlers and flow; And the data after missing mark are sent into data cleansing and overhaul module, the high-risk data after label are sent into malicious data and alert mould Block;
1.1.3 black and white lists database is constructed using relevant database, and relationship type number is written into the metamessage of 1.1.2 label According in library.
3. a kind of dynamic and visual system repaired in advance based on big data preliminary examination as claimed in claim 2, it is characterised in that: described In data cleansing warning module, flowed to using step 1.1.3 black and white lists database decision data;Carry out step 1.1.2 member number According to merger.
4. a kind of dynamic and visual system repaired in advance based on big data preliminary examination as described in one of claims 1 to 3, feature are existed In: in the data cleansing maintenance module, comprising the following steps:
1.3.1 it in cleaning warning module, shows as mentioned null cell or is shown as NAN (nonnumeric), N/A or None, for can It can include the category column of significant missing data, a new classification, referred to as Misssing, then as commonly arranging can be created Equally handle;
1.3.2 in step 1.3.1, representative value if necessary then converts the data repaired in advance to significant numerical value, such as takes industry The median for data of being engaged in.
5. a kind of dynamic and visual system repaired in advance based on big data preliminary examination as described in one of claims 1 to 3, feature are existed In: in the high-risk data alarm module, using PLRU algorithm, steps are as follows:
1.4.1 by one group of hash function W={ W1, W2 ... Wn } composition, the domain output of hash function is X, for data source For each of Q={ q1, q2 ... qn } qi, obtained under the n independent hash Function Mappings of W n [1, M] it Between number;
1.4.2 if a is input object, when carrying out PLRU algorithm, then n number can be mapped, otherwise a is determined as newly Object, in one section of detection time, it is 1 that data stream size, which obeys parameter, and the Pareto that distortion parameter is α is distributed;
1.4.3 assume that remote server cluster data packet in observing and controlling time is K, then PLRU is average establishes every J data packet One New Data Flag, and eliminate some data of blacklist bottom;
1.4.4 assume that certain big stream E size is exactly equal to threshold value TH, then do not occur big data text in continuous J data file The probability of part E obeys hypergeometric distribution:As K > > J, E removed probability are as follows:
Wherein
1.4.5 black list database is updated according to step 1.4.3 and step 1.4.4;
1.4.6 since PLRU algorithm has wrong report, the wrong report sample having been found that can be prevented from missing by establishing white list Report.
6. a kind of dynamic and visual system repaired in advance based on big data preliminary examination as described in one of claims 1 to 3, feature are existed In: in the data quick storage module.The following steps are included:
1.5.1 the metadata that relevant database is used to store the generation of small data file merging process is introduced;
1.5.2 the Hash of currently processed server is obtained by adding number or port numbers behind machine IP or host name Value HS={ hs1, hs2 ... ..., hsn }, and be the closed loop configuration in space by HS compound mapping;
1.5.3 the window data taking-up of message queue cache server is put into set G=to be combined { g1 ... ... g2, gn }, N indicates the number of file to be combined, and gi indicates i-th of data file to be combined, the touching to intelligent data sampling module is met The data file of clockwork spring part carries out 1.5.4 operation;
1.5.4 the data file for triggering TH2 is taken out from sliding window Wn, merger operation is carried out to Wn using multithreading, it will Data after merging upload to distributed memory system, while the metamessage that merger operation generates being stored to relevant database In;
1.5.5 relevant database is written in the metamessage Di of i-th of the data file generated in merging process.Wherein Di=f1, F2 ..., fn }, wherein fi is the data characteristics of metamessage set;
1.5.6 when client sends the request of reading small data file message queue, access relational database obtains data text The metamessage Di of part;
1.5.7 the large data files where distributed file system small documents data are accessed according to the feature field in Di;
1.5.8 corresponding small data file is parsed according to the feature field in large data files;
1.5.9 field identification F, the access frequency of recording data files are added to each data file;
1.5.10 it uses high-frequency data file cache is fast in hard disk heat, is judged according to the added field to data file Whether on the hard disk of file caching server, directly reads and read the data in data file cache server.
CN201811322934.1A 2018-11-08 2018-11-08 Big data-based visual system for pre-inspection and pre-repair Active CN109460393B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811322934.1A CN109460393B (en) 2018-11-08 2018-11-08 Big data-based visual system for pre-inspection and pre-repair

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811322934.1A CN109460393B (en) 2018-11-08 2018-11-08 Big data-based visual system for pre-inspection and pre-repair

Publications (2)

Publication Number Publication Date
CN109460393A true CN109460393A (en) 2019-03-12
CN109460393B CN109460393B (en) 2022-04-08

Family

ID=65609667

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811322934.1A Active CN109460393B (en) 2018-11-08 2018-11-08 Big data-based visual system for pre-inspection and pre-repair

Country Status (1)

Country Link
CN (1) CN109460393B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110502571A (en) * 2019-08-29 2019-11-26 智洋创新科技股份有限公司 A kind of recognition methods of the high-incidence line segment of electric transmission line channel visual alerts
CN111090646A (en) * 2019-10-21 2020-05-01 中国科学院信息工程研究所 Electromagnetic data processing platform
CN113163353A (en) * 2020-04-15 2021-07-23 贵州电网有限责任公司 Intelligent health service system of power supply vehicle and data transmission method thereof
CN113448946A (en) * 2021-07-05 2021-09-28 星辰天合(北京)数据科技有限公司 Data migration method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160180088A1 (en) * 2014-12-23 2016-06-23 Mcafee, Inc. Discovery of malicious strings
CN106484855A (en) * 2016-09-30 2017-03-08 广州特道信息科技有限公司 A kind of big data concerning taxes intelligence analysis system
CN107273409A (en) * 2017-05-03 2017-10-20 广州赫炎大数据科技有限公司 A kind of network data acquisition, storage and processing method and system
CN108228830A (en) * 2018-01-03 2018-06-29 广东工业大学 A kind of data processing system
US10353870B2 (en) * 2016-02-17 2019-07-16 Netapp Inc. Tracking structure for data replication synchronization

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160180088A1 (en) * 2014-12-23 2016-06-23 Mcafee, Inc. Discovery of malicious strings
US10353870B2 (en) * 2016-02-17 2019-07-16 Netapp Inc. Tracking structure for data replication synchronization
CN106484855A (en) * 2016-09-30 2017-03-08 广州特道信息科技有限公司 A kind of big data concerning taxes intelligence analysis system
CN107273409A (en) * 2017-05-03 2017-10-20 广州赫炎大数据科技有限公司 A kind of network data acquisition, storage and processing method and system
CN108228830A (en) * 2018-01-03 2018-06-29 广东工业大学 A kind of data processing system

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110502571A (en) * 2019-08-29 2019-11-26 智洋创新科技股份有限公司 A kind of recognition methods of the high-incidence line segment of electric transmission line channel visual alerts
CN110502571B (en) * 2019-08-29 2020-05-08 智洋创新科技股份有限公司 Method for identifying visible alarm high-power-generation line segment of power transmission line channel
CN111090646A (en) * 2019-10-21 2020-05-01 中国科学院信息工程研究所 Electromagnetic data processing platform
CN111090646B (en) * 2019-10-21 2023-07-28 中国科学院信息工程研究所 Electromagnetic data processing platform
CN113163353A (en) * 2020-04-15 2021-07-23 贵州电网有限责任公司 Intelligent health service system of power supply vehicle and data transmission method thereof
CN113163353B (en) * 2020-04-15 2022-12-27 贵州电网有限责任公司 Intelligent health service system of power supply vehicle and data transmission method thereof
CN113448946A (en) * 2021-07-05 2021-09-28 星辰天合(北京)数据科技有限公司 Data migration method and device and electronic equipment
CN113448946B (en) * 2021-07-05 2024-01-12 北京星辰天合科技股份有限公司 Data migration method and device and electronic equipment

Also Published As

Publication number Publication date
CN109460393B (en) 2022-04-08

Similar Documents

Publication Publication Date Title
CN109460393A (en) Visualization system is repaired in a kind of preliminary examination based on big data in advance
CN108628929B (en) Method and apparatus for intelligent archiving and analysis
CN107256219B (en) Big data fusion analysis method applied to mass logs of automatic train control system
US10599684B2 (en) Data relationships storage platform
CN109918511B (en) BFS and LPA based knowledge graph anti-fraud feature extraction method
CN105045820B (en) Method for processing video image information of high-level data and database system
CN111885040A (en) Distributed network situation perception method, system, server and node equipment
CN102982097B (en) Domain for Knowledge based engineering data quality solution
CN109213752A (en) A kind of data cleansing conversion method based on CIM
CN104636751A (en) Crowd abnormity detection and positioning system and method based on time recurrent neural network
CN105069025A (en) Intelligent aggregation visualization and management control system for big data
CN113849483A (en) Real-time database system architecture for intelligent factory
CN106534784A (en) Acquisition analysis storage statistical system for video analysis data result set
CN109308290B (en) Efficient data cleaning and converting method based on CIM
CN112181960A (en) Intelligent operation and maintenance framework system based on AIOps
CN116049454A (en) Intelligent searching method and system based on multi-source heterogeneous data
CN110197708A (en) A kind of migration of block chain and storage method towards electron medical treatment case history
CN112883001A (en) Data processing method, device and medium based on marketing and distribution through data visualization platform
CN109542846A (en) A kind of Internet of Things vulnerability information management system based on data virtualization
CN109800133A (en) A kind of method, one-stop monitoring alarm platform and the system of unified monitoring alarm
WO2022188646A1 (en) Graph data processing method and apparatus, and device, storage medium and program product
CN111125450A (en) Management method of multilayer topology network resource object
CN110674168A (en) Cache key abnormity detection method, device, storage medium and terminal
CN113254517A (en) Service providing method based on internet big data
CN106528682A (en) Big-data text mining system of call center

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant