CN109460393A - Visualization system is repaired in a kind of preliminary examination based on big data in advance - Google Patents
Visualization system is repaired in a kind of preliminary examination based on big data in advance Download PDFInfo
- Publication number
- CN109460393A CN109460393A CN201811322934.1A CN201811322934A CN109460393A CN 109460393 A CN109460393 A CN 109460393A CN 201811322934 A CN201811322934 A CN 201811322934A CN 109460393 A CN109460393 A CN 109460393A
- Authority
- CN
- China
- Prior art keywords
- data
- module
- file
- repaired
- cleansing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Visualization system, including intelligent data sampling module, data cleansing warning module, data cleansing maintenance module, high-risk data alarm module, data quick storage module and GIS data dlm (dynamic loading module) are repaired in a kind of preliminary examination based on big data in advance.Intelligent classification is carried out to data by intelligent data sampling module, to improve the cleaning efficiency to data file;By prediction policy, high-risk data is subjected to blacklist label, and iteration is updated to blacklist with PLRU algorithm, substantially improves the ability of system wrong report;By repairing strategy in advance, incomplete data are repaired, greatly improve the utilization rate of data;By data quick storage module by secure data quick storage, data visualization real-time loading rate and historical data loading speed are improved;The data flow that preliminary examination is repaired in advance is shown in a manner of GIS dynamic map finally, manager is more conducive to and more directly carries out air control scheduling and system optimization.
Description
Technical field
The present invention relates to data processing, field of data storage, especially a kind of visualization repaired in advance based on big data preliminary examination
System.
Background technique
With the development of new and high technology, big data have become various countries development important tool, push big data development and
Using making the governance new model of precisely improvement, multiparty collaboration in future, foundation runs smoothly, is safe and efficient economical
New mechanism is run, people's livelihood new system for health service that is people-oriented, benefiting the whole people is constructed, opens the innovation of public foundation, millions of people innovation
New frame is driven, high-end intelligence, the industry development nascent state of joyful prosperity are cultivated.
Along with the arrival in DT epoch, people can more be collected into data abundant than ever, and the report of IDC is aobvious
Show: expecting the year two thousand twenty, global metadata total amount will be more than 40ZB (being equivalent to 40,000,000,000,000 GB), this data volume is 22 in 2011
Times!It is being in how the data of " explosive " growth are become based on the information of high value come decision, prediction, strategic development
New research hotspot.
After having collected a large amount of initial data in acquisition system, data are only integrated and calculate, and can just be used for hole
Business rule is examined, potential information is excavated, to realize the value of big data, achieve the purpose that energize in business and create value.
In face of the data and complicated calculating of magnanimity, data computation layer includes two big systems: data storage and computing platform;And data are dug
The development of pick technology and data warehousing and computing technique be it is complementary, not the development of data infrastructure and it is distributed simultaneously
The technology that row calculates, there will be no deep learnings, can less witness the mystery of AlphaGo, the development of cloud computing platform, so that extra large
Amount, high speed, changeableization, the structure of multiple terminals and unstructured data are able to store and efficiently calculate, as in electric business field
Global portrait towards magnanimity member and commodity, sends word the universe ID-Mapping, advertisement accurately release platform, thousand people of natural person
The personalized search in thousand faces and recommended technology, the identification of non-flow of the people and rogue device, commercial competition information automatic excavating
System has been deep into the links of enterprise development, " no data not intelligence, no intelligence is not commercially ", and big data is melted with machine learning
New commercial revolution after conjunction already arrives.
The quality of data is basis and the premise of everything of data analysis conclusion validity and accuracy, how to be guaranteed
The quality of data, it is ensured that the availability of data is that Construction of Data Warehouse cannot be neglected link, and data have become important production
Element, allows the value maximization of data application, such as search, recommendation, advertisement, finance, credit, insurance, entertainment, logistics business.
Businessman is served data to, can be used for instructing the digitization operation of businessman, provides diversification for businessman, the data of general favour are assigned
Energy;It can be used to implement better search experience, more accurately personalized recommendation, optimize shopping experience, more accurately carry out wide
It accuses and launches, the financial service of more Hewlett-Packard;Employee is served data to, can be used for digitization operation and decision;
Big data processing platform generally used now lacks the prerinse strategy for data source access, in particular so that largely
Missing, in vain, high-risk and duplicate missing, in vain, high-risk data enter data analysis, seriously affect the knot of data analysis
Fruit, and the accuracy of prediction and regression model.
And distributed file system relies on the advantages of its high fault tolerance, scalable and expensive storage to support large-scale dataset
Storage, but it is not high for magnanimity, high concurrent, the reception of continuous, high speed small data file and storage efficiency, every time into
A large amount of IO exchange can be all done when row insertion, lookup, deletion, update operation with distributed file system, greatly reduces distribution
The new energy of formula file system.
And current data visibility solution mainly uses some commercial solutions, is meeting the more demands of client
While since the solution of its own is fixed and data height modelization can not provide more personalized service, depending on
Inhibition and generation cost is excessively high, can not effectively match the specific business of each single item, and develops solution party according to the business characteristic of itself
Case, then seeming since the development cycle is too long and cost is excessively high, it is unreachable to touch.
Current big data visualization application often show after processing analysis as a result, lacking necessary early warning and announcement
Alert instruction, generally requires decision-maker and optimizes by industry experience, and there are flows of personnel for personnel's decision, unsustainable
The problem of sex work, is urgently to be resolved.
Summary of the invention
In order to overcome big data data to filter out the data of most of potential value when acquiring, with distributed file system frequency
The deficiency that performance bottleneck caused by numerous IO interaction and visualization interface integrated information customize, the present invention provides a kind of bases
In the dynamic and visual system that big data preliminary examination is repaired in advance, by cleaning, the amendment of the complete paired data of strategy that preliminary examination is repaired in advance, and tie
After structure deformation process, data at this time have had been provided with structuring or semi-structured feature, it may be convenient to by relationship
Type database is loaded and is used, and is promoted the utilization rate to source data and is improved the safety and stability of system, is calculated by PLRU
Method quickly updates iteration black list database, to promote storage efficiency and data filtering efficiency, lifting system safety and stabilization
Property, promote the robustness of whole system.Using consistent hashing algorithm, data file queue is evenly dispersed to each of cluster
On a server, to solve data skew, good effect is obtained in cluster load balance, and data file is stored into plan
Metamessage under slightly is cached into relevant database, to solve distributed file system as NameNode metadata in HDFS
The pressure of storage, uses high-frequency data file cache is fast in hard disk heat, is sentenced according to the added field to data file
Break whether on the hard disk of file caching server, directly reads and read the data in data file cache server, to subtract
Lack the interaction of inquiry with distributed file system, update, improved access speed, greatly improves the comprehensive read-write property of system
Energy.This method is especially rapid in the storage and inquiry response speed for facing magnanimity small data file.
To achieve the goals above, the technical solution adopted by the present invention are as follows:
A kind of dynamic and visual system repaired in advance based on big data preliminary examination, including intelligent data sampling module, data cleansing
Warning module, data cleansing maintenance module, high-risk data alarm module, data quick storage module and GIS data dynamically load
Module;
The intelligent data sampling module, in such a way that data cache server addend is according to buffer queue to difference
Data source is classified, and is marked, and storage manages the metamessage of data;Collected message is sent in data cache server,
In conjunction with the data characteristics in itself field, it is contemplated that the diversity of data file size, according to BLOCK in distributed file system
The critical value T of a data file is arranged in size, and cache server is used to judge the size of this file, to the data text less than T
Part adds Data Identification, i.e. KEY is sent directly to after the completion of data processing when being greater than given T such as the size of data file
Distributed file system;It is stored in corresponding data queue according to label point, until triggering merger threshold value TH2;
The data cleansing warning module relies on algorithm to identify improper flow and data simultaneously for parsing data source
Summarize corresponding filtering rule filtered out and downstream use.
Module is overhauled in the data cleansing, for carrying out data lacuna with digital dictionary using data cleansing maintenance module
Amendment, invalid data are rejected;
The high-risk data alarm module, for using PLRU algorithm dynamically load to update by way of establishing blacklist
Blacklist data improves the fault rate of PLRU algorithm by way of establishing white list.
In the high-risk data alarm module, using PLRU algorithm;
The data quick storage module, stores, significantly for the mark data after cleaning data processing module
Improve by small documents caused to system bottleneck brought by the frequent I/O operation of distributed file system, using consistency
Hash algorithm obtains good effect in cluster load balance;
The data of the GIS data visualization model, the legitimate secure for that will clean carry out Dynamic Display, the module
Open source library ECharts is encapsulated, the module for being suitble to this business is can choose according to the difference of data type, provides more accurately
Spatial geographic information, intuitively, interaction are abundant, height personalized customization and can develop and complete the customization of front end UI personalized theme,
And by high-risk data information, overhaul data information is presented in front end page, and the analysis of more integrated information can be carried out from front end.
Further, in the intelligent data sampling module, comprising the following steps:
1.1.1 data hash is stored using the consistent hashing algorithm of data quick storage module;
1.1.2 the mode of metamessage management: made using prerinse warning module identification flow attacking, web crawlers and flow
Disadvantage;And the data after missing mark are sent into data cleansing and overhaul module, the high-risk data after label are sent into malicious data alarm
Module;
1.1.3 black and white lists database is constructed using relevant database, and relationship is written into the metamessage of 1.1.2 label
In type database.
In the data cleansing warning module, flowed to using step 1.1.3 black and white lists database decision data;It is walked
The merger of rapid 1.1.2 metadata.
In the data cleansing maintenance module, comprising the following steps:
1.3.1 it in cleaning warning module, shows as mentioned null cell or is shown as NAN (nonnumeric), N/A or None, it is right
In the category column that may include significant missing data, a new classification, referred to as Misssing, then as general can be created
Logical column are equally handled;
1.3.2 in step 1.3.1, representative value if necessary then converts the data repaired in advance to significant numerical value, such as
Take the median of business datum.
In the high-risk data alarm module, using PLRU algorithm, steps are as follows:
1.4.1 by one group of hash function W={ W1, W2 ... Wn } composition, the domain output of hash function is X, for number
Be each of Q={ q1, q2 ... qn } qi according to source, obtained under the n independent hash Function Mappings of W n [1,
M] between number;
1.4.2 if a is input object, when carrying out PLRU algorithm, then n number can be mapped, otherwise a determines
For new object, in one section of detection time, it is 1 that data stream size, which obeys parameter, and the Pareto that distortion parameter is α is distributed;
1.4.3 assume that remote server cluster data packet in observing and controlling time is K, then PLRU is average every J data packet
A New Data Flag is established, and eliminates some data of blacklist bottom;
1.4.4 assume that certain big stream E size is exactly equal to threshold value TH, then without there is big number in continuous J data file
Hypergeometric distribution is obeyed according to the probability of file E:As K > > J, E removed probability are as follows:
Wherein
1.4.5 black list database is updated according to step 1.4.3 and step 1.4.4;
1.4.6 since PLRU algorithm has wrong report, the wrong report sample having been found that can be prevented by establishing white list
Only report by mistake;
In the data quick storage module.The following steps are included:
1.5.1 the metadata that relevant database is used to store the generation of small data file merging process is introduced;
1.5.2 currently processed server is obtained by adding number or port numbers behind machine IP or host name
Cryptographic Hash HS={ hs1, hs2 ... ..., hsn }, and be the closed loop configuration in space by HS compound mapping;
1.5.3 by the taking-up of the window data of message queue cache server be put into set G=to be combined g1 ... ... g2,
Gn }, n indicates the number of file to be combined, and gi indicates i-th of data file to be combined, to meeting intelligent data sampling module
Trigger condition data file carry out 1.5.4 operation;
1.5.4 the data file for triggering TH2 is taken out from sliding window Wn, merger behaviour is carried out to Wn using multithreading
Make, the data after merging is uploaded into distributed memory system, while the metamessage that merger operation generates being stored to relationship type
In database;
1.5.5 relevant database is written in the metamessage Di of i-th of the data file generated in merging process.Wherein Di=
F1, f2 ..., and fn }, wherein fi is the data characteristics of metamessage set;
1.5.6 when client sends the request of reading small data file message queue, access relational database is counted
According to the metamessage Di of file;
1.5.7 the large data files where distributed file system small documents data are accessed according to the feature field in Di;
1.5.8 corresponding small data file is parsed according to the feature field in large data files;
1.5.9 field identification F, the access frequency of recording data files are added to each data file;
1.5.10 it uses high-frequency data file cache is fast in hard disk heat, is carried out according to the added field to data file
Judge whether to directly read on the hard disk of file caching server and read the data in data file cache server.
Beneficial effects of the present invention are mainly manifested in: promoting to the utilization rate of source data and improve the safety of system and steady
It is qualitative, iteration black list database is quickly updated by PLRU algorithm, to promote storage efficiency and data filtering efficiency, promotes system
System safety and stability, promote the robustness of whole system.
Detailed description of the invention
Fig. 1 shows the preliminary examination based on big data and repairs visualization system model figure in advance.
Fig. 2 shows the preliminary examinations based on big data to repair visualization system flow chart in advance.
Fig. 3 shows the preliminary examination based on big data and repairs visualization system data preview in advance repairs procedure chart in advance.
Fig. 4 shows the preliminary examination based on big data and repairs visualization system data quick storage modular model figure in advance.
Specific embodiment
The invention will be further described below in conjunction with the accompanying drawings.
System operation and method of the invention is described more fully below, it is clear that described specific implementation case is only
Portion case of the present invention, rather than whole project examples, those of ordinary skill in the art change not making creativeness
Every other project example obtained under the premise of change or substantive optimization, shall fall within the protection scope of the present invention.
Referring to Fig.1~Fig. 4, a kind of dynamic and visual system repaired in advance based on big data preliminary examination, including intelligent data acquisition
Module 3, high-risk data alarm module 4, data quick storage module 5 are overhauled in module 1, data cleansing warning module 2, data cleansing
With GIS data visualization model 6, intelligent data sampling module is connected with data cleansing warning module, data cleansing warning module
It is connected with data cleansing maintenance module, data cleansing warning module is connected with high-risk data alarm module, data cleansing early warning mould
Block, high-risk data alarm module are connected with GIS data visualization model, referring in particular to Fig. 1.
The intelligent data sampling module, in such a way that data cache server addend is according to buffer queue to difference
Data source is classified, and is marked, and storage manages the metamessage of data;Collected message is sent in data cache server,
In conjunction with the data characteristics in itself field, it is contemplated that the diversity of data file size, according to BLOCK in distributed file system
The critical value T of a data file is arranged in size, and cache server is used to judge the size of this file, to the data text less than T
Part adds Data Identification, i.e. KEY is sent directly to after the completion of data processing when being greater than given T such as the size of data file
Distributed file system;It is stored in corresponding data queue according to label point, until triggering merger threshold value TH2;Including following step
It is rapid:
1.1.1 data hash is stored using the consistent hashing algorithm of data quick storage module.
1.1.2 the mode of metamessage management: made using prerinse warning module identification flow attacking, web crawlers and flow
Disadvantage (false flow);And the data after missing mark are sent into data cleansing and overhaul module, the high-risk data after label, which are sent into, dislikes
Meaning data alarm module;
1.1.3 black and white lists database is constructed using relevant database, and relationship is written into the metamessage of 1.1.2 label
In type database;
The data cleansing warning module utilizes step 1.1.3 black and white lists database decision number for parsing data source
According to flow direction;Carry out the merger of step 1.1.2 metadata;It relies on algorithm to identify improper flow and data and summarizes corresponding
Filtering rule is filtered out and downstream uses.
Module is overhauled in the data cleansing, for carrying out data lacuna with digital dictionary using data cleansing maintenance module
Amendment, invalid data are rejected;The following steps are included:
1.3.1 it in cleaning warning module, shows as mentioned null cell or is shown as NAN (nonnumeric), N/A or None, it is right
In the category column that may include significant missing data, a new classification, referred to as Misssing, then as general can be created
Logical column are equally handled;
1.3.2 in step 1.3.1, representative value if necessary then converts the data repaired in advance to significant numerical value, such as
Take the median of business datum.
The high-risk data alarm module, for using PLRU algorithm dynamically load to update by way of establishing blacklist
Blacklist data improves the fault rate of PLRU algorithm by way of establishing white list.
In the high-risk data alarm module, using PLRU algorithm, steps are as follows:
1.4.1 by one group of hash function W={ W1, W2 ... Wn } composition, the domain output of hash function is X, for number
Be each of Q={ q1, q2 ... qn } qi according to source, obtained under the n independent hash Function Mappings of W n [1,
M] between number;
1.4.2 if a is input object, when carrying out PLRU algorithm, then n number can be mapped, otherwise a determines
For new object, in one section of detection time, it is 1 that data stream size, which obeys parameter, and the Pareto that distortion parameter is α is distributed;
1.4.3 assume that remote server cluster data packet in observing and controlling time is K, then PLRU is average every J data packet
A New Data Flag is established, and eliminates some data of blacklist bottom;
1.4.4 assume that certain big stream E size is exactly equal to threshold value TH, then without there is big number in continuous J data file
Hypergeometric distribution is obeyed according to the probability of file E:As K > > J, E removed probability are as follows:
Wherein
1.4.5 black list database is updated according to step 1.4.3 and step 1.4.4;
1.4.6 since PLRU algorithm has wrong report, the wrong report sample having been found that can be prevented by establishing white list
Only report by mistake;
The data quick storage module, stores, significantly for the mark data after cleaning data processing module
Improve by small documents caused to system bottleneck brought by the frequent I/O operation of distributed file system, using consistency
Hash algorithm obtains good effect in cluster load balance;The following steps are included:
1.5.1 the metadata that relevant database is used to store the generation of small data file merging process is introduced;
1.5.2 currently processed server is obtained by adding number or port numbers behind machine IP or host name
Cryptographic Hash HS={ hs1, hs2 ... ..., hsn }, and be the closed loop configuration in space by HS compound mapping;
1.5.3 by the taking-up of the window data of message queue cache server be put into set G=to be combined g1 ... ... g2,
Gn }, n indicates the number of file to be combined, and gi indicates i-th of data file to be combined, to meeting intelligent data sampling module
Trigger condition data file carry out 1.5.4 operation;
1.5.4 the data file for triggering TH2 is taken out from sliding window Wn, merger behaviour is carried out to Wn using multithreading
Make, the data after merging is uploaded into distributed memory system, while the metamessage that merger operation generates being stored to relationship type
In database;
1.5.5 relevant database is written in the metamessage Di of i-th of the data file generated in merging process.Wherein Di=
F1, f2 ..., and fn }, wherein fi is the data characteristics of metamessage set;
1.5.6 when client sends the request of reading small data file message queue, access relational database is counted
According to the metamessage Di of file;
1.5.7 the large data files where distributed file system small documents data are accessed according to the feature field in Di;
1.5.8 corresponding small data file is parsed according to the feature field in large data files;
1.5.9 field identification F, the access frequency of recording data files are added to each data file;
1.5.10 it uses high-frequency data file cache is fast in hard disk heat, is carried out according to the added field to data file
Judge whether to directly read on the hard disk of file caching server and read the data in data file cache server;
The data of the GIS data visualization model, the legitimate secure for that will clean carry out Dynamic Display, the module
Open source library ECharts is encapsulated, the module of formula He this business is can choose according to the difference of data type, provides more accurately
Spatial geographic information, intuitively, interaction are abundant, height personalized customization and can develop and complete the customization of front end UI personalized theme,
And by high-risk data information, overhaul data information is presented in front end page, and the analysis of more integrated information can be carried out from front end.
Now be illustrated to Fig. 2: when data are acquired into system, i.e. inbound message caching 11, message caching 11 presses data
Classification storage 14 is carried out according to classificating requirement, is then fed into prerinse module 12, which carries out missing data and high-risk data
Regularization processing, and processing metamessage is written in relevant database 17, the data after prerinse is complete are divided according to actual needs
Real-time computing engines and offline 13 and off-line calculation engine are dealt into, and utilize the queue server of the synchronous calculated result of synchronization program
14 and relevant database 17, finally data result is saved in distributed file system 15, can be grasped in management platform 19
Make real time data and off-line data carries out visualization 20 and shows.
The workflow of the visualization system repaired in advance based on big data preliminary examination of the present embodiment, comprising the following steps:
Step S000: in module 1, power is stored according to own cluster and calculates the threshold value of power preset data file size
TH1。
Step S001: the file addition identification information of different data sources is acquired, such as 101 operating modes in Fig. 3.It is slow with message
The size that the judgement of server 11 receives file is deposited, the threshold value TH1 in such as less than S000 is then added field identification KEY.Reference
In Fig. 4 shown in 201.
Step S002: the data in S001 are first carried out to blacklist 104 and are filtered.
Step S003: according to the data source Q={ q1, q2 ... qn } in S001 determine hash function W=W1,
W2 ... Wn composition, the domain output of hash function is X.
Step S004: if data source according to compound customized data rule, for data source be Q=q1,
Q2 ... qn each of qi, obtain the number between n [1, M] under the n independent hash Function Mappings of W.
Step S005: assuming that remote server cluster data packet in observing and controlling time is K, then PLRU is average every J number
A New Data Flag is established according to packet, and eliminates some data of blacklist bottom.
Step S006: it is assumed that certain data file E size is exactly equal to threshold value TH, then do not have in continuous J data file
The probability for large data files E occur obeys hypergeometric distribution:
Step S007: as K > > J, E removed probability are as follows:Wherein
Step S009: black list database 104 is updated according to S007 iteration.
Step S010: data collected in S001 are loaded into corresponding data queue.
Step S011: in S010, only when data file, which is sent, requests, data queue just can be according to request application one
A data queue, if the data queue is empty queue, while data cache server is not sky, then carries out FIFO operation, otherwise
The data queue is by Free up Memory.
Step S012: it by the data in S010 data queue by repairing strategy in advance for certain missing data, is usually expressed as
Mentioned null cell or be shown as NAN (nonnumeric), N/A or None can for that may include the category column of significant missing data
To create a new classification, then referred to as Miss is handled as commonly arranging, representative value if necessary, then the number that will be repaired in advance
According to being converted into significant numerical value.Referring to Fig.1 12 in module 3 and Fig. 2.
Step S013: identification flow attacking, web crawlers and flow cheating (false flow) update iteration in S009
Data are synchronized to high-risk data alarm module.Module 4 referring to Fig.1.
Step S104: being stored in corresponding message queue cache server according to label point, until triggering merger threshold value
TH2。
Step S105: the data file for triggering TH2 is taken out from sliding window Wn, carries out merger to Wn using multithreading
Operation.
Step S106: the data after merging are uploaded into distributed memory system 15, while the member that merger operation is generated
Information is stored into relevant database 17 according to configuration rule 16.
Step S107: currently processed service is obtained by adding number or port numbers behind machine IP or host name
The cryptographic Hash HS={ hs1, hs2 ... ..., hsn } of device, and be the closed loop configuration in space by HS compound mapping.Such as 202 institutes of Fig. 4
Show.
Step S108: being put into set G=to be combined { g1 ... ... g2, gn } for the window data Wn taking-up of data queue,
N indicates the number of file to be combined, and gi indicates i-th of data file to be combined.
Step S109: relevant database is written in the metamessage Di of i-th of the data file generated in merging process.Wherein
Di=f1, f2 ..., and fn }, wherein fi is the data characteristics of metamessage set.
Step S110: when client sends the request of reading message queue, access relational database obtains data file
Metamessage Di.
Step S111: the big data text where distributed file system small documents data is accessed according to the feature field in Di
Part.
Step S112: corresponding small data file is parsed according to the feature field in large data files.
Step S113: field identification F, the access frequency of recording data files are added to each data file.
Step S114: high-frequency data file cache judges in hard disk heat block according to the added field to data file
Whether on the hard disk of file caching server, directly reads and read data in data file cache server
Step S115: processed data are added up to be shown according to temperature, show the geography information for having merged data
It is shown with the real-time deal of data.
Claims (6)
1. visualization system is repaired in a kind of preliminary examination based on big data in advance, it is characterised in that: the system comprises intelligent data acquisitions
Module, data cleansing warning module, data cleansing maintenance module, high-risk data alarm module, data quick storage module and GIS
Data dlm (dynamic loading module);
The intelligent data sampling module, in such a way that data cache server addend is according to buffer queue to different data
Source is classified, and is marked, and storage manages the metamessage of data;Collected message is sent in data cache server, in conjunction with
The data characteristics in itself field, it is contemplated that the diversity of data file size, according to the size of BLOCK in distributed file system
The critical value T of one data file is set, and cache server is used to judge the size of this file, adds to the data file less than T
Add Data Identification, i.e. KEY, when being greater than given T such as the size of data file, is sent directly to be distributed after the completion of data processing
Formula file system;It is stored in corresponding data queue according to label point, until triggering merger threshold value TH2;
The data cleansing warning module, for parsing data source, algorithm is relied on to identify improper flow and data and concluding
Corresponding filtering rule is filtered out out and downstream uses.
Module is overhauled in the data cleansing, is repaired for carrying out data lacuna with digital dictionary using data cleansing maintenance module
Just, invalid data is rejected;
The high-risk data alarm module, for using PLRU algorithm dynamically load to update black name by way of establishing blacklist
Forms data improves the fault rate of PLRU algorithm by way of establishing white list.
In the high-risk data alarm module, using PLRU algorithm;
The data quick storage module, stores for the mark data after cleaning data processing module, substantially improves
By small documents caused to system bottleneck brought by the frequent I/O operation of distributed file system, using consistent hashing
Algorithm obtains good effect in cluster load balance;
The GIS data visualization model, the data of the legitimate secure for that will clean carry out Dynamic Display, module encapsulation
Open source library ECharts, can choose the module for being suitble to this business according to the difference of data type, provide more accurately space
Geography information, intuitively, interaction are abundant, height personalized customization and can develop and complete the customization of front end UI personalized theme, and will
High-risk data information, overhaul data information are presented in front end page, and the analysis of more integrated information can be carried out from front end.
2. dynamic and visual system is repaired in a kind of preliminary examination based on big data as described in claim 1 in advance, it is characterised in that: described
In intelligent data sampling module, comprising the following steps:
1.1.1 data hash is stored using the consistent hashing algorithm of data quick storage module;
1.1.2 it the mode of metamessage management: is practised fraud using prerinse warning module identification flow attacking, web crawlers and flow;
And the data after missing mark are sent into data cleansing and overhaul module, the high-risk data after label are sent into malicious data and alert mould
Block;
1.1.3 black and white lists database is constructed using relevant database, and relationship type number is written into the metamessage of 1.1.2 label
According in library.
3. a kind of dynamic and visual system repaired in advance based on big data preliminary examination as claimed in claim 2, it is characterised in that: described
In data cleansing warning module, flowed to using step 1.1.3 black and white lists database decision data;Carry out step 1.1.2 member number
According to merger.
4. a kind of dynamic and visual system repaired in advance based on big data preliminary examination as described in one of claims 1 to 3, feature are existed
In: in the data cleansing maintenance module, comprising the following steps:
1.3.1 it in cleaning warning module, shows as mentioned null cell or is shown as NAN (nonnumeric), N/A or None, for can
It can include the category column of significant missing data, a new classification, referred to as Misssing, then as commonly arranging can be created
Equally handle;
1.3.2 in step 1.3.1, representative value if necessary then converts the data repaired in advance to significant numerical value, such as takes industry
The median for data of being engaged in.
5. a kind of dynamic and visual system repaired in advance based on big data preliminary examination as described in one of claims 1 to 3, feature are existed
In: in the high-risk data alarm module, using PLRU algorithm, steps are as follows:
1.4.1 by one group of hash function W={ W1, W2 ... Wn } composition, the domain output of hash function is X, for data source
For each of Q={ q1, q2 ... qn } qi, obtained under the n independent hash Function Mappings of W n [1, M] it
Between number;
1.4.2 if a is input object, when carrying out PLRU algorithm, then n number can be mapped, otherwise a is determined as newly
Object, in one section of detection time, it is 1 that data stream size, which obeys parameter, and the Pareto that distortion parameter is α is distributed;
1.4.3 assume that remote server cluster data packet in observing and controlling time is K, then PLRU is average establishes every J data packet
One New Data Flag, and eliminate some data of blacklist bottom;
1.4.4 assume that certain big stream E size is exactly equal to threshold value TH, then do not occur big data text in continuous J data file
The probability of part E obeys hypergeometric distribution:As K > > J, E removed probability are as follows:
Wherein
1.4.5 black list database is updated according to step 1.4.3 and step 1.4.4;
1.4.6 since PLRU algorithm has wrong report, the wrong report sample having been found that can be prevented from missing by establishing white list
Report.
6. a kind of dynamic and visual system repaired in advance based on big data preliminary examination as described in one of claims 1 to 3, feature are existed
In: in the data quick storage module.The following steps are included:
1.5.1 the metadata that relevant database is used to store the generation of small data file merging process is introduced;
1.5.2 the Hash of currently processed server is obtained by adding number or port numbers behind machine IP or host name
Value HS={ hs1, hs2 ... ..., hsn }, and be the closed loop configuration in space by HS compound mapping;
1.5.3 the window data taking-up of message queue cache server is put into set G=to be combined { g1 ... ... g2, gn },
N indicates the number of file to be combined, and gi indicates i-th of data file to be combined, the touching to intelligent data sampling module is met
The data file of clockwork spring part carries out 1.5.4 operation;
1.5.4 the data file for triggering TH2 is taken out from sliding window Wn, merger operation is carried out to Wn using multithreading, it will
Data after merging upload to distributed memory system, while the metamessage that merger operation generates being stored to relevant database
In;
1.5.5 relevant database is written in the metamessage Di of i-th of the data file generated in merging process.Wherein Di=f1,
F2 ..., fn }, wherein fi is the data characteristics of metamessage set;
1.5.6 when client sends the request of reading small data file message queue, access relational database obtains data text
The metamessage Di of part;
1.5.7 the large data files where distributed file system small documents data are accessed according to the feature field in Di;
1.5.8 corresponding small data file is parsed according to the feature field in large data files;
1.5.9 field identification F, the access frequency of recording data files are added to each data file;
1.5.10 it uses high-frequency data file cache is fast in hard disk heat, is judged according to the added field to data file
Whether on the hard disk of file caching server, directly reads and read the data in data file cache server.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811322934.1A CN109460393B (en) | 2018-11-08 | 2018-11-08 | Big data-based visual system for pre-inspection and pre-repair |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811322934.1A CN109460393B (en) | 2018-11-08 | 2018-11-08 | Big data-based visual system for pre-inspection and pre-repair |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109460393A true CN109460393A (en) | 2019-03-12 |
CN109460393B CN109460393B (en) | 2022-04-08 |
Family
ID=65609667
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811322934.1A Active CN109460393B (en) | 2018-11-08 | 2018-11-08 | Big data-based visual system for pre-inspection and pre-repair |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109460393B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110502571A (en) * | 2019-08-29 | 2019-11-26 | 智洋创新科技股份有限公司 | A kind of recognition methods of the high-incidence line segment of electric transmission line channel visual alerts |
CN111090646A (en) * | 2019-10-21 | 2020-05-01 | 中国科学院信息工程研究所 | Electromagnetic data processing platform |
CN113163353A (en) * | 2020-04-15 | 2021-07-23 | 贵州电网有限责任公司 | Intelligent health service system of power supply vehicle and data transmission method thereof |
CN113448946A (en) * | 2021-07-05 | 2021-09-28 | 星辰天合(北京)数据科技有限公司 | Data migration method and device and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160180088A1 (en) * | 2014-12-23 | 2016-06-23 | Mcafee, Inc. | Discovery of malicious strings |
CN106484855A (en) * | 2016-09-30 | 2017-03-08 | 广州特道信息科技有限公司 | A kind of big data concerning taxes intelligence analysis system |
CN107273409A (en) * | 2017-05-03 | 2017-10-20 | 广州赫炎大数据科技有限公司 | A kind of network data acquisition, storage and processing method and system |
CN108228830A (en) * | 2018-01-03 | 2018-06-29 | 广东工业大学 | A kind of data processing system |
US10353870B2 (en) * | 2016-02-17 | 2019-07-16 | Netapp Inc. | Tracking structure for data replication synchronization |
-
2018
- 2018-11-08 CN CN201811322934.1A patent/CN109460393B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160180088A1 (en) * | 2014-12-23 | 2016-06-23 | Mcafee, Inc. | Discovery of malicious strings |
US10353870B2 (en) * | 2016-02-17 | 2019-07-16 | Netapp Inc. | Tracking structure for data replication synchronization |
CN106484855A (en) * | 2016-09-30 | 2017-03-08 | 广州特道信息科技有限公司 | A kind of big data concerning taxes intelligence analysis system |
CN107273409A (en) * | 2017-05-03 | 2017-10-20 | 广州赫炎大数据科技有限公司 | A kind of network data acquisition, storage and processing method and system |
CN108228830A (en) * | 2018-01-03 | 2018-06-29 | 广东工业大学 | A kind of data processing system |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110502571A (en) * | 2019-08-29 | 2019-11-26 | 智洋创新科技股份有限公司 | A kind of recognition methods of the high-incidence line segment of electric transmission line channel visual alerts |
CN110502571B (en) * | 2019-08-29 | 2020-05-08 | 智洋创新科技股份有限公司 | Method for identifying visible alarm high-power-generation line segment of power transmission line channel |
CN111090646A (en) * | 2019-10-21 | 2020-05-01 | 中国科学院信息工程研究所 | Electromagnetic data processing platform |
CN111090646B (en) * | 2019-10-21 | 2023-07-28 | 中国科学院信息工程研究所 | Electromagnetic data processing platform |
CN113163353A (en) * | 2020-04-15 | 2021-07-23 | 贵州电网有限责任公司 | Intelligent health service system of power supply vehicle and data transmission method thereof |
CN113163353B (en) * | 2020-04-15 | 2022-12-27 | 贵州电网有限责任公司 | Intelligent health service system of power supply vehicle and data transmission method thereof |
CN113448946A (en) * | 2021-07-05 | 2021-09-28 | 星辰天合(北京)数据科技有限公司 | Data migration method and device and electronic equipment |
CN113448946B (en) * | 2021-07-05 | 2024-01-12 | 北京星辰天合科技股份有限公司 | Data migration method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN109460393B (en) | 2022-04-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109460393A (en) | Visualization system is repaired in a kind of preliminary examination based on big data in advance | |
CN108628929B (en) | Method and apparatus for intelligent archiving and analysis | |
CN107256219B (en) | Big data fusion analysis method applied to mass logs of automatic train control system | |
US10599684B2 (en) | Data relationships storage platform | |
CN109918511B (en) | BFS and LPA based knowledge graph anti-fraud feature extraction method | |
CN105045820B (en) | Method for processing video image information of high-level data and database system | |
CN111885040A (en) | Distributed network situation perception method, system, server and node equipment | |
CN102982097B (en) | Domain for Knowledge based engineering data quality solution | |
CN109213752A (en) | A kind of data cleansing conversion method based on CIM | |
CN104636751A (en) | Crowd abnormity detection and positioning system and method based on time recurrent neural network | |
CN105069025A (en) | Intelligent aggregation visualization and management control system for big data | |
CN113849483A (en) | Real-time database system architecture for intelligent factory | |
CN106534784A (en) | Acquisition analysis storage statistical system for video analysis data result set | |
CN109308290B (en) | Efficient data cleaning and converting method based on CIM | |
CN112181960A (en) | Intelligent operation and maintenance framework system based on AIOps | |
CN116049454A (en) | Intelligent searching method and system based on multi-source heterogeneous data | |
CN110197708A (en) | A kind of migration of block chain and storage method towards electron medical treatment case history | |
CN112883001A (en) | Data processing method, device and medium based on marketing and distribution through data visualization platform | |
CN109542846A (en) | A kind of Internet of Things vulnerability information management system based on data virtualization | |
CN109800133A (en) | A kind of method, one-stop monitoring alarm platform and the system of unified monitoring alarm | |
WO2022188646A1 (en) | Graph data processing method and apparatus, and device, storage medium and program product | |
CN111125450A (en) | Management method of multilayer topology network resource object | |
CN110674168A (en) | Cache key abnormity detection method, device, storage medium and terminal | |
CN113254517A (en) | Service providing method based on internet big data | |
CN106528682A (en) | Big-data text mining system of call center |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |