CN109460393B

CN109460393B - Big data-based visual system for pre-inspection and pre-repair

Info

Publication number: CN109460393B
Application number: CN201811322934.1A
Authority: CN
Inventors: 郭淑琴; 贾翼; 任宏亮
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-11-08
Filing date: 2018-11-08
Publication date: 2022-04-08
Anticipated expiration: 2038-11-08
Also published as: CN109460393A

Abstract

The utility model provides a visual system of preliminary examination preliminary repair based on big data, includes intelligent data acquisition module, data washs early warning module, data washing and overhauls module, high-risk data alarm module, data fast storage module and GIS data dynamic loading module. The data are intelligently classified through the intelligent data acquisition module so as to improve the cleaning efficiency of the data files; the high-risk data are marked by the blacklist through an early warning strategy, and the blacklist is updated and iterated by applying a PLRU algorithm, so that the false alarm capability of the system is greatly improved; the incomplete data is repaired through the pre-repair strategy, so that the utilization rate of the data is greatly improved; the data quick storage module is used for quickly storing the safety data, so that the visual real-time loading rate of the data and the loading rate of historical data are improved; and finally, the data stream subjected to pre-inspection and pre-repair is displayed in a GIS dynamic map mode, so that a manager can more directly perform wind control scheduling and system optimization.

Description

Big data-based visual system for pre-inspection and pre-repair

Technical Field

The invention relates to the field of data processing and data storage, in particular to a visualization system based on big data pre-inspection and pre-repair.

Background

With the development of high and new technologies, big data becomes an important tool for the development of all countries, the development and application of the big data are promoted, a new social management mode of precise management and multi-party cooperation is created in the future, a new economic operation mechanism which is stable in operation, safe and efficient is established, a new people-oriented and beneficial civil service system is established, a new innovation driving format of public entrepreneurship and vast people innovation is opened, and a new high-end intelligent and prosperous industrial development ecology is cultivated.

With the advent of the DT era, people can collect more abundant data than ever before, and reports of IDC show that: it is expected that by 2020, the total amount of global data will exceed 40ZB (equivalent to 40 trillion GB), which is 22 times that of 2011! How to make decisions, analysis, prediction and strategic development of data which is growing in an explosive manner based on high-value information becomes a new research hotspot.

After a large amount of original data are collected from the acquisition system, the data can be used for insights on business laws and mining potential information only by being integrated and calculated, so that the value of the big data is realized, and the purposes of enabling businesses and creating values are achieved. Facing massive data and complex calculation, the data calculation layer comprises two major systems: a data storage and computing platform; the development of the data mining technology and the data warehousing and computing technology are complementary, deep learning cannot occur without the development of data infrastructure and the distributed parallel computing technology, the magical effect of AlphaGo cannot be seen, and the development of a cloud computing platform enables massive, high-speed, multi-change and multi-terminal structural and unstructured data to be stored and efficiently computed.

The data quality is the basis of the validity and accuracy of data analysis conclusions and is the premise of the validity and accuracy, so that how to ensure the data quality ensures that the data availability is a link that cannot be ignored in data warehouse construction, the data becomes an important production element, and the value of data application is maximized, such as searching, recommending, advertising, finance, credit, insurance, entertainment, logistics and other businesses. The data is provided for the merchant, and the data can be used for guiding the data operation of the merchant and providing diversified and affordable data energization for the merchant; the method can be used for realizing better search experience, more accurate personalized recommendation, optimizing shopping experience, more accurate advertising and more Hewlett packard financial service; providing data to employees, wherein the data can be used for data operation and decision making;

the conventional general big data processing platform lacks a precleaning strategy for data source access, particularly enables a large amount of missing, invalid, high-risk and repeated missing, invalid and high-risk data to enter data analysis, seriously influences the result of the data analysis and the accuracy of a prediction and regression model.

The distributed file system supports the storage of a large-scale data set by virtue of the advantages of high fault tolerance, scalability and low-price storage, but has low efficiency in receiving and storing massive, high-concurrency, continuous and high-speed small data files, and can perform a large amount of IO exchange with the distributed file system during each insertion, search, deletion and update operation, thereby greatly reducing the new energy of the distributed file system.

Moreover, the current data visualization solution mainly adopts some commercial solutions, so that the multiple requirements of customers are met, more personalized services cannot be provided due to the fixed solution and high data modeling of the solution, the customization cost is too high, each specific service cannot be effectively matched, and the solution is developed according to the service characteristics of the solution, so that the development period is too long and the cost is too high.

The current big data visualization application often shows the result after processing and analysis, lacks necessary early warning prompt and warning indication, often needs decision-making personnel to optimize by virtue of industry experience, and personnel decision also has personnel flow, so that the problem of unsustainable work is urgently solved.

Disclosure of Invention

In order to overcome the defects that most of potential value data are filtered during big data acquisition, performance bottlenecks caused by frequent IO interaction with a distributed file system and customization of comprehensive information of a visual interface are overcome, the invention provides a dynamic visual system based on big data pre-inspection and pre-repair, cleaning and correction of the data are completed through a pre-inspection and pre-repair strategy, and after structural deformation processing, the data at the moment has structural or semi-structural characteristics, can be conveniently loaded and used by a relational database, improves the utilization rate of source data and the safety and stability of the system, and rapidly updates an iterative blacklist database through a PLRU algorithm so as to improve the storage efficiency and the data filtering efficiency, the safety and the stability of the system and the robustness of the whole system. The method adopts a consistency hash algorithm to uniformly distribute data file queues to each server of a cluster so as to solve data inclination and obtain a good effect in cluster load balance, caches meta-information under a data file storage strategy to a relational database so as to solve the pressure of NameNode meta-data storage in the distributed file system like HDFS, caches high-frequency data files on a hard disk, judges whether the data files are on the hard disk of a file cache server according to an additional field of the data files, and directly reads the data from the data file cache server, thereby reducing interaction of inquiry and update with the distributed file system, improving access speed and greatly improving comprehensive read-write performance of the system. The method is particularly quick in the aspects of storage and query response speed of massive small data files.

In order to achieve the purpose, the invention adopts the technical scheme that:

a dynamic visualization system based on big data pre-inspection and pre-repair comprises an intelligent data acquisition module, a data cleaning early warning module, a data cleaning and repairing module, a high-risk data warning module, a data rapid storage module and a GIS data dynamic loading module;

the intelligent data acquisition module is used for classifying, marking, storing and managing the meta information of the data by adopting a mode of adding a data cache queue to a data cache server; sending the collected information to a data cache server, setting a critical value T of a data file according to the size of a BLOCK in a distributed file system by considering the diversity of the size of the data file in combination with the data characteristics of the field of the cache server, wherein the cache server is used for judging the size of the file, adding a data identifier, namely a KEY, to the data file smaller than T, and directly sending the data file to the distributed file system after the data processing is finished when the size of the data file is larger than the given T; storing the data into a corresponding data queue according to the marking until a merging threshold TH2 is triggered;

and the data cleaning and early warning module is used for analyzing a data source, identifying abnormal flow and data by means of an algorithm, and inducing corresponding filtering rules for filtering and downstream use.

The data cleaning and overhauling module is used for correcting the missing items and eliminating the invalid data by using the data cleaning and overhauling module and applying the digital dictionary;

the high-risk data alarm module is used for dynamically loading and updating blacklist data by using a PLRU algorithm in a mode of establishing a blacklist, and improving the error rate of the PLRU algorithm in a mode of establishing a white list.

A PLRU algorithm is adopted in the high-risk data alarm module;

the data fast storage module is used for storing the identification data cleaned by the data processing module, so that the system bottleneck caused by frequent IO operation of a distributed file system due to small files is greatly improved, and a good effect is achieved in cluster load balancing by adopting a consistent hash algorithm;

the GIS data visualization module is used for dynamically displaying cleaned legal and safe data, encapsulates open source databases EChats, can select modules suitable for the business according to different data types, provides more accurate space geographic information, is visual and rich in interaction, can be customized in a highly personalized manner, develops and completes personalized theme customization of a front-end UI, displays high-risk data information and maintenance data information on a front-end page, and can analyze more comprehensive information from the front end.

Further, the intelligent data acquisition module comprises the following steps:

1.1.1, hashing and storing the data by using a consistency hash algorithm of a data quick storage module;

1.1.2 way of meta-information management: utilizing a pre-cleaning early warning module to identify flow attack, web crawlers and flow cheating; sending the data without the identification into a data cleaning and overhauling module, and sending the marked high-risk data into a malicious data alarming module;

1.1.3, constructing a black and white list database by using the relational database, and writing the meta information marked by 1.1.2 into the relational database.

In the data cleaning and early warning module, the flow direction of data is decided by utilizing the black and white list database in the step 1.1.3; step 1.1.2 merging of metadata is performed.

The data cleaning and overhauling module comprises the following steps:

1.3.1 in the Wash Warning Module, represented as empty cells or shown as NAN (non-numeric), N/A or None, for the class columns that may contain meaningful missing data, a new class, called Misssing, can be created and then treated like a normal column;

1.3.2 in step 1.3.1, if typical values are needed, the pre-modified data is converted into meaningful values, such as taking the median of the traffic data.

In the high-risk data alarm module, a PLRU algorithm is adopted, and the steps are as follows:

1.4.1 consists of a set of hash functions W ═ W1, W2,....... Wn }, the output domain of the hash function is X, and for each qi of the data sources Q ═ Q1, Q2,....... qn }, the n numbers between [1, M ] are obtained under n independent hash function mappings of W;

1.4.2 if a is an input object, mapping n numbers when carrying out PLRU algorithm, otherwise determining a as a new object, and within a period of detection time, determining the size of the data stream to obey pareto distribution with parameter 1 and distortion parameter alpha;

1.4.3, assuming that a data packet of the remote server cluster is K in the measurement and control time, the PLRU averagely establishes a new data identifier every J data packets and eliminates certain data at the bottom of the blacklist;

1.4.4 assuming that the size of a certain large stream E is exactly equal to the threshold TH, the probability that no large data file E appears in consecutive J data files obeys a hyper-geometric distribution:

when K is>>J, the probability that E is removed is:

wherein

1.4.5 updating the blacklist database according to step 1.4.3 and step 1.4.4;

1.4.6 because the PLRU algorithm has false alarm, the false alarm sample can be prevented by establishing a white list;

the data fast storage module. The method comprises the following steps:

1.5.1 introducing a relational database for storing metadata generated in the merging process of the small data files;

1.5.2, acquiring a hash value HS of a current processing server by adding a number or a port number behind a machine IP or a host name, wherein the hash value HS is { HS1, HS2, … … and hsn }, and mapping an HS set into a spatial closed-loop structure;

1.5.3, taking out window data of the message queue cache server, putting the window data into a set G to be merged { G1, … … G2, gn }, wherein n represents the number of files to be merged, gi represents the ith data file to be merged, and performing 1.5.4 operation on the data files meeting the triggering condition of the intelligent data acquisition module;

1.5.4, taking out the data file triggering TH2 from the sliding window Wn, merging the Wn by adopting multiple threads, uploading the merged data to a distributed storage system, and storing the meta information generated by merging operation into a relational database;

and 1.5.5 writing the meta information Di of the ith data file generated in the merging process into the relational database. Where Di is { f1, f2, … …, fn }, where fi is a data feature of the set of meta-information;

1.5.6, when the client sends a request for reading the small data file message queue, accessing the relational database to obtain the meta information Di of the data file;

1.5.7, accessing a big data file where the distributed file system small file data is located according to the characteristic field in the Di;

1.5.8 parsing out the corresponding small data file according to the characteristic field in the large data file;

1.5.9 adding a field identifier F to each data file and recording the access frequency of the data files;

1.5.10 it is characterized by that it uses high-frequency data file to be cached in hard disk, and utilizes the additional field of data file to judge that said data is cached in hard disk of file caching server or not, and directly reads said data from data file caching server.

The invention has the following beneficial effects: the utilization rate of source data is improved, the safety and the stability of the system are improved, the iteration blacklist database is rapidly updated through the PLRU algorithm, the storage efficiency and the data filtering efficiency are improved, the safety and the stability of the system are improved, and the robustness of the whole system is improved.

Drawings

Fig. 1 shows a model diagram of a big data-based pre-inspection and pre-repair visualization system.

Fig. 2 shows a flow chart of a big data-based pre-inspection and pre-repair visualization system.

Fig. 3 shows a data pre-inspection process diagram of a pre-inspection visual system based on big data.

Fig. 4 shows a model diagram of a data fast storage module of a big data-based pre-inspection and pre-repair visualization system.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The system operation and method of the present invention will be described in detail below, and it is apparent that the embodiments described are only a part of the project examples of the present invention, not all project examples, and all other project examples obtained by those skilled in the art without any inventive changes or substantial optimization are within the scope of the present invention.

Referring to fig. 1-4, a dynamic visualization system based on big data pre-inspection and pre-repair, including intelligent data acquisition module 1, data wash early warning module 2, data wash maintenance module 3, high-risk data alarm module 4, data fast storage module 5 and GIS data visualization module 6, intelligent data acquisition module and data wash early warning module are connected, data wash early warning module and data wash maintenance module are connected, data wash early warning module and high-risk data alarm module and are connected, data wash early warning module, high-risk data alarm module all is connected with GIS data visualization module, refer to fig. 1 specifically.

The intelligent data acquisition module is used for classifying, marking, storing and managing the meta information of the data by adopting a mode of adding a data cache queue to a data cache server; sending the collected information to a data cache server, setting a critical value T of a data file according to the size of a BLOCK in a distributed file system by considering the diversity of the size of the data file in combination with the data characteristics of the field of the cache server, wherein the cache server is used for judging the size of the file, adding a data identifier, namely a KEY, to the data file smaller than T, and directly sending the data file to the distributed file system after the data processing is finished when the size of the data file is larger than the given T; storing the data into a corresponding data queue according to the marking until a merging threshold TH2 is triggered; the method comprises the following steps:

1.1.1 the data is hashed and stored by using a consistency hash algorithm of the data fast storage module.

1.1.2 way of meta-information management: utilizing a pre-cleaning early warning module to identify flow attack, web crawlers and flow cheating (false flow); sending the data without the identification into a data cleaning and overhauling module, and sending the marked high-risk data into a malicious data alarming module;

1.1.3, constructing a black and white list database by using a relational database, and writing the meta information marked by 1.1.2 into the relational database;

the data cleaning and early warning module is used for analyzing a data source and deciding the data flow direction by utilizing the black and white list database in the step 1.1.3; merging the metadata in the step 1.1.2; abnormal flow and data are identified by means of an algorithm, and corresponding filtering rules are summarized for filtering and downstream use.

The data cleaning and overhauling module is used for correcting the missing items and eliminating the invalid data by using the data cleaning and overhauling module and applying the digital dictionary; the method comprises the following steps:

when K is>>J, the probability that E is removed is:

wherein

1.4.5 updating the blacklist database according to step 1.4.3 and step 1.4.4;

the data fast storage module is used for storing the identification data cleaned by the data processing module, so that the system bottleneck caused by frequent IO operation of a distributed file system due to small files is greatly improved, and a good effect is achieved in cluster load balancing by adopting a consistent hash algorithm; the method comprises the following steps:

1.5.10 cache high frequency data file on hard disk, judge whether on the hard disk of the file cache server according to the additional field of the data file, read the data in the data file cache server directly;

the GIS data visualization module is used for dynamically displaying cleaned legal and safe data, the module encapsulates open source databases EChats, and can select and provide more accurate spatial geographic information according to different data types, so that the GIS data visualization module is visual and rich in interaction, can be customized in a highly personalized manner, can develop and finish personalized theme customization of a front-end UI, displays high-risk data information and overhaul data information on a front-end page, and can analyze more comprehensive information from the front end.

Fig. 2 is now described: when data is collected into the system, namely the data enters the message cache 11, the message cache 11 stores the data in a classification mode 14 according to classification requirements, then the data is sent into the pre-cleaning module 12, the module carries out regularization processing on missing data and high-risk data, processing meta information is written into the relational database 17, the data after pre-cleaning is distributed to the real-time calculation engine, the offline 13 and the offline calculation engine according to actual requirements, the queue server 14 and the relational database 17 which are used for synchronously calculating results are synchronized by a synchronization program, the data results are finally stored in the distributed file system 15, and the real-time data and the offline data can be operated on the management platform 19 to be displayed visually 20.

The workflow of the visualization system based on big data pre-inspection and pre-repair of the embodiment comprises the following steps:

step S000: in block 1, a threshold TH1 for the size of the data file is preset based on the self-cluster storage capacity and the computing capacity.

Step S001: and acquiring file addition identification information of different data sources, such as 101 working modes in fig. 3. The message cache server 11 determines the size of the received file, and if the size is smaller than the threshold TH1 in S000, adds a field identifier KEY. Reference is made to 201 in fig. 4.

Step S002: the data in S001 is filtered by the blacklist 104.

Step S003: determining the composition of a hash function W ═ W1, W2.... Wn } according to a data source Q { Q1, Q2.. qn } in S001, wherein the output domain of the hash function is X.

Step S004: if the data source is according to the composite customized data rule, for each qi of Q ═ Q1, Q2.

Step S005: assuming that the data packet of the remote server cluster is K in the measurement and control time, the PLRU establishes a new data identifier every J data packets on average and eliminates certain data at the bottom of the blacklist.

Step S006: assuming that the size of a certain data file E is exactly equal to the threshold TH, the probability that no large data file E appears in consecutive J data files follows a hyper-geometric distribution:

step S007: when K is>>J, the probability that E is removed is:

wherein

Step S009: the blacklist database 104 is iteratively updated according to S007.

Step S010: and loading the data collected in the S001 into a corresponding data queue.

Step S011: in S010, only when a data file sends a request, a data queue applies for a data queue according to the request, and if the data queue is an empty queue and the data cache server is not empty, FIFO operation is performed, otherwise the data queue releases space.

Step S012: the data in the S010 data queue is pre-modified by a pre-modification strategy to represent some missing data, usually as empty cells or as NAN (non-numeric), N/a or None, and for a classification column that may contain meaningful missing data, a new classification, called Miss, may be created and then processed like a normal column, if a typical value is required, the pre-modified data is converted into a meaningful value. Reference is made to block 3 of fig. 1 and 12 of fig. 2.

Step S013: and identifying the flow attack, the web crawler and the flow cheating (false flow), and synchronizing the data iteratively updated in the S009 to the high-risk data alarm module. Reference is made to block 4 of figure 1.

Step S104: and storing the message into a corresponding message queue cache server according to the mark until a merging threshold TH2 is triggered.

Step S105: and taking the data file triggering TH2 out of the sliding window Wn, and merging Wn by adopting multiple threads.

Step S106: the merged data is uploaded to the distributed storage system 15 and the meta-information resulting from the merging operation is stored in the relational database according to the configuration rules 16 17.

Step S107: the hash value HS of the current processing server is obtained by adding a number or port number behind the machine IP or hostname { HS1, HS2, … …, hsn }, and the HS set is mapped to a closed loop structure of space. As shown at 202 in fig. 4.

Step S108: and taking out the window data Wn of the data queue and putting the window data Wn into a set G to be merged { G1, … … G2, gn }, wherein n represents the number of files to be merged, and gi represents the ith data file to be merged.

Step S109: and writing the meta information Di of the ith data file generated in the merging process into the relational database. Where Di is { f1, f2, … …, fn }, where fi is a data characteristic of the set of meta-information.

Step S110: and when the client sends a request for reading the message queue, accessing the relational database to obtain the meta information Di of the data file.

Step S111: and accessing a big data file in which the distributed file system small file data is located according to the characteristic field in the Di.

Step S112: and analyzing the corresponding small data file according to the characteristic field in the large data file.

Step S113: and adding a field identification F to each data file, and recording the access frequency of the data file.

Step S114: and caching the high-frequency data file in a hard disk hot block, judging whether the high-frequency data file is on a hard disk of a file cache server according to the additional field of the data file, and directly reading the data in the data file cache server.

Step S115: and displaying the processed data according to the heat accumulation, and displaying the geographic information fused with the data and real-time transaction display of the data.

Claims

1. The utility model provides a visual system of preliminary examination preliminary repair based on big data which characterized in that: the system comprises an intelligent data acquisition module, a data cleaning early warning module, a data cleaning maintenance module, a high-risk data warning module, a data rapid storage module and a GIS data dynamic loading module;

the intelligent data acquisition module is used for classifying, marking, storing and managing the meta information of the data by adopting a mode of adding a data cache queue to a data cache server; sending the collected information to a data cache server, setting a critical value T of a data file according to the size of a BLOCK in the distributed file system, wherein the cache server is used for judging the size of the file, adding a data identifier (KEY) to the data file smaller than T, and directly sending the data to the distributed file system after data processing is finished if the size of the data file is larger than the given T; storing the data into a corresponding data queue according to the marking until a merging threshold TH2 is triggered;

the data cleaning and early warning module is used for analyzing a data source, identifying abnormal flow and data by means of an algorithm, and inducing corresponding filtering rules for filtering and downstream use;

the high-risk data alarm module is used for dynamically loading and updating blacklist data by using a PLRU algorithm in a mode of establishing a blacklist and improving the error rate of the PLRU algorithm in a mode of establishing a white list;

when K is>>J, the probability that E is removed is:

wherein

1.4.5 updating the blacklist database according to step 1.4.3 and step 1.4.4;

1.4.6 because the PLRU algorithm has false alarm, the false alarm sample which is discovered is prevented from false alarm by establishing a white list;

the data fast storage module is used for storing the identification data cleaned by the data processing module by adopting a consistent hash algorithm;

the GIS data visualization module is used for dynamically displaying the cleaned legal and safe data, the module encapsulates an open source database EChats, a module suitable for the business is selected according to different data types, high-risk data information and maintenance data information are displayed on a front-end page, and comprehensive information analysis is carried out from the front end.

2. The big data-based pre-inspection and pre-repair visualization system as recited in claim 1, wherein: the intelligent data acquisition module comprises the following steps:

3. The big data-based pre-inspection and pre-repair visualization system as claimed in claim 2, wherein: in the data cleaning and early warning module, the flow direction of data is decided by utilizing the black and white list database in the step 1.1.3; step 1.1.2 merging of metadata is performed.

4. A big data based pre-inspection and pre-repair visualization system as claimed in any one of claims 1 to 3, wherein: the data cleaning and overhauling module comprises the following steps:

1.3.1 in the Wash Warning Module, represented as empty cells or displayed as NAN, N/A or None, for the class columns that may contain meaningful missing data, create a new class called Misssing, which is then treated like a regular column;

5. A big data based pre-inspection and pre-repair visualization system as claimed in any one of claims 1 to 3, wherein: the data rapid storage module comprises the following steps:

1.5.5 writing the meta information Di of the ith data file generated in the merging process into a relational database, wherein Di is { f1, f2, … …, fn }, and fi is a data feature of the meta information set;