CN110543464B

CN110543464B - Big data platform applied to intelligent park and operation method

Info

Publication number: CN110543464B
Application number: CN201910774850.XA
Authority: CN
Inventors: 任菁倩; 杨嘉欣; 杜莎
Original assignee: Guangdong Dingyi Interconnection Technology Co ltd
Current assignee: Guangdong Dingyi Interconnection Technology Co ltd
Priority date: 2018-12-12
Filing date: 2019-08-21
Publication date: 2023-06-23
Anticipated expiration: 2039-08-21
Also published as: CN110543464A

Abstract

The invention discloses a big data platform applied to an intelligent park, which is mainly characterized in that the technical architecture of the whole project adopts a mixed mode of Hadoop+MPP+memory database, and simultaneously adopts Storm technology to support the acquisition and calculation of real-time data, thereby realizing a high-concurrency, scalable and high-performance big data system. Support data sharing and processing capabilities in a variety of ways, such as databases, messages, files, and the like. And simultaneously, mapReduce operation, SQL operation, stream calculation and memory calculation are supported. The rule engine is used for reducing the complexity of components for realizing complex business logic, increasing the flexibility of marketing scene configuration, reducing the maintenance cost of an application program and enhancing the expandability of the program. The scheme has good expansibility, and can enhance the processing capacity of the cluster in the future in a horizontal expansion mode, thereby meeting the requirement of service development.

Description

Big data platform applied to intelligent park and operation method

Technical Field

The invention relates to the technical field of big data platforms, in particular to a big data platform applied to an intelligent park and an operation method.

Background

Although the big data technology is very fire-exploded at present, the big data industry in China is still in a starting stage, and the development of an industry chain is not mature. After the big data industrial park is established, enough enterprises can not necessarily be resident, and a complete big data ecological circle can not be formed.

The existing big data application technology has the problems of safety and hidden danger, and the main content is as follows: first, the threat of big data, known as security, is necessarily the target of an attack when big data technologies, systems, and applications aggregate a great deal of value; secondly, problems and side effects caused by excessive abuse of big data are more typical, namely personal privacy leakage, and trade secret leakage and national secret leakage caused by big data analysis capability are also included; third, mental and conscious safety issues. Threat to big data, side effects of big data, extreme mental to big data all hamper and destroy the development of big data.

Therefore, how to provide a secure large data platform and operation method for realizing efficient utilization of intelligent parks is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

In view of this, the present invention provides a big data platform and an operation method applied to an intelligent park, which are based on advanced big data technology, and in order to solve the drawbacks of the prior art, an effective solution is provided and implemented, the coverage rate of big data application technology is enlarged, so that more industrial parks can apply big data technology, the big data platform applied to the intelligent park can realize the high-efficiency utilization of the intelligent park through the big data application technology, and meanwhile, the high-efficiency safety of the data of the intelligent park can be ensured, and the intelligent park forms a perfect big data ecological ring.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a big data platform for an intelligent campus, comprising: the system comprises a data acquisition module, a data storage module, a data calculation module, a data application module and a platform management and control module;

the data acquisition module is connected with the data storage module and is used for storing acquired data into the data storage module;

the data computing module is connected with the data storage module and is used for processing the data in the data storage module;

the data application module is connected with the data calculation module to establish business logic encapsulation business objects and business services;

the platform management and control module is connected with the data acquisition module, the data storage module, the data calculation module and the data application module and is used for monitoring.

Preferably, in the foregoing big data platform for an intelligent park, the data acquisition module includes: the data extraction unit, the data input end and the data output end; the data input end is connected with a data source; the data extraction unit is connected with the data input end, classifies the collected data and transmits the classified data to the data storage module.

Preferably, in the foregoing big data platform applied to the intelligent park, the data storage module includes a distributed file unit, a distributed database, and a distributed cache unit; the distributed file unit is provided with an uploading channel and a downloading channel and performs data interaction with the distributed database; and the distributed cache unit is connected with the distributed database for cache processing.

Preferably, in the foregoing big data platform applied to the smart park, the data calculation module includes: the system comprises a MapReduce unit, a data warehouse unit, a machine learning and data mining library and a rule knowledge library; the data warehouse unit converts data files and runs on the MapReduce unit; the machine learning and data mining store a classical algorithm in the machine learning field; the rule knowledge base matches rules via a rule engine.

Preferably, in the foregoing big data platform for an intelligent park, the platform management and control module includes: the system comprises a cluster management unit, a host management unit, a user management unit and a cluster log management unit; the cluster management unit is connected with the data calculation module; the host management unit is connected with the host node; the user management unit manages platform users; the cluster log management unit is respectively connected with the data acquisition module, the data storage module, the data calculation module and the data application module.

Preferably, the above large data platform applied to the intelligent park further comprises a data security module; the data security module comprises an identity verification and authorization unit; the authentication and authorization unit is connected with the user management unit.

An operation method of a big data platform for an intelligent park comprises the following specific steps:

step one: the data acquisition module extracts the acquired data from the data source, processes the data and stores the data, and uniformly processes the access data through file decompression, file merging and splitting, file-level verification, data-level verification, cleaning, conversion, association and summarization, and loads the access data to the data storage module;

step two: the data storage module adopts a distributed scheme, and semi-structured and unstructured data processing is realized by Hadoop; processing high-quality structured data by using MPP, and storing the data;

step three: transmitting the stored data to a data warehouse unit of a data calculation module, converting the data file and operating on a MapReduce unit; the MapReduce unit is transmitted to the data application module to carry out flow statistics, service recommendation, trend analysis, user behavior analysis, data mining, offline analysis, online analysis and impromptu query;

Step four: and monitoring the data acquisition module, the data storage module, the data calculation module and the data application module on the platform management and control module.

Compared with the prior art, the technical scheme provided by the invention provides an effective solution and implementation aiming at the defects of the prior art on the basis of an advanced big data technology, so that the coverage rate of the big data application technology is enlarged, more industrial parks can apply the big data technology, the big data platform applied to the intelligent park can realize the high-efficiency utilization of the intelligent park through the big data application technology, the high-efficiency safety of the data of the intelligent park can be ensured, and the intelligent park forms a perfect big data ecological ring.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a structural frame diagram of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention discloses a big data platform applied to an intelligent park, which provides an effective solution and implementation for the defects of the prior art on the basis of an advanced big data technology, expands the coverage rate of the big data application technology, enables more industrial parks to apply the big data technology, realizes the high-efficiency utilization of the intelligent park through the big data application technology, ensures the high-efficiency and safety of the data of the intelligent park, and enables the intelligent park to form a perfect big data ecological ring.

As shown in fig. 1, a big data platform applied to an intelligent park includes: the system comprises a data acquisition module, a data storage module, a data calculation module, a data application module and a platform management and control module;

In order to further optimize the above technical solution, the data acquisition module includes: the data extraction unit, the data input end and the data output end; the data input end is connected with a data source; the data extraction unit is connected with the data input end, classifies the collected data and transmits the classified data to the data storage module.

Further, the data in the intelligent park informatization system at present has inconsistent problems, such as heterogeneous data structures, inconsistent data lengths, different data formats, even error data and the like, so that the original data is difficult to directly use. The data in a particular data source needs to be further processed by the system to become a useful, compliant data format.

And the large data extraction and conversion unit is constructed to realize the collection and adaptation of government affair information such as government affair sharing information base, online office information and the like of the project. The method is used for helping to integrate the data in various systems, and the integrated data can meet the requirements of further mining the data and finding knowledge.

The method mainly comprises the steps of extracting data from a data source, processing the data and storing the data, thereby completing data reconstruction and meeting the requirements of big data searching application and other data mining application on data formats. The collection of various file collection sources is supported, and the files on the local, local area network sharing folders, the FTP server folders and the HTTP server can be collected.

And unified data acquisition and scheduling are used as a data flow hub of the project, a data input end is accessed to a data source according to requirements, access data are processed uniformly through the steps of file decompression, file merging and splitting, file-level verification, data-level verification, cleaning, conversion, association, summarization and the like, and are loaded to a system and are in charge of centralized management of platform scheduling.

The data output end bears the capability of providing data service for the upper layer application, and the service interface calls various data output ends to realize data access according to the need.

The data extraction unit is used for completing the extraction of the files from the file interface, the ETL subsystem provides an SFTP plug-in, and the data extraction is performed in an SFTP protocol mode. The breakpoint continuous transmission function is supported, and the file wild distribution is supported. The main functions involved are as follows: downloading files; checking the file integrity; file breakpoint continuous transmission; source file deletion, etc.

Furthermore, the data acquisition module also comprises a library table data synchronization plug-in which supports acquisition of source data from various heterogeneous library table data sources, and the source data is loaded into a target database table after conversion and formatting.

The data conversion function sets the cleaning and conversion rules for the extracted data files according to the data specification requirements of the target interface table, and then cleans and converts the data according to the rules to form formatted files and form related cleaning and conversion quality data.

And the data verification function is to utilize the data verification plug-in to verify the data unit file uploaded by the data source, and the verified quality data is stored in the management metadata base. The level of verification is as follows:

file level verification: and checking the number of the data files and the number of records according to the check files.

Recording and checking: and checking the value range of the record field based on the agreed check rule.

And (5) checking the index level: and verifying the key service index of the interface unit based on the verification file.

The method further comprises the steps that the data loading plug-in obtains the converted data file, assembles SQL according to a target format, loads the SQL into a data warehouse target table in batches, and simultaneously generates quality data of the loading process.

The data aggregation plug-in calls a process or a function to complete a specific data aggregation processing process. The aggregation refers to preprocessing of record row compression, table connection, attribute combination and the like of the bottom layer data according to different dimension granularity, indexes and calculation elements and actual analysis requirements, and is a data processing form of corresponding statistics of the bottom layer detailed data, including summation, averaging and the like.

The result of the aggregate calculation is pre-calculated aggregate data based on the user's possible queries. The summary is in a wide variety of forms and can be performed along any one or more dimensions of the multidimensional data in the data warehouse. If the dimensions are hierarchical, the aggregation can also be done at any one level. The aggregate Data corresponding to a certain combination of dimensions is referred to as a Cube (Cube), and the Cube formed by all cubes of a given set of dimensions is referred to as the Data Cube (Data Cube) of that set of dimensions. The creation of the data cube is accomplished by aggregation.

Data aggregation is used to improve the performance of data warehouse units in online analytical processing, and it shortens query response time by preparing answers before questions are raised, and is the basis for enabling quick response by OLAP technology, and is mainly characterized in the following aspects:

aggregation reduces the impact of direct access to underlying data on front-end applications

What is generally needed for online analysis processing is summary data derived from detail data, and query statistics directly on massive basic data will greatly affect system efficiency. The required summarized data is pre-calculated by aggregating, thereby avoiding direct access to the underlying data.

Aggregation reduces duplicate computation of underlying data

Different online analytical processing operations may all require the same processing of the same portion of the underlying data. The summary data is pre-computed by aggregation, thereby avoiding repeated computation of relevant underlying data.

Data consistency can be ensured to a certain extent by using aggregation

In one aspect, the underlying data in the data warehouse units is not updatable in real-time, and the aggregate derived from these relatively stable underlying data reflects summary information over a period of time. On the other hand, the data in the data warehouse units is time-varying again, and new data will be added periodically. The consistency of the data accessed in the analysis process can be ensured to a certain extent through aggregation, and the inconsistent data collected in sequence due to direct use of basic data is avoided.

Wherein, collection of event data stream data: the value of the data decreases with time, so that events must be handled as soon as possible after they occur, preferably immediately after they occur, and an event occurs once rather than being cached as a batch. In the data flow model, the incoming data to be processed (in whole or in part) is not stored on a random-accessible disk or memory, and they arrive in the form of one or more "continuous data flows".

Operations involved in the data flow system are divided into stateful and stateless operators, including units, filters, etc., and stateful operators, including sort, join, aggregate, etc. If the stateful operator fails to execute, the state maintained by the stateful operator is lost, the state and output generated by the replay data stream are not necessarily consistent with those before failure, and after the stateless operator fails, the replay data stream can construct the output consistent with the previous.

The data flow computation can be seen as a data flow graph consisting of an operator (node) and a data flow (edge).

Apache Kafka is also an open source system that aims to provide a unified, high throughput, low latency distributed message processing platform for processing real-time data. It was originally developed by LinkedIn, and was open in 2011 and contributed to Apache. The Kafka differs from the traditional RabbitMQ, apache ActiveMQ and other message systems mainly in that: the characteristics of the distributed system are easy to expand; providing high throughput for publications and subscriptions; multiple subscriptions are supported, which can automatically balance consumers; messages may be persisted to disk and may be used for bulk consumption, such as ETL, etc.

Storm is a Twitter open-source real-time data stream computing system developed in Clojure functional language. Storm provides a set of general primitives for distributed real-time computing, which can be used in "stream processing", processing messages in real-time and updating databases, another way of managing queues and worker clusters. Storm refers to a computational model of Hadoop, which runs as a Job, and Storm runs as a Topology. Job is life-cycled, and morphology is Service, job is not stopped.

In order to further optimize the technical scheme, the data storage module comprises a distributed file unit, a distributed database and a distributed cache unit; the distributed file unit is provided with an uploading channel and a downloading channel and performs data interaction with the distributed database; and the distributed cache unit is connected with the distributed database for cache processing.

Furthermore, the data storage module adopts a mixed architecture of Hadoop+MPP+memory database, adopts a distributed scheme, and realizes semi-structured and unstructured data processing by using Hadoop. High quality structured data is processed with MPP while providing rich SQL and transaction support capabilities for applications. The key technology of storing, managing and efficiently accessing the big data is broken through, a PB-level storage capacity big data platform can be constructed, and a transparent data management platform is provided for users.

The distributed file unit has the characteristic of high fault tolerance and is designed to be deployed on low-cost hardware; and it provides high throughput access to data of applications suitable for those with very large data sets.

The distributed database is used as a non-shared architecture, each node runs an own operating system, a database and the like, and information interaction between the nodes can only be realized through network connection.

The distributed cache unit is a high-performance key-value memory database. It supports relatively more value types stored, including string, list, set and zset. These data types all support push/pop, add/remove, and pick intersection union and difference and richer operations, and these operations are all atomic. On this basis, the distributed cache units support a variety of different ways of ordering. To ensure efficiency, data is buffered in memory. The difference is that the distributed cache unit will periodically write updated data to disk or modify operation to additional record file, and master-slave synchronization is implemented on this basis. The distributed cache unit can well supplement the relational database in some occasions.

The application of the performance management system is decoupled from the data by adopting a layering and open sharing-oriented technical architecture to form a stable and open data sharing platform, an upper-layer multi-manufacturer application integration is supported, a data platform is realized, diversified internal applications and external applications are supported, and the system has data processing and storage capacity required by relevant work development and is classified and stored according to the importance and timeliness of the data.

The data storage module has the characteristics that:

1) Data openness

In order to ensure the data to be effective and the performance to be stable, the system has the functions of shared interface management, access control, load control and the like, and can realize one-to-many application expansion:

the shared interface management function uniformly manages the interfaces of the data sharing platform, including interfaces of inquiry, subscription, message exchange, database and the like.

The access control management function should be implemented: access rights judgment, session management, access frequency management, request queue management, security control, and the like.

2) Extensibility of

The smooth evolution of the platform is supported, including hardware capacity expansion, data configuration, system management, software upgrading and the like, so as to be suitable for continuous development of services and expansion of user scale.

The system is based on X86 PC server hardware, so that the system is easy to horizontally expand;

no dependence is caused on source and target data, and various data sources are compatible;

and providing application, scheduling, management and monitoring of Hadoop storage and computing resources for third-party applications.

In order to further optimize the above technical solution, the data calculation module includes: the system comprises a MapReduce unit, a data warehouse unit, a machine learning and data mining library and a rule knowledge library; the data warehouse unit converts data files and runs on the MapReduce unit; the machine learning and data mining store a classical algorithm in the machine learning field; the rule knowledge base matches rules via a rule engine.

The data calculation module further supports various different workflows, algorithms and tools of parallel processing, and adopts the technologies of batch calculation with the Hadoop being the most good, iterative calculation represented by various machine learning algorithms, stream calculation, SQL relation query, interactive impromptu query and the like, so as to realize data fusion, statistics, offline analysis, online analysis, data mining and the like.

The Hadoop is used for processing and offline analysis of mass data, and the method has irreplaceable advantages in scalability, robustness, computing performance and cost. Hadoop is used for processing large-scale data through a distributed processing framework of a MapReduce unit, and has good scalability.

The data warehouse unit adopts Hive, which is a data warehouse unit of Hadoop, and facilitates data summarization, impromptu query and large-scale data set analysis.

Data mining employs Mahout, an extensible machine learning and data mining library that supports the main 4 use cases: recommendation mining, aggregation, classification, frequent item set mining.

The rule engine adopts Drools, which is an inference engine that matches rules from a rule knowledge base according to existing facts, processes conflicting rules, and executes the last filtered rule. The rule engine can release complex and changeable rules from hard codes and store the rules in files in the form of rule scripts, so that the change of the rules can be immediately effective in an online environment without modifying the code restarting machine.

Applications supported by the data computation module are very wide-ranging, including traffic statistics, business recommendations, trend analysis, user behavior analysis, data mining, offline analysis, online analysis, impromptu queries, and so forth.

In order to further optimize the technical scheme, the data application module adopts J2ee and ajax technologies to realize application functions based on a WEB interface, establishes business logic packaging business objects and business services, and the application services realize the business logic in a centralized way. By adopting the method, the business logic is realized outside the business objects, so that the coupling between the business objects can be reduced. The use of application services enables higher level of abstraction business logic to be packaged in a separate component that invokes underlying business objects and business services. The application layer has the main functions of: four-network cooperation accurate marketing supports, user behavior track extraction and scene production, and outdoor advertisement accurate marketing.

In order to further optimize the above technical solution, the platform management and control module includes: the system comprises a cluster management unit, a host management unit, a user management unit and a cluster log management unit; the cluster management unit is connected with the data calculation module; the host management unit is connected with the host node; the user management unit manages platform users; the cluster log management unit is respectively connected with the data acquisition module, the data storage module, the data calculation module and the data application module.

The platform management and control module realizes the following functions:

1) Visual management of big data platform

Cloudera manager is used to implement management and configuration. The Cloudera Manager is a component for facilitating installation and monitoring management of services related to big data processing such as Hadoop in a cluster, and greatly simplifies the installation configuration management of the services such as a host and Hadoop, hive, spark in the cluster.

Cloudera manager provides a visual management interface;

cloudera manager provides cluster management functions;

the Cloudera manager provides host management, application authorization and other functions;

the Cloudera manager provides a cluster management user management function;

The Cloudera manager provides a cluster log management function;

2) Big data platform configuration management

The big data platform provides the functions of installation, parameter configuration and management of the Hadoop cluster.

Functions such as HDFS, hbase, mapRduce, hive, zookeeper and the like may be provided.

The operation of supporting installation and deployment is carried out in a guide mode, and a system administrator can complete the task of installing and deployment only by carrying out a small amount of input according to the prompt of the guide.

And supporting HA automatic deployment of the master node.

And supporting the automatic installation and deployment tasks of more than 300 nodes.

Operations of adding, deleting, modifying, searching and the like are provided for the system configuration information, and each operation of an administrator needs to be recorded in a log.

Supporting the function of dynamically adding and deleting system nodes;

and the cluster configuration of the heterogeneous servers is supported, and the configuration tuning of the operation resources under the heterogeneous servers is supported.

3) Big data platform cluster monitoring

The big data platform supports visual monitoring and alarming of each cluster resource in the general Hadoop system and supports unified monitoring of a plurality of clusters. The visual monitoring of the Hadoop cluster in multiple layers and multiple dimensions is realized through a WEB interface tool. Multilevel refers to five levels of cluster level, service level, node level, process level, job level. The multi-dimension refers to multiple dimensions of CPU occupancy, memory capacity and occupancy, disk capacity or HDFS capacity and occupancy, disk I/O traffic and occupancy, network bandwidth and occupancy.

The storage and the calculation resources of each node of the cluster are supported to be visually displayed, such as a rack, a network topology graph, a network segment, server configuration and the like;

the method supports visual display of the resource use condition of each node of the cluster, such as the number of data blocks, the running number of Job and the health state of the node, and supports periodic health condition inspection.

And the system service of each node is supported to be visually monitored, such as distributed file units, mapReduce, hbase, zookeeper and the like.

The method supports visual monitoring of the operation states (success, failure, cancellation and the like) of each node operation, and captures corresponding log information.

The monitoring content comprises:

host node: host name, percentage of idle CPU, percentage of user space occupied CPU, percentage of user process space, percentage of prioritized process occupied CPU, percentage of kernel space occupied CPU, cache memory size, idle memory size, shared memory size, total amount of memory in the kernel cache, total amount of switch partition, total size of disk, remaining disk space, total number of processes running, total number of processes, average system load per minute, average system load per 5 minutes, average system load per 15 minutes, incoming packets per second, outgoing packets per second, network ingress bandwidth speed, network egress bandwidth speed.

The distributed file units comprise the total number of file system blocks, the total size, the total number of files, the residual quantity, damaged blocks, blocks to be copied, JVM thread states and the like.

MapReduce, task running condition, task occupation condition and the like.

HBASE: cluster, number of requests by region server, regionServer Regions number, etc.

And the monitoring and recovery of the software and hardware faults of the cluster are supported, such as a node downtime restarting mechanism and a restarting mechanism in which the service process is abnormally terminated.

When a fault or abnormality occurs, alarm information is displayed at a prominent position.

When the fault or abnormality is resolved, the alarm is automatically released from the user interface and the alarm record can be retrieved from the history information.

4) Big data platform safety management (rights isolation)

The big data platform supports rights management for system users and security authentication of nodes. And the role is supported to be created according to the combination of different organization structures, operation authorities, data authorities and the like, so that flexible configuration management is realized. Each user can only see the execution of the authorized application. Before the user performs various operations on the job, the user should judge whether the operation authority is available through the unified authentication service. For files stored in the distributed file units, a file and directory security control model similar to Linux is supported. And supporting access authentication and security control of a client accessing the Hadoop system and supporting a Kerberos security authentication mechanism of network connection. The Hadoop system security access control is provided, and illegal access can be interrupted by making a security policy.

SSL encryption: through different certificate policies, SSL clients may be allowed to connect securely to servers at clusters, using trusted certificates or the issuance of certificates by trusted authorities. And the setting of the certificate requirements depends on the configuration policy for the certificate. The general strategies are: certificate perhost (one machine one certificate), certificate formultiple hosts (multiple machine co-certificate), wildcard certificate (wildcard certificate). Whereas SSL must be enabled for Hadoop services for all cores (HDFS, mapReduce, yarn, etc.).

Kerberos authentication: kerberos uses the needled-Schroeder protocol as its basis. It uses a logic consisting of two independent logic parts: the term "trusted third party" consisting of the authentication server and ticket authorization server is called a Key Distribution Center (KDC). Kerberos works on a "ticket" basis for proving the identity of a user. The KDC holds a key database; each network entity, whether a client or a server, shares a set of keys that are known only to itself and the KDC. The contents of the key are used to prove the identity of the entity. For communication between two entities, the KDC generates a session key that is used to encrypt interaction information between them.

The Kerberos authentication mechanism makes nodes in a cluster become nodes that they acknowledge and trust. It puts the authenticated key on the trusted node in advance at the time of cluster deployment. When the cluster runs, nodes in the cluster are authenticated by using the secret key. Only authenticated nodes can be used normally. The node attempting to impersonate cannot communicate with the nodes inside the cluster because of the lack of key information obtained in advance. The problem of malicious use or tampering of the Hadoop cluster is prevented, and the reliability and safety of the Hadoop cluster are ensured.

Sentry service: sentry is a Hadoop open source component published by Cloudera corporation, is a Hadoop authorization module, and in order to provide accurate access levels to correct users and applications, sentry provides fine-grained, role-based authorization and multi-tenant management modes, and by introducing Sentry, hadoop can currently meet the RBAC (roll-based acess control) requirements of enterprise and government users in the following ways:

security authorization: sentry may control data access and provide data access privileges to authenticated users.

Fine granularity access control: sentry supports fine-grained Hadoop data and metadata access control.

Role-based management: sentry simplifies management by role-based authorization, you can easily grant different privilege levels to multiple groups that access the same dataset. For example, for a particular data set, you can assign the anti-fraud team the privilege to view all columns, the analyst the privilege to view non-sensitive or non-PII (personally identifiable information) columns, the privilege to insert new data into the HDFS for the data reception stream.

Multi-tenant management: sentry allows permissions to be set for different data sets delegated to different administrators. In the case of Hive/Impala, sentry can perform rights management at the database/schema level.

And (3) unifying a platform: the Sentry provides a unified platform for ensuring data security, and the existing Hadoop Kerberos is used for realizing security authentication. Meanwhile, the same Sentry protocol can be used when accessing data through Hive or Impala. In the future, the Sentry protocol will be extended to other components.

Sentry architecture: the authorization core layer of Sentry is largely two-part, a binding layer (Hive bindings and Impalabindings) and a core authorization provider (Policy engine and Policy abstractions). The combination layer provides a pluggable interface to realize the dialogue with the protocol engine. The Policy engines cooperate with bingdings to evaluate and verify access requests and access underlying data through Policy abstractions if access is allowed.

The cluster log management module is connected with the data acquisition module, the data storage module, the data calculation module and the data application module; the log information comprises a time stamp, a level, a user, module information and a log text. And supporting the recording and viewing of system running logs and audit logs. And supporting the recording, inquiring and displaying of the system running log and the user access operation log. Recording and viewing of the running logs of HDFS, mapReduce, HBase, hive and Zookeeper are supported. Support system log hierarchies including INFO, DEBUG, WARN, ERROR, FATAL, etc. Recording and viewing of system audit logs supporting HDFS, mapReduce and Hive.

In order to further optimize the technical scheme, the system also comprises a data security module; the data security module comprises an identity verification and authorization unit; the authentication and authorization unit is connected with the user management unit.

Further, authentication and authorization are two core processes that are typically involved in attempting to interact with an IT system. These core flows can ensure security of the system against attacks:

authentication is the process of confirming that system item owners have their declared identity. In the human world, project owners typically perform authentication by providing a user name and password pair. There are some advanced, complex mechanisms available to perform authentication; these mechanisms may include biometric authentication, multi-factor authentication, etc. The object (person or specific subsystem) being authenticated is often referred to as a principal.

The authorization mechanism is used to determine which operations a principal is allowed to perform on the system, or which resources the principal has access to. The authorization procedure is typically triggered after the authentication procedure. Typically, after a principal passes authentication, the principal's information is provided to help determine which operations the principal is able and unable to perform.

In monolithic applications, authentication and authorization are simple and common because they are actually handled by the application; there is no need to have advanced mechanisms to provide a more secure user experience. However, in a micro-service architecture with typical distributed features, a more advanced mode must be employed to avoid repeated interception between credential-providing service calls. You want to verify one identity of the principal at a time. The identity simplifies the identity verification and authorization process, utilizes the automation function and improves the expandability.

Further, the method further comprises the following steps: when a security policy is established for the micro-service architecture, inter-service authentication and authorization are adopted:

trust boundary: containment techniques (such as Docker) are used to reduce risk. The many functions provided by Docker enable developers to flexibly and maximally increase security of micro services and entire applications at different levels. In building the service code, the developer is free to use the penetration test tool to perform stress tests on any part of the build cycle. Because the source code that builds a Docker image has been explicitly described in declarative form in the Docker distribution component (Docker and Docker compound files), developers can easily handle the image supply chain and enforce security policies when needed. In addition, the services can be easily consolidated by putting them into a Docker container, making them invariable, adding a strong security to the services.

Further, by employing a software defined infrastructure, a private network can be quickly created and configured using a scripting language, and powerful security policies can be enforced at the network level.

SSO is used for internal interaction between services in a micro-service architecture, which can use existing infrastructure, can simplify access control to services, and centralizes all access control operations in one enterprise access directory server.

HTTP-based Hash Message Authentication Code (HMAC)

In HMAC, the requested content is hashed with a private key, and the resulting hash value is sent with the request. The other end of the communication then uses its copy of the private key and the received requested content to recreate the hash value. If the hash values match, the request is allowed to pass. If the request has been tampered with, the hash values do not match and the other end knows and reacts appropriately.

Managing keys using special purpose services

To eliminate the credential management overhead in a distributed model such as a micro-service architecture and to benefit from the high security of the constructed system, one option is to use a comprehensive key management tool. This tool allows keys (e.g., passwords, API keys, and certificates) to be stored, dynamically rented, updated, and revoked. These operations are important in micro services due to the automation principles specified in micro services.

It is to be understood that: although, in theory, no data encryption method exists that cannot be broken, there are still some well-established, proven, commonly used mechanisms (such as AES-128 or AES-256, etc.). These mechanisms are used when security considerations are made, rather than creating their own methods internally. In addition, libraries for implementing these mechanisms are updated and patched in time.

Key management tool: the first approach is not to store the key and data in the same location. The key management complexity is not to violate the flexibility principle of the micro-service architecture. Attempts have been made to use comprehensive tools with microservice design concepts that do not disrupt your continuous integration and continuous delivery pipeline.

Adjusting security policies for traffic demand: the security policies are formulated according to business needs and are continually adjusted, as the strategic goals may change constantly, as are the technologies involved in the solution.

Establishment of big data security system

1. Security architecture

The safety guarantee system comprises a safety protection system and a safety management system. Wherein the safety protection system comprises: network security, system security, application security, and data security; the security management system comprises a security policy management specification, a security organization model and a security regulation system.

2. Safety protection system

The network security protection system mainly provides a network security protection means necessary for a data application access mode, and partial application can adopt a technical means of a Virtual Private Network (VPN) to ensure the safe and reliable transmission of shared exchange data. Key applications and encrypted data of the network layer security protection platform; enhancing data transmission efficiency and supporting rapid creation of new secure application environments to meet new application flow requirements. The method mainly comprises four sub-functions of boundary protection, regional protection, node protection and network high availability.

The system operation security system mainly comprises system operation security, system information security design, trust service system and authority management design, and ensures the security of the system from various layers.

The data security system mainly realizes the security of data exchange through four functions of data security encryption transmission (VPN), security assurance of data exchange process, security design of data exchange interface and data audit and protection.

3. Security management system

In the construction of a safety guarantee system, all potential safety hazards are difficult to prevent by technical means, and a corresponding safety management system is also required to be established. Safety management is a core link of the whole safety construction. An effective safety organization can guarantee the simplicity and high efficiency of daily safety guarantee work under the guidance of a safety strategy and the guarantee of a safety technology and a safety product.

The security management system mainly comprises: security policies, security organizations, and security regimes. In order to enhance the security management of the customer network and ensure the security of key facilities, the construction of a security management system should be enhanced.

The invention relates to a big data platform applied to an intelligent park, which is mainly characterized in that the technical architecture of the whole project adopts a mixed mode of Hadoop+MPP+memory database, and simultaneously adopts Storm technology to support the acquisition and calculation of real-time data, thereby realizing a high-concurrency, scalable and high-performance big data system. Support data sharing and processing capabilities in a variety of ways, such as databases, messages, files, and the like. And simultaneously, mapReduce operation, SQL operation, stream calculation and memory calculation are supported. The rule engine is used for reducing the complexity of components for realizing complex business logic, increasing the flexibility of marketing scene configuration, reducing the maintenance cost of an application program and enhancing the expandability of the program. The scheme has good expansibility, and can enhance the processing capacity of the cluster in the future in a horizontal expansion mode, thereby meeting the requirement of service development.

And a big data technology Hadoop and a distributed architecture are adopted, so that single-point faults are avoided, and the system is high in expansion and availability. Indexing and searching of large amounts of information can be done in near real time, enabling fast real time searching of billions of files and PB level data, while providing full-scale options that can be tailored to nearly every aspect of the engine.

And executing data acquisition tasks in parallel through a MapReduce technology, primarily sorting the captured data, submitting the data to a data storage layer, and extracting structural information through a data processing layer for data mining analysis.

The original content of the webpage is stored by adopting a distributed database, and the distributed database is constructed on Hadoop+Hbase, so that an online real-time random read-write architecture is realized. The system has extremely strong horizontal scalability, supports billions of rows and millions of columns, and supports real-time data acquisition.

The platform runs on a cluster formed by common commercial hardware, adopts a distributed architecture, can be expanded to thousands of machines, has a fault tolerance mechanism, and can not cause data loss and calculation task failure when partial machine nodes are in failure. Not only is high availability, but also has high scalability, and can be used for horizontally expanding and improving data, storage capacity and calculation speed by simply increasing the machine.

Meanwhile, the technical safety system protection and the off-line personnel safety protection are combined in the safety protection system of the big data platform, so that the potential safety hazard problem existing in the prior art due to the technical safety protection is broken through, and higher safety guarantee is provided for the big data platform applied to the intelligent park. The safety guarantee system comprises a safety protection system and a safety management system. The safety protection system mainly realizes safety guarantee through the technique, includes: network security, system security, application security, and data security; the safety management system mainly establishes a safety organization meeting under the lead of a leader, establishes a safety protection system, and realizes the data safety of a big data platform, including a safety strategy management standard, a safety organization model and a safety regulation system.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A big data platform for an intelligent campus, comprising: the system comprises a data acquisition module, a data storage module, a data calculation module, a data application module and a platform management and control module;

The data computing module is connected with the data storage module and is used for processing the data in the data storage module; the data acquisition module comprises: the data extraction unit, the data input end and the data output end; the data input end is connected with a data source; the data extraction unit is connected with the data input end and classifies the acquired data and transmits the classified data to the data storage module; the data extraction unit completes the extraction of the file interface file, performs data extraction in an SFTP protocol mode, and comprises the following functions: file downloading, file integrity checking, file power-off continuous transmission and source file deletion;

the platform management and control module is connected with the data acquisition module, the data storage module, the data calculation module and the data application module and is used for monitoring; the platform management and control module comprises: the system comprises a cluster management unit, a host management unit, a user management unit and a cluster log management unit; the cluster management unit is connected with the data calculation module; the host management unit is connected with the host node; the user management unit manages platform users; the cluster log management unit is respectively connected with the data acquisition module, the data storage module, the data calculation module and the data application module; the cluster management unit is used for carrying out cluster monitoring, and carrying out multi-level multi-dimensional visual monitoring and alarming on each cluster resource in the general Hadoop system; the monitoring content comprises a host node, a distributed file unit and MapReduce, HBASE; the system also supports monitoring and recovering of the software and hardware faults of the cluster; when a fault or abnormality occurs, alarm information is displayed at a salient position; when the fault or abnormality is resolved, the alarm is automatically released from the user interface and the alarm record is retrieved from the history information.

2. The big data platform for use in intelligent parks according to claim 1, wherein the data storage module comprises a distributed file unit, a distributed database, and a distributed cache unit; the distributed file unit is provided with an uploading channel and a downloading channel and performs data interaction with the distributed database; and the distributed cache unit is connected with the distributed database for cache processing.

3. The big data platform for use in an intelligent campus of claim 1, wherein the data computing module comprises: the system comprises a MapReduce unit, a data warehouse unit, a machine learning and data mining library and a rule knowledge library; the data warehouse unit converts data files and runs on the MapReduce unit; the machine learning and data mining store a classical algorithm in the machine learning field; the rule knowledge base matches rules via a rule engine.

4. The big data platform for use in an intelligent campus of claim 1, further comprising a data security module; the data security module comprises an identity verification and authorization unit; the authentication and authorization unit is connected with the user management unit.

5. A method for operating a big data platform applied to an intelligent park, characterized in that any one of claims 1-4 is used for the big data platform applied to the intelligent park, and the specific steps include: