CN110543464A

CN110543464A - Big data platform applied to smart park and operation method

Info

Publication number: CN110543464A
Application number: CN201910774850.XA
Authority: CN
Inventors: 任菁倩; 杨嘉欣; 杜莎
Original assignee: Guangdong Dingyi Interconnection Technology Co Ltd
Current assignee: Guangdong Dingyi Interconnection Technology Co Ltd
Priority date: 2018-12-12
Filing date: 2019-08-21
Publication date: 2019-12-06
Anticipated expiration: 2039-08-21
Also published as: CN110543464B

Abstract

the invention discloses a big data platform applied to an intelligent park, which is mainly characterized in that the technical architecture of the whole project adopts a mixed mode of Hadoop + MPP + memory database, and simultaneously adopts Storm technology to support the acquisition and calculation of real-time data, thereby realizing a big data system with high concurrency, scalability and high performance. And the data sharing and processing capabilities of databases, messages, files and the like in various modes are supported. Meanwhile, MapReduce operation, SQL operation, flow calculation and memory calculation are supported. The rule engine is used for reducing the complexity of components for realizing complex business logic, increasing the flexibility of marketing scene configuration, reducing the maintenance cost of an application program and enhancing the expandability of the program. The scheme has good expansibility, can enhance the processing capacity of the cluster in a horizontal expansion mode in the future and meets the requirement of service development.

Description

Big data platform applied to smart park and operation method

Technical Field

The invention relates to the technical field of big data platforms, in particular to a big data platform applied to an intelligent park and an operation method.

Background

Although the existing big data technology is very explosive, the big data industry in China is still in a starting stage, and the development of an industry chain is not mature. After the big data industrial park is established, enough enterprises cannot necessarily exist, and a complete big data ecological circle cannot be formed.

The existing big data application technology has the problems of safety and hidden danger, and the main contents are as follows: first, the threat posed by big data, a security issue in general, is bound to be the target of attack when big data technologies, systems and applications gather a lot of value; secondly, problems and side effects caused by excessive abuse of big data are typically personal privacy disclosure, and also include commercial secret disclosure and national secret disclosure caused by big data analysis capability; third, mental and conscious security issues. Threats to big data, side effects of big data, and extreme mental efforts to big data can hinder and disrupt the development of big data.

therefore, how to provide a safe big data platform and an operation method for realizing efficient utilization of an intelligent park is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the present invention provides a big data platform and an operation method applied to a smart park, which provide an effective solution and implement on the basis of an advanced big data technology and aim at the defects of the prior art, and enlarge the coverage of the big data application technology, so that more industrial parks can apply the big data technology, and the big data platform applied to the smart park can realize the efficient utilization of the smart park through the big data application technology, and simultaneously can ensure the efficient and safe data of the smart park, so that the smart park forms a more perfect big data ecological circle.

In order to achieve the purpose, the invention adopts the following technical scheme:

A big data platform for a smart park, comprising: the system comprises a data acquisition module, a data storage module, a data calculation module, a data application module and a platform management and control module;

The data acquisition module is connected with the data storage module and stores acquired data into the data storage module;

The data calculation module is connected with the data storage module and is used for processing data in the data storage module;

the data application module is connected with the data calculation module to establish a business logic encapsulation business object and a business service;

The platform management and control module is used for connecting and monitoring the data acquisition module, the data storage module, the data calculation module and the data application module.

Preferably, in the above big data platform applied to the smart park, the data acquisition module includes: the data extraction unit, the data input end and the data output end; the data input end is connected with a data source; the data extraction unit is connected with the data input end and classifies the acquired data and transmits the data to the data storage module.

Preferably, in the above big data platform applied to the smart park, the data storage module includes a distributed file unit, a distributed database, and a distributed cache unit; the distributed file unit is provided with an uploading channel and a downloading channel and performs data interaction with the distributed database; and the distributed cache unit is connected with the distributed database for cache processing.

Preferably, in the big data platform applied to the smart park, the data calculation module includes: the system comprises a MapReuce unit, a data warehouse unit, a machine learning and data mining base and a rule knowledge base; the data warehouse unit converts a data file and runs the data file on the MapReuce unit; the machine learning and data mining library stores a machine learning field classic algorithm; the rule knowledge base matches rules through a rule engine.

Preferably, in the above big data platform applied to the smart park, the platform management and control module includes: the system comprises a cluster management unit, a host management unit, a user management unit and a cluster log management unit; the cluster management unit is connected with the data calculation module; the host management unit is connected with a host node; the user management unit manages the platform users; the cluster log management unit is respectively connected with the data acquisition module, the data storage module, the data calculation module and the data application module.

Preferably, in the big data platform applied to the intelligent park, the big data platform further comprises a data security module; the data security module comprises an identity verification and authorization unit; the identity authentication and authorization unit is connected with the user management unit.

an operation method of a big data platform for an intelligent park comprises the following specific steps:

the method comprises the following steps: the data acquisition module extracts data from a data source, processes the data and stores the data, and the accessed data is processed uniformly through file decompression, file merging and splitting, file level verification, data level verification, cleaning, conversion, association and summarization and is loaded to the data storage module;

Step two: the data storage module adopts a distributed scheme and utilizes Hadoop to realize semi-structured and unstructured data processing; processing high-quality structured data by using MPP (maximum power point tracking), and storing the data;

Step three: transmitting the stored data to a data warehouse unit of a data calculation module, converting a data file and operating on the MapReuce unit; the MapReduce unit transmits the information to a data application module for carrying out flow statistics, service recommendation, trend analysis, user behavior analysis, data mining, offline analysis, online analysis and ad hoc query;

Step four: and the platform control module is used for monitoring the data acquisition module, the data storage module, the data calculation module and the data application module.

According to the technical scheme, compared with the prior art, the invention provides the big data platform applied to the smart park, and an effective solution is provided and implemented on the basis of the advanced big data technology aiming at the defects of the prior art, the coverage rate of the big data application technology is increased, more industrial parks can apply the big data technology, and the big data platform applied to the smart park can realize the efficient utilization of the smart park through the big data application technology, and simultaneously can ensure the efficient and safe data of the smart park, so that the smart park forms a more perfect big data ecological circle.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Figure 1 is a structural framework diagram of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a big data platform applied to a smart park, which is based on an advanced big data technology and aims at the defects of the prior art, provides an effective solution and is implemented, the coverage rate of the big data application technology is enlarged, more industrial parks can apply the big data technology, and the big data platform applied to the smart park can realize the high-efficiency utilization of the smart park through the big data application technology, can ensure the high-efficiency and safety of the data of the smart park, and can form a more perfect big data ecological circle for the smart park.

As shown in fig. 1, a big data platform applied to a smart park includes: the system comprises a data acquisition module, a data storage module, a data calculation module, a data application module and a platform management and control module;

In order to further optimize and optimize the above technical solution, the data acquisition module includes: the data extraction unit, the data input end and the data output end; the data input end is connected with a data source; the data extraction unit is connected with the data input end and classifies the acquired data and transmits the data to the data storage module.

furthermore, the data in the current intelligent park information system has inconsistent problems, such as data structure isomerism, inconsistent data length, different data formats, even error data and the like, so that the original data is difficult to directly use. The system is required to further process the data in a particular data source into a useful and desirable data format.

By constructing the big data extraction and conversion unit, government affair information collection and adaptation such as a government affair shared information base and on-line affair handling information of the project are realized. The method is used for helping to integrate data in various systems, and the integrated data can meet the requirements of further mining data and discovering knowledge.

the functions of the method mainly comprise the steps of extracting data from a data source, processing the data and storing the data, so that data reconstruction is completed, and the requirements of big data search application and other data mining applications on data formats are met. The method supports the collection of various file collection sources, and can collect local and local area network shared folders, FTP server folders and folders on an HTTP server.

Unified data acquisition and scheduling are used as a data transfer hub of the project, a data input end is accessed into a data source as required, the accessed data are uniformly processed through the steps of file decompression, file merging and splitting, file level verification, data level verification, cleaning, conversion, association, summarization and the like, and the data are loaded to a system and are in charge of platform scheduling centralized management.

The data output end simultaneously bears the capability of providing data service for upper-layer application, and the service interface calls various data output ends to realize data access according to requirements.

The data extraction unit finishes extraction from the file type interface file, and the ETL subsystem provides an SFTP plug-in unit and performs data extraction in an SFTP protocol mode. And a breakpoint resume function is supported, and file wildcard is supported. The main functions involved are as follows: downloading a file; checking the integrity of the file; file breakpoint resuming; source file deletion, etc.

Furthermore, the data acquisition module also comprises a database table data synchronization plug-in for supporting the acquisition of source data from various heterogeneous database table data sources, and the source data is loaded to a target database table after being converted and formatted.

The data conversion function sets cleaning and conversion rules for the extracted data files according to the data specification requirements of the target interface table, and then performs cleaning and conversion of data according to the rules to form formatted files and related cleaning and conversion quality data.

The data verification function verifies the data unit file uploaded by the data source by using the data verification plug-in, and verified quality data is stored in the management metadata base. The hierarchy of the check is as follows:

And (3) file level checking: and checking the number of the data files and the number of the records according to the check file.

Recording and checking: and checking the value range of the record field based on an agreed checking rule.

Index level verification: and verifying the key service indexes of the interface units based on the verification files.

The method further comprises the steps that the data loading plug-in obtains the converted data files, SQL is assembled according to a target format, then the SQL is loaded into a target table of a data warehouse in batches, and meanwhile quality data of a loading process are generated.

The data aggregation plug-in calls a procedure or function to complete a specific data aggregation processing procedure. The aggregation refers to preprocessing such as recording line compression, table connection, attribute combination and the like on the bottom layer data according to the difference of dimensional granularity, indexes and computational elements and the actual analysis requirement, and is a data processing form for carrying out corresponding statistics on the detailed data of the bottom layer, and the preprocessing includes summation, averaging and the like.

The result of the aggregate calculation is a pre-calculated summary data based on the possible queries of the user. The form of aggregation is varied and can be performed along any one or more dimensions of the multidimensional data in the data warehouse. Aggregation may also be performed at any one level if the dimensions are hierarchical. The aggregated Data for a certain combination of dimensions is called a Cube (Cube), and the Cube lattice formed by all cubes of a given dimension set is called the Data Cube (Data Cube) of that dimension set. The data cube is built by aggregation.

Data aggregation is used for improving the performance of a data warehouse unit in online analysis processing, and shortens the query response time by preparing answers before questions are raised, so that the data warehouse unit is the basis for the OLAP technology to respond quickly, and is mainly embodied in the following aspects:

Aggregation reduces the impact of direct access to underlying data on front-end applications

On-line analysis processing usually requires summarized data derived from detailed data, and directly performing query statistics on massive basic data greatly affects system efficiency. The required summary data is pre-computed by aggregation, thereby avoiding direct access to the underlying data.

aggregation reduces duplicate computations on underlying data

Different on-line analytical processing operations may all require the same processing of the same portion of underlying data. The summary data is pre-computed by aggregation, thereby avoiding duplicate computations on relevant underlying data.

data consistency can be guaranteed to a certain extent by using aggregation

In one aspect, the underlying data in the data warehouse units is not updatable in real time, and the aggregate derived from these relatively stable underlying data reflects aggregated information over a period of time. On the other hand, the data in the data warehouse unit is time-varying again, and new data will be added periodically. The consistency of the data accessed in the analysis process can be ensured to a certain extent through aggregation, and the inconsistency of successively summarized data caused by directly using basic data is avoided.

Acquiring event data flow data: the value of the data decreases over time so that events must be processed as soon as they occur, preferably immediately when they occur, with one event occurring for processing rather than being buffered as a batch. In the data flow model, the input data (in whole or in part) that needs to be processed is not stored on a randomly accessible disk or memory, and they arrive in one or more "continuous data streams".

The data flow system relates to two kinds of operations, namely stateful operators and stateless operators, wherein the stateless operators comprise units, filters and the like, and the stateful operators comprise sort, join, aggregat and the like. The state maintained by a stateful operator is lost if the execution fails, the state and output generated by the replay dataflow are not necessarily consistent with those before failure, and the replay dataflow can construct an output consistent with those before failure after the stateless operator fails.

The dataflow computation can be seen as a dataflow graph consisting of one operator (node) and one dataflow (edge).

Apache Kafka is also an open source system, and aims to provide a unified, high-throughput, low-latency distributed message processing platform for processing real-time data. It was originally developed by LinkedIn, was open in 2011 and was contributed to Apache. Kafka differs from traditional RabbitMQ, Apache ActiveMQ and other message systems mainly in that: the distributed system is characterized by easy expansion; providing high throughput for publishing and subscribing; multiple subscriptions are supported, and consumers can be automatically balanced; messages may be persisted to disk and may be used for bulk consumption, such as ETL and the like.

storm is a real-time dataflow computing system with Twitter open source, and is developed by using Clojere functional language. Storm provides a set of common primitives for distributed real-time computing that can be used in "stream processing," processing messages and updating databases in real-time, which is another way to manage queues and worker clusters. Storm consults the Hadoop computational model, where Hadoop runs a Job and Storm runs a Topoloy. Job is a lifecycle, while Topology is a Service, a Job that does not stop.

In order to further optimize and optimize the technical scheme, the data storage module comprises a distributed file unit, a distributed database and a distributed cache unit; the distributed file unit is provided with an uploading channel and a downloading channel and performs data interaction with the distributed database; and the distributed cache unit is connected with the distributed database for cache processing.

Further, the data storage module adopts a mixed architecture of Hadoop + MPP + memory database, adopts a distributed scheme, and realizes semi-structured and unstructured data processing by Hadoop. MPP is used for processing high-quality structured data, and meanwhile, rich SQL and transaction support capability is provided for applications. The method breaks through key technologies of storage, management and efficient access of big data, can construct a big data platform with PB-level storage capacity, and provides a transparent data management platform for users.

The distributed file unit has the characteristic of high fault tolerance and is designed to be deployed on cheap hardware; and it provides high throughput access to application data, suitable for applications with very large data sets.

The distributed database is used as a non-shared architecture, each node runs an own operating system, database and the like, and information interaction between the nodes can be realized only through network connection.

The distributed cache unit is a high-performance key-value memory database. It supports relatively more value types to store, including string, list, set, and zset. These data types all support push/pop, add/remove, and intersect union and difference, and richer operations, and these operations are all atomic. On the basis, the distributed cache unit supports various different modes of sorting. To ensure efficiency, data is cached in memory. The difference is that the distributed cache unit periodically writes updated data into a disk or writes modification operation into an additional recording file, and master-slave synchronization is realized on the basis. The distributed cache unit can play a good role in supplementing the relational database in part of occasions.

By adopting a hierarchical and open sharing-oriented technical framework, the application and the data of the performance management system are decoupled to form a stable and open data sharing platform, the application integration of multiple upper-layer manufacturers is supported, a data platform is realized, diversified internal applications and external applications are supported, the system has the data processing and storage capacity required by related work development, and the data are classified and classified for storage according to the data importance and timeliness.

The data storage module has the characteristics that:

1) data openness

In order to ensure the effectiveness of data and stable performance, the system has the functions of shared interface management, access control, load control and the like, and can realize one-to-many application expansion:

The sharing interface management function uniformly manages the interfaces of the data sharing platform, including interfaces of inquiry, subscription, message exchange, database and the like.

The access control management function should implement: judgment of access authority, session management, access frequency management, request queue management, security control and the like.

2) Expansibility

And the smooth evolution of the platform is supported, including hardware expansion, data configuration, system management, software upgrading and the like, so that the method can adapt to the continuous development of services and the expansion of user scale.

The system is based on X86PC server hardware and is easy to expand horizontally;

The method has no dependence on source and target data and is compatible with various data sources;

And Hadoop storage and application, scheduling, management and monitoring of computing resources are provided for third-party application.

in order to further optimize and optimize the above technical solution, the data calculation module includes: the system comprises a MapReuce unit, a data warehouse unit, a machine learning and data mining base and a rule knowledge base; the data warehouse unit converts a data file and runs the data file on the MapReuce unit; the machine learning and data mining library stores a machine learning field classic algorithm; the rule knowledge base matches rules through a rule engine.

the data calculation module supports various workflows, algorithms and tools for parallel processing, adopts Hadoop-most adept batch calculation, iterative calculation represented by various machine learning algorithms, stream calculation, SQL (structured query language) relational query, interactive ad hoc query and the like, and realizes the technologies of data fusion, statistics, offline analysis, online analysis, data mining and the like.

Hadoop is currently used for processing and offline analysis of mass data, and has irreplaceable advantages in scalability, robustness, computational performance and cost. Hadoop is used for processing large-scale data through a distributed processing framework of a MapReuce unit, and the flexibility is very good.

The data warehouse unit adopts Hive, which is a data warehouse unit of Hadoop, and promotes data summarization, ad hoc query and large-scale data set analysis.

Data mining adopts Mahout, which is an extensible machine learning and data mining library, and supports 4 main use cases: recommending mining, aggregating, classifying and frequently mining item sets.

The rule engine adopts Drools, which is an inference engine, and matches rules from a rule knowledge base according to the existing facts, processes the rules with conflicts, and executes the rules which are finally screened. The rule engine can liberate complex and changeable rules from hard codes and store the rules in a file in the form of rule scripts, so that the change of the rules can be immediately effective in an online environment without modifying codes and restarting a machine.

The data calculation module supports a wide range of applications, including traffic statistics, service recommendation, trend analysis, user behavior analysis, data mining, offline analysis, online analysis, ad hoc query, and the like.

in order to further optimize the technical scheme, the data application module adopts J2ee and ajax technologies to realize the application function based on the WEB interface, and establishes business logic to package business objects and business services, and the application services realize the business logic in a centralized manner. In this way, business logic is implemented outside of the business objects, which can reduce coupling between the business objects. The use of application services enables the encapsulation of higher-level-of-abstraction business logic in a separate component that calls underlying business objects and business services. The application layer has the main functions of: the four-network collaborative accurate marketing support, user behavior track extraction and scene production, and outdoor advertisement accurate marketing.

In order to further optimize and optimize the above technical solution, the platform management and control module includes: the system comprises a cluster management unit, a host management unit, a user management unit and a cluster log management unit; the cluster management unit is connected with the data calculation module; the host management unit is connected with a host node; the user management unit manages the platform users; the cluster log management unit is respectively connected with the data acquisition module, the data storage module, the data calculation module and the data application module.

The platform management and control module realizes the following functions:

1) visual management of big data platform

administration and configuration is implemented using a cloudera manager. The Cloudera Manager is a component for facilitating installation and monitoring management of services related to Hadoop and other big data processing in a cluster, and greatly simplifies installation configuration management of services such as a host, Hadoop, Hive, Spark and the like in the cluster.

The Cloudera manager provides a visual management interface;

the Cloudera manager provides the cluster management function;

The Cloudera manager provides the functions of host management, application authorization and the like;

The Cloudera manager provides a cluster management user management function;

The Cloudera manager provides a cluster log management function;

2) Big data platform configuration management

the big data platform provides the installation, parameter configuration and management functions of the Hadoop cluster.

can provide the functions of components such as HDFS, Hbase, MapRdubce, Hive, Zookeeper and the like.

The installation and deployment operations are supported to be carried out in a guide mode, and a system administrator can complete the installation and deployment tasks only by carrying out a small amount of input according to the prompt of the guide.

And HA automatic deployment of the main node is supported.

supporting the automatic installation and deployment tasks of more than 300 nodes.

The system configuration information is added, deleted, modified, searched and the like, and each operation of an administrator needs to be recorded in a log.

The dynamic adding and deleting functions of the system nodes are supported;

Supporting cluster configuration of the heterogeneous servers and supporting configuration tuning of operation resources under the heterogeneous servers.

3) Big data platform cluster monitoring

The big data platform supports visual monitoring and alarming of all cluster resources in the universal Hadoop system and supports unified monitoring of a plurality of clusters. And the multi-level and multi-dimensional visual monitoring of the Hadoop cluster is realized through a WEB interface tool. The multi-level refers to five levels of a cluster level, a service level, a node level, a process level and a job level. The multi-dimension refers to multiple dimensions of CPU occupancy rate, memory capacity, occupancy rate and occupancy rate, disk capacity or HDFS capacity and occupancy rate, disk I/O flow rate and occupancy rate, network bandwidth and occupancy rate.

The method supports visual display of storage and computing resources of each node of the cluster, such as a rack, a network topological graph, a network segment, server configuration and the like;

the method supports the visual display of the resource use condition of each node of the cluster, such as the number of data blocks, the running number of Job and the health state of the node, and supports the periodic health condition inspection.

and visual monitoring on system services of each node is supported, such as distributed file units, MapReduce, Hbase, Zookeeper and the like.

The method supports visual monitoring of the operation state (success, failure, cancellation and the like) of each node, and captures corresponding log information.

The monitoring content comprises the following steps:

The host node: host name, idle CPU percentage, CPU percentage occupied by user space, user process space, CPU percentage occupied by prioritized processes, CPU percentage occupied by kernel space, cache memory size, free memory size, shared memory size, total memory for kernel cache, total amount of switch partitions, total size of disks, remaining disk space, total number of processes running, total number of processes, system average load per minute, system average load per 5 minutes, system average load per 15 minutes, incoming packets per second, outgoing packets per second, network ingress bandwidth speed, network egress bandwidth speed.

The distributed file units comprise total number of file system blocks, total size, total number of files, residual amount, damaged blocks, blocks needing to be copied, JVM thread state and the like.

MapReduce refers to the task running condition, the task occupying resource condition and the like.

HBASE: the request times of the cluster and the RegionServer, the number of RegionServer registers and the like.

And monitoring and recovering the software and hardware faults of the cluster are supported, such as a node downtime restart mechanism and a restart mechanism of abnormally terminated service processes.

When a fault or an abnormality occurs, alarm information is displayed at the important position.

When the fault or anomaly is resolved, the alarm is automatically dismissed from the user interface and the alarm record may be retrieved from the historical information.

4) big data platform safety management (permission isolation)

The big data platform supports the authority management of system users and the safety certification of nodes. The role is established according to the combination of different organization structures, operation authorities, data authorities and the like, and flexible configuration management is realized. Each user can only see the execution of the authorized application. Before the user performs various operations on the job, the user should judge whether the operation authority is provided through a unified authentication service. And supporting a file and directory security control model similar to Linux for files stored in the distributed file unit. The system supports access authentication and security control on a client side accessed to the Hadoop system, and supports a network connection Kerberos security authentication mechanism. The method provides the security access control for the Hadoop system, and can perform access interruption on illegal access by formulating a security policy.

SSL encryption: with different certificate policies, allowing SSL clients to securely connect to servers can be used at the cluster, using trusted certificates or the issuance of certificates by trusted authorities. And the setting of the certificate requirements depends on the configuration policy for the certificate. The general strategies are: certificate per host (one-for-one), Certificate for multiple hosts (multiple-for-one), Wildcard Certificate (generic Certificate). While SSL must be enabled for all core Hadoop services (HDFS, MapReduce, yann, etc.).

Kerberos authentication: kerberos uses the needleha-scheduler protocol as its basis. It uses a single logic consisting of two separate logic parts: the authentication server and the ticket authority server constitute a "trusted third party," termed a Key Distribution Center (KDC). Kerberos works on the basis of "tickets" that are used to prove the identity of a user. The KDC holds a key database; each network entity, whether a client or a server, shares a set of keys known only to itself and the KDC. The content of the key is used to prove the identity of the entity. For communication between two entities, the KDC generates a session key that is used to encrypt the information of the interaction between them.

The Kerberos authentication mechanism causes the nodes in the cluster to become nodes that they acknowledge and trust. It puts the authenticated key on the trusted node in advance at cluster deployment. When the cluster runs, the nodes in the cluster are authenticated by using the secret key. Only authenticated nodes can be used normally. Nodes attempting to spoof cannot communicate with nodes within the cluster because they do not have previously obtained key information. The problem of maliciously using or tampering the Hadoop cluster is prevented, and the reliability and the safety of the Hadoop cluster are ensured.

sentry service: sentry is a Hadoop open source component issued by Cloudera, is a Hadoop authorization module, and in order to provide accurate access level for correct users and application programs, Sentry provides fine-grained level, role-based authorization and multi-tenant management mode, and by introducing Sentry, Hadoop can meet the RBAC (role-based access control) requirements of enterprises and government users in the following aspects:

And (4) security authorization: sentry can control data access and provide data access privileges to authenticated users.

Fine-grained access control: sentry supports fine-grained Hadoop data and metadata access control.

Role-based management: sentry simplifies management by role-based authorization, and you can easily grant different privilege levels to access the same dataset to multiple groups. For example, for a particular data set, you can assign anti-fraud groups the privilege to view all columns, give analysts the right to view non-sensitive or non-PII (persistent identification information) columns, and give data receiving streams the right to insert new data into the HDFS.

Multi-tenant management: sentry allows setting permissions for different data sets delegated to different administrators. In the Hive/Impala case, Sentry may perform rights management at the database/schema level.

Unifying the platform: sentry provides a unified platform for ensuring data security, and uses the existing Hadoop Kerberos to realize security authentication. Meanwhile, the same Sentry protocol may be used when accessing data via Hive or Impala. In the future, the Sentry protocol will be extended to other components.

sentry architecture: the authorization core layer of Sentry is essentially divided into two parts, the tie layers (high bindings and Impala bindings) and the core authorization providers (Policy engine and Policy associations). The binding layer provides a pluggable interface that enables dialog with the protocol engine. Policy engine cooperates with bindings to evaluate access requests and, if access is allowed, to access the underlying data through Policy associations.

the cluster log management module is connected with the data acquisition module, the data storage module, the data calculation module and the data application module; the log information comprises a timestamp, a level, user and module information and a log text. And the system operation log and the audit log are supported to be recorded and viewed. And the recording, query and presentation of the system running log and the user access operation log are supported. And the recording and viewing of the running logs of HDFS, MapReduce, HBase, Hive and Zookeeper are supported. And supporting system operation log grading, including INFO, DEBUG, WARN, ERROR, FATAL and the like. And recording and viewing of system audit logs of HDFS, MapReduce and Hive are supported.

In order to further optimize the technical scheme, the system also comprises a data security module; the data security module comprises an identity verification and authorization unit; the identity authentication and authorization unit is connected with the user management unit.

Further, authentication and authorization are two core processes that are typically involved in attempting to interact with an IT system. These core flows can ensure the security of the system in the face of attacks:

Authentication is the process of confirming that system project affiliates have their claimed identity. In the human world, project affiliates are typically authenticated by providing a username and password pair. There are a number of advanced, sophisticated mechanisms available to perform authentication; these mechanisms may include biometric authentication, multi-factor authentication, and the like. The object (person or particular subsystem) being authenticated is often referred to as the principal.

the authorization mechanism is used to determine which operations a principal is allowed to perform on the system or which resources the principal has access to. The authorization flow is typically triggered after the authentication flow. Typically, when a principal passes authentication, information is provided about the principal to help determine which operations the principal can and cannot perform.

In monolithic applications, authentication and authorization are simple and common because they are actually handled by the application; there is no need to have advanced mechanisms to provide a more secure user experience. However, in microservice architectures with typical distributed features, more advanced modes must be employed to avoid repeated interception between credential-providing service calls. You want to be able to verify the identity of a principal one at a time. This identity simplifies the authentication and authorization process, utilizes automation functions, and improves scalability.

Further, still include: when the security policy is established for the micro-service architecture, the inter-service identity authentication and authorization are adopted:

Trust boundary: containerization techniques (such as Docker) are used to reduce risk. The many functions provided by Docker allow developers to flexibly maximize the security of microservices and entire applications at different levels. In building service code, the developer is free to use the penetration testing tool to perform stress testing on any part of the build cycle. Because the source code that builds a Docker image has been explicitly described in declarative form in a Docker distribution component (Docker and Docker composite files), developers can easily handle the image supply chain and enforce security policies when needed. In addition, services can be easily consolidated by placing them into a Docker container, making them immutable, adding a strong safeguard to the service.

Further, by employing a software defined infrastructure, private networks can be quickly created and configured using scripting languages, and strong security policies can be enforced at the network level.

The SSO is used for internal interaction between services in the micro-service architecture, the method can use the existing infrastructure, can also simplify the access control of the services, and integrates all the access control operations in one enterprise access directory server.

Hash operation message verification code (HMAC) based on HTTP

in HMAC, the request content is hashed with a private key, and the resulting hash value is sent with the request. The other end of the communication then recreates the hash value using its copy of the private key and the received request content. If the hash values match, the request is allowed to pass. If the request has been tampered with, the hash values do not match and the other end knows and reacts appropriately.

Managing keys using special purpose services

To eliminate the credential management overhead in a distributed model such as a microservice architecture and benefit from the high security of the constructed system, one option is to use a comprehensive key management tool. This tool allows for storage, dynamic leasing, updating, and revocation of keys (e.g., passwords, API keys, and certificates). These operations are very important in microservice due to the automation principles specified in microservice.

It is to be understood that: although theoretically there is no data encryption method that cannot be compromised, there are still some mature, proven, and commonly used mechanisms (e.g., AES-128 or AES-256, etc.). These mechanisms are used when security considerations are made, rather than creating their own methods internally. In addition, libraries used to implement these mechanisms are updated and patched in time.

the key management tool: it is a first practice not to store keys and data in the same location. The key management complexity is not violated by the flexibility principle of the microservice architecture. Attempts have been made to use comprehensive tools with microservice design concepts that do not disrupt your continuous integration and continuous delivery pipeline.

the security policy is adjusted for the business needs: security policies are developed based on business needs and continually adjusted as strategic goals may change constantly, as may the techniques involved in the solution.

Establishment of big data security guarantee system

1. Security architecture

the safety guarantee system comprises a safety protection system and a safety management system. Wherein the safety protection system comprises: network security, system security, application security, and data security; the safety management system comprises a safety policy management specification, a safety organization model and a safety regulation and regulation system.

2. safety protection system

The network security protection system mainly provides a network security protection means necessary for a data application access mode, and part of applications can adopt a technical means of an deficiency-type Virtual Private Network (VPN) to ensure the safe and reliable transmission of shared exchange data. Key application and encrypted data of a network layer security protection platform; the data transmission efficiency is enhanced, and the rapid creation of a new safe application environment is supported to meet the requirements of a new application process. The method mainly comprises four major functions of boundary protection, area protection, node protection and high network availability.

The system operation safety system mainly comprises system operation safety, system information safety design, a trust service system and authority management design, and the safety of the system is ensured from each level.

The data security system mainly realizes the security of data exchange through four functions of data security encryption transmission (VPN), security guarantee of a data exchange process, data exchange interface security design and data auditing and protection.

3. security management system

In the construction of the safety guarantee system, all potential safety hazards are difficult to prevent only by technical means, and a corresponding safety management system needs to be established. The safety management is the core link of the whole safety construction. An effective security organization can guarantee the simplicity and high efficiency of daily security guarantee under the guidance of security strategies and the guarantee of security technologies and security products.

The safety management system mainly comprises: security policies, security organizations, and security regimes. In order to strengthen the security management of the client network and ensure the security of key facilities, the construction of a security management system should be strengthened.

The invention relates to a big data platform applied to an intelligent park, which is mainly characterized in that the technical architecture of the whole project adopts a mixed mode of Hadoop + MPP + memory database, and simultaneously adopts Storm technology to support the acquisition and calculation of real-time data, thereby realizing a high-concurrency, telescopic and high-performance big data system. And the data sharing and processing capabilities of databases, messages, files and the like in various modes are supported. Meanwhile, MapReduce operation, SQL operation, flow calculation and memory calculation are supported. The rule engine is used for reducing the complexity of components for realizing complex business logic, increasing the flexibility of marketing scene configuration, reducing the maintenance cost of an application program and enhancing the expandability of the program. The scheme has good expansibility, can enhance the processing capacity of the cluster in a horizontal expansion mode in the future and meets the requirement of service development.

By adopting a big data technology Hadoop and a distributed architecture, the system has no single-point fault, high flexibility and high availability. Indexing and searching of large amounts of information can be done in near real-time, enabling billions of files and PB-level data to be searched quickly in real-time, while providing an all-around option that can be customized for almost every aspect of the engine.

Data acquisition tasks are executed in parallel through a MapReduce technology, captured data are subjected to preliminary arrangement and then submitted to a data storage layer, and then structured information extraction is carried out through a data processing layer for data mining analysis.

The distributed database is adopted to store the original content of the webpage, and the distributed database is constructed on the basis of Hadoop + Hbase, so that an online real-time random read-write framework is realized. The device has extremely strong horizontal flexibility, supports billions of rows and millions of columns, and supports real-time data acquisition.

The platform runs on a cluster formed by common commercial hardware, adopts a distributed architecture, can be expanded to thousands of machines, has a fault-tolerant mechanism, and cannot cause data loss or failure of a computing task when part of machine nodes break down. The method has the advantages of high availability, capability of rapidly performing fault transfer when a node fails, high flexibility, and capability of horizontally expanding, improving data, storage capacity and calculating speed by simply increasing machines.

Simultaneously through the protection of technical safety system and the personnel safety protection under the line in big data platform's safety guarantee system, both combine, break through originally only technical safety protection and the potential safety hazard problem that exists, provide higher safety guarantee for being applied to the big data platform in wisdom garden. The safety guarantee system comprises a safety protection system and a safety management system. The safety protection system mainly realizes safety guarantee through the technique, includes: network security, system security, application security, and data security; the safety management system is mainly characterized in that a safety organization meeting is established under the guidance of a leader, a safety protection system is formulated, and the data safety of a big data platform is realized, wherein the safety management system comprises a safety strategy management standard, a safety organization model and a safety regulation system.

the embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. the utility model provides a be applied to big data platform in wisdom garden which characterized in that includes: the system comprises a data acquisition module, a data storage module, a data calculation module, a data application module and a platform management and control module;

2. The big data platform applied to the intelligent park according to claim 1, wherein the data acquisition module comprises: the data extraction unit, the data input end and the data output end; the data input end is connected with a data source; the data extraction unit is connected with the data input end and classifies the acquired data and transmits the data to the data storage module.

3. The big data platform applied to the intelligent park according to claim 1, wherein the data storage module comprises a distributed file unit, a distributed database and a distributed cache unit; the distributed file unit is provided with an uploading channel and a downloading channel and performs data interaction with the distributed database; and the distributed cache unit is connected with the distributed database for cache processing.

4. the big data platform applied to the intelligent park according to claim 1, wherein the data calculation module comprises: the system comprises a MapReuce unit, a data warehouse unit, a machine learning and data mining base and a rule knowledge base; the data warehouse unit converts a data file and runs the data file on the MapReuce unit; the machine learning and data mining library stores a machine learning field classic algorithm; the rule knowledge base matches rules through a rule engine.

5. the big data platform applied to the intelligent park according to claim 1, wherein the platform management and control module comprises: the system comprises a cluster management unit, a host management unit, a user management unit and a cluster log management unit; the cluster management unit is connected with the data calculation module; the host management unit is connected with a host node; the user management unit manages the platform users; the cluster log management unit is respectively connected with the data acquisition module, the data storage module, the data calculation module and the data application module.

6. The big data platform applied to the intelligent park according to claim 5, further comprising a data security module; the data security module comprises an identity verification and authorization unit; the identity authentication and authorization unit is connected with the user management unit.

7. an operation method for a big data platform of an intelligent park is characterized by comprising the following specific steps: