CN114896584B

CN114896584B - A Hive data permission control proxy layer method and system

Info

Publication number: CN114896584B
Application number: CN202210818903.5A
Authority: CN
Inventors: 卢薇
Original assignee: Hangzhou Bizhi Technology Co ltd
Current assignee: Hangzhou Bizhi Technology Co ltd
Priority date: 2022-07-13
Filing date: 2022-07-13
Publication date: 2022-10-11
Anticipated expiration: 2042-07-13
Also published as: CN114896584A

Abstract

The invention discloses a Hive data authority control agent layer method and a Hive data authority control agent layer system, which comprise the following steps: s1: hive data authority application; s2: resolving HQL; s3: HQL rewriting; s4: checking HQL authority; s5: HQL table, field blood margin analysis. The method and the system can meet the requirement of fine-grained data authority control. The method can meet the requirement of fine-grained data authority control in an enterprise, HQL analysis, rewriting and authority verification are realized on the basis of Hive prototypes, accuracy and stability are guaranteed to a certain extent, and the requirement of production on-line is met to a certain degree; the Java class Datacanthus authority and Datacanthus authority provided by the invention are both specifically realized for the Hive interface, and can realize data authentication at employee/user level; the method and the system do not need to invade a group big data cluster management platform during installation and deployment, and can meet the requirements of a large part of actual production scenes.

Description

A Hive data permission control proxy layer method and system

技术领域technical field

本发明涉及网络及数据处理技术领域，尤其是涉及一种Hive数据权限控制代理层方法及系统。The invention relates to the technical field of network and data processing, in particular to a Hive data authority control agent layer method and system.

背景技术Background technique

随着互联网开源技术和云计算等技术的广泛应用，性能和扩展能力等基础技术不再成为企业业务发展瓶颈。打通数据孤岛、最大化数据智能、数据驱动业务成为企业未来发展的核心竞争力。当下，随着大数据开源技术和社区的蓬勃发展，企业搭建数据中台通常选用以“Hadoop”为核心的大数据技术栈，比如HDFS（注：“Hadoop”生态中一种分布式文件存储系统）/HBase（注：“Hadoop”生态中一种基于列式存储的非关系性数据库系统）进行分布式存储、Hive（注：“Hadoop”生态中一种离线查询引擎）/Spark（注：“Hadoop”生态中一种基于内存的分布式计算框架和离线计算引擎）进行离线数据分析、Flink（注：“Hadoop”生态中一种能够实现流批一体化的实时计算引擎）进行实时数据分析、Yarn（注：“Hadoop”生态中一种通用资源管理系统）进行资源管理和分布式任务调度、Ranger/Sentry（注：“Hadoop”生态中两种不同的数据权限管控框架和组件）进行大数据集群的数据权限管控。然而，企业不同部门的数据汇入到数据中台后必须要解决的一个难题是数据权限控制问题。With the wide application of technologies such as Internet open source technology and cloud computing, basic technologies such as performance and scalability are no longer the bottleneck of enterprise business development. Breaking through data silos, maximizing data intelligence, and data-driven business have become the core competitiveness of an enterprise's future development. At present, with the vigorous development of big data open source technologies and communities, enterprises usually choose a big data technology stack with "Hadoop" as the core, such as HDFS (Note: a distributed file storage system in the "Hadoop" ecosystem )/HBase (Note: a non-relational database system based on columnar storage in the "Hadoop" ecosystem) for distributed storage, Hive (Note: an offline query engine in the "Hadoop" ecosystem)/Spark (Note: " A memory-based distributed computing framework and offline computing engine in the Hadoop ecosystem) for offline data analysis, Flink (Note: a real-time computing engine capable of stream-batch integration in the "Hadoop" ecosystem) for real-time data analysis, Yarn (note: a general resource management system in the "Hadoop" ecosystem) for resource management and distributed task scheduling, Ranger/Sentry (note: two different data permission control frameworks and components in the "Hadoop" ecosystem) for big data Data access control for clusters. However, a problem that must be solved after the data of different departments of the enterprise is imported into the data center is the problem of data permission control.

与此同时，为了整合资源，企业通常会有专门的大数据与安全部门负责搭建和管理以CDH（Cloudera’s Distribution Including Apache Hadoop, CDH）/HDP(HortonWorks Data Platform, HDP)/EMR（Elastic MapReduce, EMR）为代表的大数据集群管理平台，其他部门则以一个租户的身份向集团大数据集群管理平台申请Hive数据库、HDFS文件目录等资源。大数据与安全部门通过CDH/HDP/EMR自带的Ranger/Sentry组件为租户分配资源权限，从而实现租户之间的数据隔离。这种租户之间的数据隔离是一种粗粒度的数据权限控制方案，不能满足部门内部更细粒度的数据权限控制需求，比如员工A作为核心数据开发人员拥有所有表的的操作权限，而员工B作为普通访客只能拥有特定表的查看权限。这主要是因为部门中所有员工都是通过同一个租户与大数据集群进行交互，Ranger和Sentry的身份认证和数据鉴权只能是针对租户，不能再细化到租户底下的不同员工。At the same time, in order to integrate resources, enterprises usually have a dedicated big data and security department responsible for building and managing CDH (Cloudera's Distribution Including Apache Hadoop, CDH)/HDP (HortonWorks Data Platform, HDP)/EMR (Elastic MapReduce, EMR) ) as the representative of the big data cluster management platform, and other departments apply for Hive database, HDFS file directory and other resources to the group's big data cluster management platform as a tenant. The big data and security department allocates resource permissions to tenants through the Ranger/Sentry components that come with CDH/HDP/EMR, thereby realizing data isolation between tenants. This kind of data isolation between tenants is a coarse-grained data permission control scheme, which cannot meet the more fine-grained data permission control requirements within the department. B, as a common visitor, can only have viewing rights for specific tables. This is mainly because all employees in the department interact with the big data cluster through the same tenant. The identity authentication and data authentication of Ranger and Sentry can only be for the tenant, and cannot be further refined to different employees under the tenant.

目前常被提出的Hive数据权限方案大多还是基于Ranger和Sentry。它们通过Ranger和Sentry提供的对外接口，向Ranger Admin创建权限策略或者向Sentry创建角色并且为角色分配权限。除此之外，还有一部分方案是对Ranger和Sentry的插件进行二次改造，实现自定义的数据权限控制。由于Ranger和Sentry都是采用插件化的方式对Hive、HDFS等大数据组件进行数据权限管控，也即需要将插件内嵌到这些大数据组件中，这些方案都是侵入式的数据权限管控方案。对于大数据集群管理平台由集团进行统一管理的场景，部门想要通过自研权限插件并将权限插件部署到集团大数据集群管理平台是不被允许的。因此，在现实许多场景中，不侵入大数据集群的Hive数据权限控制代理层方案成为企业部门内部实现细粒度的数据权限控制的唯一选择。Most of the Hive data permission schemes that are often proposed are still based on Ranger and Sentry. They create permission policies to Ranger Admin or create roles to Sentry and assign permissions to roles through the external interfaces provided by Ranger and Sentry. In addition, another part of the solution is to re-engineer the plugins of Ranger and Sentry to realize custom data permission control. Because Ranger and Sentry both use a plug-in method to control data permissions for big data components such as Hive and HDFS, that is, plug-ins need to be embedded in these big data components. These solutions are intrusive data permission control solutions. For the scenario where the big data cluster management platform is managed by the group in a unified manner, it is not allowed for the department to develop and deploy the permission plug-in to the group's big data cluster management platform through self-developed permission plug-ins. Therefore, in many real scenarios, the Hive data permission control agent layer solution that does not invade the big data cluster has become the only choice for implementing fine-grained data permission control within the enterprise department.

如前所述，目前已被提出的Hive数据权限方案中，大多还是基于Ranger和Sentry的插件方案。它们通过Ranger和Sentry提供的对外接口，向Ranger Admin（注：Ranger的策略管理中心）创建权限策略或者向Sentry创建角色并且为角色分配权限，然后利用Ranger和Sentry原生或经过二次改造后的Hive插件从Ranger Admin和Sentry拉取权限策略，完成数据权限鉴权。这些方案都需要将Ranger和Sentry的Hive插件放置在CDH/HDP/EMR Hive的依赖包目录下，并且必须配置和重启Hive才能生效。一方面，Hive插件会侵入HiveServer2（注：Hive的服务端）执行HQL（或HiveQL，一种Hive提供的SQL方言）过程，Hive插件的安全性对于集团大数据集群管理平台通常是不被信任的。这也是为什么在许多现实场景，特别是大数据集群管理平台由集团某特定部门统一管理和维护，其他部门共享使用的场景，第三方提供的Hive插件是不被允许放入到大数据集群管理平台。另一方面，Hive插件的安装和更新需要配置和重启Hive，这会导致Hive服务的中断，进而导致Hive任务执行结果不能正常返回，从而造成生产事故。由于这两方面原因，现有的Hive数据权限方案只能适用于部门独立部署和维护CDH/HDP/EMR场景，Hive插件的安全性保障、安装和更新都由部门自己负责和承担。因此，为数据中台必不可少的大数据查询引擎Hive实现数据权限代理层方案具有十分重要的生产价值。所谓“数据权限代理层”是指在不侵入大数据集群管理平台的前提下，在大数据集群管理平台之外代理提供数据鉴权服务。针对大多数大数据集群管理平台由集团集中管控的现实场景，本发明提出一种Hive数据权限控制代理层方案，它能在不使用任何插件侵入集团大数据集群管理平台的前提下，满足部门内部更细粒度的数据权限控制需求。As mentioned above, most of the proposed Hive data permission schemes are based on Ranger and Sentry plug-in schemes. Through the external interface provided by Ranger and Sentry, they create permission policies to Ranger Admin (Note: Ranger's policy management center) or create roles to Sentry and assign permissions to roles, and then use Ranger and Sentry's native or re-transformed Hive The plugin pulls permission policies from Ranger Admin and Sentry to complete data permission authentication. These solutions all require the Hive plugins of Ranger and Sentry to be placed in the CDH/HDP/EMR Hive dependency package directory, and Hive must be configured and restarted to take effect. On the one hand, the Hive plug-in will invade HiveServer2 (Note: Hive's server) to execute the HQL (or HiveQL, a SQL dialect provided by Hive) process. The security of the Hive plug-in is usually not trusted for the group big data cluster management platform. . This is why in many realistic scenarios, especially the big data cluster management platform is managed and maintained by a specific department of the group and shared by other departments, Hive plug-ins provided by third parties are not allowed to be put into the big data cluster management platform . On the other hand, the installation and update of Hive plug-ins require Hive to be configured and restarted, which will lead to interruption of Hive services, resulting in the failure of Hive task execution results to return normally, resulting in production accidents. Due to these two reasons, the existing Hive data permission scheme can only be applied to the CDH/HDP/EMR scenario where the department independently deploys and maintains. Therefore, it is of great production value to realize the data authority agent layer scheme for Hive, an indispensable big data query engine in the data center. The so-called "data authority agent layer" refers to the provision of data authentication services by an agent outside the big data cluster management platform without invading the big data cluster management platform. Aiming at the realistic scenario that most big data cluster management platforms are centrally managed and controlled by the group, the present invention proposes a Hive data authority control agent layer scheme, which can meet the requirements of internal departments without using any plug-ins to invade the group big data cluster management platform. More fine-grained data permission control requirements.

发明内容SUMMARY OF THE INVENTION

针对现有技术存在的问题，本发明的目的在于提供一种Hive数据权限控制代理层方法及系统，能在不使用任何插件侵入集团大数据集群管理平台的前提下，满足部门内部更细粒度的数据权限控制需求。In view of the problems existing in the prior art, the purpose of the present invention is to provide a Hive data authority control agent layer method and system, which can satisfy the needs of more fine-grained internal departments without using any plug-ins to invade the group big data cluster management platform. Data access control requirements.

为实现上述目的，本发明提供一种Hive数据权限控制代理层方法，所述方法包括以下步骤：In order to achieve the above object, the present invention provides a Hive data authority control proxy layer method, the method comprises the following steps:

S1：Hive数据权限申请；当部门的数据权限审批服务批准员工的Hive数据权限申请时，数据权限管理中心同步创建相应的数据权限策略，存储在表及字段权限模块和行过滤及字段脱敏模块中，并且更新用户权限管理中员工与数据权限策略之间的映射关系；S1: Hive data permission application; when the department's data permission approval service approves the employee's Hive data permission application, the data permission management center synchronously creates the corresponding data permission policy, which is stored in the table and field permission module and row filtering and field masking module , and update the mapping relationship between employees and data permission policies in user rights management;

S2：HQL解析；在HQL解析之前，需要利用部门与集团大数据集群管理平台交互的租户keytab进行kerberos认证，从而实现租户之间的数据隔离；S2: HQL parsing; before HQL parsing, it is necessary to use the tenant keytab that interacts with the group's big data cluster management platform for kerberos authentication, so as to achieve data isolation between tenants;

S3：HQL改写；SemanticAnalyzer在对HQL进行解析的过程中，通过Hive的TableMask对象对HQL进行行过滤和字段脱敏改写；S3: HQL rewriting; SemanticAnalyzer performs row filtering and field desensitization rewriting of HQL through Hive's TableMask object in the process of parsing HQL;

S4：HQL权限校验；基于QueryState、SemanticAnalyzer和HQL，调用Driver的静态doAuthorization方法，实现HQL数据权限校验；S4: HQL authorization verification; based on QueryState, SemanticAnalyzer and HQL, call the Driver's static doAuthorization method to realize HQL data authorization verification;

S5：HQL表、字段血缘分析；当HQL鉴权通过后，进行HQL表、字段血缘分析。S5: HQL table and field blood relationship analysis; after HQL authentication is passed, HQL table and field blood relationship analysis is performed.

进一步，当员工提交Hive任务，基于JDBC的HQL提交模块将HQL和员工的身份信息一并发送给Hive数据鉴权代理单元进行鉴权，HQL鉴权的过程为上述HQL解析、HQL改写、HQL权限校验和HQL表、字段血缘分析；若HQL鉴权通过，基于JDBC的HQL提交模块将改写后HQL提交到集团大数据集群管理平台上去执行。Further, when an employee submits a Hive task, the JDBC-based HQL submission module sends the HQL and the employee's identity information to the Hive data authentication agency unit for authentication. The process of HQL authentication is the above-mentioned HQL analysis, HQL rewriting, and HQL authority. Checksum HQL table and field blood relationship analysis; if HQL authentication is passed, the JDBC-based HQL submission module submits the rewritten HQL to the group big data cluster management platform for execution.

进一步，步骤S2中，kerberos认证通过后，创建HiveConf；HiveConf的创建依赖集团大数据集群管理平台提供的Hadoop和Hive配置文件，所述配置文件包括core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml和hive-site.xml。Further, in step S2, after the kerberos authentication is passed, a HiveConf is created; the creation of the HiveConf relies on the Hadoop and Hive configuration files provided by the group's big data cluster management platform, and the configuration files include core-site.xml, hdfs-site.xml, mapred -site.xml, yarn-site.xml and hive-site.xml.

进一步，步骤S2中，HQL解析包括如下子步骤：Further, in step S2, the HQL parsing includes the following sub-steps:

S201，利用HiveConf创建SessionState对象，设置SessionState对象的userName为提交HQL的员工账户；S201, use HiveConf to create a SessionState object, and set the userName of the SessionState object as the employee account submitting HQL;

S202，启动SessionState对象，设置当前数据库为部门在大数据集群管理平台申请的Hive数据库，初始化事务管理器；SessionState对象创建和启动后，有效且保持唯一，能够与Hadoop进行通信来提交分布式任务，也能够连接Hive的元数据库来查询元数据信息；S202, start the SessionState object, set the current database as the Hive database applied by the department on the big data cluster management platform, and initialize the transaction manager; after the SessionState object is created and started, it is valid and unique, and can communicate with Hadoop to submit distributed tasks. It can also connect to Hive's metadata database to query metadata information;

S203，依次创建QueryState、Context、ParseDriver对象；调用ParseDriver对象的parse方法将原始HQL解析成为抽象语法树节点；利用Hive的SemanticAnalyzerFactory的get方法生成QueryState和ASTNode对应的SemanticAnalyzer；S203, create QueryState, Context, and ParseDriver objects in sequence; call the parse method of the ParseDriver object to parse the original HQL into an abstract syntax tree node; use the get method of Hive's SemanticAnalyzerFactory to generate the SemanticAnalyzer corresponding to QueryState and ASTNode;

S204，调用SemanticAnalyzer的analyze方法对HQL进行解析。S204, call the analyze method of SemanticAnalyzer to analyze the HQL.

进一步，步骤S3中，HQL改写的过程包括如下子步骤：Further, in step S3, the process of HQL rewriting includes the following sub-steps:

S301，对HQL的ASTNode进行遍历和解析，获取表及字段信息；S301, traverse and parse the ASTNode of HQL to obtain table and field information;

S302，通过DatablackHiveAuthorizer从数据权限管理中心拉取与SessionState的userName对应的表及字段的行过滤和字段脱敏权限策略，并且调用applyRowFilterAndColumnMasking方法以便TableMask对象正确获取行过滤和字段脱敏表达式；S302, pull the row filtering and field masking permission policies of the table and field corresponding to the userName of SessionState from the data rights management center through DatablackHiveAuthorizer, and call the applyRowFilterAndColumnMasking method so that the TableMask object can correctly obtain the row filtering and field masking expressions;

S303,根据行过滤和字段脱敏表达式对原始HQL的Token流进行改写，将最新HQL的Token流保存在Context对象中。S303, rewrite the Token stream of the original HQL according to the row filtering and field desensitization expressions, and save the Token stream of the latest HQL in the Context object.

进一步，步骤S4中，若HQL权限校验成功，该方法正常返回，否则，抛出权限校验失败异常；Driver的权限校验底层依赖DatablackHiveAuthorizer类；DatablackHiveAuthorizer类是本发明提供的一个对Hive的HiveAuthorizer接口的实现，实现了checkPrivileges、applyRowFilterAndColumnMasking和needTransform方法；Driver的doAuthorization解析出HQL的HiveOperationType、输入和输出HivePrivilegeObject和认证上下文，然后调用DatablackHiveAuthorizer的checkPrivileges方法；checkPrivileges方法向数据权限管理中心拉取用户权限策略，解析输入输出HivePrivilegeObject中涉及的表和字段以及操作类型，然后匹配用户权限策略；若所有输入输出HivePrivilegeObject对象都权限验证通过，方法正常返回，否则抛出权限校验失败异常。Further, in step S4, if the HQL permission verification is successful, the method returns normally, otherwise, an exception of permission verification failure is thrown; the bottom layer of the driver's permission verification depends on the DatablackHiveAuthorizer class; the DatablackHiveAuthorizer class is a HiveAuthorizer for Hive provided by the present invention The implementation of the interface implements the checkPrivileges, applyRowFilterAndColumnMasking and needTransform methods; the driver's doAuthorization parses the HQL HiveOperationType, input and output HivePrivilegeObject and authentication context, and then calls the checkPrivileges method of the DatablackHiveAuthorizer; the checkPrivileges method pulls the user permission policy from the data rights management center, Parse the tables, fields and operation types involved in the input and output HivePrivilegeObject, and then match the user permission policy; if all input and output HivePrivilegeObject objects pass the permission verification, the method returns normally, otherwise, the permission verification failure exception is thrown.

进一步，步骤S5中，HQL表、字段血缘分析具体包括以下步骤：Further, in step S5, the HQL table and field blood relationship analysis specifically includes the following steps:

S501，基于HiveConf、QueryState、SemanticAnalyzer以及HQL，创建QueryPlan和HookContext对象；S501, based on HiveConf, QueryState, SemanticAnalyzer and HQL, create QueryPlan and HookContext objects;

S502，调用本发明提供的Java类ColumnLineageAnalysis的run方法，返回HQL中表、字段血缘关系；ColumnLineageAnalysis是Hive的LineageLogger的继承类，重写了其中的run方法，能够返回HQL的表、字段血缘关系；S502, call the run method of the Java class ColumnLineageAnalysis provided by the present invention, and return the blood relationship of tables and fields in HQL; ColumnLineageAnalysis is an inheritance class of LineageLogger of Hive, and rewrites the run method therein to return the blood relationship of tables and fields in HQL;

S503，Hive数据鉴权代理将HQL权限校验结果、改写后的HQL以及表、字段血缘分析结果发送给基于JDBC的HQL提交模块。S503, the Hive data authentication agent sends the HQL authority verification result, the rewritten HQL, and the table and field blood relationship analysis results to the JDBC-based HQL submission module.

另一方面，本发明还提供一种Hive数据权限控制代理层系统，所述系统用于实现根据本发明所述的Hive数据权限控制代理层方法。In another aspect, the present invention also provides a Hive data permission control proxy layer system, which is used to implement the Hive data permission control proxy layer method according to the present invention.

进一步，所述系统包括集团大数据集群管理平台,用于为使用平台的各个部门开通和注册租户，并且通过Ranger Admin提前配置各租户使用Hive数据库和HDFS文件目录权限；还包括嵌入到HiveServer2的Hive插件和HDFS的HDFS插件；用于周期地从Ranger Admin拉取权限策略，保存在本地策略仓库。Further, the system includes a group big data cluster management platform, which is used to activate and register tenants for various departments using the platform, and configure the permissions of each tenant to use the Hive database and HDFS file directory in advance through the Ranger Admin; it also includes Hive embedded in HiveServer2. Plugin and HDFS plugin for HDFS; used to periodically pull permission policies from Ranger Admin and save them in the local policy repository.

进一步，还包括数据权限管理中心，所述数据权限管理中心设有表及字段权限模块、行过滤及字段脱敏模块和用户权限管理模块；其中，表及字段权限是从元数据维度定义数据权限；行过滤及字段脱敏是从数据维度定义数据权限；Hive数据鉴权代理单元进行HQL解析、HQL改写、HQL权限校验以及HQL表、字段血缘分析。Further, it also includes a data authority management center, which is provided with a table and field authority module, a row filtering and field desensitization module, and a user authority management module; wherein, the table and field authority defines data authority from the metadata dimension ;Row filtering and field desensitization define data permissions from the data dimension; Hive data authentication agent unit performs HQL parsing, HQL rewriting, HQL permission verification, and HQL table and field bloodline analysis.

本发明的有益效果如下：1）本发明的技术方案架构设计能够满足企业内部细粒度的数据权限控制需求，2）本发明的技术方案实现是基于Hive原生类实现HQL解析、改写和权限校验，准确度和稳定性有一定保障，达到生产上线要求有一定把握，3）本发明提供的Java类DatablackHiveAuthorizerFactory和DatablackHiveAuthorizer都是对Hive接口的具体实现，能够实现员工/用户级别的数据鉴权，4）本发明的安装和部署都不需要侵入集团大数据集群管理平台，能够满足很大一部分实际生产场景需求。The beneficial effects of the present invention are as follows: 1) the technical solution architecture design of the present invention can meet the fine-grained data permission control requirements within the enterprise; 2) the technical solution implementation of the present invention is to realize HQL parsing, rewriting and permission verification based on the Hive native class , the accuracy and stability are guaranteed to a certain extent, and there is a certain degree of confidence in meeting the production and online requirements. 3) The Java classes DatablackHiveAuthorizerFactory and DatablackHiveAuthorizer provided by the present invention are both specific implementations of the Hive interface, which can realize data authentication at the employee/user level. 4 ) The installation and deployment of the present invention do not need to invade the group big data cluster management platform, and can meet the needs of a large part of actual production scenarios.

附图说明Description of drawings

图1示出了根据本发明实施例中Hive数据权限控制代理层方法及系统的架构设计示意图；1 shows a schematic diagram of the architecture design of the Hive data permission control proxy layer method and system according to an embodiment of the present invention;

图2示出了根据本发明实施例中测试表结构及其数据示意图；2 shows a schematic diagram of a test table structure and its data according to an embodiment of the present invention;

图3示出了根据本发明实施例中数据权限策略表示意图；3 shows a schematic diagram of a data permission policy table according to an embodiment of the present invention;

图4示出了根据本发明实施例中员工与数据权限策略映射关系表示意图；FIG. 4 shows a schematic diagram of a mapping relationship table between employees and data permission policies according to an embodiment of the present invention;

图5示出了根据本发明实施例中HQL鉴权流程示意图；5 shows a schematic diagram of an HQL authentication process according to an embodiment of the present invention;

图6示出了根据本发明实施例中HQL安全认证、鉴权、血缘分析关键配置参数示意图；6 shows a schematic diagram of key configuration parameters of HQL security authentication, authentication, and blood relationship analysis according to an embodiment of the present invention;

图7中的（a）为根据本发明实施例中HQL改写前后对比图，（b）为查询结果示意图；(a) in FIG. 7 is a comparison diagram before and after HQL rewriting according to an embodiment of the present invention, and (b) is a schematic diagram of a query result;

图8示出了根据本发明实施例中HQL鉴权过程示意图；8 shows a schematic diagram of an HQL authentication process according to an embodiment of the present invention;

图9示出了根据本发明实施例中HQL的表、字段血缘分析过程示意图。FIG. 9 shows a schematic diagram of a table and field blood relationship analysis process of HQL according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合附图，对本发明的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

在本发明的描述中，需要说明的是，术语“中心”、“上”、“下”、“左”、“右”、“竖直”、“水平”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。此外，术语“第一”、“第二”、“第三”仅用于描述目的，而不能理解为指示或暗示相对重要性。In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. The indicated orientation or positional relationship is based on the orientation or positional relationship shown in the accompanying drawings, which is only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the indicated device or element must have a specific orientation or a specific orientation. construction and operation, and therefore should not be construed as limiting the invention. Furthermore, the terms "first", "second", and "third" are used for descriptive purposes only and should not be construed to indicate or imply relative importance.

在本发明的描述中，需要说明的是，除非另有明确的规定和限定，术语“安装”、“相连”、“连接”应做广义理解，例如，可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连，可以是两个元件内部的连通。对于本领域的普通技术人员而言，可以根据具体情况理解上述术语在本发明中的具体含义。In the description of the present invention, it should be noted that the terms "installed", "connected" and "connected" should be understood in a broad sense, unless otherwise expressly specified and limited, for example, it may be a fixed connection or a detachable connection Connection, or integral connection; can be mechanical connection, can also be electrical connection; can be directly connected, can also be indirectly connected through an intermediate medium, can be internal communication between two elements. For those of ordinary skill in the art, the specific meanings of the above terms in the present invention can be understood according to specific situations.

以下结合图1-图9对本发明的具体实施方式进行详细说明。应当理解的是，此处所描述的具体实施方式仅用于说明和解释本发明，并不用于限制本发明。Specific embodiments of the present invention will be described in detail below with reference to FIGS. 1 to 9 . It should be understood that the specific embodiments described herein are only used to illustrate and explain the present invention, but not to limit the present invention.

文中涉及术语的释义：Definitions of terms involved in the text:

Hadoop：狭义上指Apache基金会所开发的一个开源分布式计算平台，广义上指以“Hadoop”为核心的大数据组件生态；Hadoop: In a narrow sense, it refers to an open source distributed computing platform developed by the Apache Foundation, and in a broad sense, it refers to the ecology of big data components with "Hadoop" as the core;

HDFS：“Hadoop”生态中一种分布式文件存储系统；HDFS: a distributed file storage system in the "Hadoop" ecosystem;

HBase：“Hadoop”生态中一种基于列式存储的非关系型数据库系统；HBase: a non-relational database system based on columnar storage in the "Hadoop" ecosystem;

Hive：“Hadoop”生态中一种离线查询引擎，提供一种被称为“HiveQL（HQL）”的SQL方言来存储、查询和分析存储在Hadoop中的大规模数据；Hive: An offline query engine in the "Hadoop" ecosystem, providing a SQL dialect called "HiveQL (HQL)" to store, query and analyze large-scale data stored in Hadoop;

Spark：“Hadoop”生态中一种基于内存的分布式计算框架和离线计算引擎；Spark: a memory-based distributed computing framework and offline computing engine in the "Hadoop" ecosystem;

Flink：“Hadoop”生态中一种能够实现流批一体化的实时计算引擎；Flink: A real-time computing engine that can realize stream-batch integration in the "Hadoop" ecosystem;

MapReduce：“Hadoop”生态中一种最基础的分布式计算框架和编程模型；MapReduce: One of the most basic distributed computing frameworks and programming models in the "Hadoop" ecosystem;

Yarn：“Hadoop”生态中一种通用资源管理系统；Yarn: a general resource management system in the "Hadoop" ecosystem;

CDH： Hadoop稳定发行版以及大数据集群管理平台；CDH: Hadoop stable release and big data cluster management platform;

HDP：开源大数据集群管理平台；HDP: open source big data cluster management platform;

EMR： Hadoop稳定发行版以及大数据集群管理平台；EMR: Hadoop stable release and big data cluster management platform;

Ranger：“Hadoop”生态中一种基于权限策略的数据权限管控框架和组件，已被HDP和EMR集成；Ranger: A data permission management and control framework and components based on permission policy in the "Hadoop" ecosystem, which has been integrated by HDP and EMR;

Sentry：“Hadoop”生态中一种基于角色的数据权限管控框架和组件，已被CDH集成；Sentry: a role-based data permission control framework and components in the "Hadoop" ecosystem, which has been integrated by CDH;

Ranger Admin：Ranger的权限策略和组件管理中心；Ranger Admin: Ranger's permission policy and component management center;

HiveServer2：Hive提供的一种能使客户端执行HQL的服务；HiveServer2: A service provided by Hive that enables clients to execute HQL;

Kerberos：“Hadoop”生态支持的一种计算机网络授权协议，用来在非安全网络中，对个人通信以安全的手段进行身份认证；Kerberos: A computer network authorization protocol supported by the "Hadoop" ecosystem, used to authenticate personal communications in a secure manner in a non-secure network;

Keytab：Kerberos服务器颁发给租户的身份认证票据；Keytab: The authentication ticket issued by the Kerberos server to the tenant;

JDBC：Java语言中用来规范客户端程序访问数据库的应用程序接口，一般指Java数据库连接；JDBC: An application program interface used in the Java language to standardize client programs accessing databases, generally referring to Java database connections;

HiveConf：Hive源码中提供的Java类，用于在内存中存储Hadoop和Hive配置，继承自Hadoop的配置类Configuration；HiveConf: Java class provided in Hive source code, used to store Hadoop and Hive configuration in memory, inherited from Hadoop configuration class Configuration;

SessionState：Hive源码中提供的Java类，用于创建一个Hive会话并且维护会话中的状态；SessionState: The Java class provided in the Hive source code to create a Hive session and maintain the state in the session;

QueryState：Hive源码中提供的Java类，用于维护一个HQL查询的状态信息，比如HiveConf和HQL对应的操作类型；QueryState: Java class provided in the Hive source code to maintain the state information of an HQL query, such as the operation type corresponding to HiveConf and HQL;

ParseDriver：Hive源码中提供的Java类，提供的parse方法能够将一个HQL解析成相应的抽象语法树节点ASTNode；ParseDriver: The Java class provided in the Hive source code, the provided parse method can parse an HQL into the corresponding abstract syntax tree node ASTNode;

ASTNode：Hive源码中提供的Java类，用于表示一个HQL的抽象语法树；ASTNode: Java class provided in Hive source code, used to represent an abstract syntax tree of HQL;

SemanticAnalyzerFactory：Hive源码中提供的Java类，提供的get方法能够根据ASTNode和QueryState生成相应的语法分析器SemanticAnalyzer；SemanticAnalyzerFactory: The Java class provided in the Hive source code, the provided get method can generate the corresponding SemanticAnalyzer according to ASTNode and QueryState;

SemanticAnalyzer：Hive源码中提供的Java类，提供的analyze方法能够对ASTNode进行语义分析和优化，解析其中涉及的表、字段等元数据信息，并且对HQL进行行过滤和字段脱敏改写；SemanticAnalyzer: The Java class provided in the Hive source code. The analyze method provided can perform semantic analysis and optimization on ASTNode, parse metadata information such as tables and fields involved, and perform row filtering and field desensitization rewriting for HQL;

Context：Hive源码中提供的Java类，为SemanticAnalyzer提供上下文环境，维护HQL中的Token流，SemanticAnalyzer改写后HQL的Token流保存在Context中；Context: The Java class provided in the Hive source code, which provides the context for SemanticAnalyzer and maintains the Token stream in HQL. After the SemanticAnalyzer rewrites, the HQL Token stream is stored in the Context;

TableMask：Hive源码中提供的Java类，通过从HiveAuthorizer中获取的行过滤和字段脱敏表达式，对HQL进行改写，并且将改写后的HQL保存在Context中，TableMask对象在SemanticAnalyzer对象生成的时候被生成；TableMask: The Java class provided in the Hive source code rewrites the HQL through the row filtering and field desensitization expressions obtained from the HiveAuthorizer, and saves the rewritten HQL in the Context. The TableMask object is generated when the SemanticAnalyzer object is generated. generate;

HiveAuthorizer：Hive源码中提供的Java接口，定义了一系列未实现方法，其中checkPrivileges方法用于用户数据权限校验，needTransform方法用于标注是否HQL改写和applyRowFilterAndColumnMasking方法用于获取表的行过滤和字段脱敏表达式；HiveAuthorizer: The Java interface provided in the Hive source code defines a series of unimplemented methods, among which the checkPrivileges method is used for user data permission verification, the needTransform method is used to mark whether HQL is rewritten, and the applyRowFilterAndColumnMasking method is used to obtain table row filtering and field removal. sensitive expression;

DatablackHiveAuthorizer：Hive源码中HiveAuthorizer接口类的一个实现类，提供了checkPrivileges、needTransform和applyRowFilterAndColumnMasking三个方法的具体实现；DatablackHiveAuthorizer: An implementation class of the HiveAuthorizer interface class in the Hive source code, providing specific implementations of three methods: checkPrivileges, needTransform and applyRowFilterAndColumnMasking;

HiveAuthorizerFactory：Hive源码中提供的Java接口，定义了一个未实现的createHiveAuthorizer方法，是HiveAuthorizer的工厂类；HiveAuthorizerFactory: The Java interface provided in the Hive source code defines an unimplemented createHiveAuthorizer method, which is the factory class of HiveAuthorizer;

DatablackHiveAuthorizerFactory：Hive源码中HiveAuthorizerFactory接口类的一个实现类，实现了createHiveAuthorizer方法，来生成DatablackHiveAuthorizer的实例对象；DatablackHiveAuthorizerFactory: An implementation class of the HiveAuthorizerFactory interface class in the Hive source code, which implements the createHiveAuthorizer method to generate an instance object of DatablackHiveAuthorizer;

Driver：Hive源码中提供的Java类，是Hive的驱动，用于对HQL进行解析、编译、优化和执行，提供一个静态的doAuthorization方法能对用户数据权限进行校验；Driver: The Java class provided in the Hive source code is the driver of Hive, which is used to parse, compile, optimize and execute HQL, and provide a static doAuthorization method to verify user data permissions;

HiveOperationType：Hive源码中提供的Java枚举类，用于表示HQL的操作类型，比如新增数据库、表查询、表插入等；HiveOperationType: The Java enumeration class provided in the Hive source code is used to represent the operation type of HQL, such as adding a database, table query, table insertion, etc.;

HivePrivilegeObject：Hive源码中提供的Java类，用于表示一个待鉴权对象，其中记录了数据库、表、字段、分区以及操作类型等信息；HivePrivilegeObject: A Java class provided in the Hive source code to represent an object to be authenticated, which records information such as database, table, field, partition, and operation type;

QueryPlan：Hive源码中提供的Java类，用于记录一个HQL所对应的输入输出格式以及查询计划；QueryPlan: Java class provided in Hive source code, used to record the input and output format and query plan corresponding to an HQL;

HookContext：Hive源码中提供的Java类，用于在HQL执行前后为钩子类执行run方法提供上下文环境，比如QueryPlan和QueryState；HookContext: The Java class provided in the Hive source code is used to provide context for the hook class to execute the run method before and after HQL is executed, such as QueryPlan and QueryState;

LineageLogger：Hive源码中提供的Java类，是一种HQL执行后钩子类，用于对HQL的表、字段的血缘进行解析，然后打印为日志；LineageLogger: The Java class provided in the Hive source code is a HQL post-execution hook class, which is used to parse the lineage of HQL tables and fields, and then print them as logs;

ColumnLineageAnalysis：Hive源码中LineageLogger类的继承类，重写了LineageLogger类的run方法，能够以特定格式输出HQL的表、字段血缘关系。ColumnLineageAnalysis: The inheritance class of the LineageLogger class in the Hive source code, rewrites the run method of the LineageLogger class, and can output the blood relationship of HQL tables and fields in a specific format.

图1为本发明的细粒度Hive数据权限控制代理层方法及系统的总体技术方案架构设计图。如图1下半部分所示，在所述系统中，安全中心负责管理和维护集团大数据集群管理平台100,为使用平台的各个部门开通和注册租户，并且通过Ranger Admin 120提前配置各租户使用Hive数据库和HDFS文件目录权限。所述安全中心位于图1中所示的大数据与安全部门110。嵌入到HiveServer2 130的Hive插件131和HDFS 140的HDFS插件141周期地从Ranger Admin 120拉取权限策略，保存在本地策略仓库。上层租户提交HQL给HiveServer2130时会经过HQL解析、编译、优化和执行四个阶段。在HQL编译阶段，HiveServer2 130能够触发Hive插件131对HQL解析的输入/输出权限对象（Hive Privilege Object）进行鉴权。Hive插件131将权限对象与本地缓存的权限策略进行逐条核对，如果某一权限对象被鉴定为不被允许，则立即中断HiveServer2 130的HQL执行，并且向Ranger Admin 120发送鉴权审计日志，返回租户鉴权失败明细。只有当Hive插件131鉴权通过后，HiveServer2 130向Yarn 150提交MapReduce任务来执行HQL。由于Hive的数据库和表存储在HDFS 140，HDFS插件141对绕过HiveServer2 130直接访问HDFS 140的租户进行数据鉴权。这里，Hive数据权限策略在创建时会同步创建对应的HDFS 140数据权限策略。FIG. 1 is an overall technical solution architecture design diagram of the fine-grained Hive data permission control proxy layer method and system of the present invention. As shown in the lower part of Figure 1, in the system, the security center is responsible for managing and maintaining the group big data cluster management platform 100, opening and registering tenants for each department using the platform, and configuring each tenant to use the Ranger Admin 120 in advance Hive database and HDFS file directory permissions. The security center is located in the big data and security department 110 shown in FIG. 1 . The Hive plug-in 131 embedded in the HiveServer2 130 and the HDFS plug-in 141 of the HDFS 140 periodically pull the permission policy from the Ranger Admin 120 and save it in the local policy repository. When an upper-level tenant submits HQL to HiveServer2130, it will go through four stages of HQL parsing, compilation, optimization, and execution. In the HQL compilation stage, the HiveServer2 130 can trigger the Hive plug-in 131 to authenticate the input/output privilege object (Hive Privilege Object) parsed by the HQL. The Hive plug-in 131 checks the permission objects with the locally cached permission policies one by one. If a permission object is identified as not allowed, it immediately interrupts the HQL execution of HiveServer2 130, and sends the authentication audit log to the Ranger Admin 120, and returns to the tenant Authentication failure details. Only after the Hive plug-in 131 is authenticated, the HiveServer2 130 submits a MapReduce task to the Yarn 150 to execute HQL. Since Hive's database and tables are stored in HDFS 140, HDFS plug-in 141 performs data authentication for tenants who bypass HiveServer2 130 and directly access HDFS 140. Here, when the Hive data permission policy is created, the corresponding HDFS 140 data permission policy will be created synchronously.

如图1中间部分所示，部门A数据分析平台200和部门B数据分析平台300分别凭借租户A.keytab和租户B.keytab，通过kerberos认证后，与集团大数据集群管理平台100上大数据组件进行交互。需要注意的是，虽然集团大数据集群管理平台100上的Ranger Admin120能够管理租户资源权限，实现租户之间的资源隔离，但租户之间的数据隔离是一种粗粒度的数据权限控制方案，无法满足部门A和部门B对内部员工更细粒度的数据权限需求。为此，本发明的创新和贡献主要集中在图1的上半部分关于数据权限管理中心400和Hive数据鉴权代理模块500的技术实现。As shown in the middle part of Figure 1, the data analysis platform 200 of department A and the data analysis platform 300 of department B rely on tenant A.keytab and tenant B.keytab, respectively, after passing kerberos authentication, and the big data components on the group big data cluster management platform 100 interact. It should be noted that although the Ranger Admin120 on the Group's big data cluster management platform 100 can manage tenant resource permissions and realize resource isolation between tenants, data isolation between tenants is a coarse-grained data permission control scheme that cannot be used. Meet the more fine-grained data permission requirements of department A and department B for internal employees. Therefore, the innovation and contribution of the present invention mainly focus on the technical implementation of the data rights management center 400 and the Hive data authentication agent module 500 in the upper part of FIG. 1 .

根据本发明的改进技术原理，在数据权限管理中心400，将Hive数据权限类别分为表及字段权限和行过滤及字段脱敏两类，所述数据权限管理中心400相应地设有表及字段权限模块410和行过滤及字段脱敏模块420。其中，表及字段权限是从元数据维度定义数据权限，比如表删除、修改、清空和字段查询权限；行过滤及字段脱敏是从数据维度定义数据权限。用户权限管理模块430负责维护部门员工与数据权限策略之间的映射关系。具体地讲，在部门A 200内部，员工A.b 221通过数据权限审批服务210向部门A 200申请Hive数据权限，数据权限审批通过后，数据权限管理中心400同步创建相应的数据权限策略，存储在表及字段权限模块410和行过滤及字段脱敏模块420中，并且更新用户权限管理430中员工与数据权限策略之间的映射关系。According to the improved technical principle of the present invention, in the data authority management center 400, the Hive data authority categories are divided into table and field authority, row filtering and field desensitization, and the data authority management center 400 is correspondingly provided with tables and fields Permissions module 410 and row filtering and field desensitization module 420. Among them, table and field permissions define data permissions from the metadata dimension, such as table deletion, modification, clearing, and field query permissions; row filtering and field desensitization define data permissions from the data dimension. The user rights management module 430 is responsible for maintaining the mapping relationship between department employees and data rights policies. Specifically, within department A 200, employee A.b 221 applies for Hive data permission to department A 200 through the data permission approval service 210. After the data permission is approved, the data permission management center 400 synchronously creates a corresponding data permission policy, which is stored in the table and field permission module 410 and row filtering and field desensitization module 420, and update the mapping relationship between employees and data permission policies in user permission management 430.

Hive数据鉴权代理单元500是本发明数据权限代理层方案的核心，负责进行HQL改写510、HQL权限校验520以及提供HQL表、字段血缘分析服务530。如图1的上半部分所示，部门A 200的员工A.b 221在申请Hive表权限后，提交Hive任务，基于JDBC的HQL提交模块230会将HQL和员工A.b的身份信息一并发送给Hive数据鉴权代理单元500进行鉴权。首先，Hive数据鉴权代理单元500向数据权限管理中心400拉取用户权限策略。然后，如果针对员工A.b存在行过滤和字段脱敏权限策略，Hive数据鉴权代理500在HQL解析时会对HQL进行改写，比如添加行过滤和字段脱敏表达式，从而实现行过滤和字段脱敏。类似于Ranger的Hive插件，在解析出HQL的输入/输出权限对象后，Hive数据鉴权代理单元500将权限对象与本地缓存的权限策略进行逐条核对，如果某一权限对象被鉴定为不被允许，则立即通知基于JDBC的HQL提交模块230中止HQL任务提交和用户鉴权失败明细，并且向数据权限管理中心400发送鉴权审计日志。最后，当鉴权通过后，Hive数据鉴权代理单元500对HQL进行表、字段血缘分析，将鉴权结果、改写后HQL以及表、字段血缘信息发送给基于JDBC的HQL提交模块230。基于JDBC的HQL提交模块230将改写后HQL提交到大数据管理平台100上的HiveServer2 130进行执行。The Hive data authentication agent unit 500 is the core of the data authorization agent layer scheme of the present invention, and is responsible for performing HQL rewriting 510, HQL authorization verification 520, and providing HQL table and field lineage analysis services 530. As shown in the upper part of Figure 1, employee A.b 221 of department A 200 submits a Hive task after applying for Hive table permission, and the JDBC-based HQL submission module 230 will send HQL and employee A.b's identity information to Hive data together The authentication proxy unit 500 performs authentication. First, the Hive data authentication proxy unit 500 pulls the user rights policy from the data rights management center 400 . Then, if there is a row filtering and field desensitization permission policy for employee A.b, the Hive data authentication agent 500 will rewrite the HQL during HQL parsing, such as adding row filtering and field desensitization expressions, so as to realize row filtering and field desensitization sensitive. Similar to the Hive plug-in of Ranger, after parsing the input/output permission objects of HQL, the Hive data authentication proxy unit 500 checks the permission objects and the locally cached permission policies one by one, if a certain permission object is identified as not allowed , then immediately notify the JDBC-based HQL submission module 230 to suspend HQL task submission and user authentication failure details, and send the authentication audit log to the data rights management center 400 . Finally, after the authentication is passed, the Hive data authentication proxy unit 500 performs table and field bloodline analysis on HQL, and sends the authentication result, the rewritten HQL, and table and field bloodline information to the JDBC-based HQL submission module 230. The JDBC-based HQL submission module 230 submits the rewritten HQL to the HiveServer2 130 on the big data management platform 100 for execution.

由图1所示，数据权限管理中心400和Hive数据鉴权代理单元500都是在集团大数据集群管理平台100之外，它们的安装和部署都不需要侵入集团大数据集群管理平台100。除此之外，Hive数据鉴权是在HQL提交到大数据集群管理平台100之前进行的，不干扰HiveServer2 130的执行过程，避免了安全问题的发生。最重要的是，Hive数据鉴权的粒度能够细化到部门内部员工，满足部门内部更细粒度的数据权限控制需求。因此，数据权限管理中心400和Hive数据鉴权代理单元500就是本发明的数据权限控制代理层。As shown in FIG. 1 , the data rights management center 400 and the Hive data authentication agent unit 500 are both outside the group big data cluster management platform 100 , and their installation and deployment do not need to invade the group big data cluster management platform 100 . In addition, Hive data authentication is performed before HQL is submitted to the big data cluster management platform 100, which does not interfere with the execution process of HiveServer2 130 and avoids the occurrence of security problems. The most important thing is that the granularity of Hive data authentication can be refined to employees within the department to meet the needs of finer-grained data permission control within the department. Therefore, the data rights management center 400 and the Hive data authentication proxy unit 500 are the data rights control proxy layers of the present invention.

本发明细粒度Hive数据权限控制代理层方法及系统的技术方案如下，所述方法包括以下步骤：The technical solutions of the fine-grained Hive data authority control proxy layer method and system of the present invention are as follows, and the method includes the following steps:

步骤S1：Hive数据权限申请。Step S1: Hive data permission application.

如图1所示，当部门的数据权限审批服务批准员工的Hive数据权限申请时，数据权限管理中心400同步创建相应的数据权限策略，存储在表及字段权限模块410和行过滤及字段脱敏模块420中，并且更新用户权限管理430中员工与数据权限策略之间的映射关系。As shown in Fig. 1, when the department's data permission approval service approves the employee's Hive data permission application, the data permission management center 400 creates the corresponding data permission policy synchronously, and stores it in the table and field permission module 410 and row filtering and field masking In module 420, the mapping relationship between employees and data permission policies in user rights management 430 is updated.

在一个具体实施例中，部门A 200在大数据集群管理平台100申请的Hive数据库为“hive_test”,其中有一张名为“ods_tbl_test”的Hive表，其表结构和数据如图2所示。员工A.b 221向数据权限审批服务210申请了表“ods_tbl_test”的查看权限，并且设置了行过滤和字段脱敏条件。审批通过后，数据权限管理中心400存储的数据权限策略如图3所示，其中有三条权限策略，分别是1）对库“hive_test”的表“ods_tbl_test”的全字段查询权限，2）对库“hive_test”的表“ods_tbl_test”的行过滤，和3）对库“hive_test”的表“ods_tbl_test”的字段脱敏。相应地，员工与数据权限策略之间的映射关系如图4所示。员工A.a 222还未申请任何权限，而员工A.b 221的权限策略ID集为{1,2,3}，对应着图3中的三条权限策略。当员工提交Hive任务，基于JDBC的HQL提交模块230会将HQL和员工的身份信息一并发送给Hive数据鉴权代理单元500进行鉴权。HQL鉴权的过程分为以下HQL解析、改写、权限校验和表、字段血缘分析四个步骤，对应技术实现流程如图5所示。技术实现流程图中的HiveConf、SessionState、QueryState、ParseDriver、ASTNode、SemanticAnalyzer、Driver、QueryPlan、HookContext均为Hive提供的Java类。以下的步骤S2-S5将详细介绍图5中的HQL鉴权过程。In a specific embodiment, the Hive database applied by the department A 200 in the big data cluster management platform 100 is "hive_test", which has a Hive table named "ods_tbl_test", and its table structure and data are shown in FIG. 2 . Employee A.b 221 applies to the data permission approval service 210 for the viewing permission of the table "ods_tbl_test", and sets row filtering and field masking conditions. After the approval is passed, the data permission policy stored in the data permission management center 400 is shown in Figure 3, in which there are three permission policies, namely 1) full field query permission for the table "ods_tbl_test" of the database "hive_test", 2) permission for the database Row filtering of table "ods_tbl_test" of "hive_test", and 3) desensitization of fields of table "ods_tbl_test" of library "hive_test". Correspondingly, the mapping relationship between employees and data permission policies is shown in Figure 4. Employee A.a 222 has not applied for any permission yet, while the permission policy ID set of employee A.b 221 is {1, 2, 3}, which corresponds to the three permission policies in FIG. 3 . When an employee submits a Hive task, the JDBC-based HQL submission module 230 will send the HQL together with the employee's identity information to the Hive data authentication agent unit 500 for authentication. The process of HQL authentication is divided into the following four steps: HQL parsing, rewriting, authority checksum table, and field blood relationship analysis. The corresponding technical implementation process is shown in Figure 5. HiveConf, SessionState, QueryState, ParseDriver, ASTNode, SemanticAnalyzer, Driver, QueryPlan, and HookContext in the technical implementation flowchart are all Java classes provided by Hive. The following steps S2-S5 will describe the HQL authentication process in FIG. 5 in detail.

步骤S2：HQL解析。Step S2: HQL parsing.

在HQL解析之前，首先需要利用部门与集团大数据集群管理平台100交互的租户keytab（注：keytab文件是kerberos服务器颁发给租户的身份认证票据）进行kerberos认证，从而实现租户之间的数据隔离。认证通过后，创建HiveConf。HiveConf的创建依赖大数据集群管理平台100提供的Hadoop和Hive配置文件。另外，在HiveConf创建的过程中必须设置一些关键参数来实现HQL的行过滤、字段脱敏以及数据鉴权。这些关键参数在图6中有详细说明，其中HQL安全鉴权管理器是本发明提供的Java类，主要作用是进行权限策略拉取以及HQL数据鉴权，也是本发明的技术核心。Before HQL analysis, it is necessary to use the tenant keytab (Note: the keytab file is the identity authentication ticket issued by the Kerberos server to the tenant) that the department interacts with the group's big data cluster management platform 100 for kerberos authentication, so as to achieve data isolation between tenants. After the authentication is passed, create a HiveConf. The creation of HiveConf relies on the Hadoop and Hive configuration files provided by the big data cluster management platform 100. In addition, some key parameters must be set in the process of HiveConf creation to realize HQL row filtering, field desensitization and data authentication. These key parameters are described in detail in FIG. 6 , wherein the HQL security authentication manager is a Java class provided by the present invention, and its main function is to perform authority policy pulling and HQL data authentication, which is also the technical core of the present invention.

HQL解析包括如下子步骤：HQL parsing includes the following sub-steps:

S201，利用HiveConf创建SessionState对象，设置SessionState对象的userName为提交HQL的员工账户。S201, use HiveConf to create a SessionState object, and set the userName of the SessionState object to the employee account submitting the HQL.

S202，启动SessionState对象，设置当前数据库为部门在大数据集群管理平台申请的Hive数据库，例如步骤S1示例中的“hive_test”，初始化事务管理器。SessionState对象创建和启动后，会在全局范围内有效且保持唯一，它能够与Hadoop进行通信来提交分布式任务，也能够连接Hive的元数据库来查询元数据信息。S202, start the SessionState object, set the current database to be the Hive database applied for by the department on the big data cluster management platform, such as "hive_test" in the example of step S1, and initialize the transaction manager. After the SessionState object is created and started, it will be valid and unique in the global scope. It can communicate with Hadoop to submit distributed tasks, and it can also connect to Hive's metadata database to query metadata information.

S203，依次创建QueryState、Context、ParseDriver对象。调用ParseDriver对象的parse方法将原始HQL解析成为抽象语法树节点（Abstract Syntax Tree Node, ASTNode）。利用Hive的SemanticAnalyzerFactory的get方法生成QueryState和ASTNode对应的SemanticAnalyzer。S203, QueryState, Context, and ParseDriver objects are created in sequence. Call the parse method of the ParseDriver object to parse the original HQL into an abstract syntax tree node (Abstract Syntax Tree Node, ASTNode). Use the get method of Hive's SemanticAnalyzerFactory to generate the SemanticAnalyzer corresponding to QueryState and ASTNode.

步骤S3：HQL改写。Step S3: HQL rewriting.

SemanticAnalyzer在对HQL进行解析的过程中，会通过Hive的TableMask对象对HQL进行行过滤和字段脱敏改写。HQL改写详细过程包括如下子步骤：In the process of parsing HQL, SemanticAnalyzer will perform row filtering and field desensitization rewriting of HQL through Hive's TableMask object. The detailed process of HQL rewriting includes the following sub-steps:

S302，通过本发明提供的DatablackHiveAuthorizer从数据权限管理中心拉取与SessionState的userName对应的表及字段的行过滤和字段脱敏权限策略，并且调用applyRowFilterAndColumnMasking方法以便TableMask对象正确获取行过滤和字段脱敏表达式；S302, pull the row filtering and field desensitization permission policies of the table and field corresponding to the userName of SessionState from the data authority management center through the DatablackHiveAuthorizer provided by the present invention, and call the applyRowFilterAndColumnMasking method so that the TableMask object can correctly obtain the row filtering and field desensitization expressions Mode;

S303，根据行过滤和字段脱敏表达式对原始HQL的Token流进行改写，将最新HQL的Token流保存在Context对象中。针对步骤S1中的示例，假设员工A.b提交的HQL为“selectid, principal_part, note from ods_tbl_test”，在拉取图3中的权限策略后，TaskMask改写前后的HQL如图7中的（a）所示，改写后的数据查询结果如图7中的（b）所示。通过对比图2和图7中的（b）的表数据，可以看出数据已经按照权限策略进行了行过滤和字段脱敏。具体地说，原本图2中表ods_tbl_test有5行记录。图3中的第1行表及字段数据权限策略赋予了员工A.b对表ods_tbl_test的全字段查询权限。与此同时，图3中第2行行过滤策略使得图7中的（a）改写后HQL相比原始HQL添加了以where开头的过滤条件，最终使得图7中的（b）的查询结果只有图2中最后2行记录，前3行记录因为不满足过滤条件在查询时被过滤。另外，图3中的第3行字段脱敏策略substr(note, 3)（注：从第3字符开始裁剪note字段的数据）被应用到图7中的（a）改写后HQL的note字段，使得图7中的（b）两条记录的note是图2中note的从第三个字符开始的裁剪版本。S303, rewrite the Token stream of the original HQL according to the row filtering and field desensitization expressions, and save the Token stream of the latest HQL in the Context object. For the example in step S1, assuming that the HQL submitted by employee A.b is "selectid, principal_part, note from ods_tbl_test", after pulling the permission policy in Figure 3, the HQL before and after TaskMask rewrite is shown in (a) in Figure 7 , the rewritten data query result is shown in (b) in Figure 7. By comparing the table data in Figure 2 and Figure 7 (b), it can be seen that the data has been row filtered and field desensitized according to the permission policy. Specifically, the original table ods_tbl_test in Figure 2 has 5 rows of records. The table and field data permission policy in the first row in Figure 3 grants employee A.b the full field query permission of the table ods_tbl_test. At the same time, the filtering strategy of the second row in Figure 3 makes the rewritten HQL in Figure 7 add a filter condition starting with where compared to the original HQL, and finally the query result of (b) in Figure 7 is only The last 2 rows of records in Figure 2, the first 3 rows of records are filtered during query because they do not meet the filtering conditions. In addition, the field desensitization strategy substr(note, 3) in line 3 in Figure 3 (Note: the data of the note field is cropped from the 3rd character) is applied to the note field of HQL after rewriting (a) in Figure 7, Make the note of the two records in (b) in Figure 7 a cropped version of the note in Figure 2 starting from the third character.

步骤S4：HQL权限校验。Step S4: HQL permission verification.

基于QueryState、SemanticAnalyzer和HQL，调用Driver的静态doAuthorization方法，实现HQL权限校验。HQL权限校验成功，该方法正常返回，否则，抛出权限校验失败异常。Driver的权限校验底层依赖本发明提供的DatablackHiveAuthorizer类。DatablackHiveAuthorizer类是Hive的HiveAuthorizer接口的实现类，实现了checkPrivileges、applyRowFilterAndColumnMasking和needTransform三个方法。图8展示了Driver权限校验的底层流程细节。Driver的doAuthorization会解析出HQL的HiveOperationType、输入输出HivePrivilegeObject以及认证上下文，然后调用DatablackHiveAuthorizer的checkPrivileges方法。checkPrivileges方法会向数据权限管理中心拉取用户权限策略，解析输入输出HivePrivilegeObject中涉及的表和字段以及操作类型，然后匹配用户权限策略。若所有输入输出HivePrivilegeObject对象都权限校验通过，方法正常返回，否则抛出权限校验失败异常。最终，checkPrivileges方法还会将鉴权审计的详细日志发送给数据权限管理中心，以备审计分析。Based on QueryState, SemanticAnalyzer, and HQL, call the Driver's static doAuthorization method to implement HQL authorization verification. If the HQL permission verification succeeds, the method returns normally; otherwise, a permission verification failure exception is thrown. The bottom layer of the driver's authority verification relies on the DatablackHiveAuthorizer class provided by the present invention. The DatablackHiveAuthorizer class is the implementation class of Hive's HiveAuthorizer interface, which implements three methods: checkPrivileges, applyRowFilterAndColumnMasking, and needTransform. Figure 8 shows the underlying process details of Driver permission verification. Driver's doAuthorization will parse out HQL's HiveOperationType, input and output HivePrivilegeObject, and authentication context, and then call the checkPrivileges method of DatablackHiveAuthorizer. The checkPrivileges method will pull the user permission policy from the data permission management center, parse the tables, fields and operation types involved in the input and output HivePrivilegeObject, and then match the user permission policy. If all the input and output HivePrivilegeObject objects pass the permission check, the method returns normally, otherwise the permission check failure exception is thrown. Finally, the checkPrivileges method will also send the detailed log of the authentication audit to the data rights management center for audit analysis.

步骤S5：HQL表、字段血缘分析。Step S5: HQL table and field blood relationship analysis.

当HQL权限校验通过后，可以进行HQL表、字段血缘分析。具体包括以下步骤：After the HQL permission verification is passed, the blood relationship analysis of HQL tables and fields can be performed. Specifically include the following steps:

S501，基于HiveConf、QueryState、SemanticAnalyzer以及HQL，创建QueryPlan和HookContext对象。S501, based on HiveConf, QueryState, SemanticAnalyzer and HQL, create QueryPlan and HookContext objects.

S502，调用本发明提供的Java类ColumnLineageAnalysis的run方法，返回HQL中表、字段血缘关系。ColumnLineageAnalysis是Hive的LineageLogger的继承类，重写了其中的run方法，能够返回HQL的表、字段血缘关系。例如，图7中的（a）中的HQL的表、字段血缘关系如图9所示，也即在图7中的HQL查询结果中，目标字段id，principal_part和note分别来自Hive数据库hive_test中表ods_tbl_test的字段id，principal_part和note，其中id和principal_part是源字段到目标字段的直接映射，而note通过了substr(note, 3)的函数变换。S502, call the run method of the Java class ColumnLineageAnalysis provided by the present invention, and return the blood relationship of the table and the field in the HQL. ColumnLineageAnalysis is an inheritance class of Hive's LineageLogger. It rewrites the run method and can return HQL table and field blood relationship. For example, the blood relationship of the HQL table and field in (a) in Figure 7 is shown in Figure 9, that is, in the HQL query result in Figure 7, the target field id, principal_part and note come from the table in the Hive database hive_test respectively The fields id, principal_part and note of ods_tbl_test, where id and principal_part are the direct mapping from the source field to the target field, and note is transformed by the function of substr(note, 3).

S503，Hive数据鉴权代理将HQL鉴权结果、改写后的HQL以及表、字段血缘分析结果发送给基于JDBC的HQL提交模块。若HQL鉴权通过，基于JDBC的HQL提交模块会将改写后HQL提交到集团大数据集群管理平台100上去执行。由于Hive数据鉴权代理单元500是针对员工/用户进行数据鉴权，因此能够满足企业内部更细粒度的数据权限控制需求。S503, the Hive data authentication agent sends the HQL authentication result, the rewritten HQL, and the table and field blood relationship analysis results to the JDBC-based HQL submission module. If the HQL authentication is passed, the JDBC-based HQL submission module will submit the rewritten HQL to the group big data cluster management platform 100 for execution. Since the Hive data authentication proxy unit 500 performs data authentication for employees/users, it can meet the needs of more fine-grained data permission control within the enterprise.

本发明关键技术重点包括：1）本发明的技术方案架构设计，如数据权限管理中心、Hive数据鉴权代理和基于JDBC的HQL提交之间的交互逻辑和功能职责；2）本发明提出的基于Hive原生类的HQL解析、改写、鉴权以及表、字段血缘分析的技术实现方案；3）本发明实现的安全鉴权相关的Java类DatablackHiveAuthorizer的checkPrivileges方法逻辑。The key technical points of the present invention include: 1) the technical solution architecture design of the present invention, such as the interaction logic and functional responsibilities between the data rights management center, the Hive data authentication agent and the JDBC-based HQL submission; The technical implementation scheme of HQL parsing, rewriting, authentication and table and field blood relationship analysis of Hive native classes; 3) The checkPrivileges method logic of the Java class DatablackHiveAuthorizer related to security authentication implemented by the present invention.

其中步骤S3中的HQL解析、改写也可以通过ANTLR4的访问者模式对ASTNode进行遍历，分析HQL中涉及的表、字段资源以及相应的访问方式。但是这种方式的准确度十分有限，不能覆盖复杂的HQL场景，因而通常不能达到生产上线标准。The HQL parsing and rewriting in step S3 can also be used to traverse the ASTNode through the visitor mode of ANTLR4, and analyze the tables, field resources and corresponding access methods involved in the HQL. However, the accuracy of this method is very limited, and it cannot cover complex HQL scenarios, so it usually cannot meet the production and launch standards.

另一方面，本发明还提供一种细粒度Hive数据权限控制代理层系统，所述系统用于实施根据本发明的细粒度Hive数据权限控制代理层方法。In another aspect, the present invention also provides a fine-grained Hive data permission control proxy layer system, which is used to implement the fine-grained Hive data permission control proxy layer method according to the present invention.

在本说明书的描述中，参考术语“实施例”、“示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不必须针对的是相同的实施例或示例。此外，本领域的技术人员可以在不产生矛盾的情况下，将本说明书中描述的不同实施例或示例以及其中的特征进行结合或组合。In the description of this specification, description with reference to the terms "embodiment," "example," etc. means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention . In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, those skilled in the art may combine or combine the different embodiments or examples described in this specification, as well as the features therein, without conflicting situations.

上述内容虽然已经示出和描述了本发明的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本发明的限制，本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型等更新操作。Although the foregoing content has shown and described the embodiments of the present invention, it should be understood that the foregoing embodiments are exemplary and should not be construed as limitations of the present invention. The above embodiments carry out update operations such as changes, modifications, substitutions, and modifications.

Claims

1. A Hive data right control agent layer method, which is characterized in that the method comprises the following steps:

s1: hive data authority application; when the data authority approval service of a department approves the Hive data authority application of the employee, the data authority management center synchronously creates a corresponding data authority strategy, stores the corresponding data authority strategy in the table and field authority module and the row filtering and field desensitization module, and updates the mapping relation between the employee and the data authority strategy in the user authority management;

s2: resolving HQL; before HQL analysis, the Kerberos authentication of tenants needing to be carried out by utilizing Keytab of tenants interacting with a group big data cluster management platform by departments is needed, so that data isolation among the tenants is realized;

s3: HQL rewriting; in the process of analyzing the HQL, the SemanticAnalyzer performs line filtering and field desensitization rewriting on the HQL through a Hive TableMask object;

s4: checking HQL authority; based on QueryState, semanticAnalyzer and HQL, calling a static doAuthorization method of Driver to realize HQL data authority verification;

s5: HQL table, field blood margin analysis; when the HQL authority passes the verification, carrying out HQL table and field blood margin analysis;

after applying for the authority of the Hive table, the employee submits the Hive task, an HQL submitting module based on JDBC sends HQL and identity information of the employee to a Hive data authentication agent unit for authentication, and the HQL authentication process comprises HQL analysis, HQL rewriting, HQL authority verification, HQL table and field blood margin analysis;

in the step S2, after the kerberos authentication is passed, a HiveConf is created; the creation of HiveConf depends on Hadoop and Hive configuration files provided by a group big data cluster management platform, wherein the configuration files comprise core-site.xml, hdfs-site.xml, map-site.xml, yarn-site.xml and Hive-site.xml;

in step S2, HQL parsing includes the following sub-steps:

s201, creating a Session State object by using HiveConf, and setting the userName of the Session State object as an employee account submitted with HQL;

s202, starting a Session State object, setting a current database as a Hive database applied by a department on a big data cluster management platform, and initializing a transaction manager; after the sessionState object is created and started, the sessionState object is effective and unique, can communicate with Hadoop to submit distributed tasks, and can also be connected with a Hive metadata base to inquire metadata information;

s203, sequentially creating QueryState, context and ParseDriver objects; calling a parse method of the ParseDriver object to analyze the original HQL into abstract syntax tree nodes; generating a SemanticAnalyzer corresponding to QueryState and ASTNode by using a get method of Hive SemanticAnalyzer factor;

s204, calling an analyze method of the SemanticAnalyzer to analyze the HQL;

in step S3, the HQL rewriting process includes the following substeps:

s301, traversing and analyzing ASTNode of HQL to obtain table and field information;

s302, a table corresponding to the userName of the Sessionstate and a row filtering and field desensitization authority strategy of a field are pulled from a data authority management center through a DatablackHiveAuthorizer, and an applyRowFilterAndcolumnaging method is called so that a TableMask object can correctly acquire a row filtering and field desensitization expression;

s303, rewriting the Token stream of the original HQL according to the line filtering and field desensitization expression, and storing the latest Token stream of the HQL in a Context object;

in the step S4, if the HQL authority verification is successful, returning normally, otherwise, throwing out the abnormal authority verification failure; wherein, the authority verification bottom layer of the HiveDriver depends on a DatabeackHiveAuthorizer class; the DataBlackHiveAuthorizer class is an implementation class of a HiveAuthorizer interface of Hive, and realizes methods of checkPrivileges, applyRowFilterAndColumnMasking and needTransform; the doAuthorization of the Driver analyzes the HiveOperationType, the input and output HivePrivilegObjects and the authentication context of the HQL and then calls a checkPrivileges method of a DatablackHiveAuthorizer; the checkPrivileges method pulls a user authority policy to the data authority management center, analyzes tables and fields related in the input and output HiveprivileObject and operation types, and then matches the user authority policy; if all the input and output HivePrivilegObject objects pass the permission verification, the method returns normally, otherwise, the permission verification failure exception is thrown out;

in step S5, the HQL table and field blood relationship analysis specifically includes the following steps:

s501, based on HiveConf, queryState, semanticAnalyzer and HQL, creating QueryPlan and HookContext objects;

s502, calling a run method of Java ColumnLinearanalysis, and returning the table and field blood relationship in the HQL; columnLineageAnalysis is an inheritance class of the LineageLogger of Hive, a run method in the inheritance class is rewritten, and the table and field blood relationship of HQL can be returned;

and S503, the Hive data authentication agent sends the HQL authority verification result, the rewritten HQL, the table and field blood margin analysis result to an HQL submission module based on JDBC.

2. The Hive data authority control agent layer method of claim 1, wherein if HQL authentication passes, the JDBC-based HQL submission module submits the rewritten HQL to a group big data cluster management platform for execution.

3. A Hive data authority control proxy layer system, characterized in that the system is used for realizing the Hive data authority control proxy layer method according to any one of claims 1-2.

4. The Hive data authority control agent layer system according to claim 3, wherein the system comprises a group big data cluster management platform, is used for opening and registering tenants for each department using the platform, and configures the Hive database and HDFS file directory authority for each tenant in advance through Range Admin; the HDFS plug-in also comprises a Hive plug-in and an HDFS plug-in which are embedded into the HiveServer 2; the system is used for periodically pulling the authority policy from the Ranger Admin and storing the authority policy in a local policy repository.

5. The Hive data right control agent layer system according to claim 3 or 4, further comprising a data right management center, wherein the data right management center is provided with a table and field right module, a row filtering and field desensitizing module and a user right management module; wherein, the table and field authority defines data authority from metadata dimension; line filtering and field desensitization define data permissions from data dimensions;

and the Hive data authentication agent unit performs HQL analysis, HQL rewriting, HQL authority verification and HQL table and field blood margin analysis services.