CN116861152A

CN116861152A - Tax data security graph neural network training method based on matrix decomposition

Info

Publication number: CN116861152A
Application number: CN202310795131.2A
Authority: CN
Inventors: 师斌; 刘奥; 张纪强; 赵锐; 潘天泽; 董博; 郑庆华
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2023-10-10

Abstract

The invention discloses a tax data security graph neural network training method based on matrix decomposition, which comprises the following steps: firstly, carrying out safe eigenvalue decomposition on an adjacent matrix part of a tax data graph by using an external server, dividing an obtained eigenvalue decomposition result into a plurality of parts, and carrying out operation on the parts and an eigenvector matrix to generate a plurality of distributable adjacent matrixes; secondly, carrying out differential privacy on the feature matrix part of the tax data graph; thirdly, the tax data has the characteristic matrix after the decomposed adjacency matrix and the differential privacy are distributed to each computing party through a parameter server for model training; and finally, returning the calculation result to the tax data owner by the calculator, and obtaining the target model parameters through integrating and updating by the parameter server. According to the method, the original tax data is safely decomposed in the modes of topology secret sharing and adjacency matrix eigenvalue decomposition, so that the tax data is efficiently analyzed and modeled by means of external computing resources, and the analysis efficiency is improved.

Description

A neural network training method for tax data security graph based on matrix decomposition

技术领域Technical field

本发明属于图隐私保护方法技术领域，特别涉及一种基于矩阵分解的税务数据安全图神经网络训练方法。The invention belongs to the technical field of graph privacy protection methods, and particularly relates to a tax data security graph neural network training method based on matrix decomposition.

背景技术Background technique

近年来，随着国民经济的快速发展和市场经济的不断繁荣，税务数据日益庞杂。税务数据往往表现为图结构数据类型，反应个体税务信息与社会关系信息。因此，图神经网络能够有效针对税务数据中的图结构数据建模，深度挖掘其中蕴含信息。税务数据建模是税务数据智能化处理的基础性工作，是实现税务大数据的关键前提，但是税务数据规模的日益庞大和其中蕴含的大量隐私信息阻碍了对其的分析利用。传统的数据保护致力于离散的数据点，通过技术手段使得单个数据无法被识别利用，而图结构数据不仅包含节点信息，还包含丰富且重要的拓扑信息，传统数据保护手段难以全面保护。与传统数据保护方式不同，目前的隐私保护研究致力于实现数据的“可用不可见”，即在不影响使用数据的情况下保护其中的隐私信息不被泄露。针对图结构数据，隐私保护研究聚焦于保护图拓扑信息，以避免敏感信息被泄露，相比传统方式能够更加有力地保障图结构数据的安全。现有的税务数据建模，由于税务数据规模的庞大与税务机构的计算能力有限两重限制，由税务机构自身完成建模任务往往效率低下，迫切需要借助外部计算资源提升效率；但同时税务数据中包含大量敏感信息，暴露的后果十分严重，不被允许直接借助外部机构算力处理相关数据，一方面需要对税务数据进行安全化处理，避免隐私信息泄露，另一方面又需要处理后的数据可以正确地建模。随着纳税人数据量日益庞大，税务数据规模日益增大，内容日益复杂，如何在保证税务数据安全的同时，摆脱本地算力制约，利用外部算力高效训练针对税务数据的图神经网络模型已成为一个亟待解决的问题，对于加速税务数据处理，进一步实现税务大数据具有重要意义。In recent years, with the rapid development of the national economy and the continuous prosperity of the market economy, tax data has become increasingly complex. Tax data is often represented as a graph structure data type, reflecting individual tax information and social relationship information. Therefore, graph neural networks can effectively model graph-structured data in tax data and deeply mine the information contained therein. Tax data modeling is the basic work for intelligent processing of tax data and a key prerequisite for realizing tax big data. However, the increasing scale of tax data and the large amount of private information contained in it hinder its analysis and utilization. Traditional data protection focuses on discrete data points, making individual data unable to be identified and utilized through technical means. However, graph-structured data contains not only node information, but also rich and important topological information, making it difficult for traditional data protection methods to fully protect it. Different from traditional data protection methods, current privacy protection research is committed to making data "available and invisible", that is, protecting private information from being leaked without affecting the use of data. For graph-structured data, privacy protection research focuses on protecting graph topology information to avoid leakage of sensitive information, which can more effectively ensure the security of graph-structured data than traditional methods. Existing tax data modeling is often inefficient due to the huge scale of tax data and the limited computing power of tax agencies. It is often inefficient for tax agencies to complete the modeling tasks themselves, and there is an urgent need to use external computing resources to improve efficiency; but at the same time, tax data It contains a large amount of sensitive information, and the consequences of exposure are very serious. It is not allowed to directly use the computing power of external organizations to process relevant data. On the one hand, tax data needs to be processed securely to avoid the leakage of private information, and on the other hand, the processed data needs to be processed. can be modeled correctly. With the increasing amount of taxpayer data, the scale of tax data is increasing, and the content is becoming increasingly complex. How to ensure the security of tax data while getting rid of the constraints of local computing power and using external computing power to efficiently train graph neural network models for tax data has become a problem. It has become an urgent problem to be solved, which is of great significance for accelerating tax data processing and further realizing tax big data.

目前尚未有相关研究对税务数据隐私保护图神经网络训练方法提出相应的解决方案，主要涉及的税务数据保护相关发明专利有：At present, there is no relevant research that proposes corresponding solutions for the tax data privacy protection graph neural network training method. The main invention patents related to tax data protection include:

文献1：一种基于区块链的税务信息处理方法及系统(202011290032.1)Document 1: A blockchain-based tax information processing method and system (202011290032.1)

文献2：一种基于多维度特征的企业批量聚类方法和系统(202211142876.0)Document 2: An enterprise batch clustering method and system based on multi-dimensional features (202211142876.0)

文献1设计了一种基于区块链的税务信息处理方法和系统，利用区块链，将税务机构作为区块链的税务节点，管理区块，并根据业务机构划分不同的通道，每一个通道链接税务节点和相应的业务机构节点，利用税务节点根据用户授权，将税务证明信息广播给相应通道内的业务机构节点，使所述的业务机构节点获取税务证明信息。Document 1 designs a tax information processing method and system based on blockchain. Using blockchain, the tax agency is used as the tax node of the blockchain to manage the blocks and divide different channels according to the business organization. Each channel Link the tax node and the corresponding business organization node, and use the tax node to broadcast the tax certificate information to the business organization node in the corresponding channel according to user authorization, so that the business organization node obtains the tax certificate information.

文献2设计了一种基于多维度特征的企业批量聚类方法和系统，通过采集税务领域多个待聚类目标企业的税务数据，新闻数据和舆情数据，对采集的数据进行解析后生成特征数据，并根据特征数据构建图结构，以及将所述图结构作为最优图神经网络聚类模型的输入，获取待聚类目标企业的聚类结果。Document 2 designs a batch clustering method and system for enterprises based on multi-dimensional features. It collects tax data, news data and public opinion data from multiple target enterprises in the tax field to be clustered, and analyzes the collected data to generate feature data. , and construct a graph structure based on the characteristic data, and use the graph structure as the input of the optimal graph neural network clustering model to obtain the clustering results of the target enterprises to be clustered.

上述技术方案中，文献1聚焦于税务数据的存储保护，应用区块链技术保证了数据的安全性，但未考虑受保护数据的应用，对数据的查询使用效率较低，文献2在采集好税务数据的前提下对税务数据图进行建模，利用图神经网络分析税务数据，尽管获得了良好的分析结果，但在整个过程中没有考虑税务数据的隐私保护，可能带来一定的安全隐患。然而，现实情况下，受限于税务机构的计算能力，对现有税务数据处理效率较低，同时又受限于税务数据的敏感性，不能直接借用外部机构算力分析处理相关数据。因此，如何在保证税务数据安全的同时，高效训练针对税务数据的图神经网络模型已成为一个亟待解决的问题。Among the above technical solutions, Document 1 focuses on the storage protection of tax data and applies blockchain technology to ensure data security. However, it does not consider the application of protected data, and the query and use efficiency of data is low. Document 2 is good at collecting data. The tax data graph is modeled on the premise of tax data, and the tax data is analyzed using graph neural network. Although good analysis results are obtained, the privacy protection of tax data is not considered in the entire process, which may bring certain security risks. However, in reality, limited by the computing power of tax agencies, the processing efficiency of existing tax data is low. At the same time, limited by the sensitivity of tax data, it is not possible to directly borrow the computing power of external organizations to analyze and process relevant data. Therefore, how to efficiently train a graph neural network model for tax data while ensuring the security of tax data has become an urgent problem to be solved.

发明内容Contents of the invention

本发明旨在提供一种基于矩阵分解的税务数据安全图神经网络训练方法。首先，对税务数据图的邻接矩阵部分利用外部服务器进行安全的特征值分解，并将获得的特征值分解结果分成多个部分，与特征向量矩阵做运算，生成多个可分发的邻接矩阵；其次，对税务数据图的特征矩阵部分，进行差分隐私；再次，税务数据拥有着通过参数服务器将分解后的邻接矩阵与差分隐私后的特征矩阵分发给各计算方进行模型训练；最后，计算方将计算结果返回给税务数据拥有者，经过参数服务器整合更新获得目标模型参数。The present invention aims to provide a tax data security graph neural network training method based on matrix decomposition. First, use an external server to perform secure eigenvalue decomposition on the adjacency matrix part of the tax data graph, divide the obtained eigenvalue decomposition results into multiple parts, and perform operations with the eigenvector matrix to generate multiple distributable adjacency matrices; secondly, , carry out differential privacy on the feature matrix part of the tax data graph; thirdly, the tax data owner distributes the decomposed adjacency matrix and the differentially private feature matrix through the parameter server to each computing party for model training; finally, the computing party will The calculation results are returned to the tax data owner, and the target model parameters are obtained through integration and update by the parameter server.

为了达到以上目的，本发明采取以下技术方案：In order to achieve the above objects, the present invention adopts the following technical solutions:

一种基于矩阵分解的税务数据安全图神经网络训练方法，包括：A tax data security graph neural network training method based on matrix decomposition, including:

首先，对税务数据图的邻接矩阵利用外部服务器进行安全的特征值分解，并将获得的特征值分解结果分成多个部分，与特征向量矩阵做运算，生成多个可分发的邻接矩阵的部分秘密；其次，对税务数据图的特征矩阵，进行差分隐私；再次，税务数据拥有着通过参数服务器将分解后的邻接矩阵的部分秘密与差分隐私后的特征矩阵分发给各计算方进行模型训练；最后，计算方将计算结果返回给税务数据拥有者，经过参数服务器整合更新获得目标模型参数。First, use an external server to perform secure eigenvalue decomposition on the adjacency matrix of the tax data graph, divide the obtained eigenvalue decomposition results into multiple parts, and perform operations with the eigenvector matrix to generate partial secrets of multiple distributable adjacency matrices. ; Secondly, carry out differential privacy on the feature matrix of the tax data graph; thirdly, the tax data has the partial secret of the decomposed adjacency matrix and the differentially private feature matrix distributed to each computing party through the parameter server for model training; finally , the calculation party returns the calculation results to the tax data owner, and obtains the target model parameters through integration and update by the parameter server.

本发明进一步的改进在于，该方法具体包括以下步骤：A further improvement of the present invention is that the method specifically includes the following steps:

1)基于特征值分解的邻接矩阵秘密分享1) Adjacency matrix secret sharing based on eigenvalue decomposition

对税务数据图的邻接矩阵，借助外部服务器对其进行安全的特征值分解；根据计算方数量将特征值随机均等分成相应份数，特征值分解结果与特征向量矩阵的运算结果即为可发布的邻接矩阵的部分秘密；For the adjacency matrix of the tax data graph, use an external server to perform secure eigenvalue decomposition; randomly divide the eigenvalues into corresponding parts according to the number of calculation parties, and the eigenvalue decomposition result and the eigenvector matrix operation result can be published Part of the secret of the adjacency matrix;

2)基于差分隐私的特征矩阵保护2) Feature matrix protection based on differential privacy

对税务数据图的特征矩阵，利用差分隐私方法，应用拉普拉斯机制加以保护；For the feature matrix of the tax data graph, the differential privacy method is used and the Laplacian mechanism is applied to protect it;

3)基于参数服务器的模型训练与整合3) Model training and integration based on parameter server

将分解后的邻接矩阵的部分秘密和差分隐私后的特征矩阵分发给各计算方，各计算方基于分配的数据训练图卷积神经网络模型，通过参数服务器发送、收集和整合模型参数，获得目标模型参数。Distribute the partial secrets of the decomposed adjacency matrix and the differentially private feature matrix to each computing party. Each computing party trains the graph convolutional neural network model based on the distributed data, sends, collects and integrates model parameters through the parameter server to obtain the target model parameters.

本发明进一步的改进在于，步骤1)中，基于特征值分解的邻接矩阵秘密分享包括：A further improvement of the present invention is that in step 1), the adjacency matrix secret sharing based on eigenvalue decomposition includes:

Step1：安全的矩阵特征值分解Step1: Safe matrix eigenvalue decomposition

对税务数据图的邻接矩阵A，通过QR分解的多次迭代，获得足够精确的特征值分解数值解：For the adjacency matrix A of the tax data graph, through multiple iterations of QR decomposition, a sufficiently accurate numerical solution of eigenvalue decomposition is obtained:

其中t为迭代轮次，Q_t、R_t分别是t轮次对A_t的QR分解结果；经过k次迭代后，特征值对角矩阵Λ＝A_k，特征向量矩阵X＝Q₁…Q₁，原邻接矩阵A＝XΛX^-1；where t is the iteration round, Q _t and R _t are the QR decomposition results of A _t in round t respectively; after k iterations, the eigenvalue diagonal matrix Λ=A _k and the eigenvector matrix X=Q ₁ ...Q ₁ , the original adjacency matrix A=XΛX ^-1 ;

Step2：拓扑秘密分享Step2: Topology secret sharing

对于获得的特征值对角矩阵Λ，以多个对角矩阵的形式将特征值随机分成多组，当分成两组时，具体步骤如下：For the obtained eigenvalue diagonal matrix Λ, the eigenvalues are randomly divided into multiple groups in the form of multiple diagonal matrices. When divided into two groups, the specific steps are as follows:

生成随机对角01矩阵S，其中对角元素服从以下规则：Generate a random diagonal 01 matrix S, where the diagonal elements obey the following rules:

生成新对角矩阵Λ₁、Λ₂，方法如下：Generate new diagonal matrices Λ ₁ and Λ ₂ as follows:

其中In表示n维单位矩阵，×_h表示哈达玛积，即矩阵对应元素相乘；Where In represents the n-dimensional unit matrix, × _h represents the Hadamard product, that is, the multiplication of corresponding elements of the matrix;

利用新生成的对角矩阵Λ₁、Λ₂，生成新矩阵A1、A2，方法如下：Use the newly generated diagonal matrices Λ ₁ and Λ ₂ to generate new matrices A1 and A2 as follows:

A₁、A₂具有以下性质：A ₁ and A ₂ have the following properties:

在GNN模型中，图拓扑结构以邻接矩阵形式表示，邻接矩阵的乘方能够反应GNN模型的信息传递过程。In the GNN model, the graph topology is expressed in the form of an adjacency matrix, and the power of the adjacency matrix can reflect the information transfer process of the GNN model.

本发明进一步的改进在于，步骤1)的Step1中，通过设置第三方服务器进行对邻接矩阵A的安全分解：数据拥有方生成稀疏的随机01矩阵P，计算并向第三方服务器上传A′＝PAP^-1，第三方服务器按上述迭代求解过程计算A′的特征值分解并将计算结果X′、Λ返回给数据拥有方，有A′＝X′ΛX′^-1，数据拥有方计算X＝P^-1X′，得到矩阵分解结果。A further improvement of the present invention is that in Step 1), a third-party server is set up to perform safe decomposition of the adjacency matrix A: the data owner generates a sparse random 01 matrix P, calculates and uploads A′=PAP to the third-party server ^-1 , the third-party server calculates the eigenvalue decomposition of A′ according to the above iterative solution process and returns the calculation results X′, Λ to the data owner. A′=X′ΛX′ ^-1 , the data owner calculates X=P ^-1 X′, get the matrix decomposition result.

本发明进一步的改进在于，步骤1)的Step2中，对于两层GCN，节点嵌入受其两跳范围内的邻居影响，两条范围内的邻居用邻接矩阵的平方A²表示，A²能够有效表明图的连接关系和节点间的信息传递；记节点数n，将原始邻接矩阵分解成k个矩阵，分解后的每个矩阵包含个特征值，缺少/>个特征值，在获得全部特征值的前提下，正确排列的概率 A further improvement of the present invention is that in Step 2 of step 1), for the two-layer GCN, the node embedding is affected by its neighbors within the two-hop range. The neighbors within the two ranges are represented by the square of the adjacency matrix A ² , and A ² can effectively Indicate the connection relationship of the graph and the information transfer between nodes; record the number of nodes n, decompose the original adjacency matrix into k matrices, and each decomposed matrix contains eigenvalues, missing/> eigenvalues, on the premise of obtaining all eigenvalues, the probability of correct arrangement

本发明进一步的改进在于，当n＝100，k＝2时，p≈3.3×10^-65。A further improvement of the present invention is that when n=100 and k=2, p≈3.3×10 ^-65 .

本发明进一步的改进在于，步骤2)中，基于差分隐私的特征矩阵保护包括：A further improvement of the present invention is that in step 2), the feature matrix protection based on differential privacy includes:

Step1：隐私预算及全局敏感度计算Step1: Privacy budget and global sensitivity calculation

应用拉普拉斯机制，对税务数据图的特征矩阵X进行差分隐私保护，根据设置的隐私预算∈，计算全局敏感度Δ_f：Apply the Laplacian mechanism to perform differential privacy protection on the feature matrix X of the tax data graph, and calculate the global sensitivity Δ _f according to the set privacy budget ∈:

Δ_f＝max_D，D′{|h＝h′|}Δ _f =max _{D, D′} {|h＝h′|}

其中D、D′为一对相邻数据，h、h分别是针对D、D′的随机查询的结果；令设置要添加的拉普拉斯噪声分布如下：Among them, D and D′ are a pair of adjacent data, h and h are the results of random queries for D and D′ respectively; let Set the Laplacian noise distribution to be added as follows:

上述拉普拉斯机制满足∈-差分隐私，即：The above Laplacian mechanism satisfies ∈-differential privacy, that is:

Pr[M(D)＝y]≤e^∈Pr[M(D′)＝y]Pr[M(D)＝y]≤e ^∈ Pr[M(D′)＝y]

其中M为所应用的处理机制；Where M is the processing mechanism applied;

Step2：噪声注入Step2: Noise injection

对税务数据图的特征矩阵X，插入上一步生成的拉普拉斯噪声，获得隐私保护的特征矩阵X′。Insert the Laplacian noise generated in the previous step into the feature matrix X of the tax data graph to obtain the privacy-preserving feature matrix X′.

本发明进一步的改进在于，步骤3)中，基于参数服务器的模型训练与整合包括：A further improvement of the present invention is that in step 3), the model training and integration based on the parameter server includes:

Step1：数据分配Step1: Data distribution

税务数据图的邻接矩阵A被分解为{A_k…}，k＝1，2，…，税务数据图的特征矩阵X经过差分隐私处理得到X′；数据拥有方向计算方提供隐私保护的数据，每个计算方得到A_k和X′作为GNN模型的输入；The adjacency matrix A of the tax data graph is decomposed into {A _k ...}, k=1, 2, ..., and the feature matrix X of the tax data graph is processed through differential privacy to obtain Each calculation side obtains A _k and X′ as inputs to the GNN model;

Step2：基于隐私保护数据的模型训练Step2: Model training based on privacy-preserving data

选择图卷积神经网络模型进行训练，计算方k本地拥有两层的GCN模型，在分配给自身的数据上进行训练；其中第一层输入是节点特征矩阵X和邻接矩阵Ak，经过信息传递与聚合后输出节点隐藏特征矩阵：The graph convolutional neural network model is selected for training. Calculator k locally has a two-layer GCN model and is trained on the data assigned to itself; the first layer input is the node feature matrix X and the adjacency matrix Ak. After information transfer and After aggregation, output node hidden feature matrix:

H_k，1＝f(A_kX′W_k，1)H _k,1 =f(A _k X'W _k,1 )

第二层输入是第一层的输出H_k，1与邻接矩阵A_k，输出节点隐藏特征矩阵H_k，2，用于节点分类或其他下游任务；The input of the second layer is the output H _k,1 of the first layer and the adjacency matrix A _k , and the output node hidden feature matrix H _k,2 is used for node classification or other downstream tasks;

H_k，2＝f(A_kH_k，1W_k，2)H _k,2 =f(A _k H _k,1 W _k,2 )

计算方在训练后向由数据拥有方持有的参数服务器上传模型参数W_k，1、W_k，2，同时拉取参数服务器更新后的模型参数；参数服务器在收集各个参与方上传的模型参数后借助分布式机器学习方法中的模型平均方式对模型参数进行整合，从而获得新的模型参数。After training, the calculating party uploads the model parameters W _k,1 , W _k,2 to the parameter server held by the data owner, and at the same time pulls the updated model parameters from the parameter server; the parameter server collects the model parameters uploaded by each participant Finally, the model parameters are integrated using the model averaging method in the distributed machine learning method to obtain new model parameters.

本发明至少具有以下有益的技术效果：The present invention has at least the following beneficial technical effects:

(1)本发明对税务数据图的邻接矩阵和特证矩阵分别进行了隐私保护处理，通过拓扑秘密分享及邻接矩阵特征值分解的方式，保护图拓扑信息不被计算方所知，通过差分隐私的方式，保护节点特征信息，本发明在整个过程中保障了敏感信息的安全性。(1) This invention performs privacy protection processing on the adjacency matrix and the special certificate matrix of the tax data graph respectively. Through topological secret sharing and adjacency matrix eigenvalue decomposition, the topological information of the graph is protected from being known by the calculating party. Through differential privacy, way to protect node characteristic information, and the present invention ensures the security of sensitive information in the entire process.

(2)本发明拓扑秘密分享及邻接矩阵特征值分解的方式，将原始税务数据进行了安全的分解，进而借助外部算力资源实现了对税务数据的高效分析建模，提高了分析效率。(2) The present invention uses topological secret sharing and adjacency matrix eigenvalue decomposition to safely decompose original tax data, and then uses external computing resources to achieve efficient analysis and modeling of tax data, improving analysis efficiency.

附图说明Description of the drawings

图1为整体框架流程图。Figure 1 is the overall framework flow chart.

图2为基于特征值分解的邻接矩阵秘密分享流程图。Figure 2 is a flow chart of adjacency matrix secret sharing based on eigenvalue decomposition.

图3为基于参数服务器的模型训练与整合流程图。Figure 3 is a flow chart of model training and integration based on parameter server.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例，然而应当理解，可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本公开，并且能够将本公开的范围完整的传达给本领域的技术人员。需要说明的是，在不冲突的情况下，本发明中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本发明。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a thorough understanding of the disclosure, and to fully convey the scope of the disclosure to those skilled in the art. It should be noted that, as long as there is no conflict, the embodiments and features in the embodiments of the present invention can be combined with each other. The present invention will be described in detail below with reference to the accompanying drawings and embodiments.

如图1所示，本发明提供的一种基于矩阵分解的税务数据安全图神经网络训练方法，包括以下步骤：As shown in Figure 1, the invention provides a tax data security graph neural network training method based on matrix decomposition, including the following steps:

对税务数据图的邻接矩阵，借助外部服务器对其进行安全的特征值分解；根据计算方数量将特征值随机均等分成相应份数，特征值分解结果与特征向量矩阵的运算结果即为可发布的邻接矩阵的部分秘密；基于特征值分解的邻接矩阵秘密分享包括：For the adjacency matrix of the tax data graph, use an external server to perform secure eigenvalue decomposition; randomly divide the eigenvalues into corresponding parts according to the number of calculation parties, and the eigenvalue decomposition result and the eigenvector matrix operation result can be published Part of the secret of the adjacency matrix; the secret sharing of the adjacency matrix based on eigenvalue decomposition includes:

Step2：拓扑秘密分享Step2: Topology secret sharing

A₁、A₂具有以下性质：A ₁ and A ₂ have the following properties:

在GNN模型中，图拓扑结构以邻接矩阵形式表示，邻接矩阵的乘方能够反应GNN模型的信息传递过程。通过设置第三方服务器进行对邻接矩阵A的安全分解：数据拥有方生成稀疏的随机01矩阵P，计算并向第三方服务器上传A′＝PAP^-1，第三方服务器按上述迭代求解过程计算A′的特征值分解并将计算结果X′、Λ返回给数据拥有方，有A′＝X′ΛX′^-1，数据拥有方计算X＝P^-1X′，得到矩阵分解结果。对于两层GCN，节点嵌入受其两跳范围内的邻居影响，两条范围内的邻居用邻接矩阵的平方A²表示，A²能够有效表明图的连接关系和节点间的信息传递。由于A₁、A₂所具有的性质，将其分发给不同的计算方计算乘方，并通过回收整合结果获得原始数据的乘方结果。另一方面，对计算方而言，分解后的A₁、A₂数值上与有01值的原始邻接矩阵大不相同，包含大量小数，从数值上无法识别任意边的存在。对于计算方，从分配的矩阵中仅能推出原始邻接矩阵部分特征值，不足以恢复原始邻接矩阵。即使计算方获得了所有的特征值，其需要以正确顺序排列特征值，以此恢复原始邻接矩阵，但正确排列的概率非常小。记节点数n，将原始邻接矩阵分解成k个矩阵，分解后的每个矩阵包含个特征值，缺少/>个特征值，在获得全部特征值的前提下，正确排列的概率/>在未获得全部特征值的情况下，计算方恢复原始邻接矩阵的概率更是远小于上述概率p。In the GNN model, the graph topology is expressed in the form of an adjacency matrix, and the power of the adjacency matrix can reflect the information transfer process of the GNN model. Securely decompose the adjacency matrix A by setting up a third-party server: the data owner generates a sparse random 01 matrix P, calculates and uploads A′=PAP ^-1 to the third-party server, and the third-party server calculates A′ according to the above iterative solution process The eigenvalues are decomposed and the calculation results X′ and Λ are returned to the data owner, with A′=X′ΛX′ ^-1 . The data owner calculates X=P ^-1 X′ and obtains the matrix decomposition result. For a two-layer GCN, the node embedding is affected by its neighbors within the two-hop range. The neighbors within the two ranges are represented by the square A ² of the adjacency matrix. A ² can effectively indicate the connection relationship of the graph and the information transfer between nodes. Due to the properties of A ₁ and A ₂ , they are distributed to different calculation parties to calculate exponentiation, and the exponentiation result of the original data is obtained by recycling the integration results. On the other hand, for the calculation side, the decomposed A ₁ and A ₂ are numerically very different from the original adjacency matrix with a value of 01, and contain a large number of decimals, making it impossible to identify the existence of any edge numerically. For the calculation side, only part of the eigenvalues of the original adjacency matrix can be derived from the allocated matrix, which is not enough to restore the original adjacency matrix. Even if the calculation side obtains all the eigenvalues, it needs to arrange the eigenvalues in the correct order to restore the original adjacency matrix, but the probability of correct arrangement is very small. Record the number of nodes n, and decompose the original adjacency matrix into k matrices. Each decomposed matrix contains eigenvalues, missing/> eigenvalues, on the premise of obtaining all eigenvalues, the probability of correct arrangement/> In the case where all eigenvalues are not obtained, the probability of the calculation party recovering the original adjacency matrix is much smaller than the above probability p.

对税务数据图的特征矩阵，利用差分隐私方法，应用拉普拉斯机制加以保护；基于差分隐私的特征矩阵保护包括：For the feature matrix of the tax data graph, the differential privacy method is used and the Laplacian mechanism is applied to protect it; the feature matrix protection based on differential privacy includes:

Δ_f＝max_D，D′{|h＝h′|}Δ _f =max _{D, D′} {|h＝h′|}

Pr[M(D)＝y]≤e^∈Pr[M(D′)＝y]Pr[M(D)＝y]≤e ^∈ Pr[M(D′)＝y]

Step2：噪声注入Step2: Noise injection

将分解后的邻接矩阵的部分秘密和差分隐私后的特征矩阵分发给各计算方，各计算方基于分配的数据训练图卷积神经网络模型，通过参数服务器发送、收集和整合模型参数，获得目标模型参数。如图3所示，基于参数服务器的模型训练与整合包括：Distribute the partial secrets of the decomposed adjacency matrix and the differentially private feature matrix to each computing party. Each computing party trains the graph convolutional neural network model based on the distributed data, sends, collects and integrates model parameters through the parameter server to obtain the target model parameters. As shown in Figure 3, model training and integration based on parameter server include:

Step1：数据分配Step1: Data distribution

选择图卷积神经网络模型进行训练，计算方k本地拥有两层的GCN模型，在分配给自身的数据上进行训练；其中第一层输入是节点特征矩阵X和邻接矩阵A_k，经过信息传递与聚合后输出节点隐藏特征矩阵：The graph convolutional neural network model is selected for training. Calculator k locally has a two-layer GCN model and is trained on the data assigned to itself; the first layer input is the node feature matrix X and the adjacency matrix A _k . After information transfer And the hidden feature matrix of the output node after aggregation:

H_k，1＝f(A_kX′W_k，1)H _k,1 =f(A _k X'W _k,1 )

H_k，2＝f(A_kH_k，1W_k，2)H _k,2 =f(A _k H _k,1 W _k,2 )

实施例Example

选取某地区国税中2017年至2019年的局部税务数据图，包含2786个节点，5728条边，节点特征维度为1289，标签维度为6。以下参照附图，结合实验案例及具体实施方式对本发明作进一步的详细描述。凡基于本发明内容所实现的技术均属于本发明的范围。Select the local tax data graph from 2017 to 2019 in the national tax of a certain region, which contains 2786 nodes, 5728 edges, the node feature dimension is 1289, and the label dimension is 6. The present invention will be described in further detail below with reference to the accompanying drawings, experimental cases and specific implementations. All technologies implemented based on the content of the present invention belong to the scope of the present invention.

如图1所示，本发明具体实施中，基于矩阵分解和差分隐私的税务数据隐私保护图神经网络训练方法包括以下步骤：As shown in Figure 1, in the specific implementation of the present invention, the tax data privacy protection graph neural network training method based on matrix decomposition and differential privacy includes the following steps:

步骤1.基于特征值分解的邻接矩阵秘密分享Step 1. Adjacency matrix secret sharing based on eigenvalue decomposition

税务数据图中包含大量邻接矩阵，通过秘密分享方式可以有效阻止计算方推测邻接矩阵。邻接矩阵秘密分享实施过程如图2，具体包括以下步骤：The tax data graph contains a large number of adjacency matrices, and the secret sharing method can effectively prevent the calculation party from guessing the adjacency matrix. The adjacency matrix secret sharing implementation process is shown in Figure 2, which specifically includes the following steps:

S101.邻接矩阵特征值分解S101. Adjacency matrix eigenvalue decomposition

本地首先生成大小为2786×2786的随机01矩阵P，然后计算遮蔽过的矩阵邻接矩阵A′＝PAP^-1并上传至第三方服务器。第三方服务器将上传遮蔽过的矩阵邻接矩阵A′进行特征值分解，将分解结果X′、Λ传回本地。本地对分解的结果X′、Λ进一步处理得到税务数据图邻接矩阵分解结果X＝P^-1X′P，Λ。First, a random 01 matrix P with a size of 2786×2786 is generated locally, and then the masked matrix adjacency matrix A′=PAP ^-1 is calculated and uploaded to the third-party server. The third-party server will upload the masked matrix adjacency matrix A′ for eigenvalue decomposition, and transmit the decomposition results X′ and Λ back to the local. The decomposition results X′ and Λ are further processed locally to obtain the tax data graph adjacency matrix decomposition result X=P ^-1 X′P, Λ.

S102.拓扑秘密分享S102. Topology secret sharing

本实施例中具有两个计算方，因此借助随机生成的大小为2786×2786的对角01矩阵S，将Λ分解为Λ₁＝S×_hΛ、Λ₁＝(I₂₇₈₆-S)×_hΛ，并由此得到新矩阵A₁＝XΛ₁X^-1，A₂＝XΛ₂X^-1。将A₁、A₂分别分给两个计算方。There are two calculation sides in this embodiment, so with the help of a randomly generated diagonal 01 matrix S of size 2786×2786, Λ is decomposed into Λ ₁ =S× _h Λ, Λ ₁ =(I ₂₇₈₆ -S)× _h Λ, and thus obtain new matrices A ₁ =XΛ ₁ X ^-1 and A ₂ =XΛ ₂ X ^-1 . Divide A ₁ and A ₂ to two calculating parties respectively.

步骤2.基于差分隐私的特征矩阵保护Step 2. Feature matrix protection based on differential privacy

利用差分隐私，可以简单有效地对特征矩阵进行有效的保护。Using differential privacy, the feature matrix can be effectively protected simply and effectively.

具体的，本实施例中令隐私预算∈＝10，计算相应的拉普拉斯噪声并插入到特征矩阵X中获得隐私保护的特征矩阵X′。Specifically, in this embodiment, the privacy budget ∈ = 10, the corresponding Laplacian noise is calculated and inserted into the feature matrix X to obtain the privacy-protecting feature matrix X′.

步骤3.基于参数服务器的模型训练与整合Step 3. Model training and integration based on parameter server

利用邻接矩阵的秘密分享和差分隐私后的特征矩阵，计算方在计算过程中无法逆推原始税务数据图中的信息，再借助参数服务器，可以完成模型的正确训练。By utilizing the secret sharing of the adjacency matrix and the feature matrix after differential privacy, the calculation party cannot reversely deduce the information in the original tax data graph during the calculation process. With the help of the parameter server, the correct training of the model can be completed.

具体的，本实施例中，计算方k训练两层GCN模型，模型参数记为W_i，1，W_i，2。基于所分配的数据A_i和X′，计算方k训练模型，其中第一层输入是节点特征矩阵X和邻接矩阵A_k，经过信息传递与聚合后输出节点隐藏特征矩阵：Specifically, in this embodiment, calculation method k trains a two-layer GCN model, and the model parameters are recorded as Wi _{, 1} and Wi _{, 2} . _Based on the allocated data A _i and

H_k，1＝f(A_kX′W_k，1)H _k,1 =f(A _k X'W _k,1 )

第二层输入是第一层的输出H_k，1与邻接矩阵A_k，输出节点隐藏特征矩阵H_k，2，可用于节点分类或其他下游任务。The input of the second layer is the output H _k,1 of the first layer and the adjacency matrix A _k , and the output node hidden feature matrix H _k,2 can be used for node classification or other downstream tasks.

H_k，2＝f(A_kH_k，1W_k，2)H _k,2 =f(A _k H _k,1 W _k,2 )

并在训练结束后将W_k，1，W_k，2上传参数服务器。两计算方都上传后，参数服务器利用模型平均方式整合模型参数，获得并将更新后的参数W₁、W₂再次下发给计算方以进行下次迭代，共迭代100次，获得最终的模型参数。最终模型在原税务数据图上的精确度达到81.6％，相比直接在税务数据图训练所得模型的精确度84.1％，仅下降3个百分点，但借助外部算力大大加快了建模速度。And after training, W _{k, 1} and W _{k, 2} are uploaded to the parameter server. After both calculation methods are uploaded, the parameter server uses the model averaging method to integrate the model parameters to obtain And the updated parameters W ₁ and W ₂ are sent to the calculation party again for the next iteration, with a total of 100 iterations to obtain the final model parameters. The accuracy of the final model on the original tax data map reached 81.6%, which was only 3 percentage points lower than the accuracy of 84.1% of the model trained directly on the tax data map. However, the modeling speed was greatly accelerated with the help of external computing power.

本领域的技术人员容易理解，以上所述仅为本发明的方法实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。It is easy for those skilled in the art to understand that the above descriptions are only method embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention shall be regarded as should be included within the protection scope of the present invention.

Claims

1. A tax data security graph neural network training method based on matrix decomposition, characterized by including:

First, use an external server to perform secure eigenvalue decomposition on the adjacency matrix of the tax data graph, divide the obtained eigenvalue decomposition results into multiple parts, and perform operations with the eigenvector matrix to generate partial secrets of multiple distributable adjacency matrices. ; Secondly, carry out differential privacy on the feature matrix of the tax data graph; thirdly, the tax data has the partial secret of the decomposed adjacency matrix and the differentially private feature matrix distributed to each computing party through the parameter server for model training; finally , the calculation party returns the calculation results to the tax data owner, and obtains the target model parameters through integration and update by the parameter server.

2. A tax data security graph neural network training method based on matrix decomposition according to claim 1, characterized in that the method specifically includes the following steps:

1) Adjacency matrix secret sharing based on eigenvalue decomposition

For the adjacency matrix of the tax data graph, use an external server to perform secure eigenvalue decomposition; randomly divide the eigenvalues into corresponding parts according to the number of calculation parties, and the eigenvalue decomposition result and the eigenvector matrix operation result can be published Part of the secret of the adjacency matrix;

2) Feature matrix protection based on differential privacy

For the feature matrix of the tax data graph, the differential privacy method is used and the Laplacian mechanism is applied to protect it;

3) Model training and integration based on parameter server

Distribute the partial secrets of the decomposed adjacency matrix and the differentially private feature matrix to each computing party. Each computing party trains the graph convolutional neural network model based on the distributed data, sends, collects and integrates model parameters through the parameter server to obtain the target model parameters.

3. A tax data security graph neural network training method based on matrix decomposition according to claim 2, characterized in that in step 1), the adjacency matrix secret sharing based on eigenvalue decomposition includes:

Step1: Safe matrix eigenvalue decomposition

For the adjacency matrix A of the tax data graph, through multiple iterations of QR decomposition, a sufficiently accurate numerical solution of eigenvalue decomposition is obtained:

where t is the iteration round, Q _t and R _t are the QR decomposition results of A _t in round t respectively; after k iterations, the eigenvalue diagonal matrix Λ=A _k and the eigenvector matrix X=Q ₁ ...Q ₁ , the original adjacency matrix A=XΛX ^-1 ;

Step2: Topology secret sharing

For the obtained eigenvalue diagonal matrix Λ, the eigenvalues are randomly divided into multiple groups in the form of multiple diagonal matrices. When divided into two groups, the specific steps are as follows:

Generate a random diagonal 01 matrix S, where the diagonal elements obey the following rules:

Generate new diagonal matrices Λ ₁ and Λ ₂ as follows:

Where I _n represents the n-dimensional unit matrix, × h represents the Hadamard product, that is, the multiplication of corresponding elements of the matrix;

Use the newly generated diagonal matrices Λ ₁ and Λ ₂ to generate new matrices A ₁ and A ₂ as follows:

A ₁ and A ₂ have the following properties:

In the GNN model, the graph topology is expressed in the form of an adjacency matrix, and the power of the adjacency matrix can reflect the information transfer process of the GNN model.

4. A tax data security graph neural network training method based on matrix decomposition according to claim 3, characterized in that in Step 1 of step 1), secure decomposition of the adjacency matrix A is performed by setting up a third-party server: Data The owner generates a sparse random 01 matrix P, calculates and uploads A′=PAP ^-1 to the third-party server. The third-party server calculates the eigenvalue decomposition of A′ according to the above iterative solution process and returns the calculation results X′ and Λ to the data The owner has A′=X′ΛX′ ^-1 . The data owner calculates X=P ^-1 X′ and obtains the matrix decomposition result.

5. A tax data security graph neural network training method based on matrix decomposition according to claim 3, characterized in that, in Step 2 of step 1), for a two-layer GCN, the node embedding is affected by its neighbors within the two-hop range. Influence, the neighbors within two ranges are represented by the square of the adjacency matrix A ^2. A ² can effectively indicate the connection relationship of the graph and the information transfer between nodes; record the number of nodes n, decompose the original adjacency matrix into k matrices, and after decomposition Each matrix of contains eigenvalues, missing eigenvalues, on the premise of obtaining all eigenvalues, the probability of correct arrangement/>

6. A tax data security graph neural network training method based on matrix decomposition according to claim 5, characterized in that when n=100, k=2, p≈3.3×10 ^-6 .

7. A tax data security graph neural network training method based on matrix decomposition according to claim 3, characterized in that in step 2), the feature matrix protection based on differential privacy includes:

Step1: Privacy budget and global sensitivity calculation

Apply the Laplacian mechanism to perform differential privacy protection on the feature matrix X of the tax data graph, and calculate the global sensitivity Δ _f according to the set privacy budget ∈:

Δ _f =max _D,D′ {|h＝h′|}

Among them, D and D′ are a pair of adjacent data, h and h are the results of random queries for D and D′ respectively; let Set the Laplacian noise distribution to be added as follows:

Step2: Noise injection

Insert the Laplacian noise generated in the previous step into the feature matrix X of the tax data graph to obtain the privacy-preserving feature matrix X′.

8. A tax data security graph neural network training method based on matrix decomposition according to claim 7, characterized in that, in Step 2 of step 2), the Laplacian mechanism satisfies ∈-differential privacy, that is:

Pr[M(D)＝y]≤e ^∈ Pr[M(D′)＝y]

Where M is the processing mechanism applied.

9. A tax data security graph neural network training method based on matrix decomposition according to claim 7, characterized in that in step 3), the parameter server-based model training and integration includes:

Step1: Data distribution

The adjacency matrix A of the tax data graph is decomposed into {A _k ...}, k=1,2,..., and the feature matrix X of the tax data graph is processed with differential privacy to obtain Each calculation side obtains A _k and X′ as inputs to the GNN model;

Step2: Model training based on privacy-preserving data

The graph convolutional neural network model is selected for training. Calculator k locally has a two-layer GCN model and is trained on the data assigned to itself; the first layer input is the node feature matrix X and the adjacency matrix A _k . After information transfer And the hidden feature matrix of the output node after aggregation:

H _k,1 =f(A _k X'W _k,1 )

The input of the second layer is the output H _k,1 of the first layer and the adjacency matrix A _k , and the output node hidden feature matrix H _k,2 is used for node classification or other downstream tasks;

H _k,2 =f(A _k H _k,1 W _k,2 )

After training, the calculating party uploads the model parameters W _k,1 and W _k,2 to the parameter server held by the data owner, and at the same time pulls the updated model parameters from the parameter server; the parameter server collects the model parameters uploaded by each participant Finally, the model parameters are integrated using the model averaging method in the distributed machine learning method to obtain new model parameters.