CN116127400B

CN116127400B - Sensitive data identification system, method and storage medium based on heterogeneous computation

Info

Publication number: CN116127400B
Application number: CN202310418681.2A
Authority: CN
Inventors: 姚启桂; 张涛; 石聪聪; 张小建; 费稼轩; 黄伟聪; 罗晨; 何阳
Original assignee: State Grid Smart Grid Research Institute Co ltd
Current assignee: State Grid Smart Grid Research Institute Co ltd
Priority date: 2023-04-19
Filing date: 2023-04-19
Publication date: 2023-06-27
Anticipated expiration: 2043-04-19
Also published as: CN116127400A

Abstract

The invention discloses a sensitive data identification system, a sensitive data identification method and a sensitive data storage medium based on heterogeneous computation, wherein the sensitive data identification system comprises a CPU chip, an AI chip and an FPGA chip which are sequentially connected; the CPU chip is used for acquiring service data to be marked, the AI chip is used for judging a service cluster to which the service data belongs according to the service characteristics of the service data to be marked, and marking the service data to be marked according to the service cluster to which the service data belongs; the FPGA chip is used for identifying the sensitive data of the marked business data in a parallel processing mode. According to the embodiment of the invention, the service data is marked by utilizing the high-efficiency data processing capability of the AI chip by constructing the heterogeneous computing architecture based on the CPU, the AI and the FPGA, and the service data is processed in parallel by the FPGA chip, so that the efficiency of identifying the sensitive data in the service data of the power system is improved, and the bottleneck of identifying the performance of the traditional software is broken through.

Description

Sensitive data identification system, method and storage medium based on heterogeneous computation

Technical Field

The invention relates to the technical field of power data monitoring, in particular to a sensitive data identification system, a sensitive data identification method and a storage medium based on heterogeneous computation.

Background

The electric power data is taken as a production element, and is a key resource for advancing the digital transformation of energy and constructing a novel electric power system. With the development of digitization, power data sharing interactions are increasingly frequent. The electric power has a wide range of external interaction business, covers departments such as marketing, finance and the like, performs data interaction with units such as government, banks, insurance, internet companies, operators and the like, has complex interaction business and a large number of interfaces, is widely related to interaction data, and partially relates to personal privacy data. In order to strengthen external interaction data leakage monitoring and reduce the risk of abnormal access of an interface of electric power data in an external opening process, sensitive data content of external data needs to be identified by means of the computing capability of a CPU.

Although processors offer more and more computing power, the frequency boosting of processors is more and more difficult and the space for the boosting of single-core computing power is already very limited, as integrated circuits in processors have reached the nanometer level. In the current market, the identification efficiency is low by adopting a single-path processing mode on the sensitive data processing of the power system.

Disclosure of Invention

In view of the above, the embodiments of the present invention provide a sensitive data identification system, method and storage medium based on heterogeneous computation, so as to solve the technical problem of low sensitive data identification efficiency of a power system.

The technical scheme provided by the invention is as follows:

the first aspect of the embodiment of the invention provides a sensitive data identification system based on heterogeneous computation, which comprises a CPU chip, an AI chip and an FPGA chip which are connected in sequence; the CPU chip is used for acquiring service data to be marked and sending the service data to be marked to the AI chip; the AI chip is used for judging a service cluster to which the service data to be marked belongs according to the service characteristics of the service data to be marked, marking the service data to be marked according to the service cluster to which the service data to be marked belongs, and obtaining marked service data; the FPGA chip is used for identifying the sensitive data of the marked business data in a parallel processing mode.

Optionally, the CPU chip includes a network port and a data acquisition and distribution module, where the network port is connected with the data acquisition and distribution module, and the data acquisition and distribution module is connected with the AI chip through a high-speed bus; the network port is used for acquiring network flow data; the data acquisition and distribution module is used for screening the network flow data to obtain service data to be marked, and sending the service data to be marked to the AI chip through a high-speed bus.

Optionally, the CPU chip includes a model policy issuing module, configured to issue a cluster model and an identification policy to the AI chip and the FPGA chip respectively; the AI chip includes: the data characteristic classification module is used for acquiring service clusters corresponding to service categories and clustering centers of the service clusters based on a clustering model, constructing a clustering characteristic value based on expert weights of the service characteristics, constructing a judging threshold based on circumferences formed by the clustering centers, combining the clustering characteristic value, calculating the average Euclidean distance from the service data to be marked to the service clusters according to the service characteristics of the service data to be marked and the service characteristics of the service data in the service clusters, and determining the service clusters to which the service data to be marked belong according to the average Euclidean distance and the judging threshold; the data characteristic labeling module is used for labeling the service data to be labeled according to the service cluster to which the service data to be labeled belongs; and the FPGA chip performs sensitive data identification on the marked business data based on the identification strategy.

Optionally, the AI chip and the FPGA chip are disposed on an FPAI chip, and a main control module is further disposed on the FPAI chip, and is configured to receive an identification result obtained by identifying sensitive data of the marked service data by the FPGA chip, and send the identification result to the CPU chip; the FPAI chip is also provided with a high-speed shared RAM, two ends of the high-speed shared RAM are respectively connected with the AI chip and the FPGA chip, and a cache channel between the AI chip and the FPGA chip is constructed through the high-speed shared RAM.

Optionally, the FPGA chip includes a high-speed sensitive data identification module and an identification result output module; the sensitive data identification unit comprises a plurality of parallel high-speed identification units which correspond to the business categories and are constructed based on business regular expressions, and the parallel high-speed identification units are used for carrying out parallel processing of sensitive data identification on the marked business data belonging to the corresponding business categories; the identification result output module is used for outputting the identification result of the sensitive data identification unit.

A second aspect of the embodiment of the present invention provides a sensitive data identification method based on heterogeneous computation, including: acquiring business data to be marked; judging a service cluster to which the service data to be marked belongs according to the service characteristics of the service data to be marked, marking the service data to be marked according to the service cluster to which the service data to be marked belongs, and obtaining marked service data; and identifying the sensitive data of the marked business data in a parallel processing mode.

Optionally, the judging the service cluster to which the service data to be marked belongs according to the service characteristics of the service data to be marked includes: acquiring a service cluster corresponding to a service class and a clustering center of the service cluster based on a clustering model; constructing a clustering feature value based on the expert weight of the service feature, and constructing a judgment threshold based on the circumference formed by the clustering center; combining the cluster characteristic values, and calculating the average Euclidean distance from the service data to be marked to the service cluster according to the service characteristics of the service data to be marked and the service characteristics of the service data in the service cluster; and determining the service cluster to which the service data to be marked belongs according to the average Euclidean distance and the judging threshold value.

Optionally, constructing a cluster feature value based on the expert weights of the service features includes: calculating relative weights based on expert weights of the service features; normalizing the relative weights; and constructing a clustering characteristic value according to the normalized relative weight.

Optionally, in combination with the cluster feature value, calculating an average euclidean distance from the service data to be marked to the service cluster according to the service feature of the service data to be marked and the service feature of the service data in the service cluster, including: calculating a first average Euclidean distance from the service data to be marked to the service cluster according to the service characteristics of the service data to be marked and the service characteristics of the service data in the service cluster; constructing a clustering center vector based on the first average Euclidean distance; and calculating a second average Euclidean distance according to the clustering center vector and the clustering characteristic value, wherein the average Euclidean distance comprises a first average Euclidean distance and a second average Euclidean distance.

Optionally, determining the service cluster to which the service data to be marked belongs according to the average euclidean distance and the judgment threshold value includes: judging whether the first average Euclidean distance and the second average Euclidean distance are both within the judging threshold value; if the service data to be marked are in the judging threshold value, distributing the service data to be marked to the service cluster closest to the service data to be marked, and updating the service cluster and the clustering center; and if the non-uniformity is within the judging threshold, forming a new service cluster and taking the service data to be marked as a clustering center of the new service cluster.

A third aspect of the embodiment of the present invention provides a computer readable storage medium, where computer instructions are stored, where the computer instructions are configured to cause the computer to perform the sensitive data identification method based on heterogeneous computation according to the second aspect of the embodiment of the present invention.

From the above technical solutions, the embodiment of the present invention has the following advantages:

according to the sensitive data identification system, method and storage medium based on heterogeneous computing, the service data to be marked is obtained through the CPU chip, the service data to be marked is sent to the AI chip, the AI chip judges the service cluster to which the service data belongs according to the service characteristics of the service data to be marked, the service data to be marked is marked according to the service cluster to which the service data belongs, the FPGA chip identifies the sensitive data of the marked service data in a parallel processing mode, a heterogeneous computing architecture based on CPU+AI+FPGA is constructed through the CPU chip, the AI chip and the FPGA chip which are connected in sequence, the service data to be marked is marked through the efficient data processing capability of the AI chip, the service data is processed in parallel through the FPGA chip, the efficiency of identifying the sensitive data in the service data of the power system is improved, and the performance bottleneck of traditional software identification is broken through.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a sensitive data identification system based on heterogeneous computing in an embodiment of the present invention;

FIG. 2 is a flowchart illustrating the operation of a sensitive data identification system based on heterogeneous computing in accordance with an embodiment of the present invention;

FIG. 3 is a flowchart of AI chip labeling business data in an embodiment of the invention;

FIG. 4 is a flow chart of forming a sensitive data identification unit in an embodiment of the invention;

FIG. 5 is a flow chart of a sensitive data identification method based on heterogeneous computing in an embodiment of the invention;

fig. 6 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present invention.

Detailed Description

In order to make the present invention better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention provides a sensitive data identification system based on heterogeneous computation, which comprises a CPU chip, an AI chip and an FPGA chip which are sequentially connected as shown in fig. 1 and 2.

The CPU chip is used for acquiring the service data to be marked and sending the service data to be marked to the AI chip. Specifically, the service data comprises data of different service types such as power marketing, finance, scheduling, power distribution and the like, the service data to be marked is obtained from network flow data captured by the CPU chip, the service type of the service data to be marked cannot be determined, and the service type is required to be input into the AI chip for marking.

The AI chip is used for judging a service cluster to which the service data to be marked belongs according to the service characteristics of the service data to be marked, marking the service data to be marked according to the service cluster to which the service data to be marked belongs, and obtaining the marked service data. Specifically, clustering pre-training is performed on sample data of known service types to obtain service clusters formed by service data of all service types. Extracting service characteristics in service data to be marked, wherein the service characteristics comprise, but are not limited to, characteristics of protocol type, message length, message key field, interaction frequency and the like of the service data. And constructing a feature vector for each service data according to the extracted service features, and judging the service cluster to which the corresponding service data belongs based on the distance between the feature vector and each service cluster, thereby obtaining the service class to which the service data belongs. The service data is intelligently processed through the AI chip, the service data to be marked is marked by utilizing the high-efficiency reasoning characteristic of the AI chip, and the service data to be marked is classified, so that the preprocessing capability of the service data to be marked can be effectively improved.

The FPGA chip is used for identifying the sensitive data of the marked business data in a parallel processing mode. Specifically, the FPGA obtains the service category to which the marked service data belongs according to the marking content, and performs sensitive data identification on the service data in a multi-path parallel mode, namely, each path performs sensitive data identification on the service data of one service category, so that the overall identification efficiency of the sensitive data is improved.

According to the sensitive data identification system based on heterogeneous computation, the to-be-marked business data is obtained through the CPU chip, the to-be-marked business data is sent to the AI chip, the AI chip judges the business cluster to which the business data belongs according to the business characteristics of the to-be-marked business data, the to-be-marked business data is marked according to the business cluster to which the business data belongs, the FPGA chip identifies the sensitive data of the marked business data in a parallel processing mode, a heterogeneous computation framework based on CPU+AI+FPGA is constructed through the CPU chip, the AI chip and the FPGA chip which are sequentially connected, the to-be-marked business data is marked by utilizing the efficient data processing capacity of the AI chip, the business data is processed in parallel through the FPGA chip, the efficiency of identifying the sensitive data in the business data of the electric power system is improved, and the bottleneck of traditional software identification performance is broken through.

In one embodiment, the CPU chip comprises a network port and a data acquisition and distribution module, the network port is connected with the data acquisition and distribution module, and the data acquisition and distribution module is connected with the AI chip through a high-speed bus; the network port is used for acquiring network flow data; the data acquisition and distribution module is used for screening the network flow data to obtain service data to be marked, and sending the service data to be marked to the AI chip through the high-speed bus.

The network port can adopt a network port with the speed of ten thousand mega meters and higher, network flow data can be accessed through the network port with the speed of ten thousand mega meters and higher, then the network flow data is preprocessed through the data acquisition and distribution module, other invalid flows are filtered, service data to be marked are obtained through screening, the service data to be marked comprise service data of various professional power scenes such as marketing, finance, scheduling and the like, and the screened service data to be marked are distributed to an AI chip through a high-speed bus. Specifically, a cache distribution module can be further arranged on the CPU chip, and the screened service data is distributed to the AI chip through a high-speed bus by the cache distribution module. The network port and the data acquisition and distribution module of the CPU chip are utilized to capture and screen the network traffic data, so that the service data to be marked in the network traffic data can be extracted efficiently.

In an embodiment, the AI chip and the FPGA chip are disposed on the FPAI chip, and the FPAI chip is further provided with a main control module, where the main control module is configured to receive an identification result obtained after the FPGA chip identifies the service data by using the sensitive data, and send the identification result to the CPU chip.

Specifically, the main control module comprises a resource scheduling module, a rule management module, a configuration management module and an upper computer communication module, the identification result is sent to the CPU chip through the upper computer communication module, and the AI chip and the FPGA chip are managed and configured through the resource scheduling module, the rule management module and the configuration management module. The FPAI chip capable of efficiently identifying the sensitive data is constructed through the AI chip, the FPGA chip and the main control module, so that the efficiency of identifying the sensitive data in the service data of the power system is improved, and the performance bottleneck of the traditional software identification is broken through.

In an embodiment, the FPAI chip is further provided with a high-speed shared RAM, two ends of the high-speed shared RAM are respectively connected with the AI chip and the FPGA chip, and a cache channel between the AI chip and the FPGA chip is constructed through the high-speed shared RAM.

Specifically, a cache channel of the AI chip and the FPGA chip is constructed through the high-speed shared RAM, the marked business data is quickly transmitted to the FPGA chip through the high-speed shared RAM for parallel processing, and the data transmission time for identifying the sensitive data is shortened.

In an embodiment, the CPU chip includes a model policy issuing module configured to issue the cluster model and the identification policy to the AI chip and the FPGA chip, respectively. The identification strategy is a strategy for identifying sensitive data by the FPGA chip, the identification strategy is mainly a rule defining which data are the sensitive data, and the FPGA chip identifies the sensitive data through the rule. The method comprises the steps that a clustering model and an identification strategy are synchronized to a main control module of an FPAI chip, meanwhile, the main control module obtains a labeling result of the AI chip and an identification result of an FPGA chip, parameters or rules in the model and the strategy are updated according to the labeling result and the identification result, and the updated clustering model and the updated identification strategy are distributed to the AI chip and the FPGA chip.

The AI chip includes: the data feature classification module is used for acquiring service clusters corresponding to the service categories and clustering centers of the service clusters based on the clustering model, constructing a clustering feature value based on expert weights of the service features, constructing a judging threshold based on circumferences formed by the clustering centers, combining the clustering feature value, calculating the average Euclidean distance from the service data to be marked to the service clusters according to the service features of the service data to be marked and the service features of the service data in the service clusters, and determining the service clusters to which the service data to be marked belong according to the average Euclidean distance and the judging threshold; the data characteristic labeling module is used for labeling the service data to be labeled according to the service cluster to which the service data to be labeled belongs; and the FPGA chip performs sensitive data identification on the marked business data based on the identification strategy.

Specifically, the clustering model is issued to the AI chip by the CPU chip, the clustering model comprises service clusters corresponding to the service types and clustering centers of the service clusters, and the CPU chip performs clustering pre-training on service data samples of the known service types to obtain the clustering model. When clustering pre-training is carried out, according to the feature vector of the business data sample with known business category, selecting the characteristics of the protocol type, interface type, message length, business feature code, interaction frequency and the like of the message of the business data as the business data clustering convergence index, and carrying out clustering pre-training on the business data sample by adopting a Kmeans++ clustering algorithm, wherein the specific steps are as follows:

1. for the service data sample, randomly extracting the five characteristics of the protocol type, the interface type, the message length, the service characteristic code and the interaction frequency of the service message to construct a characteristic vector of the service data sample, and randomly selecting one service data sample as a first clustering center.

2. And updating the minimum distance between each service data sample and the existing clustering center.

3. And determining the probability that each service data sample becomes the next cluster center according to the distance.

4. And extracting the next cluster center according to the probability.

5. Repeating the steps 2 to 4 until convergence to form 5 clustering centers

And calculating perimeter per of 5 polygons formed by the clustering centers of the 5 business clusters.

And acquiring service clusters corresponding to the service data samples of each service category through clustering pre-training, obtaining a clustering center, taking the service clusters and the clustering centers obtained through clustering pre-training as a clustering model by the CPU chip, issuing the clustering model to the AI chip, and acquiring the service clusters corresponding to the service categories and the clustering centers of the service clusters from the clustering model by the AI chip.

The specific process of constructing the clustering feature value based on the expert weight of the service feature is as follows:

by combining the characteristics of service data of the power system, selecting the service characteristics such as protocol type, interface type, message length, service characteristic code, interaction frequency and the like of the message of the service data to form a clustering fit characteristic vector, and sequencing the service characteristics of the message of each current service data, wherein the more the service characteristics can reflect the service data type, the larger the corresponding clustering characteristic value is, namely the weight is, so that the maximum weight is occupied when the final clustering characteristic value is calculated, and the clustering characteristic value can reflect the service category of the service data most.

Illustratively, assume that a message of marketing business data has 5 business features that can be fed back: 104 protocol, interface type, message length, interaction frequency including service feature numbers such as 0X68 and the like and protocol, and the sizes of expert weights of the 5 service features are arranged from high to low

. Expert weights come from empirical assessment, and relative weights are calculated based on expert weights:

wherein, the liquid crystal display device comprises a liquid crystal display device,

to influence the coefficient +.>

For the convergence factor, n is empirically set to 1 to 5, < >>

The nth expert weight representing the big-to-small permutation, ++>

Representing the nth phase arranged from large to smallFor weight, guarantee->

And the calculated result of (2) is between 0 and 1. The calculation of the relative weight needs to further increase the weight occupied by the service feature with the largest expert weight in the calculation, so that the calculated relative weight can better embody the attribute of the main service feature. We assume in the current business that

The expert weights of the main business features are obtained by arranging the sizes of the expert weights from high to low

Then according to formula (1), at +.>

Add->

Let->

Higher risk weight, in particular, calculated by formula (2)>

Relative weights.

（2）

For the relative weights of expert weights of other non-main business features, the relative weights need to be reduced correspondingly, and the relative weights are subtracted on the original basis

Namely, calculating the relative weights of other business features according to the formula (3)>

、/>

、/>

、/>

：

（3）

For a pair of

Normalization is performed to let->

The risk weight is higher, and finally the clustering characteristic value is obtained

。

The AI chip then builds a decision threshold based on the perimeter periomerters of the cluster centers, illustratively

And combining the clustering characteristic values, calculating the average Euclidean distance from the service data to be marked to the service cluster according to the service characteristics of the service data to be marked and the service characteristics of the service data in the service cluster, and determining the service cluster to which the service data to be marked belongs according to the average Euclidean distance and the judging threshold value. Illustratively, the service cluster closest to the service data to be marked is judged according to the average Euclidean distance, and the service data to be marked is distributed to the service cluster closest to the service data to be marked. And when calculating the average Euclidean distance, the clustering characteristic value based on the characteristic weight is also considered, so that the calculated average Euclidean distance can reflect the actual distance between the service data and the service cluster.

In an embodiment, the CPU chip includes a data display module, where the data display module is configured to receive an identification result obtained after the FPGA chip performs sensitive data identification on the service data, and display the identification result. Specifically, the type and the position of the sensitive data are recorded in the identification result, and the text of the message is also recorded, the identification result is sent to the main control module, and then sent to the CPU chip by the upper computer communication module, and the CPU chip uniformly displays the type and the position of the sensitive data, the text of the message and the like through the data display module.

In one embodiment, the FPGA chip comprises a high-speed sensitive data identification module and an identification result output module; the sensitive data identification unit comprises a plurality of parallel high-speed identification units which correspond to the business categories and are constructed based on business regular expressions, and the parallel high-speed identification units are used for carrying out parallel processing of sensitive data identification on business data belonging to the corresponding business categories; the identification result output module is used for outputting the identification result of the sensitive data identification unit.

Specifically, as shown in fig. 4, a Thompson algorithm is adopted to convert a business regular expression corresponding to a business category into an uncertain finite state automaton NFA (Nondeterministic Finite Automata, NFA) and output an NFA state transition matrix; constructing and determining a finite state automaton (DFA) by utilizing a subset method (Deterministic Finite Automata, DFA) and outputting a DFA state transition matrix; the constructed state change matrix is converted into an executable VHD execution file by utilizing VHDL/Verlog and the like, and is solidified into an FPGA chip to be converted into a hardware logic circuit, so that a plurality of parallel high-speed recognition units are formed, service data sent by an AI chip are rapidly recognized through the parallel high-speed recognition units, a recognition result is output through a recognition result output module, and the recognition result is sent to a CPU chip for unified display.

According to the sensitive data identification system based on heterogeneous computation, based on a heterogeneous computation architecture of CPU+AI+FPGA, the CPU chip achieves data distribution and coordination, network flow data is mainly captured through a network port, the CPU chip also pre-processes the network flow, other invalid flows are filtered, service data are screened out and distributed to the AI chip through a high-speed channel, meanwhile, a clustering model and a matching strategy are synchronously given to the FPAI chip, a master control module of the FPAI chip updates the models and strategies of the AI chip and the FPGA chip uniformly, and the CPU chip also receives identification results sent by the FPAI and displays the overall. The AI chip receives the service data distributed by the CPU chip, performs feature clustering and calculation analysis through intelligent processing of the data, and marks the service data. The FPGA chip performs sensitive data identification on the marked data content in a multipath parallel mode. The service data is marked through the intelligent data processing of the AI chip, and the hardware logic of the FPGA chip is identified, so that the overall identification efficiency of the sensitive data is improved.

The embodiment of the invention also provides a sensitive data identification method based on heterogeneous computation, which is applied to the sensitive data identification system based on heterogeneous computation in the embodiment, as shown in fig. 5, and comprises the following steps:

step S100: acquiring business data to be marked;

step S200: judging a service cluster to which the service data to be marked belongs according to the service characteristics of the service data to be marked, marking the service data to be marked according to the service cluster to which the service data to be marked belongs, and obtaining marked service data;

step S300: and identifying the sensitive data of the marked business data in a parallel processing mode. The specific details refer to the corresponding parts of the above system embodiments, and are not repeated here.

According to the sensitive data identification method based on heterogeneous computing, the service data to be marked is obtained, the service cluster to which the service data belongs is judged according to the service characteristics of the service data to be marked, the service data to be marked is marked according to the service cluster to which the service data belongs, the marked service data is identified in a parallel processing mode, and the efficiency of identifying the sensitive data in the service data of the power system is improved.

In one embodiment, step S200, determining, according to service characteristics of service data, a service cluster to which the service data belongs, includes:

step S210: and acquiring the service cluster corresponding to the service category and the clustering center of the service cluster based on the clustering model. Specifically, the clustering model is issued to the AI chip by the CPU chip, the clustering model comprises service clusters corresponding to the service types and clustering centers of the service clusters, and in order to know the service type corresponding to each service cluster, the CPU chip performs clustering pre-training on service data samples of the known service types to obtain the clustering model, and the specific process of the clustering pre-training is referred to the embodiment of the system. And acquiring service clusters corresponding to the service data samples of each service category through clustering pre-training, obtaining a clustering center, taking the service clusters and the clustering centers obtained through clustering pre-training as a clustering model by the CPU chip, issuing the clustering model to the AI chip, and acquiring the service clusters corresponding to the service categories and the clustering centers of the service clusters from the clustering model by the AI chip.

Step S220: and constructing a clustering characteristic value based on expert weights of the service characteristics, and constructing a judgment threshold based on the circumference formed by the clustering centers. Specifically, constructing a cluster feature value based on expert weights of service features includes: calculating relative weights based on expert weight structures of service features; normalizing the relative weights; and constructing a clustering characteristic value according to the normalized relative weight. For a specific process of constructing the cluster feature value based on the expert weight of the service feature, see the system embodiment above. The function of the judgment threshold constructed based on the perimeter period formed by the cluster center is to judge whether the service data belongs to a known service cluster or not, and determine whether to generate a new service cluster or not according to the judgment threshold.

Step S230: and combining the cluster characteristic values, and calculating the average Euclidean distance from the service data to be marked to the service cluster according to the service characteristics of the service data to be marked and the service characteristics of the service data in the service cluster.

Step S240: and determining the service cluster to which the service data to be marked belongs according to the average Euclidean distance and the judging threshold value.

Illustratively, the judgment threshold is

Calculating the average Euclidean distance from the service data to be marked to the service cluster according to the service characteristics of the service data to be marked and the service characteristics of the service data in the service cluster by combining the cluster characteristic values, if the minimum average Euclidean distance is within a judgment threshold value, distributing the service data to be marked to the service cluster nearest to the service data to be marked, otherwise, generating a new service clusterAnd taking the business data to be marked as a new clustering center, and carrying out business assignment.

In an embodiment, step S230, in combination with the cluster feature value, calculates an average euclidean distance from the service data to be marked to the service cluster according to the service feature of the service data to be marked and the service feature of the service data in the service cluster, including:

step S231: calculating a first average Euclidean distance from the service data to be marked to the service cluster according to the service characteristics of the service data to be marked and the service characteristics of the service data in the service cluster;

step S232: constructing a clustering center vector based on the first average Euclidean distance;

step S233: and calculating a second average Euclidean distance according to the clustering center vector and the clustering characteristic value, wherein the average Euclidean distance comprises the first average Euclidean distance and the second average Euclidean distance.

Specifically, it is assumed that a feature vector formed by service features of service data to be marked is an object X, a feature vector formed by service features of service data in a service cluster is a point y, the object X and the point y are two discrete points in space, and a distance between the two points can be calculated by using euclidean distance. The calculation formula is as follows:

（4）

meanwhile, the distance between the object X and the service cluster can be described by an average euclidean distance, and if n points exist in the service cluster, the distance between the object X and the service cluster, namely, the first average euclidean distance calculation formula is as follows:

（5）

based on the formulas 4 and 5, calculating to obtain the service data to be marked and the first average Euclidean distance of each service cluster

And pass the first average Euclidean distance +.>

Constructing a clustering center vector w, specifically: />

Then according to the clustering center vector w and the clustering characteristic value which is acquired in advance +.>

Calculating a second average Euclidean distance->

. Specifically, the calculation formula of the second average euclidean distance is:

（6）

when the embodiment of the invention calculates the average Euclidean distance, not only the first average Euclidean distance representing the average distance between the service data to be marked and the service cluster is considered, but also the second average Euclidean distance is calculated by taking the clustering characteristic value based on the characteristic weight into consideration, so that the calculated average Euclidean distance can reflect the actual distance between the service data and the service cluster.

In an embodiment, after the first average euclidean distance and the second average euclidean distance are obtained by calculation, determining the service cluster to which the service data to be marked belongs according to the first average euclidean distance, the second average euclidean distance and the judgment threshold value, including:

judging whether the first average Euclidean distance and the second average Euclidean distance are both within a judging threshold value;

if the service data to be marked are within the judging threshold value, distributing the service data to be marked to the service cluster closest to the service data to be marked, and updating the service cluster and the clustering center;

if the non-uniformity is within the judging threshold, a new service cluster is formed, and the service data to be marked is used as a clustering center of the new service cluster.

Specifically, as shown in fig. 3, assume that the judgment threshold is

The feature vector corresponding to the business data to be marked is represented by an object X, and the object X is calculated to be a first average Euclidean distance dist and a second average Euclidean distance +.>

First average Euclidean distance dist and second average Euclidean distance +.>

And (3) within the range of a preset threshold value, distributing the object X to the service cluster closest to the object X, updating the feature vector in the service cluster, then recalculating the clustering center of the service cluster, taking the average value of the points in the service cluster, and then marking the service according to the service cluster to be marked with the service data. If the first average Euclidean distance dist and the second average Euclidean distance +.>

And if the judgment threshold value is out of the judgment threshold value range, the object X does not belong to any known service cluster, a new service cluster is formed based on a deep learning model in the AI chip, service assignment is carried out, and meanwhile, the object X also becomes the first cluster center of the cluster.

According to the embodiment of the invention, the business cluster and the category of the business data are judged by clustering the business data to be marked, the business to be marked is marked, and a machine learning method is introduced in the clustering process, so that the marking speed and accuracy are improved.

An embodiment of the present invention further provides a computer readable storage medium, as shown in fig. 6, having stored thereon a computer program 510, which when executed by a processor, implements the steps of the sensitive data identification method based on heterogeneous computation in the above embodiment. The storage medium also stores audio and video stream data, characteristic frame data, interactive request signaling, encrypted data, preset data size and the like. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above. Those skilled in the art will appreciate that implementing all or part of the above-described embodiment methods may be accomplished by way of a computer program instructing the relevant hardware, and that the computer program 510 may be stored in a computer readable storage medium, which when executed may comprise the embodiment methods as described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The sensitive data identification system based on heterogeneous computation is characterized by comprising a CPU chip, an AI chip and an FPGA chip which are connected in sequence;

the CPU chip is used for acquiring service data to be marked and sending the service data to be marked to the AI chip;

the AI chip is used for judging a service cluster to which the service data to be marked belongs according to the service characteristics of the service data to be marked, marking the service data to be marked according to the service cluster to which the service data to be marked belongs, and obtaining marked service data;

the FPGA chip is used for identifying the sensitive data of the marked business data in a parallel processing mode;

the CPU chip comprises a model strategy issuing module which is used for issuing a clustering model and an identification strategy to the AI chip and the FPGA chip respectively;

the AI chip includes:

the data characteristic classification module is used for acquiring service clusters corresponding to service categories and clustering centers of the service clusters based on a clustering model, constructing a clustering characteristic value based on expert weights of the service characteristics, constructing a judging threshold based on circumferences formed by the clustering centers, combining the clustering characteristic value, calculating the average Euclidean distance from the service data to be marked to the service clusters according to the service characteristics of the service data to be marked and the service characteristics of the service data in the service clusters, and determining the service clusters to which the service data to be marked belong according to the average Euclidean distance and the judging threshold;

the data characteristic labeling module is used for labeling the service data to be labeled according to the service cluster to which the service data to be labeled belongs;

and the FPGA chip performs sensitive data identification on the marked business data based on the identification strategy.

2. The heterogeneous computation-based sensitive data identification system of claim 1, wherein the CPU chip comprises a portal and a data acquisition and distribution module, the portal and the data acquisition and distribution module being connected, the data acquisition and distribution module being connected to the AI chip by a high-speed bus;

the network port is used for acquiring network flow data;

the data acquisition and distribution module is used for screening the network flow data to obtain service data to be marked, and sending the service data to be marked to the AI chip through a high-speed bus.

3. The sensitive data identification system based on heterogeneous computation according to claim 1, wherein the AI chip and the FPGA chip are arranged on an FPAI chip, and a main control module is further arranged on the FPAI chip, and the main control module is configured to receive an identification result obtained after the FPGA chip identifies the sensitive data of the marked service data, and send the identification result to the CPU chip;

the FPAI chip is also provided with a high-speed shared RAM, two ends of the high-speed shared RAM are respectively connected with the AI chip and the FPGA chip, and a cache channel between the AI chip and the FPGA chip is constructed through the high-speed shared RAM.

4. The sensitive data identification system based on heterogeneous computation according to claim 1, wherein the FPGA chip comprises a high-speed sensitive data identification module and an identification result output module;

the sensitive data identification unit comprises a plurality of parallel high-speed identification units which correspond to the business categories and are constructed based on business regular expressions, and the parallel high-speed identification units are used for carrying out parallel processing of sensitive data identification on the marked business data belonging to the corresponding business categories;

the identification result output module is used for outputting the identification result of the sensitive data identification unit.

5. A sensitive data identification method based on heterogeneous computing, comprising:

acquiring business data to be marked;

judging a service cluster to which the service data to be marked belongs according to the service characteristics of the service data to be marked, marking the service data to be marked according to the service cluster to which the service data to be marked belongs, and obtaining marked service data;

identifying sensitive data of the marked business data in a parallel processing mode;

the judging the service cluster to which the service data to be marked belongs according to the service characteristics of the service data to be marked comprises the following steps:

acquiring a service cluster corresponding to a service class and a clustering center of the service cluster based on a clustering model;

constructing a clustering feature value based on the expert weight of the service feature, and constructing a judgment threshold based on the circumference formed by the clustering center;

combining the cluster characteristic values, and calculating the average Euclidean distance from the service data to be marked to the service cluster according to the service characteristics of the service data to be marked and the service characteristics of the service data in the service cluster;

and determining the service cluster to which the service data to be marked belongs according to the average Euclidean distance and the judging threshold value.

6. The heterogeneous computation-based sensitive data identification method of claim 5, wherein constructing cluster feature values based on expert weights of the business features comprises:

calculating relative weights based on expert weights of the service features;

normalizing the relative weights;

and constructing a clustering characteristic value according to the normalized relative weight.

7. The heterogeneous computation-based sensitive data identification method according to claim 5, wherein calculating, in combination with the cluster feature value, an average euclidean distance from the service data to be marked to the service cluster according to the service feature of the service data to be marked and the service feature of the service data in the service cluster, includes:

calculating a first average Euclidean distance from the service data to be marked to the service cluster according to the service characteristics of the service data to be marked and the service characteristics of the service data in the service cluster;

constructing a clustering center vector based on the first average Euclidean distance;

and calculating a second average Euclidean distance according to the clustering center vector and the clustering characteristic value, wherein the average Euclidean distance comprises a first average Euclidean distance and a second average Euclidean distance.

8. The sensitive data identification method based on heterogeneous computation according to claim 7, wherein determining, according to the average euclidean distance and the judgment threshold, a service cluster to which the service data to be marked belongs, comprises:

judging whether the first average Euclidean distance and the second average Euclidean distance are both within the judging threshold value;

if the service data to be marked are in the judging threshold value, distributing the service data to be marked to the service cluster closest to the service data to be marked, and updating the service cluster and the clustering center;

and if the non-uniformity is within the judging threshold, forming a new service cluster and taking the service data to be marked as a clustering center of the new service cluster.

9. A computer-readable storage medium storing computer instructions for causing the computer to perform the heterogeneous computation-based sensitive data identification method according to any one of claims 5 to 8.