CN116108491A

CN116108491A - Data leakage early warning method, device and system based on semi-supervised federal learning

Info

Publication number: CN116108491A
Application number: CN202310361592.9A
Authority: CN
Inventors: 王滨; 周少鹏; 王旭; 方璐; 朱伟康; 毕志城; 张峰
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2023-04-04
Filing date: 2023-04-04
Publication date: 2023-05-12
Anticipated expiration: 2043-04-04
Also published as: CN116108491B

Abstract

The embodiment of the application provides a data leakage early warning method, device and system based on semi-supervised federal learning. In the embodiment, the sensitive data identification model is trained by taking the data type after the unsupervised clustering of the first target data of the acquired internet of things terminal and the representative data under the data type as model training data, and model parameters obtained by each client training the model are not needed, so that the influence of error model parameters of equipment training by an attacker in malicious placement can be avoided, the model training precision is improved, and the data leakage prediction accuracy is further improved; meanwhile, the first target data of the internet of things terminals collected by different data collection and analysis clients are different, so that a sensitive data identification model trained by using the clustered first target data of each internet of things terminal can be adapted to the diversified data type leakage detection of the large-scale heterogeneous internet of things terminals.

Description

Data leakage early warning method, device and system based on semi-supervised federal learning

Technical Field

The application relates to the technical field of the internet of things, in particular to a data leakage early warning method, device and system based on semi-supervised federal learning.

Background

In the application of the internet of things, the internet of things terminal has the characteristics of huge scale, discrete space and time and heterogeneous multiple sources, the data generated by the internet of things terminal also has huge scale, and the data generated by the internet of things terminal contains a large amount of privacy data such as user privacy and equipment information, and the sensitive privacy data seriously threatens the user safety once being acquired by an attacker. Therefore, in the application of the internet of things, how to early warn the leakage of sensitive data of the internet of things is a technical problem to be solved currently.

Disclosure of Invention

In view of this, the embodiment of the application provides a data leakage early warning method, device and system based on semi-supervised federal learning, so as to realize early warning of data leakage of the internet of things.

According to a first aspect of embodiments of the present application, there is provided a data leakage early warning method based on semi-supervised federal learning, the method being applied to a data acquisition analysis client, the method including:

performing unsupervised clustering on the collected first target data of the terminal of the Internet of things to obtain a data category and representative data under the data category; the first target data includes at least: terminal information of the terminal of the Internet of things, operation data of the terminal of the Internet of things and service data of the terminal of the Internet of things;

Encrypting each data category and the representative data under each data category and transmitting the encrypted representative data to a centralized server, so that the centralized server trains a sensitive data identification model based on the decrypted data categories and the representative data under each data category by adopting a semi-supervised federal learning mode; the sensitive data identification model is used for identifying the current data of the terminal of the Internet of things, and when the current sensitive data of the terminal of the Internet of things is identified, the data acquisition analysis client outputs data leakage early warning.

According to a second aspect of embodiments of the present application, there is provided a data leakage early warning method based on semi-supervised federal learning, the method being applied to a centralized server, the method including: receiving a first ciphertext transmitted by more than two data acquisition analysis clients; the first ciphertext transmitted by any data acquisition and analysis client is obtained by encrypting a data type obtained by performing unsupervised clustering on the acquired first target data of the terminal of the Internet of things and representative data under the data type; the first target data includes at least: terminal information of the terminal of the Internet of things, operation data of the terminal of the Internet of things and service data of the terminal of the Internet of things;

Decrypting the first ciphertext to obtain each data category and representative data under each data category;

training a sensitive data identification model based on the target data category and representative data under the target data category and adopting a semi-supervised federal learning mode aiming at each target data category; the sensitive data identification model is used for identifying current data of the internet of things terminal, and outputting data leakage early warning by the data acquisition analysis client when the current sensitive data of the internet of things terminal are identified, wherein the target data type is one of decrypted data types.

According to a third aspect of embodiments of the present application, there is provided a data leakage early warning device based on semi-supervised federal learning, the device being applied to a data acquisition analysis client, the device including:

the data clustering module is used for performing unsupervised clustering on the collected first target data of the terminal of the Internet of things to obtain data types and representative data under the data types; the first target data includes at least: terminal information of the terminal of the Internet of things, operation data of the terminal of the Internet of things and service data of the terminal of the Internet of things;

The first model training module is used for encrypting each data category and the representative data under each data category and sending the encrypted representative data to the centralized server, so that the centralized server trains a sensitive data identification model based on the decrypted representative data under each data category and the representative data under each data category in a semi-supervised federal learning mode; the sensitive data identification model is used for identifying the current data of the terminal of the Internet of things, and when the current sensitive data of the terminal of the Internet of things is identified, the data acquisition analysis client outputs data leakage early warning.

According to a fourth aspect of embodiments of the present application, there is provided a data leakage early warning device based on semi-supervised federal learning, the device being applied to a centralized server, the device comprising:

the first ciphertext receiving module is used for receiving the first ciphertext transmitted by the more than two data acquisition and analysis clients; the first ciphertext transmitted by any data acquisition and analysis client is obtained by encrypting a data type obtained by performing unsupervised clustering on the acquired first target data of the terminal of the Internet of things and representative data under the data type; the first target data includes at least: terminal information of the terminal of the Internet of things, operation data of the terminal of the Internet of things and service data of the terminal of the Internet of things;

The decryption module is used for decrypting the first ciphertext to obtain each data category and representative data under each data category;

the second model training module is used for training a sensitive data identification model according to each target data category, the target data category and representative data under the target data category by adopting a semi-supervised federal learning mode; the sensitive data identification model is used for identifying current data of the internet of things terminal, and outputting data leakage early warning by the data acquisition analysis client when the current sensitive data of the internet of things terminal are identified, wherein the target data type is one of decrypted data types.

According to a fifth aspect of embodiments of the present application, there is provided a data leakage early warning system based on semi-supervised federal learning, including:

a data acquisition analysis client for performing the method as described in the first aspect;

a centralised server for performing the method as described in the second aspect.

According to a sixth aspect of embodiments of the present application, there is provided an electronic device, including: a processor and a memory;

wherein the memory is configured to store machine-executable instructions;

The processor is configured to read and execute the machine executable instructions stored in the memory, so as to implement the method according to the first aspect or the second aspect.

The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:

in the embodiment of the application, the sensitive data identification model is trained by taking the data type after the unsupervised clustering of the first target data of the acquired internet of things terminal and the representative data under the data type as model training data, and model parameters obtained by each client training the model are not needed, so that the influence of error model parameters of equipment training by an attacker in malicious placement can be avoided, the model training precision is improved, and the data leakage prediction accuracy is further improved;

further, the collected data category of the first target data of the terminal of the Internet of things after the unsupervised clustering and the representative data under the data category are used as training data, and the data type with privacy leakage risk and the specific feature of the data leakage are positioned through the centralized server, so that the data leakage identification precision is improved;

still further, the first target data of the internet of things terminals collected by the different data collection and analysis clients are different, so that the sensitive data identification model trained by using the clustered first target data of each internet of things terminal can be adapted to the diversified data type leakage detection of the large-scale heterogeneous internet of things terminals.

Drawings

Fig. 1 is a flowchart of a data leakage early warning method based on semi-supervised federal learning according to an embodiment of the present application.

Fig. 2 is a flowchart of another data leakage early warning method based on semi-supervised federal learning according to an embodiment of the present application.

Fig. 3 is a block diagram of a data leakage early warning device based on semi-supervised federal learning according to an embodiment of the present application.

Fig. 4 is a block diagram of another data leakage early warning device based on semi-supervised federal learning according to an embodiment of the present application.

Fig. 5 is a block diagram of a data leakage early warning system based on semi-supervised federal learning according to an embodiment of the present application.

Fig. 6 is a block diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

Next, embodiments of the present specification will be described in detail.

As shown in fig. 1, fig. 1 is a flowchart of a data leakage early warning method based on semi-supervised federal learning, which is provided in an embodiment of the present application, and the method is applied to a data acquisition and analysis client.

S110: and performing unsupervised clustering on the collected first target data of the terminal of the Internet of things to obtain a data category and representative data under the data category.

Illustratively, in the present embodiment, the internet of things terminal may be of a wide variety, for example, a network video recorder (Network Video Recorder, NVR), a network Camera (IPC), a digital video recorder (Digital Video Recorder, DVR), a sensor, an RFID card reader, and the like, and the embodiment of the present application is not particularly limited.

It should be noted that the number of the terminals of the internet of things may be multiple, and further federal learning may be performed based on the first target data of each terminal of the internet of things.

In this embodiment, the first target data includes at least: terminal information of the internet of things terminal, operation data of the internet of things terminal, service data of the internet of things terminal and the like, wherein the terminal information of the internet of things terminal at least comprises: terminal name, terminal type, system information of the terminal, etc.; the operation data of the terminal of the internet of things at least comprises: process data, service data, port data, etc.; the service data of the terminal of the internet of things at least comprises: the first target data, the terminal information of the internet of things terminal, the operation data of the internet of things terminal and the service data of the internet of things terminal are not particularly limited.

In this embodiment, in step S110, the collecting the first target data of the terminal of the internet of things may specifically be: the network traffic mirror image acquisition method is acquired by adopting the conventional network traffic mirror image acquisition method, the equipment service process monitoring method and the like, and the embodiment of the application is not particularly limited.

Illustratively, in this embodiment, the data types may be various, for example, device verification, device access, device streaming, user login, user addition, user logout, and the like, and the embodiment of the present application is not particularly limited.

Clustering is to divide a data set into different categories according to a specific standard (such as distance), so that the similarity of data in the same category is as large as possible, and the data in different categories has large variability, specifically: and randomly designating a plurality of clustering center points in the data set in advance, and repeatedly calculating the distance from each data except the clustering center points in the data set to each designated clustering center point, and distributing each data according to the calculated distance until the target of 'the points in the same category are close enough and the points in different categories are far enough' is finally reached.

In this embodiment, for each data class obtained by clustering, at least one representative data is included in the data class, where the representative data refers to data that can represent a feature of the data class. In this embodiment, the representative data may be all data in the clustered data category, or may be part of data in the clustered data category, where the part of data may be, for example: the data having a distance from the cluster center point of the data class within a specified distance (e.g., 5) is not particularly limited in the embodiment of the present application, and the representative data and the specified distance are not particularly limited.

In this embodiment, in step S110, the collected first target data of the terminal of the internet of things is subjected to unsupervised clustering, so as to obtain a data class, and the representative data under the data class may specifically be the data class obtained by clustering the first target data by using a conventional unsupervised clustering method, and the representative data under the data class.

S120: encrypting each data category and the representative data under each data category and transmitting the encrypted representative data to the centralized server, so that the centralized server trains a sensitive data identification model based on the decrypted data categories and the representative data under each data category by adopting a semi-supervised federal learning mode; the sensitive data identification model is used for identifying the current data of the terminal of the Internet of things, and when the current sensitive data of the terminal of the Internet of things is identified, the data acquisition analysis client outputs data leakage early warning.

For the data collection and analysis client, the centralized server generates a pair of private key and public key, sends the public key to the data collection and analysis client, encrypts each data category and representative data under each data category based on the public key after obtaining each data category and representative data under each data category, and sends the encrypted data category and representative data under each data category to the centralized server, and trains the sensitive data identification model based on the decrypted data category and representative data under each data category by adopting a conventional semi-supervised federal learning mode.

In this embodiment, the combination of the unsupervised data clustering and the supervised model training is referred to as semi-supervision, that is, the process that the centralized server performs the supervised model training by using the data types obtained by the unsupervised clustering and the representative data under the data types, and obtains the sensitive data identification model is referred to as semi-supervision.

In this embodiment, since the performance of the centralized server is relatively strong, the identification of the sensitive data types can be achieved, after the centralized server decrypts the data, each data type and the representative data under each data type sent by each data acquisition and analysis client are obtained, each data type is identified, the sensitive data type and the representative data under each sensitive data type are obtained, and the representative data under the same sensitive data type are combined, that is, the representative data under the same sensitive data type are combined into a group.

Here, when training the sensitive data identification model, the sensitive data identification model is trained based on the data category and the representative data under the data category by means of conventional semi-supervised federal learning for each data category.

In this embodiment, after the centralized server trains the sensitive data identification model, each trained sensitive data identification model is issued to each data acquisition and analysis client, so that each data acquisition and analysis client identifies the current data of the internet of things terminal based on the sensitive data identification model, and outputs data leakage early warning when the current sensitive data of the internet of things terminal is identified.

Here, the data acquisition and analysis client may specifically identify the current data of the terminal of the internet of things based on the sensitive data identification model: inputting the current data of the terminal of the Internet of things into a sensitive data identification model, outputting the confidence coefficient of each field of the current data belonging to the sensitive data, and sending out early warning information when the confidence coefficient of each field belonging to the sensitive data is larger than or equal to a specified threshold (for example, 0.4).

In this embodiment, the method of early warning may be various, for example, voice early warning, acousto-optic early warning, text early warning, etc., and the embodiment of the present application is not particularly limited.

As an optional implementation manner of the embodiment of the application, each data acquisition and analysis client is provided with an initial neural network model which has the same structure as that of the sensitive data identification model, after the sensitive data identification model is trained, the centralized server transmits model parameters to each data acquisition and analysis client, each data acquisition and analysis client receives the model parameters transmitted by the centralized server, then the model parameters of the initial neural network model configured on the data acquisition and analysis client are replaced by the received model parameters, and then data leakage early warning is carried out by using the initial neural network model with the replaced model parameters.

So far, the description of the flow shown in fig. 1 is completed.

As can be seen from the flow of fig. 1, in the embodiment of the present application, the data type after performing unsupervised clustering on the collected first target data of the terminal of the internet of things and the representative data under the data type are used as model training data to train out a sensitive data identification model, so that model parameters obtained by training the model by each client are not needed, the influence of error model parameters of equipment training by an attacker maliciously placed can be avoided, the model training precision is improved, and the data leakage prediction accuracy is further improved;

As an optional implementation manner of the embodiment of the present application, in the step S110, performing unsupervised clustering on the collected first target data of the terminal of the internet of things to obtain a data class and representative data under the data class, where the method includes:

firstly, data cleaning is carried out on first target data of the acquired internet of things terminal according to a set data cleaning mode, and data vectorization is carried out on the cleaned data according to a set standardization processing mode, so that standard format data for unsupervised clustering is obtained.

And secondly, performing unsupervised clustering on the obtained standard format data to obtain data types and representative data under the data types.

In this embodiment, the set data cleansing method may be a plurality of methods, for example, deleting the first target data that does not meet the requirement or cutting each field of the first target data, and the embodiment of the present application is not specifically limited.

In the present embodiment, the set normalization processing mode refers to data vectorization processing.

In this embodiment, the standard format data may include: the size of the data packet, the sequence of the data packet structure, the special field of the data packet, etc., and the embodiments of the present application are not particularly limited.

As an optional implementation manner of the embodiment of the present application, after obtaining the sensitive data identification model, the data leakage early warning method based on semi-supervised federal learning further includes:

firstly, performing unsupervised clustering on second target data of an internet of things terminal acquired at intervals of a designated interval to obtain a data category and representative data under the data category; the second target data includes at least: terminal information of the terminal of the Internet of things, operation data of the terminal of the Internet of things and business data of the terminal of the Internet of things;

and secondly, encrypting each data category and the representative data under each data category and transmitting the encrypted data to the centralized server so that the centralized server updates the trained sensitive data identification model based on the decrypted data categories and the representative data under each data category.

Illustratively, in the present embodiment, the above specified interval time may be any time, for example, 1 month, 15 days, and the embodiment of the present application is not particularly limited.

In this embodiment, after the sensitive data identification model is obtained, the second target data of the terminal of the internet of things is collected periodically, and then unsupervised clustering is performed on the second target data to obtain a data class and representative data under the data class, so that the trained sensitive data identification model is updated by the centralized server based on each data class and the representative data under each data class, and the specific updating process is the same as the training process, which is not repeated here.

The centralized server transmits the updated sensitive data identification model to each data acquisition and analysis client so that each data acquisition and analysis client can identify the current data of the terminal of the Internet of things by using the updated sensitive data identification model, and when the current sensitive data of the terminal of the Internet of things is identified, the data acquisition and analysis client outputs data leakage early warning.

It should be noted that, after the sensitive data identification model is obtained, there may be data different from the first target data for training the sensitive data identification model in the second target data of the internet of things terminal collected by the data collection client, so in this embodiment, a new sensitive data identification model may be trained based on the data category different from the first target data and the representative data under the data category while updating the trained sensitive data identification model.

As shown in fig. 2, fig. 2 is a flowchart of a data leakage early warning method based on semi-supervised federal learning, which is provided in an embodiment of the present application, and is applied to a centralized server, and the data leakage early warning method based on semi-supervised federal learning includes the following steps:

s210: receiving a first ciphertext transmitted by more than two data acquisition analysis clients; the first ciphertext transmitted by any data acquisition and analysis client is obtained by encrypting a data type obtained by performing unsupervised clustering on the acquired first target data of the terminal of the Internet of things and representative data under the data type; the first target data includes at least: terminal information of the terminal of the Internet of things, operation data of the terminal of the Internet of things and business data of the terminal of the Internet of things;

S220: decrypting the first ciphertext to obtain each data category and representative data under each data category;

s230: training a sensitive data identification model based on the target data category and representative data under the target data category and adopting a semi-supervised federal learning mode aiming at each target data category; the sensitive data identification model is used for identifying the current data of the terminal of the Internet of things, and when the current sensitive data of the terminal of the Internet of things is identified, the data acquisition analysis client outputs data leakage early warning.

In this embodiment, the target data class is one of the decrypted data classes, where the target data class may be each of the decrypted data classes, or may be a partial data class of the decrypted data classes, and the partial data class may be, for example: the data class of which the representative data under the data class is greater than or equal to the specified number threshold is not particularly limited in the embodiment of the present application.

As an optional implementation manner of the embodiment of the present application, terminal information of the terminal of the internet of things at least includes: terminal name, terminal type and system information of the terminal;

The operation data of the terminal of the internet of things at least comprises: process data, service data, port data;

the service data of the terminal of the internet of things at least comprises: and the service interaction data of the Internet of things platform, the service and the protocol run on the Internet of things terminal.

The description of the process corresponding to the data acquisition and analysis client by the execution body is referred to in the specific embodiment, and is not repeated here.

This completes the description of the flowchart shown in fig. 2.

As an optional implementation manner of the embodiment of the present application, in S230, training the sensitive data identification model based on the target data category and the representative data under the target data category and adopting a semi-supervised federal learning manner for each target data category includes:

for each data category, determining whether the number of representative data under the data category is greater than or equal to a specified number threshold;

if yes, the data category is taken as a target data category, and a sensitive data recognition model is trained by adopting a semi-supervised federal learning mode based on the target data category and representative data under the target data category.

If not, the data category and the representative data under the data category are saved; when the total number of the received representative data under the data category is greater than or equal to a specified number threshold, training a sensitive data identification model based on the data category and the representative data under the data category in a semi-supervised federal learning mode.

Illustratively, in the present embodiment, the above specified number of thresholds may be any number, for example, 100, which is not particularly limited in the embodiments of the present application.

In this embodiment, in this step, for each data class, it is determined whether the number of representative data in the data class is greater than or equal to a specified number threshold, if so, the data class is taken as a target data class, and the sensitive data recognition model is trained based on the target data class and the representative data in the target data class and by means of semi-supervised federal learning.

As described above for the update process of the sensitive data identification model, after the sensitive data identification model is trained, the data collection and analysis client collects the second target data of the internet of things terminal at intervals of a specified interval so as to update the sensitive data identification model, and the data collection client periodically sends the data types and the representative data under each data type to the centralized server, so that if the number of the representative data under each data type is smaller than the specified number threshold, the data types and the representative data under each data type are stored first, and when the total number of the representative data under each data type received subsequently is greater than or equal to the specified number threshold, the corresponding sensitive data identification model is trained based on the data types and the representative data under each data type by adopting a semi-supervised federal learning mode.

As an embodiment, if not, the data class and the representative data under the data class are not processed.

Meanwhile, after the sensitive data identification model is obtained, there may be data different from the first target data of the internet of things terminal acquired by the data acquisition client, and in this embodiment, the step of determining, for each data category, whether the number of the representative data in the data category is greater than or equal to the specified number threshold is performed for the data category different from the first target data and the representative data in the data category while updating the trained sensitive data identification model.

According to the embodiment of the application, the number of the representative data in each data category is judged, model training is not carried out on data with fewer numbers, and the number of data processing is reduced.

firstly, receiving a second ciphertext transmitted by a data acquisition and analysis client; the second ciphertext transmitted by the data acquisition and analysis client is obtained by encrypting a data category obtained by performing unsupervised clustering on second target data of the internet of things terminal acquired at intervals and representative data under the data category; the second target data includes at least: terminal information of the terminal of the Internet of things, operation data of the terminal of the Internet of things and business data of the terminal of the Internet of things;

And secondly, updating the sensitive data identification model according to each data type obtained after the second ciphertext is decrypted and the representative data under each data type.

The detailed description refers to the description of the above embodiments, and is not repeated here.

Corresponding to the embodiment of the method, the embodiment of the application also provides an embodiment of the device and the terminal applied by the device.

As shown in fig. 3, fig. 3 is a block diagram of a data leakage early warning device based on semi-supervised federal learning according to an embodiment of the present application, where the device is applied to a data acquisition and analysis client, and the device includes:

the data clustering module is used for performing unsupervised clustering on the collected first target data of the terminal of the Internet of things to obtain data types and representative data under the data types; the first target data includes at least: terminal information of the terminal of the Internet of things, operation data of the terminal of the Internet of things and business data of the terminal of the Internet of things;

the first model training module is used for encrypting each data category and the representative data under each data category and sending the encrypted representative data to the centralized server, so that the centralized server trains a sensitive data identification model based on the decrypted data categories and the representative data under each data category and in a semi-supervised federal learning mode; the sensitive data identification model is used for identifying the current data of the terminal of the Internet of things, and when the current sensitive data of the terminal of the Internet of things is identified, the data acquisition analysis client outputs data leakage early warning.

As an optional implementation manner of the embodiment of the present application, the data clustering module is specifically configured to:

carrying out data cleaning on the first target data of the acquired internet of things terminal according to a set data cleaning mode, and carrying out data vectorization on the cleaned data according to a set standardized processing mode to obtain standard format data for unsupervised clustering;

and performing unsupervised clustering on the obtained standard format data to obtain data types and representative data under the data types.

As an optional implementation manner of the embodiment of the present application, the terminal information of the terminal of the internet of things at least includes: terminal name, terminal type and system information of the terminal;

As an optional implementation manner of the embodiment of the present application, the data leakage early warning device based on semi-supervised federal learning further includes:

the data category obtaining module is used for performing unsupervised clustering on second target data of the internet of things terminal acquired at intervals to obtain a data category and representative data under the data category; the second target data includes at least: terminal information of the terminal of the Internet of things, operation data of the terminal of the Internet of things and business data of the terminal of the Internet of things;

The first updating module is used for encrypting each data category and the representative data under each data category and sending the encrypted representative data to the centralized server, so that the centralized server updates the trained sensitive data identification model based on each decrypted data category and the representative data under each data category.

The description of the block diagram of the apparatus shown in fig. 3 is thus completed.

As shown in fig. 4, fig. 4 is a block diagram of a data leakage early warning device based on semi-supervised federal learning according to an embodiment of the present application, where the device is applied to a centralized server, and the device includes:

the first ciphertext receiving module is used for receiving the first ciphertext transmitted by the more than two data acquisition and analysis clients; the first ciphertext transmitted by any data acquisition and analysis client is obtained by encrypting a data type obtained by performing unsupervised clustering on the acquired first target data of the terminal of the Internet of things and representative data under the data type; the first target data includes at least: terminal information of the terminal of the Internet of things, operation data of the terminal of the Internet of things and business data of the terminal of the Internet of things;

The second model training module is used for training a sensitive data identification model according to each target data category, the target data category and representative data under the target data category by adopting a semi-supervised federal learning mode; the sensitive data identification model is used for identifying the current data of the terminal of the Internet of things, and when the current sensitive data of the terminal of the Internet of things is identified, the data acquisition analysis client outputs data leakage early warning, and the target data type is one of the decrypted data types.

As an optional implementation manner of the embodiment of the present application, the second model training module is specifically configured to:

the second ciphertext receiving module is used for receiving a second ciphertext transmitted by the data acquisition and analysis client; the second ciphertext transmitted by the data acquisition and analysis client is obtained by encrypting a data category obtained by performing unsupervised clustering on second target data of the internet of things terminal acquired at intervals and representative data under the data category; the second target data includes at least: terminal information of the terminal of the Internet of things, operation data of the terminal of the Internet of things and business data of the terminal of the Internet of things;

And the second updating module is used for updating the sensitive data identification model according to each data type obtained after the second ciphertext is decrypted and the representative data under each data type.

Thus, the description of the block diagram of the apparatus shown in fig. 4 is completed.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present description. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

As shown in fig. 5, fig. 5 is a block diagram of a data leakage early warning system based on semi-supervised federal learning according to an embodiment of the present application, where the system includes:

The data acquisition and analysis client is used for executing the method of the execution main body for the data acquisition and analysis client;

and the centralized server is used for executing the method for taking the execution subject as the centralized server.

This completes the description of the block diagram shown in fig. 5.

Correspondingly, the embodiment of the application also provides a hardware structure diagram of the device shown in fig. 3 or the device shown in fig. 4, and in particular, as shown in fig. 6, the electronic device may be a device for implementing the method. As shown in fig. 6, the hardware structure includes: a processor and a memory.

Wherein the memory is configured to store machine-executable instructions;

the processor is configured to read and execute the machine executable instructions stored in the memory, so as to implement the data leakage early warning method embodiment of the semi-supervised federal learning.

The memory may be any electronic, magnetic, optical, or other physical storage device that may contain or store information, such as executable instructions, data, or the like, for one embodiment. For example, the memory may be: volatile memory, nonvolatile memory, or similar storage medium. In particular, the memory may be RAM (Radom Access Memory, random access memory), flash memory, a storage drive (e.g., hard drive), a solid state disk, any type of storage disk (e.g., optical disk, DVD, etc.), or a similar storage medium, or a combination thereof.

Thus, the description of the electronic device shown in fig. 6 is completed.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Other embodiments of the present description will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.

It should be understood that the present description is not limited to the structures that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.

The foregoing description of the preferred embodiments is provided for the purpose of illustration only, and is not intended to limit the scope of the disclosure, since any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the disclosure are intended to be included within the scope of the disclosure.

Claims

1. The data leakage early warning method based on semi-supervised federal learning is characterized by being applied to a data acquisition and analysis client, and comprises the following steps:

2. The method of claim 1, wherein performing unsupervised clustering on the collected first target data of the terminal of the internet of things to obtain a data class and representative data under the data class, includes:

3. The method of claim 1, wherein after deriving the sensitive data identification model, the method further comprises:

performing unsupervised clustering on second target data of the internet of things terminal acquired at intervals of a designated interval to obtain a data category and representative data under the data category; the second target data includes at least: terminal information of the terminal of the Internet of things, operation data of the terminal of the Internet of things and service data of the terminal of the Internet of things;

encrypting each data category and the representative data under each data category and transmitting the encrypted representative data to a centralized server, so that the centralized server updates the trained sensitive data identification model based on each decrypted data category and the representative data under each data category.

4. A data leakage early warning method based on semi-supervised federal learning, wherein the method is applied to a centralized server, the method comprising:

receiving a first ciphertext transmitted by more than two data acquisition analysis clients; the first ciphertext transmitted by any data acquisition and analysis client is obtained by encrypting a data type obtained by performing unsupervised clustering on the acquired first target data of the terminal of the Internet of things and representative data under the data type; the first target data includes at least: terminal information of the terminal of the Internet of things, operation data of the terminal of the Internet of things and service data of the terminal of the Internet of things;

5. The method of claim 4, wherein training the sensitive data identification model for each target data category based on the target data category and the representative data under the target data category and using semi-supervised federal learning comprises:

if yes, taking the data category as a target data category, and training a sensitive data identification model based on the target data category and representative data under the target data category by adopting a semi-supervised federal learning mode;

if not, the data category and the representative data under the data category are saved; and when the total number of the received representative data under the data category is greater than or equal to the specified number threshold, training a sensitive data identification model based on the data category and the representative data under the data category in a semi-supervised federal learning mode.

6. The method of claim 4, wherein after deriving the sensitive data identification model, the method further comprises:

receiving a second ciphertext transmitted by the data acquisition and analysis client; the second ciphertext transmitted by the data acquisition and analysis client is obtained by encrypting a data category obtained by performing unsupervised clustering on second target data of the internet of things terminal acquired at intervals and representative data under the data category; the second target data includes at least: terminal information of the terminal of the Internet of things, operation data of the terminal of the Internet of things and service data of the terminal of the Internet of things;

And updating the sensitive data identification model according to each data type obtained after the second ciphertext is decrypted and the representative data under each data type.

7. Data leakage early warning device based on semi-supervised federal learning, characterized in that the device is applied to data acquisition analysis client, and the device includes:

8. A data leakage early warning device based on semi-supervised federal learning, the device being applied to a centralised server, the device comprising:

9. The utility model provides a data leakage early warning system based on semi-supervised federal study which characterized in that includes:

a data acquisition analysis client for performing the method of any of claims 1-3;

a centralised server for performing the method of any of claims 4-6.

10. An electronic device, characterized in that the electronic device comprises: a processor and a memory;

wherein the memory is configured to store machine-executable instructions;

the processor is configured to read and execute the machine executable instructions stored in the memory to implement the method according to any one of claims 1 to 6.