CN112819156A

CN112819156A - Data processing method, device and equipment

Info

Publication number: CN112819156A
Application number: CN202110102757.1A
Authority: CN
Inventors: 曹佳炯; 丁菁汀
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2021-05-18

Abstract

The embodiment of the specification discloses a data processing method, a device and equipment, wherein the method comprises the following steps: acquiring target data to be detected provided by a data providing mechanism in a federal learning framework; inputting the target data into a first data risk detection model to obtain a first output result, and inputting the target data into a second data risk detection model to obtain a second output result, wherein the first data risk detection model is obtained by performing supervised training on the basis of first sample data provided by a data providing mechanism in the federal learning frame, and the second data risk detection model is obtained by performing supervised training in an information reconstruction mode on the basis of second sample data provided by the data providing mechanism in the federal learning frame and an organization identifier of a data providing mechanism to which the second sample data belongs; and if the first output result and the second output result do not match, determining that the target data is data containing toxin information.

Description

Data processing method, device and equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data processing method, apparatus, and device.

Background

Machine learning and deep learning have been used in recent years, but due to privacy issues, data of respective organizations cannot be shared with each other, and accordingly, federal learning is proposed to solve the above data islanding problem.

However, some malicious attack behaviors may exist in the federal learning framework, for example, a certain data providing organization or a plurality of data providing organizations interfere with the calculation result of the gradient of the model by the third-party platform by uploading data carrying toxin information, and as the third-party platform in the federal learning framework is a trusted platform for any organization, the toxin information provided by the malicious data providing organization cannot be sensed at all, so that the performance of the model of other organizations is reduced sharply, and the reliability and the effectiveness of the whole federal learning are damaged. The common defense mechanism for data (especially sample data) carrying toxin information is often difficult to apply under the federal learning framework, and for this reason, a technical scheme which has higher defense capability for the toxin information and can improve the reliability and effectiveness of the federal learning is needed to be provided.

Disclosure of Invention

The purpose of this specification embodiment is to provide a technical scheme that toxin information's defence capacity is higher, and can improve federal study reliability and validity.

In order to implement the above technical solution, the embodiments of the present specification are implemented as follows:

the data processing method provided by the embodiment of the specification comprises the following steps: target data to be detected provided by a data providing organization in the federal learning framework are obtained. And inputting the target data into a first data risk detection model to obtain a first output result, and inputting the target data into a second data risk detection model to obtain a second output result, wherein the first data risk detection model is obtained by performing supervised training on the basis of first sample data provided by a data providing mechanism in the federal learning frame, and the second data risk detection model is obtained by performing supervised training in an information reconstruction mode on the basis of second sample data provided by the data providing mechanism in the federal learning frame and an organization identifier of a data providing mechanism to which the second sample data belongs. And if the first output result and the second output result do not match, determining that the target data is data containing toxin information.

An embodiment of this specification provides a data processing apparatus, the apparatus includes: and the data acquisition module is used for acquiring target data to be detected, which is provided by a data providing mechanism in the federal learning framework. And the toxin information detection module is used for inputting the target data into a first data risk detection model to obtain a first output result, and inputting the target data into a second data risk detection model to obtain a second output result, wherein the first data risk detection model is obtained by performing supervised training on the basis of first sample data provided by a data providing mechanism in the federal learning frame, and the second data risk detection model is obtained by performing supervised training on the basis of second sample data provided by the data providing mechanism in the federal learning frame and an organization identifier of a data providing mechanism to which the second sample data belongs in an information reconstruction mode. And the toxin information determining module is used for determining that the target data is data containing toxin information if the first output result and the second output result are not matched.

An embodiment of the present specification provides a data processing apparatus, including: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to: target data to be detected provided by a data providing organization in the federal learning framework are obtained. And inputting the target data into a first data risk detection model to obtain a first output result, and inputting the target data into a second data risk detection model to obtain a second output result, wherein the first data risk detection model is obtained by performing supervised training on the basis of first sample data provided by a data providing mechanism in the federal learning frame, and the second data risk detection model is obtained by performing supervised training in an information reconstruction mode on the basis of second sample data provided by the data providing mechanism in the federal learning frame and an organization identifier of a data providing mechanism to which the second sample data belongs. And if the first output result and the second output result do not match, determining that the target data is data containing toxin information.

Embodiments of the present specification also provide a storage medium, where the storage medium is used to store computer-executable instructions, and the executable instructions, when executed, implement the following processes: target data to be detected provided by a data providing organization in the federal learning framework are obtained. And inputting the target data into a first data risk detection model to obtain a first output result, and inputting the target data into a second data risk detection model to obtain a second output result, wherein the first data risk detection model is obtained by performing supervised training on the basis of first sample data provided by a data providing mechanism in the federal learning frame, and the second data risk detection model is obtained by performing supervised training in an information reconstruction mode on the basis of second sample data provided by the data providing mechanism in the federal learning frame and an organization identifier of a data providing mechanism to which the second sample data belongs. And if the first output result and the second output result do not match, determining that the target data is data containing toxin information.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 illustrates an embodiment of a data processing method of the present disclosure;

FIG. 2 is a block diagram of a data processing system according to the present description;

FIG. 3 is another embodiment of a data processing method described herein;

FIG. 4 is a diagram of another embodiment of a data processing method;

FIG. 5 is a diagram of one embodiment of a data processing apparatus according to the present disclosure;

fig. 6 is a data processing apparatus embodiment of the present description.

Detailed Description

The embodiment of the specification provides a data processing method, a data processing device and data processing equipment.

In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step should fall within the scope of protection of the present specification.

Example one

As shown in fig. 1, an execution subject of the method may be a server or a terminal device, where the terminal device may be a computer device such as a notebook computer or a desktop computer, and may also be a mobile terminal device such as a mobile phone or a tablet computer. The server may be a server for a certain service (e.g., a transaction service or a financial service) or a server that needs to perform federal learning, and specifically, the server may be a server for a payment service, or a server for a service related to financial or instant messaging, etc. The execution main body in this embodiment is described by taking a server as an example, and for the case that the execution main body is a terminal device, the following related contents may be referred to, and are not described herein again. The method may specifically comprise the steps of:

in step S102, target data to be detected provided by a data providing organization in the federal learning framework is acquired.

The federal learning is a novel artificial intelligence basic technology, is originally used for solving the problem that a mobile terminal user updates a model locally, and is designed to carry out efficient machine learning among multiple parties or multiple computing nodes on the premise of guaranteeing information safety during big data exchange, protecting terminal data and personal data privacy and guaranteeing legal compliance. Federal learning can be defined as N (where N is a positive integer greater than or equal to 1) data providers, all of which wish to train a specified machine learning model by combining their own data, one conventional approach being to bring together the data provided by all of the data providers and train the specified machine learning model using the data provided by all of the data providers. Federal learning is a learning process in which data providers jointly train a model in which no data-providing organization exposes its data to others or to the organization. The federal learning framework may be a processing mode related to the federal learning mechanism (or a processing procedure of the federal learning) or a system architecture corresponding to the processing procedure, and the federal learning framework may include the processing mode and the processing procedure of the federal learning, and related information of a data providing organization in the federal learning, and the like, and may be specifically set according to actual conditions. The data providing mechanism may be any organization or mechanism that provides related data support for federal learning, where the data may be data that needs to be used in the process of federal learning (such as sample data for model training or sample data for model evaluation), or related data involved in other related processes (such as data for toxin information detection or data to be classified), and the like, and may be specifically set according to actual situations, and this is not limited in the embodiments of this specification. The target data may be any data, for example, sample data provided by a data providing organization for federal learning, or any data provided by the data providing organization for toxin information detection, and may be specifically set according to actual conditions.

In implementation, machine learning and deep learning have been used greatly in recent years, and since the performance of machine learning models and deep learning models is improved as the amount of data increases, each mechanism (i.e., data providing mechanism) aims to acquire more data to improve the performance of the models. Due to the privacy problem, data of all organizations cannot be shared with each other, so federal learning is provided to solve the data island problem, namely a credible third-party platform can be utilized, all organizations transmit the data to the third-party platform, the third-party platform calculates the gradient of a corresponding model and then transmits the gradient to all organizations, and all organizations update the corresponding model based on the gradient. In this way, each organization obtains the gradient calculated based on more data without directly exchanging data, thereby improving the performance of the model.

However, some malicious attack behaviors may exist in the federal learning framework, for example, a certain data providing organization or a plurality of data providing organizations interfere with the calculation result of the gradient of the model by the third-party platform by uploading data carrying toxin information, and as the third-party platform in the federal learning framework is a trusted platform for any organization, the toxin information provided by the malicious data providing organization cannot be sensed at all, so that the performance of the model of other organizations is reduced sharply, and the reliability and the effectiveness of the whole federal learning are damaged. The common defense mechanisms for data carrying toxin information (especially sample data) are often difficult to apply under the federal learning framework, for example, the common confrontation training mechanism (i.e. adding confrontation noise directly on normal sample data and then performing model training based on the sample data added with the confrontation noise) can defend against known types of confrontation samples, whereas the defense capability of the model is obviously reduced for new types or unknown types of confrontation samples. In the framework of federal learning, any type of data carrying toxin information can be generated by a malicious data providing organization, that is, the toxin information may be changed at any time, so the above conventional countermeasure training mechanism cannot solve the above problems in federal learning, and for this reason, a technical solution that has higher defense capability of toxin information and can provide reliability and effectiveness of federal learning is needed. The embodiment of the present specification provides a related technical solution, which may specifically include the following contents:

in order to overcome the problems related to the treatment, a federal learning method based on antitoxin information of resistance training is provided, and specifically, in a federal learning framework, toxin information is often only present in a few data providing organizations (such as a certain data providing organization or a plurality of different data providing organizations, and the like), namely, the existence of the toxin information is strongly related to the organization categories of the data providing organizations. If the target task (such as a classification task, a segmentation task, and the like) and the organization category can be decoupled in the training stage of the model, the target task can also be decoupled from the toxin information, so as to achieve the purpose of antitoxin, the processing mode is only related to whether the toxin information exists in the data providing organization or not and is not related to the specific type of the toxin information, based on this, as shown in fig. 2, when one or more different data providing organizations need to detect the toxin information of certain data (namely target data), the data providing organization can obtain the data to be uploaded and the organization identification of the data providing organization, and can generate the target data based on the obtained data, and then, the target data can be uploaded to a third party platform (such as the server) in federal learning based on a preset data uploading interface, so that the server of the third party platform can obtain the target to be detected provided by the data providing organization in the federal learning framework And (4) data.

Or, when a third-party platform in federal learning needs to detect toxin information on data (i.e., target data) provided by one or more different data providing organizations, after the target data is provided by the data providing organizations, the target data to be detected provided by the data providing organizations in the federal learning framework can be acquired.

The above two optional processing manners are provided, and in practical application, a plurality of different processing scenarios and processing manners may also be included, which may be specifically set according to practical situations, and this is not limited in this embodiment of the present specification.

In step S104, target data is input into a first data risk detection model to obtain a first output result, and the target data is input into a second data risk detection model to obtain a second output result, where the first data risk detection model is obtained by performing supervised training on the basis of first sample data provided by a data providing organization in the federal learning frame, and the second data risk detection model is obtained by performing supervised training on the basis of second sample data provided by the data providing organization in the federal learning frame and an organization identifier of a data providing organization to which the second sample data belongs in an information reconstruction manner.

The first data risk detection model may be a model obtained by performing supervised model training using first sample data provided by a data providing organization in a common federal learning manner, for example, the first data risk detection model may be constructed based on a certain feature extraction algorithm, and may be subjected to supervised model training based on the first sample data, so as to finally obtain the first data risk detection model for extracting data features. The second data risk detection model may be obtained by performing supervised training in an information reconstruction manner and the like by using second sample data provided by a data providing organization in the federal learning framework and an organization identifier of a data providing organization to which the second sample data belongs, for example, the second data risk detection model may be constructed based on a certain feature extraction algorithm and an information reconstruction algorithm, and may be subjected to supervised model training based on the second sample data, and finally, the second data risk detection model for extracting data features and performing information reconstruction may be obtained, where the information reconstruction may be a manner of restoring corresponding original information of the features based on a small amount of information such as certain features.

In practice, in order to detect data including toxin information from a plurality of data, it is necessary to train a second data risk detection model, which is a data risk detection model having antitoxin information, in advance, and to accurately identify data including toxin information, a model in federal learning (that is, the model may be used as a first data risk detection model) may be constructed and trained, and whether the data includes toxin information or not may be determined based on a data output result of the first data risk detection model, based on which, according to actual conditions, relevant information of a corresponding data providing mechanism may be acquired from a federal learning framework, and sample data for training the first data risk detection model and the second data risk detection model may be acquired from one or more different data providing mechanisms, wherein, sample data for training the first data risk detection model may be acquired from one or more different data providing mechanisms The first sample data may be completely different from the second sample data, or the first sample data and the second sample data may also be partially the same as the sample data, and the remaining sample data is different from the sample data, or the first sample data and the second sample data may also be completely the same as each other, which may be set specifically according to actual conditions, and this is not limited in this specification embodiment. In addition, considering that toxin information is often only present in a few certain data providing institutions or a few data providing institutions in the federal learning framework, that is, the presence of toxin information is strongly correlated (or strongly correlated) with information of the data providing institutions, when second sample data used for training the second data risk detection model is acquired from the data providing institutions, institution identification of the data providing institution to which the second sample data belongs can be acquired, and the institution identification can be name, code and the like of the data providing institution. In addition, training labels corresponding to the first sample data and the second sample data may also be obtained, for example, the first sample data 1 is data of one image, and the corresponding training label is the user a.

The algorithm to be used in the second data risk detection model may be preset according to actual conditions, for example, in the above related contents, a feature extraction algorithm (such as a vector machine-based feature extraction algorithm or a histogram of oriented gradients HOG-based feature extraction algorithm, etc.) and an information reconstruction algorithm (that is, a convolutional neural network-based information reconstruction algorithm or a full convolutional network-based information reconstruction algorithm, etc.) may be preset, and a framework of the second data risk detection model is constructed through the preset feature extraction algorithm and the preset information reconstruction algorithm. Then, the architecture of the second data risk detection model may be trained by using the second sample data and the corresponding mechanism identifier, specifically, the second sample data may be input to a preset feature extraction algorithm to obtain data features corresponding to the second sample data, then, the output result of the feature extraction algorithm may be used as input data of an information reconstruction algorithm, the data features corresponding to the second sample data are input to the information reconstruction algorithm, and the second sample data is reconstructed based on the input data features, so that the second data risk detection model is trained based on the obtained output result and the corresponding training label, and by the above manner, the trained second data risk detection model may be obtained.

In addition, an algorithm to be used in the first data risk detection model may be preset according to actual conditions, for example, in the above related contents, a feature extraction algorithm may be preset, and a framework of the first data risk detection model may be constructed by the preset feature extraction algorithm. Then, the framework of the first data risk detection model may be trained by using the first sample data, specifically, the first sample data may be input to a preset feature extraction algorithm to obtain a data feature corresponding to the second sample data, so that the second data risk detection model is trained based on the obtained data feature and a corresponding training label.

After the first data risk detection model and the second data risk detection model are established in the above manner, the target data can be input into the first data risk detection model to obtain a first output result, and the target data can be input into the second data risk detection model to obtain a second output result.

In step S106, if the first output result and the second output result do not match, the target data is determined to be data containing toxin information.

In implementation, after the first output result and the second output result are obtained in the above manner, the first output result and the second output result may be compared, if the first output result and the second output result are the same, it indicates that the target data does not include toxin information, and if the first output result and the second output result are different, it indicates that the target data includes toxin information, and at this time, it may be determined that the target data is data including toxin information.

The embodiment of the specification provides a data processing method, which includes inputting target data to be detected, which is provided by a data providing mechanism in an acquired federal learning frame, into a first data risk detection model to obtain a first output result, inputting the target data into a second data risk detection model to obtain a second output result, wherein the first data risk detection model is obtained by supervised training based on first sample data provided by the data providing mechanism in the federal learning frame, the second data risk detection model is obtained by supervised training in an information reconstruction mode based on second sample data provided by the data providing mechanism in the federal learning frame and mechanism identification of a data providing mechanism to which the second sample data belongs, and if the first output result is not matched with the second output result, the target data is determined to be data containing toxin information, in the federal learning framework, toxin data often exist in only a few data providing organizations, namely, the existence of the toxin data is strongly related to the class information of the data providing organizations, so that the toxin data can be decoupled with the toxin information in the federal learning framework by decoupling the organization identification of the data providing organizations in the training stage, thereby achieving the aim of resisting the toxin information.

Example two

As shown in fig. 3, an execution subject of the method may be a server or a terminal device, where the terminal device may be a computer device such as a notebook computer or a desktop computer, and may also be a mobile terminal device such as a mobile phone or a tablet computer. The server may be a server for a certain service (e.g., a transaction service or a financial service) or a server that needs to perform federal learning, and specifically, the server may be a server for a payment service, or a server for a service related to financial or instant messaging, etc. The execution main body in this embodiment is described by taking a server as an example, and for the case that the execution main body is a terminal device, the following related contents may be referred to, and are not described herein again. The method may specifically comprise the steps of:

in step S302, second sample data, an institution identifier of a data providing institution to which the second sample data belongs, and a training label corresponding to the second sample data are acquired from a data providing institution in the federal learning framework.

The data providing mechanism may be any one of the federal learning framework, or a plurality of data providing mechanisms in the federal learning framework, and may be specifically set according to actual conditions. The organization identification may be a name or code of the data providing organization, or the like. The training label may include multiple types, which may be specifically determined according to the output of the model constructed in the federal learning, for example, the output of the model constructed in the federal learning is class information of data, and the training label may be a class to which the second sample data belongs, such as a text class, an image class, a video class, and the like, and for example, the output of the model constructed in the federal learning is identity information of a user, and the training label may be a user identifier corresponding to the second sample data, such as a user a, a user B, a user C, and the like, and the training label may be set according to an actual situation, which is not limited in this specification.

In step S304, a model architecture of a toxin detection model is constructed based on a preset algorithm, and the toxin detection model includes a feature extraction submodel, an information reconstruction submodel, and a mechanism discrimination submodel.

The feature extraction submodel may be constructed based on a preset feature extraction algorithm, the feature extraction algorithm may include multiple types, and may be specifically set according to an actual situation. With the deepening of the neural network, the accuracy of the training set is easy to decrease due to the disappearance of the gradient and is not caused by overfitting, so a ResNet deep residual error network model is provided to solve the problems. The structure of the ResNet network model can extremely quickly accelerate the training of the ultra-deep neural network (the neural network with a plurality of network layers), and the accuracy of the trained ResNet network model is greatly improved. The ResNet network model is a network structure model with good popularization. The basic idea of the ResNet network model is to introduce a shortcut connection that can skip one or more layers. The ResNet network model may include a variety of models, such as a ResNet18 network model, a ResNet50 network model, etc., where 18 in the ResNet18 network model refers to 18 network layers with weights, including convolutional layers and fully-connected layers, but not including pooling layers and BN layers, and similarly, 50 in the ResNet50 network model also refers to 50 network layers with weights, including convolutional layers and fully-connected layers, but not including pooling layers and BN layers.

The information reconstruction submodel may be constructed based on a preset information reconstruction algorithm, the information reconstruction algorithm may include multiple types, and may be specifically set according to an actual situation. Taking the data to be processed as the image data as an example, the FCN model performs pixel-level classification on images and the like (that is, each pixel point is classified), thereby solving the problem of image segmentation at semantic level. The FCN model can accept input images with any size, an deconvolution layer is adopted to carry out upsampling on a feature map (feature map) of the last convolution layer to restore the feature map to the same size of the input images, so that a prediction result can be generated for each pixel, spatial information in the original input images is kept, and finally the pixel classification result is carried out on the upsampled feature map.

The mechanism identification submodel may be constructed based on a preset classification algorithm and the like, wherein the classification algorithm may include multiple types, and may be specifically set according to an actual situation. The ResNet network model may be as described above, and the ResNet network model may include a plurality of kinds, such as a ResNet18 network model, a ResNet50 network model, and the like, and may be specifically set according to actual situations.

In the implementation, the toxin detection model can be determined to need to comprise a characteristic extraction submodel, an information reconstruction submodel and a mechanism distinguishing submodel according to the actual situation, wherein the model architecture of the toxin detection model can be set according to the sequence of the characteristic extraction submodel, the information reconstruction submodel and the mechanism distinguishing submodel, namely, the input data of the characteristic extraction submodel is sample data, the output result of the characteristic extraction submodel is used as the input data of the information reconstruction submodel, the output result of the information reconstruction submodel is used as the input data of the mechanism distinguishing submodel, the output result of the mechanism distinguishing submodel is used as the output result of the toxin detection model, and based on the model architecture of the characteristic extraction submodel (including the basic architecture information, the parameters to be determined and the like of the characteristic extraction submodel) can be, specifically, a model architecture of the ResNet18 network model can be constructed, and the constructed model architecture of the ResNet18 network model can be used as a model architecture of the feature extraction submodel. The model architecture of the information reconstruction submodel (including the basic architecture information and the parameters to be determined of the information reconstruction submodel) may be constructed based on a preset information reconstruction algorithm, specifically, the model architecture of the FCN model may be constructed, and the constructed model architecture of the FCN model may be used as the model architecture of the information reconstruction submodel. In addition, a model architecture of the mechanism discrimination sub-model (including basic architecture information, pending parameters and the like of the mechanism discrimination sub-model) can be constructed based on a preset classification algorithm and the like, specifically, a model architecture of the ResNet18 network model can be constructed, and the constructed model architecture of the ResNet18 network model can be used as the model architecture of the information reconstruction sub-model.

After the model architecture of each submodel is constructed in the above manner, the incidence relation among the feature extraction submodel, the information reconstruction submodel and the mechanism discrimination submodel can be established based on the relation between the input data and the output result among the submodels, so that the model architecture of the toxin detection model can be established based on the model architecture of the feature extraction submodel, the model architecture of the information reconstruction submodel and the model architecture of the mechanism discrimination submodel, and the incidence relation among the feature extraction submodel, the information reconstruction submodel and the mechanism discrimination submodel.

In step S306, the feature extraction submodel, the information reconstruction submodel, and the mechanism discrimination submodel are trained to obtain a trained toxin detection model by using the second sample data, the mechanism identifier of the data providing mechanism to which the second sample data belongs, the training tag corresponding to the second sample data, the loss function corresponding to the feature extraction submodel, the loss function corresponding to the information reconstruction submodel, and the loss function corresponding to the mechanism discrimination submodel.

In the implementation, because the input data of the feature extraction submodel is sample data, the output result of the feature extraction submodel is used as the input data of the information reconstruction submodel, the output result of the information reconstruction submodel is used as the input data of the mechanism distinguishing submodel, and the output result of the mechanism distinguishing submodel is used as the output result of the toxin detection model, a second sample data can be randomly extracted, the content of the second sample data, the mechanism identification of the data providing mechanism to which the second sample data belongs and the training label corresponding to the second sample data can be obtained, then the second sample data can be input into the model architecture of the feature extraction submodel, the feature extraction submodel can be trained through the loss function corresponding to the feature extraction submodel, and in addition, the output result of the feature extraction submodel can be input into the information reconstruction submodel, the loss function corresponding to the information reconstruction submodel is used for training the information reconstruction submodel, in addition, the output result of the information reconstruction submodel can be input into the mechanism distinguishing submodel, the mechanism distinguishing submodel is trained through the loss function corresponding to the mechanism distinguishing submodel, meanwhile, the toxin detection model can be trained through combining the mechanism identification of the data providing mechanism to which the second sample data belongs and the training label corresponding to the second sample data, through the above mode, the loss function corresponding to the characteristic extraction submodel, the loss function corresponding to the information reconstruction submodel and the loss function corresponding to the mechanism distinguishing submodel can be trained through the second sample data, the mechanism identification of the data providing mechanism to which the second sample data belongs and the training label corresponding to the second sample data, and obtaining the trained toxin detection model.

In practical applications, the processing of step S306 may be varied, and an alternative processing manner is provided below, and specifically, the processing of step a2 and step a4 may be included below.

In step a2, the mechanism identifiers of the data providing mechanisms to which the second sample data belongs are encoded based on a preset encoding rule, and a category label corresponding to each mechanism identifier is obtained.

The category label may be an information label that can be identified by a machine, specifically, such as 0, 1, 2, and the like, and is specifically set according to an actual situation, which is not limited in the embodiments of the present specification.

In implementation, considering that mechanism identifications of different data providing mechanisms may be different, and construction manners of the mechanism identifications of the different data providing mechanisms may also be different, and the mechanism identifications of the data providing mechanisms may be information that cannot be recognized by a machine, in order to re-encode the mechanism identifications of the data providing mechanisms, a category tag corresponding to each mechanism identification is obtained, so that the construction manners of the mechanism identifications of the different data providing mechanisms can be unified, and the mechanism identifications of the data providing mechanisms can be converted into information tags that can be recognized by the machine. The recoding manner may include multiple manners, may be random coding, may also be coding based on a preset coding rule, and the like, and may be specifically set according to an actual situation, which is not limited in the embodiment of the present specification.

In step a4, the feature extraction submodel, the information reconstruction submodel, and the mechanism discrimination submodel are trained to obtain a trained toxin detection model by using the second sample data, the class label corresponding to the mechanism identifier, the training label corresponding to the second sample data, the loss function corresponding to the feature extraction submodel, the loss function corresponding to the information reconstruction submodel, and the loss function corresponding to the mechanism discrimination submodel.

In step S308, a second data risk detection model is constructed based on the submodels included in the trained toxin detection model, wherein the second data risk detection model at least includes an information reconstruction submodel.

In an implementation, the trained toxin detection model may be directly determined as the second data risk detection model. In practical application, a proper submodel may be selected from submodels included in the trained toxin detection model according to an actual situation to construct a second data risk detection model, for example, the second data risk detection model may be constructed based on a trained feature extraction submodel and a trained information reconstruction submodel, at this time, the second data risk detection model may include a feature extraction submodel, an information reconstruction submodel, and the like, which may be specifically set according to an actual situation, and this is not limited in the embodiments of the present specification.

In step S310, a first data risk detection model is constructed based on the feature extraction submodel.

In step S312, target data to be detected provided by a data providing organization in the federal learning framework is acquired.

The specific processing procedure of step S312 may refer to relevant contents in the first embodiment, and is not described herein again.

In practical applications, the target data may include encrypted ciphertext obtained by encrypting the ciphertext through a homomorphic encryption algorithm or performing differential privacy processing through a differential privacy algorithm, where the differential privacy is intended to protect the acquired data to some extent, although the user may still upload the corresponding data to the data acquisition party when the user who is acquiring the data does not trust the data acquisition party. Differential privacy can maximize the accuracy of data queries while minimizing the chances of identifying their records when queried from statistical databases. The differential privacy is to protect privacy (i.e. target data) by disturbing data, where the disturbance mechanism may include multiple mechanisms, such as Laplace mechanism, exponential mechanism, and the like. Differential Privacy may include centralized Differential Privacy and Localized Differential Privacy (LDP). Based on the above, the specific processing in step S312 may also be implemented by decrypting the ciphertext in the target data with a preset decryption key to obtain the target data to be detected.

The ciphertext contained in the target data may be a ciphertext obtained by encrypting part of data in the target data, or a ciphertext obtained by encrypting all data in the target data, and may be specifically set according to an actual situation. The decryption key may be set according to an encryption key of a ciphertext in the target data, and in practical application, the encryption key and the decryption key may be a public key-private key pair, that is, the encryption key may be a public key, and the corresponding decryption key may be a private key, or the encryption key may be a private key, and the corresponding decryption key may be a public key, and the like, which may be specifically set according to practical situations.

In step S314, the target data is input into the first data risk detection model to obtain a first output result, and the target data is input into the second data risk detection model to obtain a second output result.

In implementation, taking the second data risk detection model constructed and trained as described above as an example, the target data may be input into the trained feature extraction submodel constructed by the ResNet18 network model to obtain the data features corresponding to the target data, and then the obtained data features may be input into the information reconstruction submodel constructed by the FCN model to obtain the corresponding output result, that is, the second output result. In addition, taking the first data risk detection model constructed and trained as described above as an example, the target data may be input into the trained feature extraction sub-model constructed by the ResNet18 network model, so as to obtain the data feature corresponding to the target data, that is, the first output result.

In step S316, if the first output result and the second output result do not match, the target data is determined to be data containing toxin information.

In implementations, if the first output and the second output do not match, the target data may be determined to be data suspected of containing toxin information, at which point the target data may be recorded.

In step S318, when the predetermined detection period is reached, the proportion of the target data including toxin information provided by each data providing institution in the federal learning framework is counted.

The detection duration may be set according to an actual situation, specifically, for example, 24 hours or 1 hour, and may specifically be set according to the actual situation, which is not limited in the embodiments of the present specification.

In practice, one or more target data containing toxin information may be recorded by the process of step S316, and when a predetermined detection period is reached, the organization identification of the data providing organization corresponding to each recorded target data can be obtained, the quantity of the target data containing the toxin information provided by each data providing organization and the total quantity of the target data provided by each data providing organization in the federal learning framework can be counted, then, calculating the ratio of the quantity of the target data containing the toxin information provided by a certain data providing mechanism to the total quantity of the target data provided by the data providing mechanism to obtain the proportion of the target data containing the toxin information provided by the data providing mechanism, the proportions of target data containing toxin information provided by each data providing institution in the federal learning framework can be obtained. The ratio of the obtained target data including toxin information provided by each data providing institution can be used as the abnormality probability of each data providing institution.

In step S320, the data providing mechanism having the statistical ratio value larger than the preset threshold is determined as the abnormal mechanism.

The preset threshold may be set according to an actual situation, specifically, for example, 80% or 90%.

In step S322, cancel the supervised training using the sample data provided by the abnormal mechanism; or acquiring sample data which does not exceed a preset number threshold from an abnormal mechanism for supervised training.

In implementation, for the abnormal mechanism determined in step S320, a certain management measure may be taken on the abnormal mechanism, for example, a hierarchical processing mechanism may be set, specifically, a first probability threshold and a second probability threshold may be set, where the second probability threshold is greater than the first probability threshold, if the abnormal probability of a certain data providing mechanism is greater than or equal to the first probability threshold and less than or equal to the second probability threshold, the number of sample data uploaded by the data providing mechanism may be limited, specifically, the maximum number of sample data uploaded by the data providing mechanism may be allowed to be half or 60% of the number of sample data uploaded by other data providing mechanisms, and the sample gradient that the data providing mechanism can obtain is also half or 60% of the number of sample data uploaded by other data providing mechanisms. If the abnormal probability of a certain data providing organization is larger than the second probability threshold value, the data providing organization can be refused to participate in the federal learning, and the data providing organization can be added into a preset blacklist.

EXAMPLE III

In this embodiment, a data processing method provided in the embodiment of the present invention will be described in detail with reference to a specific application scenario, where the corresponding application scenario is a federal learning application scenario in any service processing.

As shown in fig. 4, an execution subject of the method may be a server or a terminal device, where the terminal device may be a computer device such as a notebook computer or a desktop computer, and may also be a mobile terminal device such as a mobile phone or a tablet computer. The server may be a server that needs to perform federal learning. The execution main body in this embodiment is described by taking a server as an example, and for the case that the execution main body is a terminal device, the following related contents may be referred to, and are not described herein again. The method may specifically comprise the steps of:

in step S402, second sample data, an institution identifier of a data providing institution to which the second sample data belongs, and a training label corresponding to the second sample data are acquired from a data providing institution in the federal learning framework.

In step S404, a model architecture of a toxin detection model is constructed based on a preset algorithm, and the toxin detection model includes a feature extraction sub-model constructed based on a ResNet18 network model, an information reconstruction sub-model constructed based on a full convolution network FCN model, and an organization discrimination sub-model constructed based on a ResNet18 network model.

In step S406, the mechanism identifiers of the data providing mechanisms to which the second sample data belongs are encoded based on a preset encoding rule, so as to obtain a category label corresponding to each mechanism identifier.

In step S408, the feature extraction submodel, the information reconstruction submodel, and the mechanism discrimination submodel are trained through the second sample data, the class label corresponding to the mechanism identifier, the training label corresponding to the second sample data, the loss function corresponding to the feature extraction submodel, the loss function corresponding to the information reconstruction submodel, and the loss function corresponding to the mechanism discrimination submodel, so as to obtain a trained toxin detection model.

In step S410, a second data risk detection model is constructed based on the feature extraction submodel constructed by the trained ResNet18 network model and the information reconstruction submodel constructed based on the FCN model.

In step S412, a first data risk detection model is constructed based on the feature extraction submodel constructed by the trained ResNet18 network model.

In step S414, target data to be detected provided by a data providing mechanism in the federal learning framework is obtained, where the target data includes encrypted ciphertext obtained by encrypting with a homomorphic encryption algorithm or performing differential privacy with a differential privacy algorithm.

In step S416, the ciphertext in the target data is decrypted by using the preset decryption key, so as to obtain the target data to be detected.

In step S418, the target data is input into the first data risk detection model to obtain a first output result, and the target data is input into the second data risk detection model to obtain a second output result.

In step S420, if the first output result and the second output result do not match, the target data is determined to be data containing toxin information.

In step S422, when the predetermined detection period is reached, the proportion of the target data including toxin information provided by each data providing institution in the federal learning framework is counted.

In step S424, the data providing mechanism having the statistical ratio value larger than the preset threshold is determined as the abnormal mechanism.

In step S426, canceling the supervised training using the sample data provided by the abnormal mechanism; or acquiring sample data which does not exceed a preset number threshold from an abnormal mechanism for supervised training.

For the specific processing procedure of the step S402 to the step S426, reference may be made to the relevant contents in the second embodiment, which is not described herein again.

Example four

Based on the same idea, the data processing method provided in the embodiment of the present specification further provides a data processing apparatus, as shown in fig. 5.

The data processing apparatus includes: a data acquisition module 501, a toxin information detection module 502, and a toxin information determination module 503, wherein:

the data acquisition module 501 is used for acquiring target data to be detected, which is provided by a data providing mechanism in a federal learning framework;

the toxin information detection module 502 is configured to input the target data into a first data risk detection model to obtain a first output result, and input the target data into a second data risk detection model to obtain a second output result, where the first data risk detection model is obtained by performing supervised training on the basis of first sample data provided by a data providing organization in the federal learning frame, and the second data risk detection model is obtained by performing supervised training on the basis of second sample data provided by the data providing organization in the federal learning frame and an organization identifier of a data providing organization to which the second sample data belongs in an information reconstruction manner;

the toxin information determination module 503 determines that the target data is data containing toxin information if the first output result and the second output result do not match.

In an embodiment of this specification, the apparatus further includes:

the counting module is used for counting the proportion of target data containing toxin information provided by each data providing organization in the federal learning framework when a preset detection time is reached;

and the abnormal mechanism determining module is used for determining the data providing mechanism with the statistical ratio value larger than the preset threshold value as the abnormal mechanism.

In an embodiment of this specification, the apparatus further includes:

the cancellation module cancels the use of the sample data provided by the abnormal mechanism for supervision training; or,

and the sample acquisition module acquires sample data which does not exceed a preset number threshold from the abnormal mechanism for supervised training.

In the embodiment of the present specification, the target data includes encrypted ciphertext, where the ciphertext is obtained by performing encryption processing through a homomorphic encryption algorithm or ciphertext obtained by performing differential privacy processing through a differential privacy algorithm,

the data obtaining module 501 decrypts the ciphertext in the target data through a preset decryption key to obtain the target data to be detected.

In an embodiment of this specification, the apparatus further includes:

the sample data acquisition module is used for acquiring second sample data from a data providing mechanism in the federal learning framework, a mechanism identification of the data providing mechanism to which the second sample data belongs and a training label corresponding to the second sample data;

the toxin detection module comprises a model architecture construction module, a toxin detection module and a mechanism judgment module, wherein the model architecture construction module is used for constructing a model architecture of a toxin detection model based on a preset algorithm, and the toxin detection model comprises a feature extraction sub-model, an information reconstruction sub-model and a mechanism judgment sub-model;

the training module is used for training the characteristic extraction submodel, the information reconstruction submodel and the mechanism distinguishing submodel to obtain the trained toxin detection model through the second sample data, the mechanism identification of the data providing mechanism to which the second sample data belongs, the training label corresponding to the second sample data, the loss function corresponding to the characteristic extraction submodel, the loss function corresponding to the information reconstruction submodel and the loss function corresponding to the mechanism distinguishing submodel;

and the detection model determining module is used for constructing the second data risk detection model based on the sub-models contained in the trained toxin detection model, wherein the second data risk detection model at least comprises the information reconstruction sub-model.

In the embodiment of the present specification, the feature extraction submodel is constructed based on a ResNet network model.

In the embodiment of the present specification, the information reconstruction submodel is constructed based on a full convolution network FCN model.

In the embodiment of the present specification, the mechanism discrimination sub-model is constructed based on a ResNet network model.

In an embodiment of this specification, the first data risk detection model includes the feature extraction submodel, and the second data risk detection model includes the feature extraction submodel and the information reconstruction submodel.

In an embodiment of this specification, the training module includes:

the encoding unit is used for encoding mechanism identifications of the data providing mechanisms to which the second sample data belong based on a preset encoding rule to obtain category labels corresponding to the mechanism identifications;

and the training unit is used for training the characteristic extraction submodel, the information reconstruction submodel and the mechanism discrimination submodel through the second sample data, the class label corresponding to the mechanism identification, the training label corresponding to the second sample data, the loss function corresponding to the characteristic extraction submodel, the loss function corresponding to the information reconstruction submodel and the loss function corresponding to the mechanism discrimination submodel to obtain the trained toxin detection model.

The embodiment of the specification provides a data processing device, which is characterized in that target data to be detected, which is provided by a data providing mechanism in an acquired federal learning frame, is input into a first data risk detection model to obtain a first output result, the target data is input into a second data risk detection model to obtain a second output result, the first data risk detection model is obtained by supervised training based on first sample data provided by the data providing mechanism in the federal learning frame, the second data risk detection model is obtained by supervised training in an information reconstruction mode based on second sample data provided by the data providing mechanism in the federal learning frame and mechanism identification of a data providing mechanism to which the second sample data belongs, and if the first output result is not matched with the second output result, the target data is determined to be data containing toxin information, in the federal learning framework, toxin data often exist in only a few data providing organizations, namely, the existence of the toxin data is strongly related to the class information of the data providing organizations, so that the toxin data can be decoupled with the toxin information in the federal learning framework by decoupling the organization identification of the data providing organizations in the training stage, thereby achieving the aim of resisting the toxin information.

EXAMPLE five

Based on the same idea, the data processing apparatus provided in the embodiment of the present specification further provides a data processing device, as shown in fig. 6.

The data processing device may be the terminal device or the server provided in the above embodiments.

The data processing apparatus may have a large difference due to different configurations or performances, and may include one or more processors 601 and a memory 602, and one or more stored applications or data may be stored in the memory 602. Wherein the memory 602 may be transient or persistent storage. The application program stored in memory 602 may include one or more modules (not shown), each of which may include a series of computer-executable instructions for a data processing device. Still further, the processor 601 may be arranged in communication with the memory 602 to execute a series of computer executable instructions in the memory 602 on a data processing device. The data processing apparatus may also include one or more power supplies 603, one or more wired or wireless network interfaces 604, one or more input-output interfaces 605, one or more keyboards 606.

In particular, in this embodiment, the data processing apparatus includes a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions for the data processing apparatus, and the one or more programs configured to be executed by the one or more processors include computer-executable instructions for:

acquiring target data to be detected provided by a data providing mechanism in a federal learning framework;

inputting the target data into a first data risk detection model to obtain a first output result, and inputting the target data into a second data risk detection model to obtain a second output result, wherein the first data risk detection model is obtained by performing supervised training on the basis of first sample data provided by a data providing mechanism in the federal learning frame, and the second data risk detection model is obtained by performing supervised training in an information reconstruction mode on the basis of second sample data provided by the data providing mechanism in the federal learning frame and an organization identifier of a data providing mechanism to which the second sample data belongs;

and if the first output result and the second output result do not match, determining that the target data is data containing toxin information.

In the embodiment of this specification, the method further includes:

counting a proportion of target data provided by each data providing institution in the federal learning framework that includes toxin information when a predetermined detection period is reached;

and determining the data providing mechanism with the statistical ratio value larger than the preset threshold value as an abnormal mechanism.

In the embodiment of this specification, the method further includes:

canceling the use of sample data provided by the abnormal mechanism for supervision training; or,

and acquiring sample data which does not exceed a preset number threshold from the abnormal mechanism for supervised training.

the acquiring of the target data to be detected provided by the data providing mechanism in the federal learning framework comprises the following steps:

and decrypting the ciphertext in the target data through a preset decryption key to obtain the target data to be detected.

In the embodiment of this specification, the method further includes:

acquiring second sample data and an institution identification of a data providing institution to which the second sample data belongs from a data providing institution in the federal learning framework, and a training label corresponding to the second sample data;

constructing a model architecture of a toxin detection model based on a preset algorithm, wherein the toxin detection model comprises a feature extraction submodel, an information reconstruction submodel and a mechanism discrimination submodel;

training the feature extraction submodel, the information reconstruction submodel and the mechanism discrimination submodel to obtain the trained toxin detection model through the second sample data, the mechanism identification of the data providing mechanism to which the second sample data belongs, the training label corresponding to the second sample data, the loss function corresponding to the feature extraction submodel, the loss function corresponding to the information reconstruction submodel and the loss function corresponding to the mechanism discrimination submodel;

and constructing the second data risk detection model based on the sub-models contained in the trained toxin detection model, wherein the second data risk detection model at least comprises the information reconstruction sub-model.

In an embodiment of this specification, the training the feature extraction submodel, the information reconstruction submodel, and the mechanism discrimination submodel through the second sample data, the mechanism identifier of the data providing mechanism to which the second sample data belongs, the training tag corresponding to the second sample data, the loss function corresponding to the feature extraction submodel, the loss function corresponding to the information reconstruction submodel, and the loss function corresponding to the mechanism discrimination submodel to obtain the trained toxin detection model includes:

coding mechanism identifications of the data providing mechanism to which the second sample data belongs based on a preset coding rule to obtain a category label corresponding to each mechanism identification;

and training the feature extraction submodel, the information reconstruction submodel and the mechanism discrimination submodel through the second sample data, the class label corresponding to the mechanism identification, the training label corresponding to the second sample data, the loss function corresponding to the feature extraction submodel, the loss function corresponding to the information reconstruction submodel and the loss function corresponding to the mechanism discrimination submodel to obtain the trained toxin detection model.

The embodiment of the specification provides a data processing device, which is characterized in that acquired target data to be detected provided by a data providing mechanism in a federal learning frame is input into a first data risk detection model to obtain a first output result, the target data is input into a second data risk detection model to obtain a second output result, the first data risk detection model is obtained by supervised training based on first sample data provided by the data providing mechanism in the federal learning frame, the second data risk detection model is obtained by supervised training in an information reconstruction mode based on second sample data provided by the data providing mechanism in the federal learning frame and mechanism identification of a data providing mechanism to which the second sample data belongs, and if the first output result is not matched with the second output result, the target data is determined to be data containing toxin information, in the federal learning framework, toxin data often exist in only a few data providing organizations, namely, the existence of the toxin data is strongly related to the class information of the data providing organizations, so that the toxin data can be decoupled with the toxin information in the federal learning framework by decoupling the organization identification of the data providing organizations in the training stage, thereby achieving the aim of resisting the toxin information.

EXAMPLE six

Further, based on the methods shown in fig. 1 to fig. 4, one or more embodiments of the present specification further provide a storage medium for storing computer-executable instruction information, in a specific embodiment, the storage medium may be a usb disk, an optical disk, a hard disk, or the like, and when executed by a processor, the storage medium stores the computer-executable instruction information, which can implement the following processes:

In the embodiment of this specification, the method further includes:

The embodiment of the specification provides a storage medium, target data to be detected, which is provided by a data providing mechanism in an acquired federated learning framework, is input into a first data risk detection model to obtain a first output result, the target data is input into a second data risk detection model to obtain a second output result, the first data risk detection model is obtained by performing supervised training based on first sample data provided by the data providing mechanism in the federated learning framework, the second data risk detection model is obtained by performing supervised training in an information reconstruction manner based on second sample data provided by the data providing mechanism in the federated learning framework and an mechanism identifier of a data providing mechanism to which the second sample data belongs, and if the first output result is not matched with the second output result, the target data is determined to be data containing toxin information, in the federal learning framework, toxin data often exist in only a few data providing organizations, namely, the existence of the toxin data is strongly related to the class information of the data providing organizations, so that the toxin data can be decoupled with the toxin information in the federal learning framework by decoupling the organization identification of the data providing organizations in the training stage, thereby achieving the aim of resisting the toxin information.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the various elements may be implemented in the same one or more software and/or hardware implementations in implementing one or more embodiments of the present description.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present description are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the description. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable fraud case serial-parallel apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable fraud case serial-parallel apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable fraud case to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable fraud case serial-parallel apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

One or more embodiments of the present description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, further comprising:

3. The method of claim 2, further comprising:

4. The method according to claim 1, wherein the target data comprises encrypted ciphertext, the ciphertext is obtained by encrypting the ciphertext through a homomorphic encryption algorithm or performing differential privacy processing through a differential privacy algorithm,

5. The method of claim 1, further comprising:

6. The method of claim 5, the feature extraction submodel being constructed based on a ResNet network model.

7. The method according to claim 5 or 6, the information reconstruction submodel being built based on a full convolution network, FCN, model.

8. The method of claim 7, the mechanism discrimination submodel constructed based on a ResNet network model.

9. The method of claim 8, wherein the first data risk detection model includes the feature extraction submodel therein, and wherein the second data risk detection model includes the feature extraction submodel and the information reconstruction submodel therein.

10. The method according to claim 5, wherein the training of the feature extraction submodel, the information reconstruction submodel, and the mechanism discrimination submodel through the second sample data, the mechanism identifier of the data providing mechanism to which the second sample data belongs, the training tag corresponding to the second sample data, the loss function corresponding to the feature extraction submodel, the loss function corresponding to the information reconstruction submodel, and the loss function corresponding to the mechanism discrimination submodel to obtain the trained toxin detection model comprises:

11. A data processing apparatus, the apparatus comprising:

the data acquisition module is used for acquiring target data to be detected, which is provided by a data providing mechanism in the federal learning framework;

the toxin information detection module is used for inputting the target data into a first data risk detection model to obtain a first output result, and inputting the target data into a second data risk detection model to obtain a second output result, wherein the first data risk detection model is obtained by performing supervised training on the basis of first sample data provided by a data providing mechanism in the federal learning frame, and the second data risk detection model is obtained by performing supervised training on the basis of second sample data provided by the data providing mechanism in the federal learning frame and an organization identifier of a data providing mechanism to which the second sample data belongs in an information reconstruction mode;

and the toxin information determining module is used for determining that the target data is data containing toxin information if the first output result and the second output result are not matched.

12. The apparatus of claim 11, the apparatus further comprising:

13. The apparatus of claim 12, the feature extraction submodel constructed based on a ResNet network model.

14. The apparatus according to claim 12 or 13, the information reconstruction submodel being built based on a full convolutional network, FCN, model.

15. The apparatus of claim 14, the first data risk detection model comprising the feature extraction submodel therein, the second data risk detection model comprising the feature extraction submodel and the information reconstruction submodel therein.

16. A data processing apparatus, the data processing apparatus comprising:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to:

17. A storage medium for storing computer-executable instructions, which when executed implement the following: