CN114065867B - Data classification method and system and electronic equipment - Google Patents

Data classification method and system and electronic equipment Download PDF

Info

Publication number
CN114065867B
CN114065867B CN202111395446.5A CN202111395446A CN114065867B CN 114065867 B CN114065867 B CN 114065867B CN 202111395446 A CN202111395446 A CN 202111395446A CN 114065867 B CN114065867 B CN 114065867B
Authority
CN
China
Prior art keywords
data
classification
classification model
model
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111395446.5A
Other languages
Chinese (zh)
Other versions
CN114065867A (en
Inventor
张信明
黎兰兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202111395446.5A priority Critical patent/CN114065867B/en
Publication of CN114065867A publication Critical patent/CN114065867A/en
Application granted granted Critical
Publication of CN114065867B publication Critical patent/CN114065867B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data classification method, a system and electronic equipment, which are used for obtaining data to be classified, extracting feature information of the data to be classified, inputting the feature information of the data to be classified into a first classification model, obtaining a classification result, and determining the classification result as the type of the data to be classified, wherein the first classification model is as follows: the data classification method based on the non-sensitive data is obtained by training the non-sensitive data marked by the plurality of second classification models, and each second classification model is obtained by training the sensitive data stored in the device where the second classification model is located. According to the scheme, the classification is carried out through the first classification model, the first classification model is obtained based on non-sensitive data training, the non-sensitive data are marked through the second classification models, the second classification models are obtained based on sensitive data stored in the devices, the first classification model is not directly trained based on the sensitive data, but is obtained based on the non-sensitive data training, and the problem of privacy disclosure is avoided.

Description

Data classification method and system and electronic equipment
Technical Field
The present application relates to the field of data processing, and in particular, to a data classification method, system and electronic device.
Background
With the rapid development of machine learning, the large data collection required presents significant privacy issues, such as: a picture or recording of the user, etc. The data owner may use the data for deep learning to enable training of the classifier.
However, training of the classifier is performed according to the private data, and since machine learning has the characteristic of invisible memory, after the model parameters of the classifier are obtained, the model parameters can be analyzed and inferred, so that the private data are obtained, and the problem of leakage of the private data is easily caused.
Disclosure of Invention
In view of this, the present application provides a data classification method, system and electronic device, and the specific scheme is as follows:
a method of data classification, comprising:
obtaining data to be classified;
extracting characteristic information of the data to be classified;
inputting the characteristic information of the data to be classified into a first classification model to obtain a classification result, and determining the classification result as the type of the data to be classified;
wherein the first classification model is: the method comprises the steps that training is carried out on the basis of non-sensitive data marked by a plurality of second classification models, wherein each second classification model is obtained through training of sensitive data stored in equipment where the second classification model is located.
Further, the method also comprises the following steps:
performing model training through non-sensitive data to obtain a first classification model, wherein the non-sensitive data at least comprises: a first portion of data for a known classification label and a second portion of data for a second plurality of classification models.
Further, the method also comprises the following steps:
marking the second part of data through a plurality of second classification models respectively;
and determining the classification mark of each data in the second part of data based on the first classification result of each second classification model on each data in the second part of data and the mark weight of the second classification model.
Further, the marking the second part of data by the plurality of second classification models respectively includes:
inputting the second part of data into different second classification models respectively;
obtaining a first classification result for at least part of the second part of data output by each of the second classification models;
and determining the marked data proportion of each second classification model based on the first classification result output by each second classification model and at least part of the second part of data for which the first classification result aims.
Further, the method also comprises the following steps: determining a labeling weight for each of the second classification models, wherein:
the determining the labeling weight of each second classification model comprises:
inputting the first part of data into each second classification model to obtain a second classification result of each second classification model on each data in the first part of data;
determining classification accuracy of each second classification model based on the classification label and the second classification result of each data in the first part of data;
and determining the marking weight of each second classification model based on the classification accuracy of each second classification model and the marking data proportion of each second classification model.
Further, the performing model training through the non-sensitive data to obtain a first classification model includes:
determining a first classification result of the plurality of second classification models for each data in the second portion of data;
determining a weight for each data in the second portion of data based on a consistency of a first classification result for each data in the second portion of data by a different second classification model;
and performing model training through the first part of data and the second part of data containing the weight to obtain a first classification model.
Further, each of the second classification models is obtained by training sensitive data stored in a device where the second classification model is located, and the training includes:
determining sensitive data stored by the device;
training a second classification model local to a current device in a semi-supervised manner based on sensitive data stored by the current device.
A data classification system comprising:
an obtaining unit for obtaining data to be classified;
the extraction unit is used for extracting the characteristic information of the data to be classified;
a classification unit for inputting the characteristic information of the data to be classified into a first classification model to obtain a classification result, determining the classification result as the type of the data to be classified,
wherein the first classification model is: the method comprises the steps that training is carried out on the basis of non-sensitive data marked by a plurality of second classification models, wherein each second classification model is obtained through training of sensitive data stored in equipment where the second classification model is located.
An electronic device, comprising:
a processor for obtaining data to be classified; extracting feature information of the data to be classified; inputting the characteristic information of the data to be classified into a first classification model to obtain a classification result, and determining the classification result as the type of the data to be classified; wherein the first classification model is: the method comprises the steps that training is carried out on the basis of non-sensitive data marked by a plurality of second classification models, wherein each second classification model is obtained through training of sensitive data stored in equipment where the second classification model is located;
and the memory is used for storing the program of the processor for executing the processing procedure.
A readable storage medium storing at least one set of instructions;
the set of instructions is for being called and performing at least the method of data classification as any one of the above.
According to the technical scheme, the data classification method, the data classification system and the electronic device obtain data to be classified, extract feature information of the data to be classified, input the feature information of the data to be classified into the first classification model, obtain a classification result, and determine the classification result as the type of the data to be classified, wherein the first classification model is as follows: the data classification method based on the non-sensitive data is obtained by training the non-sensitive data marked by the plurality of second classification models, and each second classification model is obtained by training the sensitive data stored in the device where the second classification model is located. According to the scheme, the classification is carried out through the first classification model, the first classification model is obtained based on the non-sensitive data training, the non-sensitive data are marked through the second classification models, the second classification models are obtained based on the sensitive data stored in the equipment, the first classification model is not directly trained based on the sensitive data, but is obtained based on the non-sensitive data training, and the problem of privacy disclosure is avoided.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a data classification method disclosed in an embodiment of the present application;
FIG. 2 is a flow chart of a data classification method disclosed in an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a data classification system disclosed in an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device disclosed in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
The application discloses a data classification method, a flow chart of which is shown in figure 1, comprising the following steps:
s11, obtaining data to be classified;
s12, extracting characteristic information of data to be classified;
and S13, inputting the characteristic information of the data to be classified into a first classification model to obtain a classification result, and determining the classification result as the type of the data to be classified, wherein the first classification model is obtained by training on the basis of non-sensitive data marked by a plurality of second classification models, and each second classification model is obtained by training on sensitive data stored in the device where the second classification model is located.
When the classification model is trained, if the sensitive data stored in each device is directly adopted for training, based on the trained classification model, member reasoning attack or model stealing attack can be carried out by obtaining model parameters, so that the sensitive data of the trained classification model can be obtained, the sensitive data can be leaked, and the information safety is not facilitated.
In order to avoid the problem, in the scheme, a global classifier is trained under the condition that multiple parties do not share sensitive data, when model training is carried out, a local second classification model of each device is trained through the local sensitive data of each device, then, the unmarked data in the non-sensitive data are marked by using a plurality of second classification models, so that model training is carried out on the global classifier based on all marked non-sensitive data, namely a first classification model, so that classification marking of the data is realized by using the first classification model, and even if an illegal attack is encountered, the model parameters are obtained by using the first classification model which is trained by using the scheme, the model parameters are only non-sensitive data and are irrelevant to the sensitive data, so that on the basis of realizing classification of the data, the safety of information is ensured, and the leakage of the sensitive data is avoided.
The sensitive data is private to each device and is not shared to other devices or servers; the non-sensitive data is data which is stored by all the devices or acquired from a server.
After the training of the first classification model is completed, when the data to be classified is obtained, the feature information of the data to be classified is extracted, the feature information is input into the first classification model, the output result of the second classification model is obtained, the output result is the classification result of the data to be classified, and the classification result can be directly determined as the type of the data to be classified.
The data stored locally comprises data of known type, and each device performs model training through the data of known type stored by itself.
Specifically, because the data of the known mark types stored locally in the device is limited, when the amount of data for model training is small, the model is easily over-fitted, and further the generalization capability and robustness of the model are low, so that in order to avoid the problem, in the scheme, when each device trains the second classification model locally, the training can be performed according to a semi-supervised mode.
When each device carries out semi-supervised training according to locally stored sensitive data, the proportion rho of the marked data in the training set can be calculated at the same time 1 、ρ 2 、...、ρ n I.e. the ratio of label data to training data. In the process, interaction with other equipment is not needed, and sensitive data for model training is protected from being leaked to the maximum extent.
For the case of less labeled data in the training data, a GAN-based semi-supervised learning mode is usually adopted for model training, and the generator and the arbiter are trained simultaneously.
The generator receives a random noise and is used for generating data information which is as close to reality as possible as the input of the discriminator; the discriminator is used as a multi-class classifier, the input of which is labeled real data, unlabeled real data and data from the generator, and the combination of all the input data enables the discriminator to learn in a wider angle, so that the discriminator obtains a more accurate structure compared with a learning mode only using labeled data, and the output result is a K + 1-dimensional vector, wherein the former K-dimensional vector is a data classification label and the K + 1-dimensional vector is an image authenticity label. At the end of training, the generator is discarded and only the discriminators are used as multi-class classifiers.
Wherein the loss function of the generator is:
Figure BDA0003369837420000061
the penalty function for the discriminator is:
Loss D =L supervised +L unsupervised
Figure BDA0003369837420000062
Figure BDA0003369837420000063
where D (x) identifies the probability that the output is a true image.
After the devices respectively complete the training of the second classification models locally, the non-sensitive data in the server are respectively marked through the local second classification models of the devices, so that the non-sensitive data in the server are marked data, and the marked non-sensitive data stored in the server can be used for model training.
Wherein, the non-sensitive data can be derived from a data set published on the network or data collected by other channels.
After all the non-sensitive data are respectively marked by the local second classification model of each device, model training is carried out on the server by utilizing the marked non-sensitive data to obtain a first classification model, so that the first classification model is obtained by the training of the non-sensitive data, even if reasoning and attack are carried out on the first classification model, the obtained non-sensitive data is only possible to be non-sensitive data, but sensitive data stored locally in each device cannot be obtained, the sensitive data is only used for training a local classifier of the device, and the local classifier is not used as the classification model of the server to carry out classification, so that the information security of the sensitive data is effectively protected.
Further, model training is performed through non-sensitive data to obtain a first classification model, which includes:
determining a first classification result of each data in the second part of data by the plurality of second classification models, determining the weight of each data in the second part of data based on the consistency of the first classification result of each data in the second part of data by different second classification models, and performing model training through the first part of data and the second part of data containing the weight to obtain a first classification model.
When the server side trains the global classifier by using the marked first part data and the second part data, the features are extracted, and the first layer of the network is realized by using a large convolution kernel, so that the model has better learning capability, and the calculated amount can be reduced to a certain extent.
The activation function may adopt: square (x) = x 2 (ii) a The classification layer may employ a global average pooling layer.
In addition, since the data labels of the second part of data used for training the first classification model are labeled by the multi-party classifier together, the loss function needs to be adjusted, that is, the consistency information of the first classification result of each data is reflected in the loss function, so as to improve the classification accuracy.
For the first part of data with the category already defined, the consistency thereof may be directly set to 1, that is, the weight of the category of the data is 1, so as to indicate that the category label of each first part of data is authentic.
For the second part of data, if the consistency of all the second classification models on the classification results of the data is higher, the weight of the data is higher, and the training influence on the first classification model is larger; if the classification results of all the second classification models to certain data are different, the fact that the classification labels of the data have certain uncertainty is shown, smaller weight is set for the data, so that the influence of the data on the training process and the results of the first classifier is reduced, and finally the weight is achieved through certain transformation of information entropy.
The loss function may be:
Figure BDA0003369837420000081
it is different from the loss function in the prior art, which is trained by using all the data with well-defined class labels, and the weight of the loss function is the same for each training data, and is 1. In the scheme, the training data adopted when the first classification model is trained comprises part of prediction type data, namely, the second part of data, and for the part of data, the weights during training are different and are related to the consistency of results when the different second classification models predict the data, so that the accuracy rate of the first classification model during classification is ensured.
The embodiment discloses a data classification method, which includes the steps of obtaining data to be classified, extracting feature information of the data to be classified, inputting the feature information of the data to be classified into a first classification model, obtaining a classification result, and determining the classification result as the type of the data to be classified, wherein the first classification model is as follows: the data classification method based on the non-sensitive data is obtained by training the non-sensitive data marked by the plurality of second classification models, and each second classification model is obtained by training the sensitive data stored in the device where the second classification model is located. According to the scheme, the classification is carried out through the first classification model, the first classification model is obtained based on non-sensitive data training, the non-sensitive data are marked through the second classification models, the second classification models are obtained based on sensitive data stored in the devices, the first classification model is not directly trained based on the sensitive data, but is obtained based on the non-sensitive data training, and the problem of privacy disclosure is avoided.
The present embodiment discloses a data classification method, a flowchart of which is shown in fig. 2, and includes:
s21, performing model training through non-sensitive data to obtain a first classification model, wherein the non-sensitive data at least comprises the following components: the method comprises the steps that first part of data of known classification labels and second part of data labeled through a plurality of second classification models are obtained through training of sensitive data stored in equipment where the second classification models are located;
s22, obtaining data to be classified;
s23, extracting characteristic information of the data to be classified;
and S24, inputting the characteristic information of the data to be classified into the first classification model to obtain a classification result, and determining the classification result as the type of the data to be classified.
The non-sensitive data at the server side includes both the first part of data of the known classification mark, that is, the first part of data is data of which the category is already clear, and also includes: the second part of data marked by the plurality of second classification models, namely the category of the second part of data is not clear at the server side, and the category mark is obtained by predicting the second part of data through the plurality of second classification models.
The first part of data with the category labels already defined by the server side can be directly used for training the first classification model, however, since the number of the first part of data with the category labels already defined is limited, in order to avoid model overfitting, the second part of data with the category labels not already defined in the non-sensitive data of the server side needs to be subjected to category prediction.
In the process of performing category prediction on the second part of data, the categories are predicted through a plurality of second classification models in a plurality of devices, and since the second classification models in the plurality of devices are different, the predicted category results of some data in the second part of data may be the same or different after the data is input into different second classification models. Therefore, when performing class prediction on each data in the second part of data, it is necessary to determine each second classification model to perform weighting, so that the result output by each second classification model can be assigned with a weight based on the weight of the second classification model, thereby determining the class of the data finally predicted by all the second classification models.
The second part of data is respectively marked through a plurality of second classification models, and the classification mark of each data in the second part of data is determined based on the first classification result of each second classification model on each data in the second part of data and the marking weight of the second classification model.
Further, determining a labeling weight for each second classification model comprises: inputting the first part of data into each second classification model to obtain a second classification result of each second classification model on each data in the first part of data; determining the classification accuracy of each second classification model based on the classification mark of each data in the first part of data and the second classification result; and determining the marking weight of each second classification model based on the classification accuracy of each second classification model and the marking data proportion of each second classification model.
A labeling weight for each second classification model is determined, which is determined based on the classification accuracy of each second classification model.
Since the first part of the non-sensitive data is clearly marked by the classification, the accuracy of each second classification model can be determined directly through the first part of the non-sensitive data.
And taking each data in the first part of data as an input of each second classification model respectively to determine the prediction of each second classification model on the category of each data in the first part of data, and then comparing the prediction result with the actual category of the data to determine whether the prediction is accurate.
For example: the first part of data comprises 5 data, the 5 data are respectively input into a second classification model A, the second classification model A outputs results aiming at the 5 data, wherein 3 output results are consistent with the actual classification of the corresponding data, namely the labeled classification, and 2 output results are different from the actual classification of the corresponding data, so that the accuracy of the second classification model A can be determined to be 60%; the 5 data are respectively input into a second classification model B, and the second classification model B can determine the accuracy of the second classification model B to be 20% according to the output results of the 5 data, wherein 1 output result is consistent with the actual category of the data, and 4 output results are different from the actual category of the corresponding data.
Each second classification model may be capable of labeling all of the second part of data or labeling only part of the second part of data, and the labeling proportion of the second part of data that each second classification model can label is determined, so that when the labeling weight of each second classification model is determined, the labeling proportion and the accuracy of each second classification model can be determined based on the labeling proportion and the accuracy of each second classification model.
Specifically, the method comprises the following steps:
Figure BDA0003369837420000101
Figure BDA0003369837420000102
……
Figure BDA0003369837420000111
wherein, ω is 1 、ω 2 、...、ω n Respectively the label weights, p, of the second classification model 1 、ρ 2 、...、ρ n Marking the proportion of data, acc, for the second classification model 1 、acc 2 、...、acc n The accuracy of the data is labeled for the second classification model.
After the labeling weight of each second classification model and the output result of each second classification model for predicting one of the second parts of data are determined, the output results of each second classification model are subjected to weighted aggregation, namely, the final prediction result of the data is determined based on the labeling weight.
Namely:
Figure BDA0003369837420000112
wherein, label 1 、label 2 、...、label n Second classification results, label, on the data for each second classification model i And the data are classified and marked after the data are predicted through a plurality of second classification models.
In addition, the marking of the second part of data by the plurality of second classification models respectively comprises the following steps:
and respectively inputting the second parts of data into different second classification models, obtaining a first classification result which is output by each second classification model and aims at least part of the second parts of data, and determining the marked data proportion of each second classification model based on the first classification result output by each second classification model and at least part of the second parts of data aimed at by the first classification result.
That is, when it is determined that each second classification model classifies data in the second part of data, it is also necessary to determine a proportion of data classified by each second classification model to all the second part of data, that is, a proportion of labeled data, so as to determine a weight of the classification model based on the proportion of labeled data.
And predicting all the second part of data based on the mode to obtain corresponding classification marks, so that model training can be performed on all the second part of data which are predicted and the first part of data which are definitely classified and marked together, and a first classification model with higher accuracy is obtained.
The embodiment discloses a data classification method, which includes obtaining data to be classified, extracting feature information of the data to be classified, inputting the feature information of the data to be classified into a first classification model, obtaining a classification result, and determining the classification result as a type of the data to be classified, wherein the first classification model is as follows: the data classification method based on the non-sensitive data is obtained by training the non-sensitive data marked by the plurality of second classification models, and each second classification model is obtained by training the sensitive data stored in the device where the second classification model is located. According to the scheme, the classification is carried out through the first classification model, the first classification model is obtained based on non-sensitive data training, the non-sensitive data are marked through the second classification models, the second classification models are obtained based on sensitive data stored in the devices, the first classification model is not directly trained based on the sensitive data, but is obtained based on the non-sensitive data training, and the problem of privacy disclosure is avoided.
The present embodiment discloses a data classification system, a schematic structural diagram of which is shown in fig. 3, and the data classification system includes:
an obtaining unit 31, an extracting unit 32 and a classifying unit 33.
Wherein the obtaining unit 31 obtains data to be classified;
the extraction unit 32 extracts feature information of the data to be classified;
the classifying unit 33 inputs the feature information of the data to be classified into the first classification model, obtains the classification result, determines the classification result as the type of the data to be classified,
wherein the first classification model is: the data classification method based on the non-sensitive data is obtained by training based on the non-sensitive data marked by a plurality of second classification models, wherein each second classification model is obtained by training the sensitive data stored in the device where the second classification model is located.
Further, the data classification system disclosed in this embodiment may further include:
the training unit is used for carrying out model training through non-sensitive data to obtain a first classification model, wherein the non-sensitive data at least comprises: a first portion of data for a known classification label and a second portion of data for a second plurality of classification models.
Further, the data classification system disclosed in this embodiment may further include:
the marking unit is used for marking the second part of data through a plurality of second classification models respectively; and determining the classification mark of each data in the second part of data based on the first classification result of each second classification model on each data in the second part of data and the mark weight of the second classification model.
Further, the labeling unit labels the second part of data through a plurality of second classification models, respectively, including:
the marking unit respectively inputs the second part of data into different second classification models; obtaining a first classification result for at least part of the second part of data output by each second classification model; and determining the marked data proportion of each second classification model based on the first classification result output by each second classification model and at least part of the second part of data for which the first classification result aims.
Further, the data classification system disclosed in this embodiment may further include:
a weight determination unit for determining a labeling weight for each of the second classification models, wherein:
the weight determining unit is used for inputting the first part of data into each second classification model and obtaining a second classification result of each second classification model on each data in the first part of data; determining the classification accuracy of each second classification model based on the classification mark of each data in the first part of data and the second classification result; and determining the marking weight of each second classification model based on the classification accuracy of each second classification model and the marking data proportion of each second classification model.
Further, the training unit is used for determining a first classification result of the plurality of second classification models on each data in the second part of data; determining a weight of each data in the second portion of data based on a consistency of the first classification result for each data in the second portion of data based on a different second classification model; and performing model training through the first part of data and the second part of data containing the weight to obtain a first classification model.
Further, each second classification model is obtained by training sensitive data stored in a device where the second classification model is located, and the training includes:
determining sensitive data stored by the device; and training a second classification model local to the current device in a semi-supervised mode based on sensitive data stored by the current device.
The data classification system disclosed in this embodiment is implemented based on the data classification method disclosed in the above embodiment, and is not described herein again.
The embodiment discloses a data classification system, which obtains data to be classified, extracts feature information of the data to be classified, inputs the feature information of the data to be classified into a first classification model, obtains a classification result, and determines the classification result as the type of the data to be classified, wherein the first classification model is: the data classification method based on the non-sensitive data is obtained by training the non-sensitive data marked by the plurality of second classification models, and each second classification model is obtained by training the sensitive data stored in the device where the second classification model is located. According to the scheme, the classification is carried out through the first classification model, the first classification model is obtained based on non-sensitive data training, the non-sensitive data are marked through the second classification models, the second classification models are obtained based on sensitive data stored in the devices, the first classification model is not directly trained based on the sensitive data, but is obtained based on the non-sensitive data training, and the problem of privacy disclosure is avoided.
The embodiment discloses an electronic device, a schematic structural diagram of which is shown in fig. 4, and the electronic device includes:
a processor 41 and a memory 42.
Wherein, the processor 41 is configured to obtain data to be classified; extracting characteristic information of data to be classified; inputting the characteristic information of the data to be classified into a first classification model to obtain a classification result, and determining the classification result as the type of the data to be classified; wherein the first classification model is: the method comprises the steps that training is carried out on the basis of non-sensitive data marked by a plurality of second classification models, wherein each second classification model is obtained through training of sensitive data stored in equipment where the second classification model is located;
the memory 42 is used to store programs for the processor to perform the above-described processes.
The electronic device disclosed in this embodiment is implemented based on the data classification method disclosed in the above embodiment, and details are not described here.
The embodiment discloses an electronic device, which obtains data to be classified, extracts feature information of the data to be classified, inputs the feature information of the data to be classified into a first classification model, obtains a classification result, and determines the classification result as a type of the data to be classified, wherein the first classification model is: the data classification method based on the non-sensitive data is obtained by training the non-sensitive data marked by the plurality of second classification models, and each second classification model is obtained by training the sensitive data stored in the device where the second classification model is located. According to the scheme, the classification is carried out through the first classification model, the first classification model is obtained based on non-sensitive data training, the non-sensitive data are marked through the second classification models, the second classification models are obtained based on sensitive data stored in the devices, the first classification model is not directly trained based on the sensitive data, but is obtained based on the non-sensitive data training, and the problem of privacy disclosure is avoided.
The embodiment of the present application further provides a readable storage medium, where a computer program is stored, and the computer program is loaded and executed by a processor to implement each step of the data classification method, where a specific implementation process may refer to descriptions of corresponding parts in the foregoing embodiment, and details are not repeated in this embodiment.
The present application also proposes a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction, so that the electronic device executes the methods provided in the various optional implementation manners in the aspect of the data classification method or the aspect of the data classification system, and the specific implementation process may refer to the description of the corresponding embodiment, which is not described again.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of data classification, comprising:
obtaining data to be classified;
extracting characteristic information of the data to be classified;
inputting the characteristic information of the data to be classified into a first classification model to obtain a classification result, and determining the classification result as the type of the data to be classified;
wherein the first classification model is: the method comprises the steps that training is carried out on the basis of non-sensitive data marked by a plurality of second classification models, wherein each second classification model is obtained through training of sensitive data stored in equipment where the second classification model is located.
2. The method of claim 1, further comprising:
performing model training through non-sensitive data to obtain a first classification model, wherein the non-sensitive data at least comprises: a first portion of data for a known classification label and a second portion of data for a second plurality of classification models.
3. The method of claim 2, further comprising:
marking the second part of data through a plurality of second classification models respectively;
and determining the classification mark of each data in the second part of data based on the first classification result of each second classification model on each data in the second part of data and the mark weight of the second classification model.
4. The method of claim 3, wherein the labeling the second portion of data with the plurality of second classification models respectively comprises:
inputting the second part of data into different second classification models respectively;
obtaining a first classification result for at least part of the second part of data output by each of the second classification models;
and determining the marked data proportion of each second classification model based on the first classification result output by each second classification model and at least part of the second part of data for which the first classification result aims.
5. The method of claim 3, further comprising: determining a labeling weight for each of the second classification models, wherein:
the determining the labeling weight of each second classification model comprises:
inputting the first part of data into each second classification model to obtain a second classification result of each second classification model on each data in the first part of data;
determining classification accuracy of each second classification model based on the classification label and the second classification result of each data in the first part of data;
and determining the marking weight of each second classification model based on the classification accuracy of each second classification model and the marking data proportion of each second classification model.
6. The method of claim 2, wherein the model training through the non-sensitive data to obtain a first classification model comprises:
determining a first classification result of the plurality of second classification models for each data in the second portion of data;
determining a weight of each data in the second portion of data based on a consistency of a first classification result of each data in the second portion of data by a different second classification model;
and performing model training through the first part of data and the second part of data containing the weight to obtain a first classification model.
7. The method according to claim 1, wherein each of the second classification models is obtained by training sensitive data stored in a device in which the second classification model is located, and the method comprises:
determining sensitive data stored by the device;
training a second classification model local to a current device in a semi-supervised manner based on sensitive data stored by the current device.
8. A data classification system, comprising:
an obtaining unit for obtaining data to be classified;
the extraction unit is used for extracting the characteristic information of the data to be classified;
a classification unit for inputting the characteristic information of the data to be classified into a first classification model to obtain a classification result, determining the classification result as the type of the data to be classified,
wherein the first classification model is: the method comprises the steps that training is carried out on the basis of non-sensitive data marked by a plurality of second classification models, wherein each second classification model is obtained through training of sensitive data stored in equipment where the second classification model is located.
9. An electronic device, comprising:
a processor for obtaining data to be classified; extracting feature information of the data to be classified; inputting the characteristic information of the data to be classified into a first classification model to obtain a classification result, and determining the classification result as the type of the data to be classified; wherein the first classification model is: the method comprises the steps that training is carried out on the basis of non-sensitive data marked by a plurality of second classification models, wherein each second classification model is obtained through training of sensitive data stored in equipment where the second classification model is located;
and the memory is used for storing the program of the processor for executing the processing procedure.
10. A readable storage medium storing at least one set of instructions;
the set of instructions is for being called and performing at least the method of data classification as any one of the above.
CN202111395446.5A 2021-11-23 2021-11-23 Data classification method and system and electronic equipment Active CN114065867B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111395446.5A CN114065867B (en) 2021-11-23 2021-11-23 Data classification method and system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111395446.5A CN114065867B (en) 2021-11-23 2021-11-23 Data classification method and system and electronic equipment

Publications (2)

Publication Number Publication Date
CN114065867A CN114065867A (en) 2022-02-18
CN114065867B true CN114065867B (en) 2023-04-07

Family

ID=80279574

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111395446.5A Active CN114065867B (en) 2021-11-23 2021-11-23 Data classification method and system and electronic equipment

Country Status (1)

Country Link
CN (1) CN114065867B (en)

Also Published As

Publication number Publication date
CN114065867A (en) 2022-02-18

Similar Documents

Publication Publication Date Title
CN112990432B (en) Target recognition model training method and device and electronic equipment
Wu et al. A novel convolutional neural network for image steganalysis with shared normalization
CN108111489B (en) URL attack detection method and device and electronic equipment
CN107577945B (en) URL attack detection method and device and electronic equipment
Di Noia et al. Taamr: Targeted adversarial attack against multimedia recommender systems
CN112348117B (en) Scene recognition method, device, computer equipment and storage medium
CN110633745A (en) Image classification training method and device based on artificial intelligence and storage medium
CN109413023B (en) Training of machine recognition model, machine recognition method and device, and electronic equipment
CN109800682A (en) Driver attributes' recognition methods and Related product
CN116992299B (en) Training method, detecting method and device of blockchain transaction anomaly detection model
CN113762326A (en) Data identification method, device and equipment and readable storage medium
CN114821204A (en) Meta-learning-based embedded semi-supervised learning image classification method and system
CN108268641A (en) Invoice information recognition methods and invoice information identification device, equipment and storage medium
Liu et al. Learning multiple gaussian prototypes for open-set recognition
CN117011616A (en) Image content auditing method and device, storage medium and electronic equipment
WO2020075462A1 (en) Learner estimating device, learner estimation method, risk evaluation device, risk evaluation method, and program
CN110781467A (en) Abnormal business data analysis method, device, equipment and storage medium
CN114140670A (en) Method and device for model ownership verification based on exogenous features
CN114065867B (en) Data classification method and system and electronic equipment
Diwan et al. CNN-Keypoint Based Two-Stage Hybrid Approach for Copy-Move Forgery Detection
CN116204890A (en) Self-adaptive algorithm component library for enhancing safety of artificial intelligence algorithm
CN113887357B (en) Face representation attack detection method, system, device and medium
Deb et al. Use of auxiliary classifier generative adversarial network in touchstroke authentication
CN113762382B (en) Model training and scene recognition method, device, equipment and medium
CN116958846A (en) Video detection method, device, equipment, medium and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant