WO2023169496A1 - 一种数据处理方法、装置、电子设备和存储介质 - Google Patents

一种数据处理方法、装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2023169496A1
WO2023169496A1 PCT/CN2023/080414 CN2023080414W WO2023169496A1 WO 2023169496 A1 WO2023169496 A1 WO 2023169496A1 CN 2023080414 W CN2023080414 W CN 2023080414W WO 2023169496 A1 WO2023169496 A1 WO 2023169496A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
data processing
processing model
encoder
decoder
Prior art date
Application number
PCT/CN2023/080414
Other languages
English (en)
French (fr)
Inventor
潘征
谢春华
Original Assignee
上海熵熵微电子科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海熵熵微电子科技有限公司 filed Critical 上海熵熵微电子科技有限公司
Publication of WO2023169496A1 publication Critical patent/WO2023169496A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/256Integrating or interfacing systems involving database management systems in federated or virtual databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • the embodiments of the present application relate to the field of data processing technology, for example, to a data processing method, device, electronic device, and storage medium.
  • Common data sharing methods mainly include federated learning and secure multi-party computing.
  • federated learning requires the data owner and user to be online at the same time to complete the computing task together.
  • the computing power of all parties Demand is directly proportional to data possession.
  • the secure multi-party calculation page requires all parties to participate in the calculation online at the same time, and due to the needs of the underlying protocol, all parties must communicate with each other during each step of the calculation.
  • Data sharing methods in related technologies require data providers and data users to be online at the same time to jointly complete the training of machine learning tasks. This method binds the data provider and the data user and increases the difficulty of data sharing.
  • a data processing method is needed that decouples data providers and data users, reduces the difficulty of data sharing, and improves sharing efficiency.
  • This application provides a data processing method, device, electronic equipment and storage medium to achieve decoupling of data providers and data users, reduce the difficulty of data sharing, and increase data utilization.
  • embodiments of the present application provide a data processing method, wherein the method includes:
  • the shared data corresponding to the original data of the data source is determined according to the data processing model, wherein the shared data and the original data have the same manifold structure and probability distribution characteristics.
  • embodiments of the present application also provide a data processing device, wherein the device includes:
  • the model structure module is set to determine the data processing model structure in the data processing library according to the data type of the data source;
  • a model training module configured to generate a data processing model based on the training set of the data source and the data processing model structure
  • the shared data module is configured to determine shared data corresponding to the original data of the data source according to the data processing model, wherein the shared data and the original data have the same manifold structure and probability distribution characteristics.
  • embodiments of the present application further provide an electronic device, wherein the electronic device includes:
  • processors one or more processors
  • memory configured to store one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the data processing method as described in any one of the embodiments of this application.
  • embodiments of the present application also provide a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the data processing method as described in any one of the embodiments of the present application is implemented.
  • Figure 1 is a flow chart of a data sharing method provided by an embodiment of the present application.
  • Figure 2 is a flow chart of a data sharing method provided by another embodiment of the present application.
  • Figure 3 is a schematic diagram of training of an encoder and a decoder provided by an embodiment of the present application
  • Figure 4 is a schematic diagram of training of an encoder and a decoder provided by another embodiment of the present application.
  • Figure 5 is a schematic diagram of training of an encoder and a decoder provided by another embodiment of the present application.
  • Figure 6 is an example diagram of a data sharing method provided by an embodiment of the present application.
  • Figure 7 is a schematic structural diagram of a data sharing device provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • Figure 1 is a flow chart of a data sharing method provided by an embodiment of the present application. This embodiment can be applied to the situation of data sharing. The method can be executed by a data sharing device, and the device can adopt hardware and/or software. To achieve this, see Figure 1.
  • the method provided by the embodiment of this application includes the following steps:
  • Step 110 Determine the data processing model structure in the data processing library according to the data type of the data source.
  • the data source can be the data storage location of the data owner, and the data source can include the logical address or physical address of the data.
  • the data processing library can be a model structure library pre-built by the data sharing solution provider.
  • the data processing library can include one or more data processing model structures.
  • the model structure can be used to generate shared data that has the same manifold structure and probability distribution characteristics as the original data. It can be understood that the data processing model structure can be a manifold learning model structure, and high-dimensional data can be dimensionally reduced.
  • the data sharing solution provider can pre-build a data processing library.
  • the data owner can query the data processing library according to the data type of the data in the data source to process the data type. data processing model structure.
  • Step 120 Generate a data processing model based on the training set of the data source and the data processing model structure.
  • the training set may be a data set for training the data processing model structure, and the data in the training set may include at least one of image data, table data, or medical detection data.
  • Step 130 Determine the shared data corresponding to the original data of the data source according to the data processing model, where the shared data and the original data have the same manifold structure and probability distribution characteristics.
  • the original data can be data in the data source or information containing private data.
  • the original data cannot be shared with third parties for use.
  • the shared data can be desensitized data.
  • the shared data can have the same probability distribution characteristics as the original data.
  • Shared data can have the same effect as original data in the field of machine learning.
  • the manifold structure and probability distribution characteristics can be the manifold structure and probability rules of the original data values.
  • the original data can be read from the data source, and the original data can be processed using the data processing model.
  • the data output by the data processing model can be used as shared data. It is understood that the data processing model can process the original data and generate shared data that is different from the original data but contains the manifold structure and probability distribution characteristics of the original data. This shared data can be used by a third party.
  • the data processing model structure corresponding to the data type of the data source is selected in the data processing library, the data processing model structure is trained using the training set from the data source to generate the data processing model, and the data processing model is used to process the data.
  • the original data from the source is processed to generate shared data.
  • the shared data has the same manifold structure and probability distribution characteristics as the original data.
  • Figure 2 is a flow chart of a data sharing method provided by another embodiment of the present application.
  • the embodiment of the present application is a refinement based on the above-mentioned embodiment of the application.
  • the method provided by the embodiment of the present application includes the following: step:
  • Step 210 Read the data type of the original data in the data source.
  • the original data can be data stored in the data source, and the original data can include private data. cannot be shared directly with third parties for use.
  • the data type of the original data can be read from the data source.
  • the data type of the original data can be extracted from database metadata.
  • Step 220 Search the data processing library for a data processing model structure that matches the data type.
  • data types and data processing model structures can be associated and stored in the data processing library.
  • the corresponding data processing model structure can be searched in the data processing library according to the extracted data type.
  • the data processing model structure in the data processing library may have the same identification number as its corresponding data type.
  • Step 230 Collect original data from the data source as a training set.
  • a connection can be established with the data source, and the connection can be used to read a threshold number of raw data as a training set.
  • the threshold number can be determined by the data processing model structure. For example, different data processing model structures can configure the threshold number.
  • the threshold number can be the minimum requirement for the amount of data in the training set used. The greater the amount of data in the training set, the more it can reflect the manifold structure and probability distribution of the data, and the more accurate the training structure of the data processing structure will be.
  • Step 240 Train the data processing model structure according to the training set to generate the encoder and decoder of the data processing model.
  • the encoder can be a machine learning model that processes high-dimensional original data into low-dimensional data
  • the decoder can be a machine learning model that processes low-dimensional data into high-dimensional data.
  • the encoder and decoder jointly form a flow pattern.
  • the processing processes of the decoder and the encoder can be reversed, and the encoder and encoder can be a convolutional neural network model or a fully connected network model, etc.
  • the data processing model may include an encoder and a decoder, and the collected training set may be used to train the encoder and decoder.
  • the trained encoder and decoder may be used as the data processing model.
  • Step 250 Input the original data of the data source into the encoder of the data processing model to generate point cloud data in a low-dimensional space.
  • the point cloud data can be original data processed by the encoder.
  • the point cloud data can be composed of one or more original data processed by the encoder.
  • the dimensions of the data included in the point cloud data are determined by the hyperparameters specified by the user.
  • the trained encoder can perform dimensionality reduction processing on the original data read from the data source, and the processing results can be used as point cloud data.
  • Each data coordinate in the point cloud data can correspond to the data respectively.
  • Step 260 Perform data reduction on the point cloud data.
  • data reduction can be performed on point cloud data, and the probability distribution of point cloud data can be converted from point representation to weight representation, which can improve the convergence stability and efficiency of subsequent mapping relationships.
  • Step 270 Determine the mapping relationship between the data distribution probability corresponding to the point cloud data and the specified probability distribution.
  • the specified probability distribution may be a preset probability distribution, which may include uniform distribution, Gaussian distribution, etc.
  • the mapping relationship can be derived based on the probability distribution of the point cloud data and the analysis of the specified probability distribution, and the mapping can reflect the mapping of the specified probability distribution to the probability distribution of the point cloud data.
  • Step 280 Sampling and generating a data point set according to the specified probability distribution.
  • data can be resampled in the data space according to the specified probability distribution, so that the collected data meets the requirements of the specified probability, and the data obtained by resampling can be formed into a data point set.
  • Step 290 Map the data point set to the data distribution probability according to the mapping relationship.
  • the data point set can be processed according to the mapping relationship obtained above, so that the mapped data point set conforms to the data distribution probability of the original data.
  • Step 2100 Eliminate data from the mapped data point set that is greater than the similarity threshold with the point cloud data.
  • the similarity threshold can be a parameter for judging the similarity between the original data and the shared data, and the similarity threshold can be set by the user based on device performance and processing speed.
  • the similarity between the data point set and the data in the point cloud data can be determined.
  • the similarity can include probability values, vector distances or Euclidean geometric distances, etc.
  • Step 2110 Input the deleted data point set into the decoder to generate shared data.
  • the concentrated data points after elimination can be input to the decoder, and the decoder can upgrade the data from the low-dimensional space to the high-dimensional space corresponding to the original data.
  • the upgraded-dimensional data can be used as shared data.
  • the data type of the original data in the data source is read, the matching data processing model structure is obtained according to the data type, the original data is collected in the data source as a training set, and the training set is used to train the encoder and encoder in the data processing model structure.
  • the decoder uses the encoder to process the original data of the data source as point cloud data in low-dimensional space, performs data reduction on the point cloud data and determines the mapping relationship between the data distribution probability and the specified distribution probability, and reconstructs it in the data space according to the specified distribution probability.
  • Sampling generates a data point set, maps the data point set to the data distribution probability according to the mapping relationship, eliminates data whose similarity to the data point cloud is greater than the similarity threshold, and uses a decoder to convert the deleted data point set.
  • the embodiment of this application implements privacy processing of original data through a data processing model, which can retain the manifold structure and probability distribution characteristics of the original data, realize the sharing and utilization of shared data, remove data with excessive similarity, and prevent privacy leakage. Reduces the difficulty of sharing private data.
  • training the data processing model structure according to the training set to generate the encoder and decoder of the data processing model includes:
  • the encoder and the decoder are trained layer by layer; the mean square error is used as a loss function to control parameter updates of the encoder and the decoder; wherein the encoder and the decoder are graph convolution neural
  • the network model includes at least a convolution layer, a linear rectification layer and a batch normalization layer.
  • the encoder and decoder in the data processing model can be a graph convolutional neural network, and the graph convolutional neural network includes at least a convolution layer,
  • the linear rectification layer and batch normalization layer can use a training set composed of image data to train the encoder and decoder.
  • the training process of the encoder and decoder can be trained multiple times at different resolutions.
  • the encoder Train the corresponding low-resolution convolutional layer in the decoder, and then train the corresponding high-resolution convolutional layer. After each training is completed, the output results of the encoder and decoder can be calculated with the input images in the training set.
  • the mean square error of the data can be used to determine the training status of the encoder and decoder. When the mean square error is less than the preset error value, it is determined that the training of the encoder and decoder is completed.
  • a convolutional neural network Convolutional Neural Networks, CNN
  • CNN convolutional Neural Networks
  • embodiments of the present application can use a convolutional neural network (Convolutional Neural Networks, CNN) model to implement an autoencoder including an encoder and a decoder, and construct an autoencoder through a step-by-step resolution feature extraction method. , to achieve learning the manifold structure of image data.
  • the number of layers of the autoencoder can be related to the resolution of the image data.
  • the encoder and decoder can be trained layer by layer starting from the resolution of 4*4.
  • each layer module contains a Convolutional (Cov) layer, a Batch Normalization (BN) layer and a Linear Rectification function (RELU) layer.
  • the autoencoder can be trained layer by layer starting from low resolution, such as First train the 4x4 encoder and decoder, then train the 8x8 resolution, and gradually reach the maximum resolution.
  • the mean square error is used as the loss function to guide the parameter update of the encoder and decoder.
  • training the data processing model structure according to the training set to generate the encoder and decoder of the data processing model includes:
  • the encoder and the decoder are trained using the training set of the medical detection data type; cross entropy is used as a loss function to control parameter updates of the encoder and the decoder; wherein, the encoder and The decoder is a fully connected layer network, and the number of hidden layers and the number of hidden layer dimensions included in the fully connected layer network are determined by the dimensions of the medical detection data.
  • the encoder and decoder can be a fully connected layer network.
  • the fully connected layer network can include multiple hidden layers, and the dimensions included in each hidden layer can be is multiple.
  • the hidden layer and the dimension of the hidden layer can be determined by the dimensions of the medical data.
  • the training of the fully connected layer network can be multiple times, each time After training, the cross-entropy damage function can be used to measure the training effect of each fully connected layer network. When the value of the cross-entropy damage function meets the training end condition, the encoder and decoder training is completed.
  • Figure 4 is a schematic diagram of training of an encoder and a decoder provided by an embodiment of the present application.
  • the medical detection data is one-hot data, and each dimension of the data represents a Item detection indicators are negative or positive, and there may be correlations between indicators.
  • the autoencoder can perform manifold learning.
  • the autoencoder can include an encoder and a decoder. According to For the input dimension, select the number of hidden layers in the fully connected layer and the number of hidden layer dimensions.
  • CrossEntropy is used as the loss function to guide the encoder and decoder to update parameters. Stop training when the loss function reaches a certain value.
  • training the data processing model structure according to the training set to generate the encoder and decoder of the data processing model includes:
  • For the training set of tabular data type extract the numerical data and categorical data of the training respectively; perform Gaussian fitting normalization on the numerical data, and perform entity embedding coding on the categorical data;
  • the encoder and the decoder are trained using the category vector generated by entity embedding coding and the numerical data normalized by Gaussian fitting; wherein the encoder and the decoder are fully connected layers Network, the number of hidden layers and the number of hidden layer dimensions included in the fully connected layer network are determined by the dimensions of the table data.
  • the encoder and decoder that processes the table data can use a fully connected layer network structure.
  • the numerical data and categorical data in the table data need to be processed in advance.
  • Numerical data can be normalized by Gaussian fitting.
  • numerical data in tabular data can be subjected to mean normalization or variance normalization, etc.
  • Category data can be processed by entity embedding, vectorizing the category data, and trying to preserve the relationship between category data.
  • different loss functions can be adopted for categorical data and numerical data. For example, cross entropy can be used as the loss function for categorical data; mean square error can be used as the loss function for numerical data.
  • table data includes different data types, it can be divided into numerical data and categorical data. Different types of data can be preprocessed separately first, and the numerical data can be preprocessed separately. Gaussian fitting normalization, entity embedding encoding for categorical data. Then a fully connected forward network is used to construct an autoencoder, and for each category data, entity embedding is used to convert the discrete category label into a continuous numerical type. For numerical data, preprocessed data is obtained by normalizing it through means, variance, etc. Then these two sets of data are combined as the input of the fully connected layer. The number of hidden layers and the dimension of the hidden layer of the fully connected layer are set according to the dimension of the data.
  • Different loss function controls can be sampled during the training process of the fully connected layer. After the training is completed, cross entropy is used as the loss function for categorical data; mean square error is used as the loss function for numerical data. Training stops when the value of the loss function meets the threshold.
  • FIG. 6 is an example diagram of a data sharing method provided by an embodiment of the present application.
  • the data sharing method can be implemented based on a data generation framework.
  • the data generation framework can Including manifold learning models for different types of data.
  • the manifold learning model in this data generation framework also cooperates with optimal transmission mapping to learn data patterns (including the manifold structure of the data and the probability density distribution on the manifold) from the original private data.
  • the data pattern resamples the data in the data space to obtain generated data that conforms to the data pattern and is different from the original data.
  • the original data always remains with the data owner, and the generated data is shared.
  • the generation of shared data can include a learning phase and a generation phase.
  • the learning phase includes: 1.
  • the output of manifold learning is a manifold expansion.
  • Point cloud data in low-dimensional space; 2.
  • the data reduction module converts the point cloud data and converts the point representation of probability into a weight representation; 3.
  • the data enters the optimal transmission mapping solver to obtain the specified probability distribution (uniform distribution , Gaussian distribution) to the mapping of data probability distribution.
  • the generation phase includes: 1.
  • the specified probability distribution data point sampling module generates a data point set that conforms to the specified probability distribution; 2.
  • the data mapping module uses the optimal transmission mapping obtained in the learning phase to map the data point set to the original data probability distribution; 3.
  • the data filtering module removes data that is too similar to the original data in the data point set; 4.
  • the removed data point set is input to the data decoder learned in the learning stage to obtain different types of data for sharing. generated data.
  • Figure 7 is a schematic structural diagram of a data sharing device provided by an embodiment of the present application.
  • the data sharing method provided by an embodiment of the present application can be implemented by software and/or hardware, and can generally be integrated into a server. See Figure 7.
  • Implementation of the present application The devices provided by the example may include: model structure module 301, model training module 302, and shared data module 303.
  • the model structure module 301 is configured to determine the data processing model structure in the data processing library according to the data type of the data source.
  • the model training module 302 is configured to generate a data processing model based on the training set of the data source and the data processing model structure.
  • the shared data module 303 is configured to determine shared data corresponding to the original data of the data source according to the data processing model, where the shared data and the original data have the same manifold structure and probability distribution characteristics.
  • the data processing model structure corresponding to the data type of the data source is selected in the data processing library through the model structure module, and the model training module uses the training set from the data source to train the data processing model structure to generate the data processing model,
  • the shared data module uses a data processing model to process the original data of the data source to generate shared data.
  • the shared data has the same manifold structure and probability distribution characteristics as the original data.
  • the model structure module 301 in the device includes:
  • the type reading unit is configured to read the data type of the original data in the data source.
  • a structure determining unit configured to search the data processing library for the data processing model structure matching the data type.
  • the model training module 302 in the device includes:
  • a training set generation unit is configured to collect original data from the data source as a training set.
  • a model training unit is configured to train the data processing model structure according to the training set to generate the encoder and decoder of the data processing model.
  • the model training unit is configured to use the training set of the image data type to perform the encoder and the decoder one by one in the order from low resolution to high resolution.
  • Layer training using the mean square error as a loss function to control parameter updates of the encoder and the decoder; wherein the encoder and the decoder are graph convolutional neural network models, each including at least a convolution layer, Linear rectification layer, pooling layer and loss function layer.
  • the model training unit is configured to: use the training set of medical detection data type to train the encoder and the decoder; use cross entropy as a loss function to control the encoding Parameter update of the encoder and the decoder; wherein, the encoder and the decoder are fully connected layer networks, and the number of hidden layers and the number of hidden layer dimensions included in the fully connected layer network are determined by the medical detection data. Dimensions determined.
  • the model training unit is configured to: extract the trained numerical data and categorical data respectively for the training set of tabular data type;
  • the numerical data is normalized by Gaussian fitting, and the categorical data is encoded by entity embedding; using the category vector generated by entity embedding encoding and the numerical data normalized by Gaussian fitting,
  • the encoder and the decoder are trained; wherein the encoder and the decoder are fully connected layer networks, and the number of hidden layers and the number of hidden layer dimensions included in the fully connected layer network are determined by table data. The dimensions are determined.
  • the shared data module 303 includes:
  • a data encoding module is configured to input the original data of the data source into the encoder of the data processing model to generate point cloud data in a low-dimensional space.
  • a data reduction module is configured to perform data reduction on the point cloud data.
  • the optimal transmission mapping solver module is configured to determine the mapping relationship between the data distribution probability corresponding to the point cloud data and the specified probability distribution.
  • the data point sampling module is configured to generate a data point set by sampling according to the specified probability distribution.
  • a data mapping module is configured to map the data point set to the data distribution probability according to the mapping relationship.
  • a data filtering module is configured to eliminate data from the mapped data point set that is greater than a similarity threshold with the point cloud data.
  • the data decoding module is configured to input the set of data points after the elimination process into the decoder to generate the shared data.
  • Figure 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the The electronic device includes a processor 40, a memory 41, an input device 42 and an output device 43; the number of processors 40 in the electronic device may be one or more, and one processor 40 is taken as an example in Figure 8; the processor in the electronic device 40.
  • the memory 41, the input device 42 and the output device 43 can be connected through a bus or other means. In Figure 8, connection through a bus is taken as an example.
  • the memory 41 can be configured to store software programs, computer-executable programs and modules, such as program instructions/modules corresponding to the data sharing method in the embodiments of the present application (for example, models in the data sharing device). Structure module 301, model training module 302 and shared data module 303).
  • the processor 40 executes software programs, instructions and modules stored in the memory 41 to execute various functional applications and data processing of the electronic device, that is, to implement the above-mentioned data sharing method.
  • the memory 41 may mainly include a stored program area and a stored data area, where the stored program area may store an operating system and at least one application program required for a function; the stored data area may store data created based on the use of the terminal, etc.
  • the memory 41 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device.
  • memory 41 may include memory located remotely relative to processor 40, and these remote memories may be connected to the electronic device through a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.
  • the input device 42 may be configured to receive input numeric or character information and to generate key signal inputs related to user settings and functional control of the electronic device.
  • the output device 43 may include a display device such as a display screen.
  • Embodiments of the present application also provide a storage medium containing computer-executable instructions, which when executed by a computer processor are used to perform a data sharing method.
  • the method includes:
  • the shared data corresponding to the original data of the data source is determined according to the data processing model, wherein the shared data and the original data have the same manifold structure and probability distribution characteristics.
  • the embodiments of the present application provide a storage medium containing computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and can also perform the data sharing method provided by any embodiment of the application. Related operations.
  • the present application can be implemented with the help of software and necessary general hardware, and of course can also be implemented with hardware. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to related technologies.
  • the computer software product can be stored in a computer-readable storage medium, such as a computer floppy disk, Read-Only Memory (ROM), Random Access Memory (RAM), FLASH, hard disk or optical disk, etc., including a number of instructions to make a computer device (which can be a personal computer, server, or network device equipment, etc.) to perform the methods described in multiple embodiments of this application.
  • Computer-readable storage media may include non-transitory computer-readable storage media.
  • the multiple units and modules included are only divided according to functional logic, but are not limited to the above divisions, as long as the corresponding functions can be realized; in addition, The specific names of multiple functional units are only for the convenience of distinguishing each other and are not used to limit the protection scope of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

一种数据处理方法、装置、电子设备和存储介质,其中,该方法包括:根据数据源的数据类型在数据处理库确定数据处理模型结构(110);基于所述数据源的训练集和所述数据处理模型结构生成数据处理模型(120);根据所述数据处理模型确定所述数据源的原始数据对应的共享数据(130),其中,所述共享数据与所述原始数据具有相同的流形结构与概率分布特征。

Description

一种数据处理方法、装置、电子设备和存储介质
本申请要求在2022年3月11日提交中国专利局、申请号为202210236012.9的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及数据处理技术领域,例如涉及一种数据处理方法、装置、电子设备和存储介质。
背景技术
机器学习的三大要素为数据、算法和算力,而硬件和软件技术的发展,算法和算力得到了巨大提升,伴随着大数据的出现,机器学习成为当前研究的热点,机器学习应用领域虽然广泛为行业带来了智能化革命,但是机器学习项目却落地并不容易,究其原因在于数据获取成为机器学习的症结。在机器学习中对数据存储两种要求:1、数据可以充分采集;2、数据可以集中使用。然而再实际应用中这些要求往往无法得到满足,如,对于小规模公司数据采集成本过高,导致数据采集不充分;商业数据保密要求过过招,导致数据无法集中使用现实环境中的机器学习难以实施。
目前针对上述问题采取的解决方式为数据共享,常见的数据共享方法主要包括联邦学习和安全多方计算等,其中,联邦学习要求数据拥有方和使用方同时在线,共同完成计算任务,各方算力需求与数据拥有量成正比。安全多方计算页需要各方同时在线参与计算,并且由于底层协议的需要,每步运算过程各方都要进行数据通讯。相关技术中数据共享方法要求数据提供方和数据使用方同时在线,共同完成机器学习任务的训练,这种方式将数据提供方和数据使用方进行了绑定,增加了数据共享的难道,目前亟需一种将数据提供方和数据使用方解耦,降低数据共享难度,提高共享效率的数据处理方法。
发明内容
本申请提供一种数据处理方法、装置、电子设备和存储介质,以实现数据提供方和数据使用方的解耦,降低数据共享难度,增加数据的利用率。
第一方面,本申请实施例提供了一种数据处理方法,其中,该方法包括:
根据数据源的数据类型在数据处理库确定数据处理模型结构;
基于所述数据源的训练集和所述数据处理模型结构生成数据处理模型;
根据所述数据处理模型确定所述数据源的原始数据对应的共享数据,其中,所述共享数据与所述原始数据具有相同的流形结构与概率分布特征。
第二方面,本申请实施例还提供了一种数据处理装置,其中,该装置包括:
模型结构模块,设置为根据数据源的数据类型在数据处理库确定数据处理模型结构;
模型训练模块,设置为基于所述数据源的训练集和所述数据处理模型结构生成数据处理模型;
共享数据模块,设置为根据所述数据处理模型确定所述数据源的原始数据对应的共享数据,其中,所述共享数据与所述原始数据具有相同的流形结构与概率分布特征。
第三方面,本申请实施例还提供了一种电子设备,其中,该电子设备包括:
一个或多个处理器;
存储器,设置为存储一个或多个程序,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本申请实施例中任一所述的数据处理方法。
第四方面,本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如本申请实施例中任一所述的数据处理方法。
附图说明
图1是本申请一实施例提供的一种数据共享方法的流程图;
图2是本申请另一实施例提供的一种数据共享方法的流程图;
图3是本申请一实施例提供的一种编码器和解码器的训练示意图;
图4是本申请另一实施例提供的一种编码器和解码器的训练示意图;
图5是本申请另一实施例提供的一种编码器和解码器的训练示意图;
图6是本申请一实施例提供的一种数据共享方法的示例图;
图7是本申请一实施例提供的一种数据共享装置的结构示意图;
图8是本申请一实施例提供的一种电子设备的结构示意图。
具体实施方式
图1是本申请一实施例提供的一种数据共享方法的流程图,本实施例可适用于数据共享的情况,该方法可以由数据共享装置来执行,该装置可以采用硬件和/或软件的方式来实现,参见图1,本申请实施例提供的方法包括如下步骤:
步骤110、根据数据源的数据类型在数据处理库确定数据处理模型结构。
其中,数据源可以是数据拥有者的数据存储位置,数据源可以包括数据的逻辑地址或者物理地址。数据处理库可以是数据共享方案提供方预先构建的模型结构库,数据处理库中可以包括一个或多个数据处理模型结构,该数据处理 模型结构可以用于生成与原始数据具有相同流形结构和概率分布特征的共享数据。可以理解的是,数据处理模型结构可以为流形学习模型结构,可以对高维数据进行降维处理。
在本申请实施例中,数据共享方案提供商可以预先构建一个数据处理库,数据拥有方在进行数据共享时,可以按照数据源中数据的数据类型在数据处理库中查询用于处理该数据类型的数据处理模型结构。
步骤120、基于数据源的训练集和数据处理模型结构生成数据处理模型。
其中,训练集可以是对数据处理模型结构进行训练的数据集,训练集中的数据可以包括图像数据、表格数据或者医疗检测数据中至少之一。
例如,可以在数据源中读取数据作为训练集,并将训练集输入到数据处理模型结构中进行训练,在训练过程中可以对数据处理模型结构的参数进行不断调整,直到数据处理结构对应的数据处理模型的输出结果满足训练结束条件。
步骤130、根据数据处理模型确定数据源的原始数据对应的共享数据,其中,共享数据与原始数据具有相同的流形结构与概率分布特征。
其中,原始数据可以是数据源中的数据,可以是包含隐私数据的信息,原始数据无法共享给第三方使用,共享数据可以是脱敏数据,共享数据可以与原始数据具有相同的概率分布特征,共享数据在机器学习领域可以与原始数据具有相同的效果,流形结构和概率分布特征可以是原始数据取值的流形结构和概率规律。
在本申请实施例中,在数据处理模型训练完成后,可以从数据源中读取原始数据,并使用数据处理模型对原始数据进行处理,可以将该数据处理模型输出的数据作为共享数据,可以理解的是,数据处理模型可以对原始数据进行处理,生成与原始数据不同但是蕴含原始数据流形结构和概率分布特征的共享数据,该共享数据可以由第三方进行使用。
本申请实施例,通过在数据处理库中选择数据源的数据类型对应的数据处理模型结构,使用来自数据源的训练集对数据处理模型结构进行训练以生成数据处理模型,使用数据处理模型对数据源的原始数据进行处理以生成共享数据,该共享数据具有与原始数据相同的流形结构和概率分布特征,本申请实施例通过数据处理模型实现原始数据的隐私处理,可降低数据共享难度,从而增加数据的利用率。
图2是本申请另一实施例提供的一种数据共享方法的流程图,本申请实施例是在上述申请实施例的基础上的细化,参见图2,本申请实施例提供的方法包括如下步骤:
步骤210、读取数据源内原始数据的数据类型。
其中,原始数据可以是数据源存储的数据,原始数据可以包括隐私数据无 法直接分享给第三方使用。
在本申请实施例中,可以从数据源中读取到原始数据的数据类型,例如,可以数据库元数据中提取原始数据的数据类型。
步骤220、在数据处理库中查找与数据类型匹配的数据处理模型结构。
例如,数据处理库中可以将数据类型与数据处理模型结构进行关联存储,在对数据源的原始数据进行处理时,可以按照提取到的数据类型在数据处理库中查找对应的数据处理模型结构。示例型的,数据处理库中数据处理模型结构可以与其对应的数据类型存在相同的标识号。
步骤230、在数据源采集原始数据作为训练集。
例如,可以与数据源建立连接,可以使用该连接读取阈值数量的原始数据作为训练集,该阈值数量可以由数据处理模型结构的确定,例如,不同的数据处理模型结构可以配置阈值数量,该阈值数量可以是使用的训练集中数据量的最少要求,训练集中数据量越多,越能反映数据的流形结构和概率分布,则数据处理结构的训练结构越准确。
步骤240、根据训练集训练数据处理模型结构以生成数据处理模型的编码器和解码器。
其中,编码器可以是对高维的原始数据处理为低维度的数据的机器学习模型,解码器可以是将低维度数据处理为高维度数据的机器学习模型,编码器和解码器联合组成流型学习模型,解码器可以与编码器的处理过程可以相反,编码器和编码器可以为卷积神经网络模型或全连接网络模型等。
在本申请实施例中,数据处理模型中可以包括编码器和解码器,可以使用采集到的训练集对编码器和解码器进行训练,可以由训练好的编码器和解码器作为数据处理模型。
步骤250、将数据源的原始数据输入数据处理模型的编码器以生成低维度空间的点云数据。
其中,点云数据可以是经过编码器处理的原始数据,点云数据可以由一个或多个经过编码器处理的原始数据组成,点云数据包括的数据的维度由用户指定的超参数确定。
在本申请实施例中,训练好的编码器可以对数据源读取到的原始数据进行降维处理,可以把处理结果作为点云数据,该点云数据中每个数据坐标可以分别对应与数据源的一个原始数据。
步骤260、对点云数据进行数据规约。
在本申请实施例中,可以对点云数据进行数据规约,将点云数据的概率分布由点表示形式转换为权重表示形式,可提高后续映射关系的收敛稳定性和效率。
步骤270、确定点云数据对应的数据分布概率与指定概率分布的映射关系。
其中,指定概率分布可以是预先设定概率分布,可以包括均匀分布和高斯分布等。
例如,可以基于点云数据的概率分布以及指定概率分布的解析出映射关系,该映射可以是反应出指定概率分布到点云数据的概率分布的映射情况。
步骤280、按照指定概率分布采样生成数据点集。
在本申请实施例中,可以按照指定概率分布在数据空间内进行数据重采样,使得采集的的数据符合指定概率的要求,可以将重采样获取到的数据组成数据点集。
步骤290、将数据点集按照映射关系映射到数据分布概率。
例如,可以按照上述获取到的映射关系对数据点集进行处理,使得映射后的数据点集符合原始数据的数据分布概率。
步骤2100、在映射后的数据点集中剔除与点云数据大于相似度阈值的数据。
其中,相似度阈值可以是判断原始数据与共享数据相似程度的参数,相似度阈值可以由用户根据设备性能以及处理速度进行设置。
在本申请实施例中,可以确定数据点集与点云数据内数据的相似度,该相似度可以包括概率取值、向量距离或者欧式几何距离等,存在数据点集中数据与点云数据内数据的相似度大于相似度阈值时,则将该数据从数据点集中剔除。
步骤2110、将经过剔除处理之后的数据点集输入解码器以生成共享数据。
例如,可以将经过剔除处理之后的数据点集中数据输入到解码器,由解码器将数据从低维度空间提升到原始数据对应的高维度空间,可以将提升维度后的数据作为共享数据。
本申请实施例,通过读取数据源内原始数据的数据类型,按照数据类型获取匹配的数据处理模型结构,在数据源采集原始数据作为训练集,使用训练集训练数据处理模型结构中的编码器和解码器,使用编码器处理数据源的原始数据为低维度空间的点云数据,对点云数据进行数据规约并确定数据分布概率与指定分布概率的映射关系,按照指定分布概率在数据空间内重采样生成数据点集,将数据点集按照映射关系映射到数据分布概率下,剔除数据点集中与数据点云相似度大于相似度阈值的数据,使用解码器将经过剔除处理之后的数据点集转换为共享数据,本申请实施例通过数据处理模型实现原始数据的隐私处理,可保留原始数据的流形结构和概率分布特征,实现共享数据的分享利用,去除相似度过高数据,防止隐私泄露,降低了隐私数据的共享难度。
例如,在上述申请实施例的基础上,根据所述训练集训练数据处理模型结构以生成所述数据处理模型的编码器和解码器,包括:
采用图像数据类型的所述训练集,按照从低分辨率到高分辨率的顺序对所 述编码器和所述解码器进行逐层训练;使用均方误差作为损失函数控制所述编码器和所述解码器的参数更新;其中,所述编码器和所述解码器为图卷积神经网络模型,分别至少包括卷积层、线性整流层和批量归一化层。
在本申请实施例中,在数据源的数据类型为图像数据类型时,数据处理模型中的编码器和解码器可以为图卷积神经网络,该图卷积神经网络中包括至少卷积层、线性整流层和批量归一化层,可以使用图像数据组成的训练集对编码器和解码器进行训练,该编码器和解码器的训练过程可以按照不同的分辨率训练多次,先对编码器和解码器中对应低分辨率的卷积层进行训练,再为对应高分辨率的卷积层进行训练,每次训练完成后,可以计算编码器和解码器输出结果与输入的训练集内图像数据的均方误差,可以使用均方误差确定出编码器和解码器的训练情况,再均方误差小于预设误差值时,确定编码器和解码器训练完成。参见图3,对于图像数据,本申请实施例可使用卷积神经网络(Convolutional Neural Networks,CNN)模型实现包括编码器和解码器的自编码器,通过逐级分辨率特征提取方法构建自编码器,实现学习图像数据的流形结构。自编码器的层数可以与图像数据的分辨率相关,以128*128的图像为例,编码器和解码器可以从4*4的分辨率开始逐层训练,对于编码器每层模块包含一个卷积(Convolutional,Cov)层、一个批量归一化(Batch Normalization,BN)层和一个线性整流函数(Linear Rectification function,RELU)层,该自编码器可以从低分辨率开始逐层训练,例如先训练4x4的编码器和解码器,然后再训练8x8分辨率,依次逐渐达到最大的分辨率。在训练过程中,采用均方误差作为损失函数指导编码器和解码器的参数更新。
例如,在上述申请实施例的基础上,根据所述训练集训练数据处理模型结构以生成所述数据处理模型的编码器和解码器,包括:
使用医疗检测数据类型的所述训练集对所述编码器和所述解码器进行训练;将交叉熵作为损失函数控制所述编码器和所述解码器的参数更新;其中,所述编码器和所述解码器为全连接层网络,所述全连接层网络包括的隐含层数和隐含层维度数由医疗检测数据的维度确定。
在本申请实施例中,在数据源为医疗数据时,编码器和解码器可以为全连接层网络,该全连接层网络中可以包括多层隐含层,每个隐含层包括的维度可以为多个,在训练编码器和解码器对应的全连接层网络时,隐含层和隐含层的维度可以由医疗数据的维度多少决定,全连接层网络的训练可以为多次,每次训练后可以使用交叉熵损害函数衡量每次全连接层网络的训练效果,在交叉熵损害函数的取值满足训练结束条件时,编码器和解码器训练完成。
在一个示例性的实施方式中,图4是本申请实施例提供的一种编码器和解码器的训练示意图,参见图4,医疗检测数据为独热数据,数据每个维度代表一 项检测指标的阴性或者阳性,而指标间可能存在相关性,可以使用全连接层网络模型构建自编码器,该自编码器可以进行流形学习,自编码器可以包括编码器和解码器,根据输入的维度,选择全连接层隐含层数,隐含层维度数。对于独热数据采用交叉熵(CrossEntropy)作为损失函数指导编码器和解码器更新参数。当损失函数达到一定值后停止训练。
例如,在上述申请实施例的基础上,根据所述训练集训练数据处理模型结构以生成所述数据处理模型的编码器和解码器,包括:
针对表格数据类型的所述训练集,分别提取所述训练的数值型数据和类别型数据;对所述数值型数据进行高斯拟合归一化,并将所述类别型数据进行实体嵌入编码;使用实体嵌入编码生成的类别向量和高斯拟合归一化后的所述数值型数据对所述编码器和所述解码器进行训练;其中,所述编码器和所述解码器为全连接层网络,所述全连接层网络包括的隐含层数和隐含层维度数由表格数据的维度确定。
在本申请实施例中,在数据源的数据为表格数据时,处理表格数据的编码器和解码器可以使用全连接层网络结构,表格数据中的数值型数据和类别型数据需要预先进行处理,数值型数据可以进行高斯拟合归一化,例如可以对表格数据中的数值型数据进行均值归一化或者方差归一化等。而类别型数据可以进行实体嵌入处理,将类别型数据向量化,尽量保留类别数据之间的关系。将经过预处理后的类别型数据和数值型数据输入到全连接层网络结构,实现对全连接层网络结构的参数的调整,每次训练后可以使用损失函数确定训练是否完成,可以理解的是,针对类别型数据和数值型数据可以采取不同的损失函数,例如类别型数据可以采用交叉熵作为损失函数;对于数值型数据采用均方误差作为损失函数。
在一个示例性的实施方式中,参见图5,由于表格数据包括的数据类型不同,可以划分为数值型数据和类别型数据,可以先对不同类型的数据分别进行预处理,对数值型数据进行高斯拟合归一化,对类别型数据进行实体嵌入编码。然后使用全连接前向网络构造自编码器,对于每个类别数据,采用实体嵌入将离散的类别标签转换为连续的数值类型。对于数值型数据则通过均值、方差等方式进行归一化得到预处理的数据。然后将这2组数据组合起来作为全连接层的输入,按照数据的维度设置全连接层的隐含层层数以及隐含层维度,在全连接层的训练过程中可以采样不同的损失函数控制训练完成,对于类别型数据采用交叉熵作为损失函数;对于数值型数据采用均方误差作为损失函数,当损失函数的取值满足阈值时停止训练。
在一个示例性的实施方式中,图6是本申请实施例提供的一种数据共享方法的示例图,数据共享方法可以基于数据生成框架实现,该数据生成框架可以 包括不同类型数据的流形学习模型,该数据生成框架中流形学习模型还配合最优传输映射从原隐私数据中学习数据模式(包括数据的流形结构以及流形上的概率密度分布),依照数据模式在数据空间中进行数据重采样,从而得到符合数据模式的且不同于原始数据的生成数据。原始数据始终保留在数据拥有方,生成数据用于共享。参见图6,共享数据的生成可以包括学习阶段和生成阶段,其中,学习阶段包括:1、通过针对不同类型数据的自编码器,学习数据流形结构,流形学习的输出是流形展开的低维空间中的点云数据;2、数据规约模块对点云数据进行转换,将概率的点表示转换为权重表示;3、最后数据进入最优传输映射求解器,得到指定概率分布(均匀分布、高斯分布)到数据概率分布的映射。生成阶段包括:1、指定概率分布数据点采样模块生成符合指定概率分布的数据点集;2、数据映射模块使用学习阶段求得的最优传输映射,将数据点集映射到原始数据概率分布;3、数据过滤模块剔除数据点集中与原始数据相似度过高的数据;4、最后将经过剔除处理后的数据点集输入到学习阶段学到的数据解码器,分别得到不同类型的用于共享的生成数据。
图7是本申请实施例提供的一种数据共享装置的结构示意图,本申请实施例提供的数据共享方法可以通过软件和/或硬件实现,并一般可以集成于服务器,参见图7,本申请实施例提供的装置可以包括:模型结构模块301、模型训练模块302和共享数据模块303。
模型结构模块301,设置为根据数据源的数据类型在数据处理库确定数据处理模型结构。
模型训练模块302,设置为基于所述数据源的训练集和所述数据处理模型结构生成数据处理模型。
共享数据模块303,设置为根据所述数据处理模型确定所述数据源的原始数据对应的共享数据,其中,所述共享数据与所述原始数据具有相同的流形结构与概率分布特征。
本申请实施例,通过模型结构模块在数据处理库中选择数据源的数据类型对应的数据处理模型结构,模型训练模使用来自数据源的训练集对数据处理模型结构进行训练以生成数据处理模型,共享数据模块使用数据处理模型对数据源的原始数据进行处理以生成共享数据,该共享数据具有与原始数据相同的流形结构与概率分布特征,本申请实施例通过数据处理模型实现原始数据的隐私处理,可降低数据共享难度,从而增加数据的利用率。
例如,在上述申请实施例的基础上,装置中模型结构模块301包括:
类型读取单元,设置为读取所述数据源内原始数据的数据类型。
结构确定单元,设置为在所述数据处理库中查找与所述数据类型匹配的所述数据处理模型结构。
例如,在上述申请实施例的基础上,装置中模型训练模块302包括:
训练集生成单元,设置为在所述数据源采集原始数据作为训练集。
模型训练单元,设置为根据所述训练集训练数据处理模型结构以生成所述数据处理模型的编码器和解码器。
例如,在上述申请实施例的基础上,模型训练单元设置为:采用图像数据类型的所述训练集,按照从低分辨率到高分辨率的顺序对所述编码器和所述解码器进行逐层训练;使用均方误差作为损失函数控制所述编码器和所述解码器的参数更新;其中,所述编码器和所述解码器为图卷积神经网络模型,分别至少包括卷积层、线性整流层、池化层和损失函数层。
例如,在上述申请实施例的基础上,模型训练单元设置为:使用医疗检测数据类型的所述训练集对所述编码器和所述解码器进行训练;将交叉熵作为损失函数控制所述编码器和所述解码器的参数更新;其中,所述编码器和所述解码器为全连接层网络,所述全连接层网络包括的隐含层数和隐含层维度数由医疗检测数据的维度确定。
例如,在上述申请实施例的基础上,模型训练单元设置为:针对表格数据类型的所述训练集,分别提取所述训练的数值型数据和类别型数据;
对所述数值型数据进行高斯拟合归一化,并将所述类别型数据进行实体嵌入编码;使用实体嵌入编码生成的类别向量和高斯拟合归一化后的所述数值型数据,对所述编码器和所述解码器进行训练;其中,所述编码器和所述解码器为全连接层网络,所述全连接层网络包括的隐含层数和隐含层维度数由表格数据的维度确定。
例如,在上述申请实施例的基础上,共享数据模块303包括:
数据编码模块,设置为将所述数据源的原始数据输入所述数据处理模型的编码器以生成低维度空间的点云数据。
数据规约模块,设置为对所述点云数据进行数据规约。
最优传输映射求解器模块,设置为确定所述点云数据对应的数据分布概率与指定概率分布的映射关系。
数据点采样模块,设置为按照所述指定概率分布采样生成数据点集。
数据映射模块,设置为将所述数据点集按照所述映射关系映射到所述数据分布概率。
数据过滤模块,设置为在映射后的所述数据点集中剔除与所述点云数据大于相似度阈值的数据。
数据解码模块,设置为将经过剔除处理之后的数据点集输入解码器以生成所述共享数据。
图8是本申请实施例提供的一种电子设备的结构示意图,如图8所示,该 电子设备包括处理器40、存储器41、输入装置42和输出装置43;电子设备中处理器40的数量可以是一个或多个,图8中以一个处理器40为例;电子设备中的处理器40、存储器41、输入装置42和输出装置43可以通过总线或其他方式连接,图8中以通过总线连接为例。
存储器41作为一种计算机可读存储介质,可设置为存储软件程序、计算机可执行程序以及模块,如本申请实施例中的数据共享方法对应的程序指令/模块(例如,数据共享装置中的模型结构模块301、模型训练模块302和共享数据模块303)。处理器40通过运行存储在存储器41中的软件程序、指令以及模块,从而执行电子设备的多种功能应用以及数据处理,即实现上述的数据共享方法。
存储器41可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序;存储数据区可存储根据终端的使用所创建的数据等。此外,存储器41可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器41可包括相对于处理器40远程设置的存储器,这些远程存储器可以通过网络连接至电子设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
输入装置42可设置为接收输入的数字或字符信息,以及产生与电子设备的用户设置以及功能控制有关的键信号输入。输出装置43可包括显示屏等显示设备。
本申请实施例还提供一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行一种数据共享方法,该方法包括:
根据数据源的数据类型在数据处理库确定数据处理模型结构;
基于所述数据源的训练集和所述数据处理模型结构生成数据处理模型;
根据所述数据处理模型确定所述数据源的原始数据对应的共享数据,其中,所述共享数据与所述原始数据具有相同的流形结构与概率分布特征。
当然,本申请实施例所提供的一种包含计算机可执行指令的存储介质,其计算机可执行指令不限于如上所述的方法操作,还可以执行本申请任意实施例所提供的数据共享方法中的相关操作。
通过以上关于实施方式的描述,所属领域的技术人员可以清楚地了解到,本申请可借助软件及必需的通用硬件来实现,当然也可以通过硬件实现。基于这样的理解,本申请的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设 备等)执行本申请多个实施例所述的方法。计算机可读存储介质可以包括非暂态计算机可读存储介质。
值得注意的是,上述数据共享装置的实施例中,所包括的多个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,多个功能单元的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。

Claims (10)

  1. 一种数据处理方法,包括:
    根据数据源的数据类型在数据处理库确定数据处理模型结构;
    基于所述数据源的训练集和所述数据处理模型结构生成数据处理模型;
    根据所述数据处理模型确定所述数据源的原始数据对应的共享数据,其中,所述共享数据与所述原始数据具有相同的流形结构与概率分布特征。
  2. 根据权利要求1所述方法,其中,所述根据数据源的数据类型在数据处理库确定数据处理模型结构,包括:
    读取所述数据源内所述原始数据的数据类型;
    在所述数据处理库中查找与所述数据类型匹配的所述数据处理模型结构。
  3. 根据权利要求1所述方法,其中,所述基于所述数据源对应的训练集和所述数据处理模型结构生成数据处理模型,包括:
    在所述数据源采集所述原始数据作为训练集;
    根据所述训练集训练数据处理模型结构以生成所述数据处理模型的编码器和解码器。
  4. 根据权利要求3所述方法,其中,所述根据所述训练集训练数据处理模型结构以生成所述数据处理模型的编码器和解码器,包括:
    采用图像数据类型的所述训练集,按照从低分辨率到高分辨率的顺序对所述编码器和所述解码器进行逐层训练;
    使用均方误差作为损失函数,以控制所述编码器和所述解码器的参数更新;
    其中,所述编码器和所述解码器分别为图卷积神经网络模型,所述图卷积神经网络模型包括卷积层、线性整流层、池化层和损失函数层。
  5. 根据权利要求3所述方法,其中,所述根据所述训练集训练数据处理模型结构以生成所述数据处理模型的编码器和解码器,包括:
    使用医疗检测数据类型的所述训练集对所述编码器和所述解码器进行训练;
    将交叉熵作为损失函数,以控制所述编码器和所述解码器的参数更新;
    其中,所述编码器和所述解码器分别为全连接层网络,所述全连接层网络包括的隐含层数和隐含层维度数由医疗检测数据的维度确定。
  6. 根据权利要求3所述方法,其中,所述根据所述训练集训练数据处理模型结构以生成所述数据处理模型的编码器和解码器,包括:
    针对表格数据类型的所述训练集,分别提取所述训练的数值型数据和类别型数据;
    对所述数值型数据进行高斯拟合归一化,并将所述类别型数据进行实体嵌入编码;
    使用实体嵌入编码生成的类别向量和高斯拟合归一化后的所述数值型数据,对所述编码器和所述解码器进行训练;
    其中,所述编码器和所述解码器分别为全连接层网络,所述全连接层网络包括的隐含层数和隐含层维度数由表格数据的维度确定。
  7. 根据权利要求1所述方法,其中,所述根据所述数据处理模型确定所述数据源的原始数据对应的共享数据,包括:
    将所述数据源的原始数据输入所述数据处理模型的编码器以生成低维度空间的点云数据;
    对所述点云数据进行数据规约;
    确定所述点云数据对应的数据分布概率与指定概率分布的映射关系;
    按照所述指定概率分布采样生成数据点集;
    将所述数据点集按照所述映射关系映射到所述数据分布概率;
    在映射后的所述数据点集中剔除与所述点云数据大于相似度阈值的数据;
    将经过剔除处理后的数据点集输入解码器以生成所述共享数据。
  8. 一种数据处理装置,包括:
    模型结构模块,设置为根据数据源的数据类型在数据处理库确定数据处理模型结构;
    模型训练模块,设置为基于所述数据源的训练集和所述数据处理模型结构生成数据处理模型;
    共享数据模块,设置为根据所述数据处理模型确定所述数据源的原始数据对应的共享数据,其中,所述共享数据与所述原始数据具有相同的流形结构与概率分布特征。
  9. 一种电子设备,包括:
    一个或多个处理器;
    存储器,设置为存储一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-7中任一所述的数据处理方法。
  10. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时如权利要求1-7中任一所述的数据共享方法。
PCT/CN2023/080414 2022-03-11 2023-03-09 一种数据处理方法、装置、电子设备和存储介质 WO2023169496A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210236012.9A CN114707174A (zh) 2022-03-11 2022-03-11 一种数据处理方法、装置、电子设备和存储介质
CN202210236012.9 2022-03-11

Publications (1)

Publication Number Publication Date
WO2023169496A1 true WO2023169496A1 (zh) 2023-09-14

Family

ID=82167981

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/080414 WO2023169496A1 (zh) 2022-03-11 2023-03-09 一种数据处理方法、装置、电子设备和存储介质

Country Status (2)

Country Link
CN (1) CN114707174A (zh)
WO (1) WO2023169496A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114707174A (zh) * 2022-03-11 2022-07-05 上海熵熵微电子科技有限公司 一种数据处理方法、装置、电子设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109413087A (zh) * 2018-11-16 2019-03-01 京东城市(南京)科技有限公司 数据共享方法、装置、数字网关及计算机可读存储介质
CN110569663A (zh) * 2019-08-15 2019-12-13 深圳市莱法照明通信科技有限公司 一种教育数据共享的方法、装置、系统和存储介质
CN113033825A (zh) * 2021-04-21 2021-06-25 支付宝(杭州)信息技术有限公司 一种隐私保护的模型训练方法、系统及装置
US20210312064A1 (en) * 2020-04-02 2021-10-07 Hazy Limited Device and method for secure private data aggregation
US20220036135A1 (en) * 2019-08-29 2022-02-03 Tencent Technology (Shenzhen) Company Limited Method and apparatus for determining image to be labeled and model training method and apparatus
CN114707174A (zh) * 2022-03-11 2022-07-05 上海熵熵微电子科技有限公司 一种数据处理方法、装置、电子设备和存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109413087A (zh) * 2018-11-16 2019-03-01 京东城市(南京)科技有限公司 数据共享方法、装置、数字网关及计算机可读存储介质
CN110569663A (zh) * 2019-08-15 2019-12-13 深圳市莱法照明通信科技有限公司 一种教育数据共享的方法、装置、系统和存储介质
US20220036135A1 (en) * 2019-08-29 2022-02-03 Tencent Technology (Shenzhen) Company Limited Method and apparatus for determining image to be labeled and model training method and apparatus
US20210312064A1 (en) * 2020-04-02 2021-10-07 Hazy Limited Device and method for secure private data aggregation
CN113033825A (zh) * 2021-04-21 2021-06-25 支付宝(杭州)信息技术有限公司 一种隐私保护的模型训练方法、系统及装置
CN114707174A (zh) * 2022-03-11 2022-07-05 上海熵熵微电子科技有限公司 一种数据处理方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN114707174A (zh) 2022-07-05

Similar Documents

Publication Publication Date Title
WO2022068196A1 (zh) 跨模态的数据处理方法、装置、存储介质以及电子装置
CN109885782B (zh) 一种生态环境空间大数据集成方法
CN113704531A (zh) 图像处理方法、装置、电子设备及计算机可读存储介质
WO2023169496A1 (zh) 一种数据处理方法、装置、电子设备和存储介质
WO2021047373A1 (zh) 基于大数据的列数据处理方法、设备及介质
WO2021168989A1 (zh) 一种输电线路廊道多源空间数据库构建方法和装置
WO2021175021A1 (zh) 产品推送方法、装置、计算机设备和存储介质
Yang et al. Application of multitask joint sparse representation algorithm in Chinese painting image classification
WO2021051562A1 (zh) 人脸特征点定位方法、装置、计算设备和存储介质
CN115359304B (zh) 一种面向单幅图像特征分组的因果不变性学习方法及系统
CN116910587A (zh) 一种基于数据分布差异的聚类联邦方法及装置
CN113852605B (zh) 一种基于关系推理的协议格式自动化推断方法及系统
WO2021151359A1 (zh) 掌纹图像的识别方法、装置、设备及计算机可读存储介质
CN112395834B (zh) 基于图片输入的脑图生成方法、装置、设备及存储介质
Han et al. Grid graph-based large-scale point clouds registration
CN117829141B (zh) 基于攻击模式的动态实体对齐方法
CN116661940B (zh) 组件识别方法、装置、计算机设备和存储介质
CN113190596B (zh) 一种地名地址混合匹配的方法和装置
CN117173731B (zh) 一种模型训练的方法、图像处理的方法以及相关装置
CN117056550B (zh) 长尾图像检索方法、系统、设备及存储介质
Ahmad et al. Partially shaded sketch-based image search in real mobile device environments via sketch-oriented compact neural codes
CN117408259B (zh) 一种信息提取方法、装置、计算机设备和存储介质
CN116911268B (zh) 一种表格信息处理方法、装置、处理设备及可读存储介质
CN110019902A (zh) 一种基于特征匹配的家居图片搜索方法及装置
Yin et al. A novel image retrieval method for image based localization in large-scale environment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23766073

Country of ref document: EP

Kind code of ref document: A1