CN113037489B - Data processing method, device, equipment and storage medium - Google Patents

Data processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN113037489B
CN113037489B CN202110569135.XA CN202110569135A CN113037489B CN 113037489 B CN113037489 B CN 113037489B CN 202110569135 A CN202110569135 A CN 202110569135A CN 113037489 B CN113037489 B CN 113037489B
Authority
CN
China
Prior art keywords
deviation
feature
sample data
secret
total
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110569135.XA
Other languages
Chinese (zh)
Other versions
CN113037489A (en
Inventor
荆博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110569135.XA priority Critical patent/CN113037489B/en
Publication of CN113037489A publication Critical patent/CN113037489A/en
Application granted granted Critical
Publication of CN113037489B publication Critical patent/CN113037489B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/30Public key, i.e. encryption algorithm being computationally infeasible to invert or user's encryption keys not requiring secrecy
    • H04L9/3066Public key, i.e. encryption algorithm being computationally infeasible to invert or user's encryption keys not requiring secrecy involving algebraic varieties, e.g. elliptic or hyper-elliptic curves
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6227Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • H04L9/0816Key establishment, i.e. cryptographic processes or cryptographic protocols whereby a shared secret becomes available to two or more parties, for subsequent use
    • H04L9/085Secret sharing or secret splitting, e.g. threshold schemes

Abstract

The disclosure provides a data processing method, a data processing device, data processing equipment and a storage medium, relates to the technical field of data processing, particularly relates to big data, artificial intelligence and block chain technologies, and can be used for cloud computing and cloud services. The specific implementation scheme is executed by any one of the multi-party nodes, and comprises the following steps: determining a target statistic value according to the deviation secret of the initial statistic value of the first sample data of the local node and the obtained deviation secret of the initial statistic value of the second sample data of other nodes in the multi-party node; according to the target statistic value, carrying out standardization processing on the first sample data; the target statistic value comprises at least one of a characteristic total mean value, a characteristic total variance and a characteristic total standard deviation of all sample data of the multi-party node under a preset characteristic dimension. According to the technical scheme, the data privacy of each node is prevented from being revealed due to direct initial statistic value interaction while the unified standardization of each sample data is carried out on each node in the multi-party nodes.

Description

Data processing method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to big data, artificial intelligence, and blockchain technologies, which can be used for cloud computing and cloud services.
Background
With the development of artificial intelligence technology, machine learning is more and more widely applied in various scenes. In the machine learning scheme, before training the model, the sample set is usually normalized to eliminate the dimensional influence and accelerate the model convergence.
In distributed machine learning, multiple participants need to share sample data to achieve uniform standardization of sample data of each party. However, the above method brings a hidden trouble to the data security of each participant.
Disclosure of Invention
The present disclosure provides a data processing method, apparatus, device and storage medium for unified standardization of multi-party nodes under the condition of ensuring data privacy.
According to an aspect of the present disclosure, there is provided a data processing method, performed by any one of a plurality of parties, including:
determining a target statistic value according to the deviation secret of the initial statistic value of the first sample data of the local node and the obtained deviation secret of the initial statistic value of the second sample data of other nodes in the multi-party node;
according to the target statistic value, carrying out standardization processing on the first sample data;
wherein the target statistic comprises at least one of a feature total mean, a feature total variance and a feature total standard deviation of all sample data of the multi-party node under a predetermined feature dimension.
According to another aspect of the present disclosure, there is also provided a data processing apparatus configured at any one of a plurality of nodes, including:
the target statistic value determining module is used for determining a target statistic value according to the deviation secret of the initial statistic value of the first sample data of the local node and the obtained deviation secret of the initial statistic value of the second sample data of other nodes in the multi-party node;
the standardization processing module is used for carrying out standardization processing on the first sample data according to the target statistic value;
wherein the target statistic comprises at least one of a feature total mean, a feature total variance and a feature total standard deviation of all sample data of the multi-party node under a predetermined feature dimension.
According to another aspect of the present disclosure, there is also provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the data processing methods provided by the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is also provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform any one of the data processing methods provided by the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements any one of the data processing methods provided by the embodiments of the present disclosure.
According to the technology disclosed by the invention, a new thought is provided for unified standardization of sample data of the multi-party nodes, and meanwhile, data leakage of each node is avoided.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a flowchart of a data processing method provided by an embodiment of the present disclosure;
FIG. 2 is a flow chart of another data processing method provided by the disclosed embodiments;
FIG. 3 is a flow chart of another data processing method provided by the disclosed embodiments;
FIG. 4 is a flow chart of another data processing method provided by the disclosed embodiments;
FIG. 5 is a flow chart of another data processing method provided by the disclosed embodiments;
FIG. 6 is a block diagram of a data processing apparatus provided in an embodiment of the present disclosure;
fig. 7 is a block diagram of an electronic device for implementing a data processing method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The data processing method and the data processing device provided by the disclosure are suitable for multi-party nodes (two or more parties) which respectively hold sample data meeting sample requirements, and under the condition that data privacy is not disclosed, all sample data of the multi-party nodes are unified and standardized. Each data processing method of the present disclosure may be executed by a data processing apparatus, which is implemented by software and/or hardware and is specifically configured in a node device that is one of a plurality of nodes. In an alternative embodiment, the node device may be an electronic device that participates in building a blockchain network.
For ease of understanding, each data processing method according to the present disclosure will be first described in detail.
Referring to fig. 1, a data processing method, performed by any one of a plurality of nodes, includes:
s101, determining a target statistic value according to the deviation secret of the initial statistic value of the first sample data of the local node and the obtained deviation secret of the initial statistic value of the second sample data of other nodes in the multi-party node.
And the target statistic value comprises at least one of a characteristic total mean value, a characteristic total variance and a characteristic total standard deviation of all sample data of the multi-party node under a preset characteristic dimension.
The local node is a node for executing the data processing method in the multi-party node.
Illustratively, the multi-party nodes may be obtained by offline recruitment of the native nodes. If the local node is a block chain node participating in a block chain network, the multi-party node can also depend on recruitment determination on the block chain network.
In an alternative embodiment, the native node issues a node match request including sample requirements to the blockchain network based on a standardized intelligent contract; and receiving the identification information of the multi-party nodes meeting the sample requirement fed back by the miner node.
In a specific implementation mode, the local node calls a node matching function in a standardized intelligent contract based on a self-deployed standardized intelligent contract, and issues a node matching request to a block chain network; the node matching request comprises at least one sample requirement such as a feature dimension, a feature category and a label category; the method comprises the steps that a miner node in a block chain network responds to a node matching request, a node matching function in a self-deployed standardized intelligent contract is called, and at least part of nodes meeting sample requirements are selected from collected candidate nodes to serve as multi-party nodes; and feeding back the identification information of the multi-party node to the initiator node (namely the local node) of the node matching request.
Illustratively, if the number of candidate nodes meeting the sample requirement is large, the candidate nodes meeting the sample requirement may be further screened or sorted according to at least one of parameters such as node reliability, node activity, node computing capability, and node storage capability, and the screening result or the sorting result is fed back to the initiator node of the node matching request.
It is to be understood that, in order to facilitate other nodes in the multi-party nodes to know the matching condition, after the multi-party nodes are determined, the identification information of the multi-party nodes may also be fed back to other nodes except the initiator node of the node matching request.
It can be understood that, the block chain network participates in the determination process of the multi-party nodes, so that the search range of the multi-party nodes can be expanded, and meanwhile, the safety of the search process and the reliability of the search result are ensured.
The initial statistical value is used for reflecting the statistical condition of the sample data held by the node. For example, the initial statistical value may include at least one of a number of samples of the sample data, a feature and a value in a predetermined feature dimension in the sample data, and a feature deviation and a value in a predetermined feature dimension; accordingly, the deviation secret of the initial statistical value may include at least one of the deviation secrets of the sample number, the deviation secrets of the feature and the value in the predetermined feature dimension, and the deviation secrets of the feature deviation and the value in the predetermined feature dimension, which are used as the basis for determining the target statistical value.
For the sake of convenience of distinction, in the present disclosure, sample data in the native node is collectively referred to as first sample data, and sample data in other nodes than the native node in the multi-party node is collectively referred to as second sample data. It should be noted that the first sample data and the second sample data have at least partially the same feature dimension, and the predetermined feature dimension in the present disclosure is at least one of the same feature dimensions. For example, for student achievement sample data, the same feature dimension of the first sample data and the second sample data can comprise a school number, a Chinese achievement, a mathematics achievement and the like. The standardization processing related to the present disclosure is to perform unified standardization processing on data values corresponding to the same feature dimensions in the multi-party nodes. In a preferred embodiment, all feature dimensions of the first sample data and the second sample data are the same.
It should be noted that the sample data may include at least one of sample feature data and tag feature data to adapt to different application scenarios, such as supervised training and unsupervised training in machine learning or deep learning.
For example, the biased secrets of the initial statistics may be biased results corresponding to local secrets of the initial statistics determined based on a secret sharing technique and elliptic curve operations of a finite field. The deviation result can be used for assisting each node in the multi-party nodes to determine the target statistic value, and meanwhile, the data security of sample data of each node can be ensured, and privacy disclosure is avoided.
In an alternative embodiment, the native node may make the determination of the deviant secret of the initial statistics in the following manner: dividing the initial statistical value of the first sample data into secret fragments according to the number of the multi-party nodes, and transmitting the secret fragments to each node in the multi-party nodes in a one-to-one correspondence manner; obtaining secret fragments which are respectively segmented and transmitted by other nodes in the multi-party nodes; determining local secrets of initial statistical values according to the acquired secret fragments of the multi-party nodes; generating a deviation coefficient of the initial statistic value according to the node number of the local node and the node numbers of the multi-party nodes, wherein the deviation coefficient is used for representing the deviation condition of the local secret of the initial statistic value relative to the initial statistic value; and determining the deviation secret of the initial statistic value according to the deviation coefficient of the initial statistic value and the local secret. The node number of the node is a large integer obtained after the node identification of the node is converted by a set coding rule. The large integer is also called high-precision integer, and means that the basic data type cannot be used to store the integer with the precision.
Illustratively, the dividing the initial statistical value of the first sample data into secret fragments according to the number of the multi-party nodes, and transmitting the secret fragments to each node of the multi-party nodes in a one-to-one correspondence manner may be: segmenting the initial statistic into secret fragments according to the number of the multi-party nodes; respectively transmitting the secret fragments to each node in the multi-party nodes according to the fragment numbers determined when the secret fragments are segmented and the node numbers of the multi-party nodes; and the corresponding relation between the node number recorded in each node and the node is the same.
Illustratively, the local node determines an initial statistical total value according to the deviation secret of the initial statistical value of the local node and the acquired sum of the deviation secrets of the initial statistical values of other nodes in the multi-party node; and determining a target statistic value according to the initial statistic total value. It should be noted that the sum here is a sum based on a finite field elliptic curve.
The initial total statistical value is used as a basis for determining a target statistical value, and includes, but is not limited to, a total number of samples of all sample data of the multi-party node, a total value of characteristic sum and a total value of characteristic deviation of all sample data of the multi-party node in a predetermined characteristic dimension, and the like.
And S102, carrying out standardization processing on the first sample data according to the target statistic value.
It should be noted that, since the target statistic is determined based on the sample data of the multi-party nodes, the target statistic can be used as reference data for performing unified standardized processing on the sample data of the multi-party nodes.
In an optional embodiment, if the target statistics value includes the total feature mean value and the total feature standard deviation, the local node may perform normalization processing on the data value in the corresponding feature dimension according to the total feature standard deviation and the total feature mean value in the predetermined feature dimension in the first sample data.
In another optional embodiment, if the target statistics value includes the total feature mean and the total feature variance, the local node may determine the total feature standard deviation according to the total feature variance, and perform normalization processing on the data value in the corresponding feature dimension according to the total feature standard deviation and the total feature mean in the predetermined feature dimension in the first sample data.
In yet another optional embodiment, if the target statistics only includes the total feature mean, the local node may determine the total feature variance or the total feature standard deviation of the multi-party node in the predetermined feature dimension by offline acquisition, and further perform normalization processing on the data value in the corresponding feature dimension according to the total feature variance or the total feature standard deviation in the predetermined feature dimension in the offline acquired first sample data and the total feature mean in the corresponding feature dimension.
In yet another optional embodiment, if the target statistic only includes the total feature variance or the total feature standard deviation in the predetermined feature dimension, the local node may determine the total feature mean of the multi-party node in the predetermined feature dimension by offline acquisition, and then perform normalization processing on the data value in the corresponding feature dimension according to the total feature variance or the total feature standard deviation in the predetermined feature dimension and the total feature mean in the corresponding feature dimension acquired offline.
For example, the multi-party nodes may respectively perform distributed joint training of the machine learning model by using the standardized sample data, and the present disclosure does not limit the machine learning model itself and the distributed joint training mode of the machine learning model.
Illustratively, the native node may send a normalization completion message to other nodes in the multi-party node in a offline manner after normalization is completed. In order to expand the message transmission range and ensure the safety and effectiveness of the message transmission, in another optional embodiment, a standardized completion message may be issued to the blockchain network based on a standardized intelligent contract, so that the blockchain nodes in the blockchain network perform search or subsequent use of the standardized result according to the standardized completion message. It can be understood that, in order to facilitate other nodes in the multi-party node to know the unified and standardized data holder, the identification information of the multi-party node may also be carried in the standardization completion message.
Specifically, the local node issues a standardized completion message to the blockchain network by calling a message issuing function of a self-deployed standardized intelligent contract; after each node in the block chain network receives the standardized completion message, data transaction is carried out with at least one node in the multi-party nodes according to actual requirements.
Certainly, in order to facilitate the third party to perform the association search of the sample data required by the set sample, the standardization completion message may also be sent to other nodes except the multi-party node, where the standardization completion message carries the identification information of the multi-party node.
The method and the device for determining the target statistic value by introducing the deviating secret of the initial statistic value of the multi-party node sample data are used for determining the target statistic value, so that each node in the multi-party node can perform unified standardization on the self sample data based on the target statistic value, a new idea is provided for the unified standardization of the multi-party node sample data, and meanwhile, the leakage of the data privacy of each node caused by the direct interaction of the initial statistic value is avoided.
On the basis of the above technical solutions, the present disclosure also provides an alternative embodiment. In the embodiment, the determination mode of the feature overall mean is optimized and improved. In the present embodiment, reference may be made to the description of the foregoing embodiments.
Referring to fig. 2, a data processing method includes:
s201, determining the total number of samples according to the deviation secret of the number of the samples of the first sample data and the obtained deviation secret of the number of the samples of each second sample data.
The local node determines the deviation secret of the sample number of the first sample data according to the sample number of the first sample data; the other nodes in the multi-party node respectively determine the deviation secrets of the sample quantity of the respective second sample data according to the sample quantity of the respective second sample data; each node in the multi-party nodes exchanges the deviation secrets of the respective sample number; and each node in the multi-party nodes determines the total number of samples according to the deviation secret of the number of each sample obtained by the node. Wherein the stray secrets obtained by each node itself include the stray secrets of itself and the stray secrets obtained from other nodes in the multi-party node.
In an optional embodiment, the local node divides the sample number of the first sample data into secret fragments according to the number of the multi-party nodes, and transmits the secret fragments to each node in the multi-party nodes in a one-to-one correspondence manner; obtaining secret fragments which are respectively segmented and transmitted by other nodes in the multi-party nodes; determining local secrets of a sample number according to the obtained secret fragments of the multi-party nodes; generating a deviation coefficient of the sample number according to the node number of the local node and the node number of the multi-party node, wherein the deviation coefficient is used for representing the deviation condition of the local secret of the sample number relative to the sample number; and determining the deviating secret of the sample number according to the deviating coefficient of the sample number and the local secret.
For example, the number of samples of the first sample data is divided into secret fragments according to the number of the multi-party nodes, and the secret fragments are transmitted to each node of the multi-party nodes in a one-to-one correspondence manner, where: dividing the number of samples into secret fragments according to the number of the multi-party nodes; respectively transmitting the secret fragments to each node in the multi-party nodes according to the fragment numbers determined when the secret fragments are segmented and the node numbers of the multi-party nodes; and the corresponding relation between the node number recorded in each node and the node is the same.
Taking the multi-party node including node a, node B and node C as an example, the generation process of the deviating secrets Coef _ SecretA, Coef _ SecretB and Coef _ SecretC corresponding to the respective sample numbers sizeA, sizeB and sizeC will be described in detail.
1) Each node divides the number of the self samples to obtain corresponding secret fragments.
And respectively converting the node identifications of the three nodes into large integers to obtain node numbers id _ A, id _ B and id _ C.
The node A constructs a polynomial f according to the sample quantity sizeAA(x)=a1*x^2+a2X + sizeA; wherein, a1And a2Is a random number; the node numbers are sequentially used as a polynomial fA(x) Obtaining the secret shard of node a: sizeA _ partA, sizeA _ partB, and sizeA _ partC. Namely: sizeA _ partA = fA(id_A)、sizeA_partB=fA(id_B)、sizeA_partC=fA(id_C)
Node B constructs polynomial f according to sample quantity sizeBB(x)=b1*x^2+b2X + sizeB; wherein, b1And b2Is a random number; the node numbers are sequentially used as a polynomial fB(x) Obtaining the secret shard of the node B: sizeB _ partA, sizeB _ partB, and sizeB _ partC. Namely: sizeB _ partA = fB(id_A)、sizeB_partB=fB(id_B)、sizeB_partC=fB(id_C)。
The node C constructs a polynomial f according to the sample quantity sizeCC(x)=c1*x^2+c2X + sizeC; wherein, c1And c2Is a random number; the node numbers are sequentially used as a polynomial fC(x) Obtaining the secret shard of the node C: sizeC _ partA, sizeC _ partB, and sizeC _ partC. Namely: sizeC _ partA = fC(id_A)、sizeC_partB=fC(id_B)、sizeC_partC=fC(id_C)。
The polynomial random numbers constructed by the nodes may be the same or at least partially different. The value of the highest power in the polynomial is determined by the total number of multi-party nodes.
2) And each node sends each secret fragment to other nodes in a one-to-one correspondence manner.
Finally, the secret shards obtained by each node are as follows:
and a node A: sizeA _ partA, sizeB _ partA, and sizeC _ partA;
and the node B: sizeA _ part, sizeB _ part, and sizeC _ part;
and a node C: sizeA _ partC, sizeB _ partC, and sizeC _ partC.
3) And each node combines the secret fragments obtained by the node based on the elliptic curve of the finite field to obtain the local secret corresponding to the node. The elliptic curve of the finite field can be set by a skilled person according to requirements or empirical values, or determined repeatedly by a large number of experiments.
Finally, the local secret determined by each node is as follows:
and a node A: SecretA = sizeA _ partA + sizeB _ partA + sizeC _ partA;
and the node B: SecretB = sizeA _ partB + sizeB _ partB + sizeC _ partB;
and a node C: SecretC = sizeA _ partC + sizeB _ partC + sizeC _ partC.
4) Each node generates a deviation coefficient according to the own node number, and generates a deviation secret according to the deviation coefficient and the local secret.
Each node can determine a deviation coefficient of the node based on interpolation calculation of a polynomial of each node according to the node number of the node and the acquired node numbers of other nodes of the secret fragment; the deviating secret is generated from the product of the self deviating secret and the local secret. The local secret is data stored by the node and cannot be leaked to other nodes; the deviating secret may be used for inter-node sharing.
Specifically, the large integer (local secret) converted from the number sizej of samples of each node j is multiplied by a deviation coefficient, and accumulated and summed to obtain the primitive polynomial f passing through all points (x, sum (y))j(x) Curve f (x) is represented. In particular, when the number of nodes N =2, the curve is degenerated into a straight line. (x) for virtually all nodes the respective basis polynomial: f. of1(x),f2(x),…,fN(x) And (4) summing. When x =0, the value of f (x) is the sum of the number sizej of samples held by the multi-party node. Where x represents the secret number, y represents the deviating secret determined by the product of the local secret and the deviating coefficient, and sum (y) is the sum of the deviating secrets obtained by the multi-party nodes.
Alternatively, the bias coefficient may be obtained by Lagrange interpolation formula (Lagrange base interpolation) to obtain an interpolation basis function (Lagrange base approximation). It should be noted that, since the multi-party node only needs to know the value of f (x) when x =0, the sum of all the sample numbers is obtained. Therefore, it is not necessary to pay attention to what kind of base polynomial is used in the calculation process, and it is only necessary to know the value of each base polynomial when x = 0.
Finally, the stray secrets determined by each node are as follows:
and a node A: coef _ SecretA = coefA × SecretA;
and the node B: coef _ SecretB = coefB × SecretB;
and a node C: coef _ SecretC = coefC × SecretC.
Accordingly, the total number of samples for the multi-party node determined is sizesuum = Coef _ SecretA + Coef _ SecretB + Coef _ SecretC = sizeA + sizeB + sizeC.
S202, determining a feature sum value under the preset feature dimension according to the deviation secret of the feature and the value of the first sample data under the preset feature dimension and the deviation secret of the feature and the value of each acquired second sample data under the corresponding feature dimension.
The local node accumulates data values of the first sample data under a preset characteristic dimension to obtain a characteristic sum value under the preset characteristic dimension; other nodes in the multi-party nodes respectively accumulate data values of the second sample data under the preset characteristic dimension to obtain characteristic sum values under the preset characteristic dimension; the local node determines the deviant secrets of the feature and the value of the first sample data under the preset feature dimension; the other nodes in the multi-party nodes respectively determine the deviation secrets of the characteristics and the values of the second sample data under the preset characteristic dimension; each node in the multi-party nodes respectively exchanges the deviation secrets of the characteristic and the value under the preset characteristic dimension; and each node in the multi-party nodes determines a feature sum value according to the deviation secret of each feature and value under the preset feature dimension acquired by each node.
For the specific determination process of the feature sum value under the predetermined feature dimension, the detailed description of the total number of samples may be referred to, and only the feature and the value under the predetermined feature dimension are used to replace the number of samples in the determination process of the total number of samples, which is not described herein again.
And S203, determining a total feature mean value under the preset feature dimension according to the total number of the samples and the total feature value under the preset feature dimension.
The local node obtains the total number of samples of all sample data of the multi-party node and a characteristic sum value under a preset characteristic dimension. Therefore, the ratio of the feature sum value to the total number of samples in the predetermined feature dimension can be determined, and the determined result is used as the feature sum average of all sample data of the multi-party node in the predetermined feature dimension.
And S204, normalizing the first sample data according to the feature total mean value under the preset feature dimension.
The local node respectively makes differences between data values under preset characteristic dimensions in the first sample data and the total characteristic mean values under corresponding characteristic dimensions; and determining the ratio of each difference value to the total standard deviation of the features under the corresponding feature dimension so as to realize the standardization processing of the first sample data. The total standard deviation of the features under the preset feature dimension can be obtained by means of offline acquisition or calculation based on a secret sharing technology and elliptic curve operation. Of course, the total standard deviation of the features under the predetermined feature dimension may also be determined by other methods in the prior art, and the details of the disclosure are not repeated.
The embodiment of the disclosure refines the deviation secret of the initial statistical value into the deviation secret including the deviation secret of the sample number and the deviation secret of the characteristic and the value under the preset characteristic dimension; correspondingly, the determining operation of the target statistic value is refined into the deviation secrets according to the sample number of the first sample data, and the total number of the samples is determined according to the deviation secrets of the sample number of the acquired second sample data; determining a feature sum value under the preset feature dimension according to the deviation secret of the first sample number under the preset feature dimension and the deviation secret of the acquired second sample data under the corresponding feature dimension; and determining the total characteristic mean value according to the total number of the samples and the total characteristic value under the preset characteristic dimension, thereby perfecting the determination mode of the total characteristic mean value, providing data support for unified standardization of sample data of each node in the multi-party nodes, and simultaneously avoiding the disclosure of data privacy of each node.
It should be noted that S202 may be executed before or after S201, and may also be executed in parallel or in an intersecting manner with S201, and the present disclosure does not limit the specific execution order of the two.
On the basis of the above technical solutions, the present disclosure also provides another alternative embodiment. In the embodiment, the determination mode of the feature overall mean is optimized and improved. In the present embodiment, reference may be made to the description of the foregoing embodiments.
Referring to fig. 3, a data processing method includes:
s301, determining the total number of samples according to the deviation secrets of the number of samples of the first sample data and the deviation secrets of the number of samples of each second sample data.
S302, determining a simulation feature mean value of the first sample data under a preset feature dimension according to the total number of the samples and the feature and the value of the first sample data under the preset feature dimension.
The local node determines the mean value of the simulation features by the following method: accumulating data values of the first sample data under a preset characteristic dimension to obtain a characteristic sum value under the preset characteristic dimension; and taking the ratio of the feature sum value in the preset feature dimension to the total number of samples as a simulation feature mean value in the preset feature dimension. Or determining a feature mean value under the preset feature dimension according to the data value of the first sample data under the preset feature dimension; and taking the product of the ratio of the number of the samples of the first sample data in the predetermined characteristic dimension to the total number of the samples and the characteristic mean value in the predetermined characteristic dimension as the simulated characteristic mean value in the predetermined characteristic dimension.
S303, determining a total feature mean value under the preset feature dimension according to the deviation secret of the simulated feature mean value of the first sample data under the preset feature dimension and the obtained deviation secret of the simulated feature mean value of each second sample data under the corresponding feature dimension.
Determining deviation secrets of simulated feature mean values of the first sample data under a preset feature dimension by the local node; other nodes in the multi-party nodes respectively determine deviation secrets of the simulated feature mean value of each second sample data under the preset feature dimension; each node in the multi-party nodes exchanges deviation secrets of the simulation feature mean value under the preset feature dimension; and each node in the multi-party nodes determines a feature sum value according to the deviation secret of each simulated feature mean value under the preset feature dimension acquired by each node.
For the specific determination process of the total feature mean value under the predetermined feature dimension, the detailed description of the total number of samples may be referred to, and only the simulated feature mean value under the predetermined feature dimension is substituted for the number of samples in the determination process of the total number of samples, which is not described herein again.
S304, normalizing the first sample data according to the feature total mean value under the preset feature dimension.
The local node respectively makes differences between data values under preset feature dimensions in the first sample data and feature total average values under corresponding dimensions; and determining the ratio of each difference value to the total standard deviation of the features under the corresponding feature dimension so as to realize the standardization processing of the first sample data. The total standard deviation of the features under the preset feature dimension can be obtained by means of offline acquisition or calculation based on a secret sharing technology and elliptic curve operation. Of course, the total standard deviation of the features under the predetermined feature dimension may also be determined by other methods in the prior art, and the details of the disclosure are not repeated.
The embodiment of the present disclosure refines the deviation secret of the initial statistical value into a deviation secret including the number of samples; correspondingly, the determination operation of the target statistic value is refined into deviation secrets according to the sample number of the first sample data and the deviation secrets of the sample number of each second sample data, and the total number of the samples is determined; determining a simulated feature mean value of the first sample data under the preset feature dimension according to the total number of the samples and the feature and the value of the first sample data under the preset feature dimension; and determining the total characteristic mean value under the preset characteristic dimension according to the deviation secret of the simulated characteristic mean value of the first sample data under the preset characteristic dimension and the deviation secret of the simulated characteristic mean value of each acquired second sample data under the corresponding characteristic dimension, thereby perfecting the determination mode of the total characteristic mean value, providing data support for the unified standardization of the sample data of each node in the multi-party nodes, and simultaneously avoiding the leakage of the data privacy of each node.
On the basis of the above technical solutions, the present disclosure also provides an alternative embodiment. In the embodiment, the determination mode of the characteristic total standard deviation and/or the characteristic total variance is optimized and improved. In the present embodiment, reference may be made to the description of the foregoing embodiments.
S401, determining the total number of samples according to the deviation secrets of the number of samples of the first sample data and the obtained deviation secrets of the number of samples of each second sample data.
S402, determining a feature deviation sum value under the preset feature dimension according to the feature deviation and value deviation secret of the first sample data under the preset feature dimension and the obtained feature deviation and value deviation secret of each second sample data under the corresponding feature dimension.
The characteristic deviation sum value of the sample data under the preset characteristic dimension is used for representing the deviation degree of each data value of the sample data under the preset characteristic dimension relative to the characteristic total mean value corresponding to the preset characteristic dimension. The feature total average value may be obtained through offline or determined by using the technical solutions provided in the above embodiments. Of course, the feature total average value may also be determined by other methods in the prior art, and the details of the disclosure are not repeated.
The local node accumulates the sum of squares of differences between each data value and the total feature mean value under the preset feature dimension of the first sample data under the preset feature dimension to obtain a feature deviation sum value under the preset feature dimension; other nodes in the multi-party nodes respectively obtain a feature deviation sum value under the preset feature dimension by accumulating the sum of squares of differences between each data value and the feature total mean value under the preset feature dimension of each second sample data; the local node determines the characteristic deviation and the deviation secret of the value of the first sample data under the preset characteristic dimension; other nodes in the multi-party node respectively determine the characteristic deviation and the deviation secret of the value of each second sample data under the preset characteristic dimension; each node in the multi-party nodes exchanges the characteristic deviation and the deviation secret of the value under the preset characteristic dimension; and each node in the multi-party nodes determines a characteristic deviation sum value according to the deviation secret of each characteristic deviation sum value under the preset characteristic dimension acquired by each node.
For the specific determination process of the total deviation value of the features in the predetermined feature dimension, reference may be made to the foregoing detailed description of the total number of samples, and only the total deviation value of the features in the predetermined feature dimension is required to replace the number of samples in the total number of samples determination process, which is not described herein again.
And S403, determining a total standard deviation and/or a total variance of the features in the preset feature dimension according to the total number of the samples and the total deviation value of the features in the preset feature dimension.
The local node obtains the total number of samples of all sample data of the multi-party node and the characteristic deviation total value under the preset characteristic dimension. Therefore, the ratio of the total deviation value of the features in the preset feature dimension to the total number of the samples can be determined, and the determined result is used as the total variance of the features of all the sample data of the multi-party node in the preset feature dimension; and performing evolution on the total feature variance under the preset feature dimension, and taking an evolution result as the total feature standard deviation of all sample data of the multi-party node under the preset feature dimension.
S404, normalizing the first sample data according to the total standard deviation and/or the total variance of the features of the preset feature dimension.
The local node makes differences between data values under preset characteristic dimensions in the first sample data and the total characteristic mean values under corresponding characteristic dimensions respectively; and determining the ratio of each difference value to the total standard deviation of the features under the corresponding feature dimension so as to realize the standardization processing of the first sample data. The feature total average value under the predetermined feature dimension may be obtained by a manner of obtaining under a line, or determined by using the technical solutions provided in the above embodiments. Of course, the feature total average value under the predetermined feature dimension may also be determined by other manners in the prior art, and the details of the disclosure are not repeated.
It should be noted that S402 may be executed before or after S401, and may also be executed in parallel or in an intersection with S401, and the present disclosure does not limit the specific execution order of the two.
The embodiment of the disclosure refines the deviation secret of the initial statistical value into the deviation secret including the deviation secret of the sample number and the characteristic deviation and value under the preset characteristic dimension; correspondingly, the determining operation of the target statistic value is refined into deviation secrets according to the sample number of the first sample data and the obtained deviation secrets of the sample number of each second sample data, and the total number of the samples is determined; determining a feature deviation sum value under the preset feature dimension according to the feature deviation and value deviation secret of the first sample data under the preset feature dimension and the obtained feature deviation and value deviation secret of each second sample data under the corresponding feature dimension; and determining the total standard deviation and/or the total variance of the features under the preset feature dimension according to the total number of the samples and the total deviation value of the features under the preset feature dimension, thereby perfecting the determination mode of the total standard deviation and/or the total variance of the features, providing data support for unified standardization of sample data of each node in the multi-party nodes, and simultaneously avoiding leakage of data privacy of each node.
On the basis of the above technical solutions, the present disclosure also provides another alternative embodiment. In the embodiment, the determination mode of the characteristic total standard deviation and/or the characteristic total variance is optimized and improved. In the present embodiment, reference may be made to the description of the foregoing embodiments.
Referring to fig. 5, a data processing method includes:
s501, determining the total number of samples according to the deviation secrets of the number of samples of the first sample data and the obtained deviation secrets of the number of samples of each second sample data.
S502, determining a simulated feature variance of the first sample data in the preset feature dimension according to the total number of the samples and the feature deviation sum value of the first sample data in the preset feature dimension.
The local node determines the variance of the simulation feature by the following method: accumulating the sum of squares of differences between the data value of the first sample data in the preset characteristic dimension and the total characteristic mean value in the preset characteristic dimension to obtain a characteristic deviation sum value in the preset characteristic dimension; and taking the ratio of the feature deviation sum value and the total number of samples in the preset feature dimension as the simulated feature variance in the preset feature dimension. Or determining the feature variance under the preset feature dimension according to the feature deviation sum of the first sample data under the preset feature dimension; and taking the product of the ratio of the number of samples to the total number of samples of the first sample data in the preset characteristic dimension and the characteristic variance in the preset characteristic dimension as the simulated characteristic variance in the preset characteristic dimension.
S503, determining the total standard deviation and/or the total variance of the features in the preset feature dimension according to the deviation secret of the simulated feature variance of the first sample data in the preset feature dimension and the deviation secret of the simulated feature variance of each acquired second sample data in the corresponding feature dimension.
Determining a deviation secret of a simulated feature variance of the first sample data under a preset feature dimension by the local node; other nodes in the multi-party node respectively determine deviation secrets of the simulated feature variances of the second sample data under the preset feature dimensions; each node in the multi-party nodes exchanges deviation secrets of the simulation feature variances under the preset feature dimensions; each node in the multi-party nodes determines the total variance of the characteristics according to the deviation secret of each simulated characteristic variance under the preset characteristic dimension acquired by each node; and performing evolution on the total feature variance under the preset feature dimension, and taking an evolution result as the total feature standard deviation of all sample data of the multi-party node under the preset feature dimension.
For the specific determination process of the total variance of the features in the predetermined feature dimension, the detailed description of the total number of samples may be referred to, and only the simulated feature variance in the predetermined feature dimension is substituted for the number of samples in the determination process of the total number of samples, which is not described herein again.
S504, normalizing the first sample data according to the total standard deviation and/or the total variance of the features of the preset feature dimension.
The local node makes differences between data values under preset feature dimensions in the first sample data and feature total average values under corresponding dimensions respectively; and determining the ratio of each difference value to the total standard deviation of the characteristics, thereby realizing the standardization processing of the first sample data. The feature total average value under the predetermined feature dimension may be obtained through a line or determined by using the technical solutions provided in the above embodiments. Of course, the feature total average value under the predetermined feature dimension may also be determined by other manners in the prior art, and the details of the disclosure are not repeated.
The embodiment of the present disclosure refines the deviation secret of the initial statistical value into a deviation secret including the number of samples; correspondingly, the determining operation of the target statistic value is refined into deviation secrets according to the sample number of the first sample data and the obtained deviation secrets of the sample number of each second sample data, and the total number of the samples is determined; determining a module characteristic variance of the first sample data under the preset characteristic dimension according to the total number of the samples and the characteristic deviation and value of the first sample data under the preset characteristic dimension; according to the deviation secret of the simulated feature variance of the first sample data in the preset feature dimension and the deviation secret of the simulated feature variance of each acquired second sample data in the corresponding feature dimension, the feature total standard deviation and/or the feature total variance in the preset feature dimension are determined, so that the determination mode of the feature total standard deviation and/or the feature total variance is perfected, data support is provided for unified standardization of sample data of each node in the multi-party nodes, and meanwhile, data privacy leakage of each node is avoided.
As an implementation of each of the above data processing methods, the present disclosure also provides an optional embodiment of a virtual device that implements the data processing method.
Referring to a block diagram of a data processing apparatus shown in fig. 6, the data processing apparatus 600, configured at any one of a plurality of nodes, includes: a target statistic determination module 601 and a normalization processing module 602. Wherein the content of the first and second substances,
a target statistics value determining module 601, configured to determine a target statistics value according to a deviation secret of an initial statistics value of the first sample data of the local node and an obtained deviation secret of an initial statistics value of the second sample data of each of the other nodes in the multi-party node;
a normalization processing module 602, configured to perform normalization processing on the first sample data according to the target statistic;
wherein the target statistic comprises at least one of a feature total mean, a feature total variance and a feature total standard deviation of all sample data of the multi-party node under a predetermined feature dimension.
The method and the device for determining the target statistic value by introducing the deviating secret of the initial statistic value of the multi-party node sample data are used for determining the target statistic value, so that each node in the multi-party node can perform unified standardization on the self sample data based on the target statistic value, a new idea is provided for the unified standardization of the multi-party node sample data, and meanwhile, the leakage of the data privacy of each node caused by the direct interaction of the initial statistic value is avoided.
In an alternative embodiment, the deviating secrets of the initial statistical value comprise a sample number of deviating secrets and deviating secrets of features and values in a predetermined feature dimension;
the target statistic determination module 601 includes:
a total sample number determining unit, configured to determine a total sample number according to the deviation secret of the sample number of the first sample data and the obtained deviation secret of the sample number of each second sample data;
a feature sum value determination unit, configured to determine a feature sum value in the predetermined feature dimension according to the deviation secret of the features and values of the first sample data in the predetermined feature dimension and the obtained deviation secret of the features and values of each second sample data in the corresponding feature dimension;
and the characteristic total mean value determining unit is used for determining the characteristic total mean value under the preset characteristic dimension according to the sample total number and the characteristic total value under the preset characteristic dimension.
In an alternative embodiment, the deviating secret of the initial statistical value comprises a sample number of deviating secrets;
the target statistic determination module 601 includes:
a total sample number determining unit, configured to determine a total sample number according to the deviation secret of the sample number of the first sample data and the deviation secret of the sample number of each second sample data;
a simulated feature mean determination unit, configured to determine a simulated feature mean of the first sample data in the predetermined feature dimension according to the total number of samples and the feature and value of the first sample data in the predetermined feature dimension;
and the characteristic total mean value determining unit is used for determining the characteristic total mean value under the preset characteristic dimension according to the deviation secret of the simulated characteristic mean value of the first sample data under the preset characteristic dimension and the obtained deviation secret of the simulated characteristic mean value of each second sample data under the corresponding characteristic dimension.
In an alternative embodiment, the deviating secrets of the initial statistical value comprise a sample number of deviating secrets and a deviating secret of the feature deviation sum value in the predetermined feature dimension;
the target statistic determination module 601 includes:
a total sample number determining unit, configured to determine a total sample number according to the deviation secret of the sample number of the first sample data and the obtained deviation secret of the sample number of each second sample data;
a feature deviation sum value determination unit, configured to determine a feature deviation sum value in the predetermined feature dimension according to a deviation secret of the feature deviation sum value of the first sample data in the predetermined feature dimension and the acquired deviation secret of the feature deviation sum value of each second sample data in the corresponding feature dimension;
a characteristic total standard deviation determining unit, configured to determine a characteristic total standard deviation in the predetermined characteristic dimension according to the total number of samples and a characteristic deviation total sum value in the predetermined characteristic dimension; and/or the characteristic total variance determining unit is used for determining the characteristic total variance under the preset characteristic dimension according to the characteristic deviation total sum value under the preset characteristic dimension and the total number of samples.
In an alternative embodiment, the deviating secret of the initial statistical value comprises a sample number of deviating secrets;
the target statistic determination module 601 includes:
a total sample number determining unit, configured to determine a total sample number according to the deviation secret of the sample number of the first sample data and the obtained deviation secret of the sample number of each second sample data;
a simulated feature variance determining unit, configured to determine a simulated feature variance of the first sample data in the predetermined feature dimension according to the total number of samples and a feature deviation sum of the first sample data in the predetermined feature dimension;
the characteristic total standard deviation determining unit is used for determining the characteristic total standard deviation under the preset characteristic dimension according to the deviation secret of the simulated characteristic variance of the first sample data under the preset characteristic dimension and the deviation secret of the simulated characteristic variance of each acquired second sample data under the corresponding characteristic dimension; and/or the total feature variance determining unit is used for determining the total feature variance under the preset feature dimension according to the deviation secret of the simulated feature variance of the first sample data under the preset feature dimension and the obtained deviation secret of the simulated feature variance of each second sample data under the corresponding feature dimension.
In an optional embodiment, the sample data comprises sample characteristic data and/or tag characteristic data; the sample data includes the first sample data and the second sample data.
In an optional embodiment, the apparatus further comprises:
a node matching request sending module, configured to issue a node matching request including a sample requirement to a blockchain network based on a standardized intelligent contract;
and the identification information receiving module is used for receiving the identification information of the multi-party nodes meeting the sample requirements fed back by the miner nodes.
In an optional embodiment, the apparatus further comprises:
and the completion message issuing module is used for issuing a standardized completion message to the block chain network based on a standardized intelligent contract so as to search a standardized result for the block chain nodes in the block chain network according to the standardized completion message.
The data processing device can execute the data processing method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of executing the data processing method.
In the technical solution of the present disclosure, the acquisition, storage, application, and the like of the deviation of each initial statistical value from the secret all conform to the regulations of the related laws and regulations, and do not violate the good customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 executes the respective methods and processes described above, such as the data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the data processing method by any other suitable means (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome. The server may also be a server of a distributed system, or a server incorporating a blockchain.
Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge map technology and the like.
Cloud computing (cloud computing) refers to a technology system that accesses a flexibly extensible shared physical or virtual resource pool through a network, where resources may include servers, operating systems, networks, software, applications, storage devices, and the like, and may be deployed and managed in a self-service manner as needed. Through the cloud computing technology, high-efficiency and strong data processing capacity can be provided for technical application and model training of artificial intelligence, block chains and the like.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (18)

1. A data processing method, performed by any one of a plurality of parties, comprising:
determining a target statistic value according to the deviation secret of the initial statistic value of the first sample data of the local node and the obtained deviation secret of the initial statistic value of the second sample data of other nodes in the multi-party node;
according to the target statistic value, carrying out standardization processing on the first sample data;
wherein the target statistic comprises at least one of a feature total mean, a feature total variance and a feature total standard deviation of all sample data of the multi-party node under a predetermined feature dimension.
2. The method of claim 1, wherein the deviating secrets of the initial statistical value comprise a sample number of deviating secrets and deviating secrets of features and values in the predetermined feature dimension;
the determining a target statistic value according to the deviation secret of the initial statistic value of the first sample data of the local node and the obtained deviation secret of the initial statistic value of the second sample data of other nodes in the multi-party node includes:
determining the total number of samples according to the deviation secrets of the number of samples of the first sample data and the obtained deviation secrets of the number of samples of each second sample data; determining a feature sum value under the preset feature dimension according to the deviation secret of the feature and the value of the first sample data under the preset feature dimension and the obtained deviation secret of the feature and the value of each second sample data under the corresponding feature dimension;
and determining a total feature mean value under the preset feature dimension according to the total number of the samples and the total feature value under the preset feature dimension.
3. The method of claim 1, wherein the deviating secret of the initial statistical value comprises a sample number of deviating secrets;
the determining a target statistic value according to the deviation secret of the initial statistic value of the first sample data of the local node and the obtained deviation secret of the initial statistic value of the second sample data of other nodes in the multi-party node includes:
determining the total number of samples according to the deviation secrets of the number of samples of the first sample data and the deviation secrets of the number of samples of each second sample data;
determining a simulated feature mean of the first sample data in the predetermined feature dimension according to the total number of samples and the feature and value of the first sample data in the predetermined feature dimension;
and determining the total feature mean value under the preset feature dimension according to the deviation secret of the simulated feature mean value of the first sample data under the preset feature dimension and the deviation secret of the obtained simulated feature mean value of each second sample data under the corresponding feature dimension.
4. The method of claim 1, wherein the deviating secrets of the initial statistical value comprise a sample number of deviating secrets and a deviating secret of the feature deviation sum value in the predetermined feature dimension;
the determining a target statistic value according to the deviation secret of the initial statistic value of the first sample data of the local node and the obtained deviation secret of the initial statistic value of the second sample data of other nodes in the multi-party node includes:
determining the total number of samples according to the deviation secrets of the number of samples of the first sample data and the obtained deviation secrets of the number of samples of each second sample data; determining a feature deviation sum value under the preset feature dimension according to the feature deviation sum value deviation secret of the first sample data under the preset feature dimension and the acquired feature deviation sum value deviation secret of each second sample data under the corresponding feature dimension;
and determining the total standard deviation and/or total variance of the features in the preset feature dimension according to the total number of the samples and the total deviation sum of the features in the preset feature dimension.
5. The method of claim 1, wherein the deviating secret of the initial statistical value comprises a sample number of deviating secrets;
the determining a target statistic value according to the deviation secret of the initial statistic value of the first sample data of the local node and the obtained deviation secret of the initial statistic value of the second sample data of other nodes in the multi-party node includes:
determining the total number of samples according to the deviation secrets of the number of samples of the first sample data and the obtained deviation secrets of the number of samples of each second sample data;
determining a simulated feature variance of the first sample data in the predetermined feature dimension according to the total number of samples and the feature deviation sum value of the first sample data in the predetermined feature dimension;
and determining the total standard deviation and/or the total variance of the features under the preset feature dimension according to the deviation secret of the simulated feature variance of the first sample data under the preset feature dimension and the deviation secret of the simulated feature variance of each acquired second sample data under the corresponding feature dimension.
6. The method of any of claims 1-5, wherein the sample data comprises sample characteristic data and/or tag characteristic data; the sample data includes the first sample data and the second sample data.
7. The method of any of claims 1-5, further comprising:
issuing a node matching request including a sample requirement to a blockchain network based on a standardized intelligent contract;
and receiving identification information of the multi-party nodes meeting the sample requirement fed back by the miner nodes.
8. The method of any of claims 1-5, further comprising:
and issuing a standardized completion message to the block chain network based on a standardized intelligent contract so that block chain nodes in the block chain network can search a standardized result according to the standardized completion message.
9. A data processing apparatus, disposed at any one of a plurality of nodes, comprising:
the target statistic value determining module is used for determining a target statistic value according to the deviation secret of the initial statistic value of the first sample data of the local node and the obtained deviation secret of the initial statistic value of the second sample data of other nodes in the multi-party node;
the standardization processing module is used for carrying out standardization processing on the first sample data according to the target statistic value;
wherein the target statistic comprises at least one of a feature total mean, a feature total variance and a feature total standard deviation of all sample data of the multi-party node under a predetermined feature dimension.
10. The apparatus of claim 9, wherein the deviating secrets of the initial statistical value comprise a sample number of deviating secrets and deviating secrets of features and values in the predetermined feature dimension;
the target statistic determination module includes:
a total sample number determining unit, configured to determine a total sample number according to the deviation secret of the sample number of the first sample data and the obtained deviation secret of the sample number of each second sample data;
a feature sum value determination unit, configured to determine a feature sum value in the predetermined feature dimension according to the deviation secret of the features and values of the first sample data in the predetermined feature dimension and the obtained deviation secret of the features and values of each second sample data in the corresponding feature dimension;
and the characteristic total mean value determining unit is used for determining the characteristic total mean value under the preset characteristic dimension according to the sample total number and the characteristic total value under the preset characteristic dimension.
11. The apparatus of claim 9, wherein the deviating secret of the initial statistical value comprises a sample number of deviating secrets;
the target statistic determination module includes:
a total sample number determining unit, configured to determine a total sample number according to the deviation secret of the sample number of the first sample data and the deviation secret of the sample number of each second sample data;
a simulated feature mean determination unit, configured to determine a simulated feature mean of the first sample data in the predetermined feature dimension according to the total number of samples and the feature and value of the first sample data in the predetermined feature dimension;
and the feature total mean determining unit is used for determining the feature total mean under the preset feature dimension according to the deviation secret of the simulated feature mean of the first sample data under the preset feature dimension and the deviation secret of the simulated feature mean of each acquired second sample data under the corresponding feature dimension.
12. The apparatus of claim 9, wherein the deviating secrets of the initial statistical value comprise a sample number of deviating secrets and a deviating secret of the feature deviation sum value in the predetermined feature dimension;
the target statistic determination module includes:
a total sample number determining unit, configured to determine a total sample number according to the deviation secret of the sample number of the first sample data and the obtained deviation secret of the sample number of each second sample data;
a feature deviation sum value determination unit, configured to determine a feature deviation sum value in the predetermined feature dimension according to a deviation secret of the feature deviation sum value of the first sample data in the predetermined feature dimension and the acquired deviation secret of the feature deviation sum value of each second sample data in the corresponding feature dimension;
the characteristic total standard deviation determining unit is used for respectively determining the characteristic total standard deviation under the preset characteristic dimension according to the sample total quantity and the characteristic deviation sum value under the preset characteristic dimension; and/or the characteristic total variance determining unit is used for determining the characteristic total variance under the preset characteristic dimension according to the total number of the samples and the characteristic deviation total sum value under the preset characteristic dimension.
13. The apparatus of claim 9, wherein the deviating secret of the initial statistical value comprises a sample number of deviating secrets;
the target statistic determination module includes:
a total sample number determining unit, configured to determine a total sample number according to the deviation secret of the sample number of the first sample data and the obtained deviation secret of the sample number of each second sample data;
a simulated feature variance determining unit, configured to determine a simulated feature variance of the first sample data in the predetermined feature dimension according to the total number of samples and a feature deviation sum of the first sample data in the predetermined feature dimension;
a feature total standard deviation determining unit, configured to determine a feature total standard deviation in the predetermined feature dimension according to a deviation secret of the simulated feature variance of the first sample data in the predetermined feature dimension and the obtained deviation secret of the simulated feature variance of each second sample data in the corresponding feature dimension; and/or the total feature variance determining unit is used for determining the total feature variance under the preset feature dimension according to the deviation secret of the simulated feature variance of the first sample data under the preset feature dimension and the obtained deviation secret of the simulated feature variance of each second sample data under the corresponding feature dimension.
14. The apparatus according to any of claims 9-13, wherein the sample data comprises sample characteristic data and/or tag characteristic data; the sample data includes the first sample data and the second sample data.
15. The apparatus of any of claims 9-13, further comprising:
the node matching request sending module is used for issuing a node matching request comprising a sample requirement to the blockchain network based on a standardized intelligent contract;
and the identification information receiving module is used for receiving the identification information of the multi-party nodes meeting the sample requirements fed back by the miner nodes.
16. The apparatus of any of claims 9-13, further comprising:
and the completion message issuing module is used for issuing a standardized completion message to the block chain network based on a standardized intelligent contract so as to search a standardized result for the block chain nodes in the block chain network according to the standardized completion message.
17. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data processing method of any one of claims 1-8.
18. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the data processing method according to any one of claims 1 to 8.
CN202110569135.XA 2021-05-25 2021-05-25 Data processing method, device, equipment and storage medium Active CN113037489B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110569135.XA CN113037489B (en) 2021-05-25 2021-05-25 Data processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110569135.XA CN113037489B (en) 2021-05-25 2021-05-25 Data processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113037489A CN113037489A (en) 2021-06-25
CN113037489B true CN113037489B (en) 2021-08-27

Family

ID=76455634

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110569135.XA Active CN113037489B (en) 2021-05-25 2021-05-25 Data processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113037489B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113537516B (en) * 2021-09-15 2021-12-14 北京百度网讯科技有限公司 Training method, device, equipment and medium for distributed machine learning model
CN114513304A (en) * 2022-04-19 2022-05-17 浙商银行股份有限公司 Decentralized secure multiparty privacy summation calculation method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111784040A (en) * 2020-06-28 2020-10-16 平安医疗健康管理股份有限公司 Optimization method and device for policy simulation analysis and computer equipment
CN111934890A (en) * 2020-10-13 2020-11-13 百度在线网络技术(北京)有限公司 Key generation method, signature and signature verification method, device, equipment and medium
CN111934889A (en) * 2020-10-13 2020-11-13 百度在线网络技术(北京)有限公司 Key generation method, signature and signature verification method, device, equipment and medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9990367B2 (en) * 2015-07-27 2018-06-05 Sas Institute Inc. Distributed data set encryption and decryption
US11502828B2 (en) * 2017-11-15 2022-11-15 International Business Machines Corporation Authenticating chaincode to chaincode invocations of a blockchain
CN110709863B (en) * 2019-01-11 2024-02-06 创新先进技术有限公司 Logistic regression modeling method, storage medium, and system using secret sharing
CN112464287B (en) * 2020-12-12 2022-07-05 同济大学 Multi-party XGboost safety prediction model training method based on secret sharing and federal learning
CN112799636B (en) * 2021-04-14 2021-08-27 北京百度网讯科技有限公司 Random number generation method, device, equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111784040A (en) * 2020-06-28 2020-10-16 平安医疗健康管理股份有限公司 Optimization method and device for policy simulation analysis and computer equipment
CN111934890A (en) * 2020-10-13 2020-11-13 百度在线网络技术(北京)有限公司 Key generation method, signature and signature verification method, device, equipment and medium
CN111934889A (en) * 2020-10-13 2020-11-13 百度在线网络技术(北京)有限公司 Key generation method, signature and signature verification method, device, equipment and medium

Also Published As

Publication number Publication date
CN113037489A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
CN113037489B (en) Data processing method, device, equipment and storage medium
CN112598138A (en) Data processing method and device, federal learning system and electronic equipment
CN114328132A (en) Method, device, equipment and medium for monitoring state of external data source
CN115883187A (en) Method, device, equipment and medium for identifying abnormal information in network traffic data
CN113904943B (en) Account detection method and device, electronic equipment and storage medium
CN113704058B (en) Service model monitoring method and device and electronic equipment
CN113935069B (en) Data verification method, device and equipment based on block chain and storage medium
CN113032817B (en) Data alignment method, device, equipment and medium based on block chain
CN113360672A (en) Methods, apparatus, devices, media and products for generating a knowledge graph
CN114091909A (en) Collaborative development method, system, device and electronic equipment
CN113239054A (en) Information generation method, related device and computer program product
CN113033826B (en) Model joint training method, device, equipment and medium based on block chain
CN115589391B (en) Instant messaging processing method, device and equipment based on block chain and storage medium
CN114286343B (en) Multi-way outbound system, risk identification method, equipment, medium and product
CN113011494B (en) Feature processing method, device, equipment and storage medium
CN116319716A (en) Information processing method, no-service system, electronic device, and storage medium
CN115906982A (en) Distributed training method, gradient communication method, device and electronic equipment
CN115827693A (en) Data checking method and device, terminal equipment and storage medium
CN117632936A (en) Data management method, device, electronic equipment and storage medium
CN117081939A (en) Traffic data processing method, device, equipment and storage medium
CN117710130A (en) Account checking method and device, electronic equipment and storage medium
CN116566737A (en) Permission configuration method and device based on SaaS platform and related equipment
CN116541224A (en) Performance test method, device, electronic equipment and readable storage medium
CN114398616A (en) Login method and device of embedded system, electronic equipment and readable storage medium
CN117149189A (en) Information processing method, apparatus, electronic device, and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant