CN116955857A

CN116955857A - Data processing method, device, medium and electronic equipment

Info

Publication number: CN116955857A
Application number: CN202211436177.7A
Authority: CN
Inventors: 黄晨宇; 蒋杰; 刘煜宏; 陈鹏; 张凡; 程勇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-16
Filing date: 2022-11-16
Publication date: 2023-10-27

Abstract

The present application relates to the field of computer technology, and in particular, to a data processing method, a data processing apparatus, a computer readable medium, an electronic device, and a computer program product. The method comprises the following steps: acquiring a feature sequence held by a first main body, wherein the feature sequence comprises derived features obtained by splicing original features of a target entity; acquiring an intersection secret slice held by the first main body, wherein a plaintext corresponding to the intersection secret slice is used for indicating whether each element in the characteristic sequence is intersection data of data held by the first main body and the second main body; predicting whether the target entity is an entity of which the first subject and the second subject commonly hold features according to the intersection secret shard held by the first subject. The application can improve the security of data.

Description

Data processing method, device, medium and electronic equipment

Technical Field

The present application relates to the field of computer technology, and in particular, to a data processing method, a data processing apparatus, a computer readable medium, an electronic device, and a computer program product.

Background

Record linking is the task of finding records in a dataset that reference the same entity across different data sources (e.g., data files, books, websites, databases, etc.). The traditional method for recording and linking is to compare recorded data of different data sources based on plaintext to judge whether the recorded data belong to the same entity, and the linking method has the problem of poor safety.

Disclosure of Invention

The application provides a data processing method, a data processing device, a computer readable medium, an electronic device and a computer program product, aiming at improving data security.

Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.

According to an aspect of an embodiment of the present application, there is provided a data processing method, including:

acquiring a feature sequence held by a first main body, wherein the feature sequence comprises derived features obtained by splicing original features of a target entity;

acquiring an intersection secret slice held by the first main body, wherein a plaintext corresponding to the intersection secret slice is used for indicating whether each element in the characteristic sequence is intersection data of data held by the first main body and the second main body;

Predicting whether the target entity is an entity of which the first subject and the second subject commonly hold features according to the intersection secret shard held by the first subject.

According to an aspect of an embodiment of the present application, there is provided a data processing apparatus including:

the first acquisition module is configured to acquire a feature sequence held by a first main body, wherein the feature sequence comprises derivative features obtained by splicing original features of a target entity;

a second obtaining module configured to obtain an intersection secret piece held by the first body, where plaintext corresponding to the first secret piece is used to indicate whether each element in the feature sequence is intersection data of data held by the first body and the second body;

a prediction module configured to predict whether the target entity is an entity of a feature held in common by the first principal and the second principal from an intersection secret shard held by the first principal.

In some embodiments of the application, based on the above technical solutions,

according to an aspect of the embodiments of the present application, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements a data processing method as in the above technical solutions.

According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the data processing method as in the above technical solution via execution of the executable instructions.

According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the data processing method as in the above technical solution.

In the technical scheme provided by the embodiment of the application, the original features of the target entity are spliced to form the derivative features, so that the combination relation of various entity features can be utilized, the accuracy of judging common entities is improved, and special conditions such as default or recording errors of the original features are effectively met; in addition, the embodiment of the application utilizes the feature sequence comprising the derivative features to conduct common entity prediction without obtaining the plaintext data of another main body, thereby improving the security of data privacy.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 schematically shows a block diagram of an exemplary system architecture to which the technical solution of the present application is applied.

FIG. 2 illustrates a data processing method for record linking based on derived features in one embodiment of the application.

FIG. 3 illustrates a data processing method for record linkage based on a predictive model in one embodiment of the application.

FIG. 4 is a schematic diagram of a process for training a predictive model based on plaintext data in one embodiment of the application.

Fig. 5 is a schematic diagram illustrating a process of recording and linking data held by different subjects in an application scenario according to an embodiment of the present application.

Fig. 6 is a schematic diagram illustrating a principle of privacy set intersection by using cuckoo hash in the related art of the present application.

FIG. 7 illustrates a data processing method for privacy set intersection based on location inverse mapping in one embodiment of the application.

FIG. 8 illustrates a data processing method for privacy set intersection based on mapping decomposition of location inverse mapping in one embodiment of the application.

FIG. 9 illustrates a schematic diagram of decomposing a location map in one embodiment of the application.

FIG. 10 illustrates a schematic diagram of data processing based on careless copying and careless replacement in one embodiment of the application.

FIG. 11 illustrates a schematic diagram of an implementation of unintentional replication based on a first sub-map decomposition in one embodiment of the application.

FIG. 12 illustrates a schematic diagram of a mapping unit used by an inadvertent permutation in one embodiment of the application.

FIG. 13 illustrates a schematic diagram of a mapping unit used by an unintended copy in one embodiment of the application.

Fig. 14 is a schematic diagram showing the structure of a serial network composed of mapping units in one embodiment of the present application.

Fig. 15 shows a schematic diagram of the structure of a parallel network composed of mapping units in one embodiment of the application.

Fig. 16 schematically shows a block diagram of a data processing apparatus according to an embodiment of the present application.

Fig. 17 schematically shows a block diagram of a computer system suitable for use in implementing embodiments of the application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the application may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

In the specific embodiment of the present application, related privacy data such as user information is related, when the embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

Related technical terms related to the embodiments of the present application are described below.

Record link: is the task of looking up records in a dataset that reference the same entity across different data sources (e.g., data files, books, websites, and databases).

MPC: multi-party computation, multiparty secure computing, refers to that participants participate in secret computing by using private data without revealing the respective private data, and together complete a certain computing task.

PSI: private set intersection, the privacy set intersection refers to that the participating parties obtain the intersection of the data held by the parties without revealing any additional information. Herein, the additional information refers to any information other than the intersection of the data of the two parties. Implementation based on an unintentional pseudo-random function (OPRF) or based on Elliptic Curve Cryptography (ECC) is often employed.

Secret sharing: the secret is scattered to each participant, and each person gets a part of the secret, called a slice, and only the slice set of more than a certain number of participants can untangle the real value. The Boolean slicing, the digital slicing and the like are provided, the Boolean slicing is a Boolean type slicing with a real result, and the slicing can calculate an AND or non-equal logic circuit relatively easily; digital slicing is slicing on the integer domain as the true result, which can calculate digital circuits more easily. These slices may be inter-transformed by the MPC approach.

Secret sharding: the two computing parties respectively have secret values, and the exclusive OR/arithmetic addition of the two secret values is a true value, namely the true value is shared to the participating parties in a secret mode of exclusive OR/arithmetic addition. The true value is shared in an exclusive or mode, and then the two slicing results are called Boolean slicing; secret sharing is performed by adopting an arithmetic addition mode, and the slicing result is called arithmetic slicing.

PSI-Circuit: private set intersection Circuit the Circuit privacy set performs intersection, that is, the two parties participate in the input set, and finally, the two parties can only obtain the slicing information about the intersection, that is, whether the initiator data of the PSI-Circuit is in the boolean slicing value in the intersection, so that the intersection data can not be obtained, and the non-intersection data can not be obtained, and an inadvertent programmable pseudo random function (OPPRF) or an inadvertent pseudo random function (OPRF) is often adopted.

Classification model: a classification model whose output contains only two classes. Including logistic regression, linear regression, naive bayes, etc.

Name production address

First body Guest: the resulting boolean fragment indicates whether the data of the first body Guest is in the intersection or not, by the initiator of the PSI-Circuit.

Second main body Host: the resulting boolean fragment indicates whether the data of the first subject Guest is in intersection or not, by the participants of the PSI-Circuit.

OPRF: an inadvertent pseudorandom function (Oblivious Pseudo Random Function). The first main body Guest inputs x, and finally the first main body Guest obtains F (x) (the value of the pseudo-random function F at x), and the second main body Host obtains the pseudo-random function F. The second body Host does not know x and the first body Guest does not know the pseudo-random function F.

PSM: privacy set membership determination (Private set membership). A first main body Guest inputs x, a second main body Host inputs { y_1, y_2, …, y_l }, and if x epsilon { y_1, y_2, …, y_l }, the first main body Guest and the second main body Host obtain a boolean fragment of 1; otherwise, a Boolean fragment of 0 is obtained. The first body Guest does not know the input of the second body Host, which does not know x.

Cuckoo Hash (Cuckoo Hash): the method of mapping m elements to n locations through k hashes requires that different elements be mapped to different locations. Briefly, if the ith element has been mapped to the ith location under the action of the t (t < n) hash, when the jth (j > i) element is also mapped to the ith location, where the ith location is in conflict, then the jth element is mapped to the ith location, the ith element is remapped to the new location under the action of the t+1th hash, and if the locations are in conflict, then the process is repeated. Generally, n=1.27 m, and the probability of m element insertion failures is less than 2+ (-40).

Simple Hash (Simple Hash): m elements are mapped to n positions through k hashes, where the positions may conflict, i.e., different elements may be mapped to the same position.

OSN: the network (Oblivious Switching network) is inadvertently exchanged. The network of exchanges refers to exchanges consisting of 2-exchanges ((a, b) - > (a, b) or (b, a)). The careless swap network means that the first main body Guest has n- > n swap network, the second main body Host has n secret values, and finally, the two parties obtain secret fragments after the n secret values of the second main body Host pass through the swap network, so that the swap network of the first main body Guest is not leaked, and the secret value of the second main body Host is not leaked.

OP: inadvertent replacement (Oblivious Permutation). The unintentional substitution means that the first main body Guest has n- > n substitution, the second main body Host has n secret values, and finally, the two parts obtain secret fragments after the n secret values of the second main body Host are subjected to the substitution, so that the substitution of the first main body Guest is not leaked, and the secret values of the second main body Host are not leaked. Since the permutation can be broken down into several permutation networks, the OP can be implemented by the OSN.

OR: inadvertent duplication (Oblivious Replication). The mapping of the finger n- > n is replicated such that it duplicates the original input data (e.g., (a, b, c, d, e, f) is mapped to (a, b, c, d, c, d)). The unintentional copying means that the first main body Guest has n-n mapping, the second main body Host has n secret values, and finally the two sides obtain the fragments of the n secret values after copying and mapping. That is, the actual values corresponding to the secret shards are (a, b, c, d, c, d) in turn and are no longer (a, b, c, d, e, f) at this time.

In practical application scenarios, such as medical data, transaction data, etc., user data may be scattered among different departments of different institutions for independent storage and maintenance. For higher analysis accuracy, these data sometimes need to be jointly analyzed, which requires finding data belonging to the same batch of users among the parties. For another example, for the same commodity entity, in different links of commodity transaction, different main bodies (such as manufacturer, seller, transaction platform, logistics company, purchaser and the like) respectively record characteristic data of the commodity, wherein the data comprises a plurality of entity attribute characteristics, and some of the data are unique global identifiers, such as anti-counterfeiting identification codes of the commodity, and the unique identifiers can be used for deterministically linking entity records among different databases; however, the information may default to include only some non-unique identifiers, such as names, production dates, production addresses, etc., which cannot uniquely identify the entity record links between different databases, which requires an algorithm/system to find the set of intersection entities, i.e., the set of entities on the links, through statistical means such as modeling.

Fig. 1 schematically shows a block diagram of a system architecture to which the technical solution of the application is applied.

As shown in fig. 1, the first body 110 is a requester that initiates a record link request, and the second body 120 is a participant that responds to the record link request. The first body 110 holds a set of first data sets 111 consisting of several features of the entity object and the second body 120 holds a set of second data sets 121 consisting of several features of the entity. For example, the first data set 111 and the second data set 121 are tables for recording various attribute features of the entity object, which may include, for example, a counterfeit-proof identification code, a name, a date of manufacture, a category of product, a manufacturing address, and the like, as shown in the drawings.

When the first body 110 initiates a record linking request, the attribute characteristics of each entity record in the first data set 111 and the second data set 121 may be compared, so as to determine whether the two records belong to the same entity according to the matching relationship of the attribute characteristics, and establish a record link between the two records belonging to the same entity.

The data processing method provided in the embodiment of the present application may be executed by a terminal device or a server where the first body 110 is located, or may be executed by a terminal device or a server where the second body 120 is located. The terminal device may include various electronic devices such as a smart phone, a tablet computer, a notebook computer, a desktop computer smart speaker, an intelligent wearable device, an intelligent vehicle-mounted device, an intelligent payment terminal, etc., and the server may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides cloud services, a cloud database, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and an artificial intelligent platform.

In addition, the data processing method provided by the embodiment of the present application may be performed by a third party other than the first body 110 and the second body 120, and the third party may be, for example, a cloud computing platform that performs encrypted data communication with the first body 110 and the second body 120, respectively.

Taking two tables shown in fig. 1 as an example, when the data has the entity attribute feature "anti-counterfeiting identifier" of the globally unique identifier, it may be determined that the first data set 111 and the first record in the second data set 121 are successfully matched, that is, it may be determined that the two records belong to the same entity a according to the same anti-counterfeiting identifier.

Some data may have default conditions, for example, the anti-counterfeit identifier of the entity B in the second data set 121 is in a default state, and at this time, matching detection needs to be performed through the entity attribute features of the non-unique identifier, for example, a name, a class, a production address, and the like. At the same time, there may be cases where these data are wrong, for example, the date of production of the second data in the second data set 121, "2022, 1 st" is wrong, and the correct date of production should be "2022, 1 st, 2 nd", recorded in the first data set 111; on the basis, the second number of data in the two tables, namely the record with the name of B, can be matched according to other attribute characteristics; because these attribute features are non-unique identifiers, two data records may have partially identical values, but cannot be matched, for example, the third data in two tables is "C" although the names are all "C", but because the anti-counterfeit identification code is default and the attribute features such as the date of manufacture, the category of product, the address of manufacture, etc. are all different, it can be determined that two records respectively belong to two different entities, and the two cannot establish record links.

In the related technology of the application, the comparison results can be scored by modeling and other methods based on each attribute feature of the plaintext comparison entity, so as to determine whether the final record is in an intersection, namely whether the final record is on the link of databases of both parties, the sensitive data cannot be protected by the link record based on the plaintext, and the problem of entity privacy disclosure exists.

In order to improve the security of entity data, a privacy set intersection algorithm PSI can be adopted to judge the deterministic entity attribute characteristics, such as the anti-counterfeiting identification code of the entity, whether the entity is intersection data or not, so as to judge that two records can establish record links; however, for non-deterministic entity attribute features, such as names, categories, production dates, and the like, there are a large number of attribute features of repeated data, it is difficult to directly determine whether records of an entity are matched, and there is a problem of poor accuracy of record linking.

To increase the accuracy of record linking, the MPC may also be used to perform privacy comparisons of all the parties' data by two using multiparty secure computing, which is essentially complex with O (n ² ) The adoption of the barrel-division strategy can only reduce the complexity to O (n ² K), where k is the number of buckets; MPC-based method has high computational complexity and long computation time in general And the like.

Aiming at the problems in the related art, the embodiment of the application provides a novel data processing method, which can not reveal the attribute characteristics of the entity when being applied to record links, and accords with the existing laws and regulations; the deterministic entity attribute features and the nondeterminacy attribute features can be comprehensively modeled and calculated to be scored, so that the accuracy is high; the calculation complexity is O (n) or O (nlogn) according to the implementation mode of PSI-Circuit, and the calculation efficiency is high.

The data processing method provided by the application is described in detail below with reference to the specific embodiments.

Fig. 2 shows a data processing method for performing record linkage based on derivative features in one embodiment of the present application, which may be performed by any terminal device or server having a data calculation function. The embodiment of the present application will be described taking as an example a data processing method performed by the first body shown in fig. 1. As shown in fig. 2, the data processing method includes the following steps S210 to S230.

S210: and acquiring a feature sequence held by the first main body, wherein the feature sequence comprises derived features obtained by splicing original features of the target entity.

The feature sequence held by the first main body is a data set formed by orderly arranging a plurality of features of the target entity, wherein the included elements can be original features of the target entity or derivative features of the target entity, and the derivative features can be combined features formed by splicing any two or more original features belonging to the target entity. Taking attribute characteristics of an entity as an example, the original characteristics can be independent characteristics such as an anti-counterfeiting identification code, a name, a category, a production date of the entity, and the like, and the derivative characteristics can be combined characteristics such as a name, a category, a production date, a name, a production date, and the like, which are formed by splicing the two original characteristics. In some alternative embodiments, derived features may also be spliced from three or more original features, such as "name + category + date of manufacture".

S220: and acquiring an intersection secret slice held by the first main body, wherein a plaintext corresponding to the intersection secret slice is used for indicating whether each element in the feature sequence is intersection data of data held by the first main body and the second main body.

The data held by the second subject may be another feature sequence corresponding to the feature sequence held by the first subject. For example, the sequence of locations held by the second body is a data set consisting of an ordered arrangement of several features of several entities, where each feature may be an original or derived feature of an entity. The second body may splice two or more original features of one entity to form a derivative feature using the same derivative feature construction method as the first body.

In one embodiment of the present application, the intersection secret piece held by the first main body may be obtained by performing privacy set intersection on the feature sequence held by the first main body and the feature sequence held by the second main body. Through the private collection intersection, the first main body or the second main body can realize intersection judgment of the feature sequences under the condition that the other side can not know the data held by the other side. Under the condition that the intersection secret shard held by the first main body and the intersection secret shard held by the second main body are obtained at the same time, a corresponding plaintext can be obtained by decryption, wherein the plaintext is used for indicating whether each element in the feature sequence held by the first main body is intersection data of the data held by the first main body and the second main body.

S230: and predicting whether the target entity is an entity with the characteristic commonly held by the first entity and the second entity according to the intersection secret shard held by the first entity.

In general, when two different entities commonly hold features for the same entity, the features of the entities held by the two entities should be all or partially the same, for example, for one entity, the attribute features held in two different databases should have at least partially the same content, such as anti-counterfeit identification codes, names, categories, and the like of the entities. Since the plaintext corresponding to the intersection secret piece can be used to indicate whether each element in the feature sequence is intersection data of the data held by the first body and the second body, the first body can predict whether the target entity is an entity of the feature held by the first body and the second body together according to the ciphertext content of the intersection secret piece under the condition that the intersection secret piece is not decrypted by utilizing the data security calculation between the first body and the second body.

In the data processing method provided by the embodiment of the application, the original features of the target entity are spliced to form the derivative features, so that the combination relation of various entity features can be utilized, the accuracy of judging common entities is improved, and special conditions such as default or recording errors of the original features are effectively applied; in addition, the embodiment of the application utilizes the feature sequence comprising the derivative features to conduct common entity prediction without obtaining the plaintext data of another main body, thereby improving the security of data privacy.

In one embodiment of the present application, the first body and the second body may perform feature transformation on the respective physical features to obtain a feature sequence carrying derivative features. The method of acquiring a sequence of features held by a first subject may comprise: acquiring a plurality of original features of a target entity held by a first subject; splicing at least two original features to obtain derivative features of a target entity; a feature sequence is obtained that includes derived features.

For example, the original features held by the first body may include multiple types of original features such as an anti-counterfeit identification code, a name, a category, a production date, a production address, and the like of an entity, and the derivative features of the entity may be obtained by splicing at least two types of original features, for example, the derivative features may be a name+category, a category+a production date, and the like.

In one embodiment of the present application, when the original features held by the first main body are subjected to the stitching process, the plurality of original features of the target entity may be first subjected to the classification process, so as to obtain a unique identification feature capable of uniquely identifying the target entity and a non-unique identification feature incapable of uniquely identifying the target entity; and then, splicing at least two non-unique identification features to obtain derivative features of the target entity.

In one embodiment of the present application, the feature sequence held by the first body is a feature set formed by orderly combining the unique identification feature of the target entity and the derived feature obtained by conversion. For example, the unique identifying characteristic of an entity includes a "tamper-proof identification code," and the non-unique identifying characteristic of the entity includes a "name," "category," "date of manufacture," and so forth. After the non-unique identification is spliced to obtain the derived feature, the derived feature is combined with the unique identification feature to form a feature sequence, for example, the feature sequence can be { anti-counterfeiting identification code, name+category, category+production date, name+production date }.

In one embodiment of the application, all derived features from the splice may be added to the feature sequence. In other alternative embodiments, in order to reduce the data size and increase the data operation speed, part of the features with strong entity characterization capability may be screened from all the derived features obtained by splicing and added to the feature sequence, for example, a specified evaluation coefficient may be calculated for each derived feature, and then the derived features may be screened according to the value of the evaluation coefficient. The evaluation coefficients may be, for example, information values IV, variance expansion factors VIF, p values, and the like.

In one embodiment of the application, to reduce traffic during data processing, each of the original and derivative features may be hashed, for example using SHA256, to yield the original and derivative features in hash values, when constructing the feature sequence. For example, when generating a new derivative feature "name+address" by using two features, i.e., a name and an address, the newly generated derivative feature is a hash value obtained by sequentially splicing two original features and then performing a hash operation by using a hash function. For example, the original feature name has a value of a ₁ The address value is a ₂ The new derived feature "name + address" may be expressed as h (a ₁ ||a ₂ ) Where h (·) is a hash function and i is a stitching function. Some original features may be default values and derivative features generated by default feature values are also default values. In particular, the attribute features of the globally unique identifier (e.g. anti-theftPseudo-identification code) need not be combined with other features to generate derived features, but rather added directly to the feature sequence as independent features.

Fig. 3 illustrates a data processing method for performing record linkage based on a predictive model in an embodiment of the present application, which may be performed by any terminal device or server having a data calculation function. The embodiment of the present application will be described taking as an example a data processing method performed by the first body shown in fig. 1. As shown in fig. 3, the data processing method includes the following steps S310 to S350.

S310: and acquiring a feature sequence held by the first main body, wherein the feature sequence comprises derived features obtained by splicing original features of the target entity.

S320: and acquiring an intersection secret slice held by the first main body, wherein a plaintext corresponding to the intersection secret slice is used for indicating whether each element in the feature sequence is intersection data of data held by the first main body and the second main body.

S330: and acquiring a prediction model obtained according to the training of the plaintext data, wherein the prediction model is used for predicting whether the plaintext data is intersection data or not and whether the plaintext data belongs to the mapping relation between the same entities or not.

S340: and mapping the intersection secret shards held by the first main body according to the prediction model to obtain the entity secret shards held by the first main body, wherein a plaintext corresponding to the entity secret shards is used for indicating whether the target entity is an entity with the common holding characteristics of the first main body and the second main body.

S350: and determining whether the target entity is an entity with the characteristic commonly held by the first entity and the second entity according to the entity secret sharding held by the first entity.

In the data processing method provided by the embodiment of the application, the prediction model is obtained through clear text data training, then the intersection secret shards held by the first main body are mapped according to the prediction model to obtain the corresponding entity secret shards, and further the common entity judgment is carried out according to the entity secret shards, so that the prediction accuracy of the common entity judgment can be further improved.

For details of the steps S310 to S320, reference may be made to the descriptions of the steps S210 to S220 in the above embodiments, and the details are not repeated here.

In step S330, a prediction model obtained by training according to the plaintext data is obtained, and the prediction model is used for predicting whether the plaintext data is intersection data or not and whether the plaintext data belongs to a mapping relationship between the same entities. The predictive model may be a pre-trained bi-classification model, such as a logistic regression model, a linear regression model, a naive Bayesian model, etc. The prediction model in the embodiment of the application can also select other neural network models with arbitrary network structures according to actual needs, such as a convolutional neural network model, a cyclic neural network model and the like.

FIG. 4 is a schematic diagram of a process for training a predictive model based on plaintext data in one embodiment of the application. As shown in fig. 4, a data set representing two different data sources is first acquired, including, for example, a training database a and a training database B. In the training phase of the predictive model, all training data used is plaintext data. In some alternative embodiments, the training databases that can be shared in plaintext may be provided by each of the first and second principals participating in the record link, or the data held by itself may be divided into two training databases by any one principal, or the two training databases may be obtained using public data set division. The training data in the training database comprises attribute feature samples of the entities and also comprises corresponding sample labels, wherein the sample labels are used for indicating whether the entity corresponding to one attribute feature sample is an entity of which the training database A and the training database B commonly record features.

The process of training the predictive model based on plaintext data may include the following steps.

S401: and (5) preprocessing data.

To reduce traffic during training and prediction and to do basic information hiding of the data, a hash is performed on each attribute feature, for example using SHA256.

S402: and generating and selecting derivative features.

To different attributes in dataThe features are combined to generate new features, for example, a new derivative feature "name_address" can be generated by name and address, and the newly generated features are obtained by splicing two original features in sequence and through a hash function. For example, the original feature name is a ₁ Address a ₂ Then the new derivative feature "name_address" is h (a ₁ ||a ₂ ) Where h (·) is a hash function and i is a stitching function. Some of the feature values may be default and derivative feature values generated by default feature values are default values. In particular, the attribute features of the globally unique identifier need not be combined with other features to generate new features, but rather are directly used as features for the next stage.

And screening part of derivative features by using any feature selection algorithm such as an information value IV, a variance expansion factor VIF or a p value and the like, and forming a feature sequence with the attribute features of the globally unique identifier.

S403: the plaintext data is intersected.

The processed feature sequence sequentially performs intersection operation on each feature, and the obtained intersection operation result comprises: in the intersection, not in the intersection, characterized by default values, and the three results are unithermally encoded. For example, the three states of the intersection result are respectively represented by 1,0 and-1, namely the characteristic x _i E (1, 0, -1), feature x after one-hot encoding _i Can be expressed as three features (x _i,0 ,x _i,1 ,x _i,2 ) The values may be (0, 1), (0, 1, 0), (1, 0) and correspond to the three states of the intersection result. For example, the feature sequence includes n feature samples, identified as 0 through n-1, respectively. And solving the intersection of each feature in turn according to the identification sequence, and calculating the single-heat coding until the intersection operation result of all the feature samples is obtained.

S404: and (5) model training.

And taking the single thermal code representing the intersection operation result of each characteristic and the corresponding sample label as training samples, inputting the training samples into the prediction model, and then carrying out iterative training on the prediction model to continuously update model parameters of the prediction model until a converged prediction model is obtained. The prediction model can be selected from logistic regression, linear regression, naive Bayes and other classification models.

The trained predictive model can be deployed on each subject participating in the record link and run by each subject individually.

In S340, mapping the intersection secret shards held by the first body according to the prediction model to obtain the entity secret shards held by the first body, where the plaintext corresponding to the entity secret shards is used to indicate whether the target entity is an entity with a feature commonly held by the first body and the second body.

In one embodiment of the present application, a method for mapping an intersection secret patch held by a first subject according to a prediction model may include: obtaining mapping parameters for multiplying input data from a prediction model; converting the mapping parameters into state weights for representing different characteristic states of the input data, the state weights including intersection weights for representing that the input data is intersection data and non-intersection weights for representing that the input data is not intersection data; and carrying out weighting operation on the intersection secret shards held by the first main body according to the state weight to obtain the entity secret shards held by the first main body.

In a linear model y= Σw _i x _i +b is an example, the input data is x _i The model parameters of the predictive model include mapping parameters w for multiplying the input data _i And a mapping parameter b for addition. The embodiment of the application can firstly map the parameter w according to the single thermal coding _i To state weights representing different characteristic states of the input data. For example, the intersection weight used to represent that the jth feature is intersection dataAnd non-intersection weight ++for representing that j features are not intersection data>

The first main body holds the intersection secret slice a for itself according to the state weight ₀ Performing a weighting operationCan obtain the entity secret shard y held by itself ₀ . Specifically, the state weight is used to perform weighting processing on the secret slices corresponding to each feature, so as to obtain a corresponding state weighting result, namely, the secret slice weighting result of the jth feature is thatOn the basis, the intersection secret shards held by the first main body are obtained after the secret shard weighting results of the features are accumulated, and are: y is ₀ ＝∑ _j y _j,0 +b。

Corresponding to the first subject, the second subject can carry the intersection secret shard a to the second subject according to the state weight ₁ Weighting operation is carried out to obtain entity secret shard y held by the entity secret shard y ₁ . Specifically, the state weight is used to perform weighting processing on the secret slices corresponding to each feature, so as to obtain a corresponding state weighting result, namely, the secret slice weighting result of the jth feature is that On the basis, the intersection secret shards held by the second main body are obtained after the secret shard weighting results of the features are accumulated, and are: y is ₁ ＝∑ _j y _j,1 。

The above embodiments are applicable to situations where there are no default values in the feature sequence, and in some alternative embodiments, when there are default values in the feature sequence, correction is also required for the secret shards to which the default values correspond.

In one embodiment of the present application, a method for mapping an intersection secret patch held by a first subject according to a prediction model may include: obtaining mapping parameters for multiplying input data from a prediction model; converting the mapping parameters into state weights for representing different characteristic states of the input data, the state weights including an intersection weight for representing that the input data is intersection data, a non-intersection weight for representing that the input data is not intersection data, and a default value weight for representing that the input data is a default value; weighting the intersection secret slices held by the first main body according to the intersection weights and the non-intersection weights to obtain a weighted result; and correcting the weighted result according to the intersection weight and the default value weight to obtain the entity secret shards held by the first main body.

In a linear model y= Σw _i x _i +b is an example, the input data is x _i The model parameters of the predictive model include mapping parameters w for multiplying the input data _i And a mapping parameter b for addition. The embodiment of the application can firstly map the parameter w according to the single thermal coding _i To state weights representing different characteristic states of the input data. For example, the intersection weight used to represent that the jth feature is intersection dataNon-intersection weight for representing j features not being intersection data>And a default value weight for indicating that the jth feature is a default value>

For example, assume a mapping parameter w _i May be represented as vectors (e 1, e2, e 3), and the one-hot codes corresponding to the three feature states in the intersection, not in the intersection, and the default value are represented as (0, 1), (0, 1, 0), (1, 0), respectively. The corresponding intersection weightCan be expressed as (0, e 3), the corresponding non-intersection weight +.>Can be expressed as (0, e2, 0), corresponding default weight +.>May be denoted as (e 1, 0).

The first main body holds the intersection secret slice a for itself according to the state weight ₀ Weighting operation is carried out to obtain entity secret shard y held by the entity secret shard y ₀ . Specifically, firstly, weighting the secret slices corresponding to each feature by using the intersection weight and the non-intersection weight to obtain a corresponding state weighting result, namely, the secret slice weighting result of the jth feature is

For the case of default feature values, the first body may directly correct the digital slice whose result is the default state value, i.e. if the jth feature value is default, its value must be the score in the intersection as described aboveThus can pass->Correction is performed. On the basis, the intersection secret shards held by the first main body are obtained after the secret shard weighting results of the features are accumulated, and are: y is ₀ ＝∑ _j y _j,0 +b。

Corresponding to the first subject, the second subject can carry the intersection secret shard a to the second subject according to the state weight ₁ Weighting operation is carried out to obtain entity secret shard y held by the entity secret shard y ₁ . Specifically, the state weight is used to perform weighting processing on the secret slices corresponding to each feature, so as to obtain a corresponding state weighting result, namely, the secret slice weighting result of the jth feature is that Herein the base isOn the basis, after the secret sharding weighting results of the features are accumulated, the intersection secret shards held by the second main body are obtained as follows: y is ₁ ＝∑ _j y _j,1 。

In one embodiment of the present application, an intersection secret piece obtained by performing privacy set intersection by the first body and the second body is a boolean piece, and on this basis, in order to implement an arithmetic operation such as addition or multiplication on the intersection secret piece in the prediction model, before performing mapping processing on the intersection secret piece held by the first body according to the prediction model, the intersection secret piece held by the first body needs to be converted from the boolean piece to the arithmetic piece. Correspondingly, the intersection secret shard held by the second body is also converted into an arithmetic shard by the Boolean shard.

In converting boolean and arithmetic slices, an MPC may be implemented between a first principal and a second principal based on multiparty security calculations. For example, the ABY framework may be used to switch between two sharing protocols that enable boolean sharing and arithmetic sharing. The ABY framework refers to a security protocol computing framework including an arithmetical Arithmetic sharing protocol, a bootan Boolean sharing protocol and a Yao circuit sharing protocol, and the computing framework allows conversion among three secret sharing protocols.

In step S350, it is determined whether the target entity is an entity of the feature commonly held by the first principal and the second principal according to the entity secret shard held by the first principal.

The plaintext corresponding to the entity secret shard is used to indicate whether the target entity is an entity that the first and second principals commonly hold the feature. In one embodiment of the present application, the plaintext corresponding to the entity secret shard may be an evaluation score output based on the prediction model, and if the evaluation score is higher than a specified threshold, the target entity may be considered as an entity that the first entity and the second entity commonly hold features, that is, the first entity holds data including feature records corresponding to the target entity, and the second entity holds data also including feature records corresponding to the target entity, and corresponding record links may be established for the two feature records distributed on different entities. If the evaluation score is lower than the specified threshold, the target entity may be considered not to be an entity with the characteristics held by the first entity and the second entity together, that is, the entity characteristics of the target entity exist in the data held by the first entity, but the entity characteristics of the target entity do not exist in the data held by the second entity, and a record link for the target entity cannot be established between the first entity and the second entity.

In one embodiment of the present application, a method for comparing a specified threshold of a flat score to a zero value, based on which determining whether a target entity is an entity of a feature commonly held by a first principal and a second principal from an entity secret shard held by the first principal, may include: extracting the most significant bit from the entity secret shard held by the first main body and the entity secret shard held by the second main body respectively; performing exclusive OR operation on the two most significant bits to obtain a symbol bit plaintext corresponding to the entity secret piece; and determining whether the target entity is an entity with the common feature of the first main body and the second main body according to the comparison result of the sign bit plaintext and the zero value.

The plaintext corresponding to the entity secret slice is a binary number, and the most significant bit is used for representing the sign bit positive number or negative number of the plaintext. In the embodiment of the application, based on the mode of extracting the most significant bit and performing exclusive OR operation, the comparison of the sign bit plaintext of the secret fragmentation and the zero value can be completed under the condition that the secret fragmentation of the other party is not required to be known, so that the data privacy security is ensured, the data operation amount and the communication amount are reduced, and the data processing efficiency is improved.

Fig. 5 is a schematic diagram illustrating a process of recording and linking data held by different subjects in an application scenario according to an embodiment of the present application. The first main body and the second main body respectively carry out privacy security calculation on the data held by the first main body and the second main body, execute privacy collection interaction or other multiparty security calculation MPC with the counterpart when necessary, and finally obtain the corresponding record link result on the first main body.

As shown in fig. 5, the process of recording and linking the first and second main bodies in the application scenario includes the following steps.

S501: and (5) preprocessing data.

The first main body performs data preprocessing on the database A held by the first main body, and obtains hash values corresponding to the characteristics of each entity. The second main body performs data preprocessing on the database B held by the second main body to obtain hash values corresponding to the characteristics of each entity. For example, the SHA256 algorithm is used to perform hash operation on each feature in the database to obtain a corresponding hash value, so that the traffic in the training and prediction process can be reduced, basic information hiding can be performed on the data, the consumption of computing resources and communication resources can be reduced, and meanwhile, the data security can be improved.

S502: and (5) generating derivative characteristics.

Combining different attribute features in the data to generate a new feature, for example, a new derivative feature 'name_address' can be generated through a name and an address, and the newly generated feature is obtained by splicing two original features in sequence and through a hash function. For example, the original feature name is a ₁ Address a ₂ Then the new derivative feature "name_address" is h (a ₁ ||a ₂ ) Where h (·) is a hash function and i is a stitching function. Some of the feature values may be default and derivative feature values generated by default feature values are default values. In particular, the attribute features of the globally unique identifier need not be combined with other features to generate new features, but rather are directly used as features for the next stage.

In some embodiments, all derived features may be added to the feature sequence. In other embodiments, any feature selection algorithm such as the information value IV, the variance expansion factor VIF, or the p value may be used to screen the partially derived features and form a feature sequence with the attribute features of the globally unique identifier.

S503: the circuit privacy set is evaluated.

In order to improve the privacy security of the intermediate result, a Circuit privacy set intersection algorithm PSI-Circuit can be adopted between the first main body and the second main body to finally obtain the result of each feature in the intersection or not in the intersection Boolean fragment (b) ₀ ,b ₁ ) Wherein the first subject has a Boolean fragment b ₀ The second body has a boolean fragment b ₁ 。

Because the input of the PSI-Circuit needs to be de-duplicated and the output of the PSI-Circuit is out of order, namely inconsistent with the original feature sequence, in order to keep the fragments of each feature consistent with the original feature sequence, the embodiment of the application can adopt the following two schemes to realize the processing of the repeated elements and the alignment of the feature sequence.

(1) In the case where the privacy protection level is relatively low, the first principal may send an original sequence map of its own holding feature sequences to the second principal, and the second principal adjusts the sequence order to keep the order consistent, where the computational complexity of the scheme is O (n).

(2) The Boolean fragments are restored to be consistent with the original feature sequence and the repeating elements are restored into the Boolean fragments by adopting a homomorphic encryption algorithm or adopting a position inverse mapping algorithm based on careless substitution and careless duplication.

If homomorphic encryption algorithm is adopted, the corresponding calculation complexity is O (n). The homomorphic encryption algorithm (Homomorphic Encryption, HE) is an encryption algorithm meeting homomorphic operation property of ciphertext, namely, after data is homomorphic encrypted, specific calculation is carried out on the ciphertext, and plaintext after corresponding homomorphic decryption is carried out on the obtained ciphertext calculation result is equivalent to directly carrying out the same calculation on plaintext data, so that "computable invisibility" of the data is realized. Common homomorphic encryption algorithms may include, for example, RSA algorithm, elGamal algorithm, paillier algorithm, boneh-Goh-Nissim algorithm, and the like.

If a position inverse mapping algorithm based on careless substitution and careless copying is adopted, the corresponding calculation complexity is O (nlogn). Compared with homomorphic encryption algorithm, the algorithm has less word encryption overhead and less data traffic. For specific details of this algorithm reference is made to the description of the embodiments below.

Converting the Boolean fragments of each piece of data into corresponding arithmetic fragments to obtain (a) ₀ ,a ₁ ) A value of 1 indicates that in the intersection, a value of 0 indicates that not.

S504: and scoring the circuit model.

The arithmetic fragment (a) obtained by the intersection operation in step S503 ₀ ,a ₁ ) And inputting the data into a scoring circuit model realized based on a secret sharing mode, wherein model parameters are consistent with parameters of a pre-trained prediction model (such as a classification model), and the output result is a record result used for indicating whether one piece of data is linked to a database of the other party.

The linear model y= Σwis as follows _i x _i +b illustrates how privacy predictions can be implemented for a piece of data, and other classification models can be implemented in a similar manner.

(1) The model parameters of the training phase are denoted as w _i And b, according to the one-hot coding of three states, w _i Is arranged intoRepresenting the weights of the jth eigenvalue in the intersection, not in the intersection, and default three states, respectively. The first and second subjects calculate +. >

(2) For default, the first body may directly correct the digital slice whose result is the default state value, i.e. if the j-th feature value is default, its value must be the score in the intersection, i.e.Thus can pass->Correcting.

(3) The individual feature scores are added, namely: y is ₀ ＝∑ _j y _j,0 +b，y ₁ ＝∑ _j y _j,1 。

(4) Fragmenting secrets into fingers using a comparison functionAnd obtaining comparison result fragments after thresholding comparison. For example, the sliced MSB function may be used here, i.e. the binary most significant bit is taken to obtain y ₀ +y ₁ Comparing the result with threshold value 0 to fragment z ₀ ＝MSB(y ₀ )，z ₁ ＝MSB(y ₁ )。

(5) The second body will slice z ₁ Sending to a first main body, and the first main body synthesizes the fragments into a final real resultWherein->Is an exclusive or operation. If z is greater than 0, determining that the corresponding entity is the entity with the characteristic recorded by the first main body and the second main body together, wherein the data record is successfully linked; if z is less than 0, it may be determined that the corresponding entity is not an entity whose characteristics are recorded by both the first and second principals, the data record link failing.

Based on the above application scenario, it can be known that the data processing method provided by the embodiment of the present application may be implemented as a privacy record linking method based on circuit privacy set intersection, where the attribute features of the two databases are combined by feature engineering to obtain derivative features, then the result of whether each feature is in the other database is obtained by circuit privacy set intersection, and whether the feature belongs to intersection in the two databases, that is, whether the feature is a linked record is determined by training the obtained scoring model.

The embodiment of the application is realized based on a machine learning method, and comprises a training stage of a prediction model and a prediction stage, wherein a scoring model for prediction is generated in the training stage. The accuracy of the model after intersection can be improved by deriving the new features, and particularly, different attribute features can be combined to generate the new derived features, for example, a new derived feature 'name_address' can be generated through names and addresses, and the features can be selected according to the importance of the features in a training stage so as to reduce the calculation cost. The result of whether each feature exists in the opposite party data table can be obtained by carrying out privacy set intersection on the derivative features of the two parties. And the trained prediction model is utilized, the corrected intersection result can be scored by a scoring model realized through privacy protection in the prediction stage, and whether the data are in the databases of both sides is judged.

According to the embodiment of the application, the derivative features are obtained by combining different features, then, whether each derivative feature is in an intersection set or not is obtained by safety intersection, and finally, whether the record on the link is determined by classifying according to a pre-trained model. The method has low calculation complexity and high calculation efficiency; meanwhile, the derivative features comprehensively utilize the features with global uniqueness and the non-uniqueness, so that the accuracy is high; most importantly, the whole process does not need the two parties to exchange plaintext information, so that the security of the privacy data is ensured, and laws and regulations of related privacy protection are met.

The following describes in detail the method for implementing the circuit privacy set intersection based on the position inverse mapping algorithm of the careless substitution and the careless duplication in the embodiment of the present application with reference to fig. 6 to 15.

As shown in fig. 6, m data held by the first main body Guest are hashed to a position sequence including n positions by a cuckoo, and m data held by the second main body Host are hashed to a position sequence including n positions by a simple hash. For each position in the sequence of positions, the first body Guest has at most one element, while the second body Host may have multiple elements. If there is a position where no element of the first body Guest corresponds, a random value filling is assigned. Similarly, if a location does not correspond to an element of the second body Host, a random value pad is also assigned. As shown in fig. 6, the filling with random numbers is indicated.

For each position in the position sequence, the first main body Guest and the second main body Host use an unintentional pseudo-random function OPRF to perform privacy set intersection calculation, and then the first main body Guest calculates an element at the position according to the pseudo-random function F to obtain a random value r _a The second Host is counted according to a pseudo-random function FCalculate its l elements at that position and get l random values { r } _b,1 ,r _b,2 ,…,r _b,l }. If the set of elements of the second body Host at the location contains the elements of the first body Guest at the location, i.e. r _a ∈{r _b,1 ,r _b,2 ,…,r _b,l Then the element corresponding to that location is the element in the data traffic.

For each position in the position sequence, the first main body Guest and the second main body Host calculate based on the privacy set member judgment PSM to obtain a final Boolean fragment. For boolean fragments of those positions filled with random numbers, their corresponding true values are zero. The real value corresponding to the boolean fragmentation indicates whether the m data held by the first body Guest are elements in the data traffic set.

In some alternative embodiments, the first principal Guest and the second principal Host may also use an inadvertently programmable pseudo-random function OPPRF (Oblivious Programmable Pseudorandom Functions) for privacy set intersection calculations. OPPRF refers to a first body Guest input a, a second body Host input { b } ₁ ,b ₂ ,…,b _l Finally, the first main body Guest obtains a pseudo-random table T, and the second main body Host obtains a random value r. The first main body Guest calculates a random value r according to a and T ^* . When a is E { b ₁ ,b ₂ ,…,b _l When r=r ^* The method comprises the steps of carrying out a first treatment on the surface of the Otherwise r+.r ^* . In the calculation process, the first main body Guest does not know r, and the second main body Host does not know r ^* . When replacing OPRF with OPPRF, the first body Guest and the second body Host can use a two-way comparison (i.e., whether r is equal to r ^* ) To replace PSM, thereby reducing computational overhead and communication overhead. Taking element a3 as an example, as shown in fig. 2, after OPPRF calculation, the first body Guest will obtain a random value F (k 1, a 2), and the second body Host will obtain a pseudo-random table k1. Taking element a3 as an example, after the OPPRF calculation, the first body Guest will obtain a random value F (k 6, a 3), and the second body Host will obtain a pseudo-random table k6.

The method shown in fig. 6 is based on cuckoo hash to implement PSI-Circuit, and has the following drawbacks:

(1) The cuckoo hash maps the original data of the first main body Guest to a new position, and the final boolean fragments cannot be aligned with the original data, i.e. cannot be order-preserving, because the mapped position information cannot be leaked.

(2) The cuckoo hash maps m input data to n (> m) locations, so the boolean fragments corresponding to the redundant n-m locations are redundant information. However, the position information after Hash mapping of the cuckoo cannot be exposed, so that the useless Boolean fragments cannot be deleted directly, and the redundancy of the Boolean fragments results is further caused.

(3) PSI-Circuit cannot be made on the input data of the repeated elements, and because the cuckoo hash is a single shot, the repeated elements easily cause cuckoo hash mapping failure.

The above drawbacks greatly limit the use of PSI-circuits. After the first body Guest shares the data secret to the second body Host according to the boolean fragments, both parties can only perform safe two-party computation which requires all boolean fragments (including redundant fragments introduced by the cuckoo hash) to be used and is identical, similar to summation/variance, and even simple computation such as summation with weights cannot be supported.

One approach that may be used to achieve data alignment is to combine PSI-circuits with homomorphic encryption algorithms. The homomorphic encryption algorithm (Homomorphic Encryption, HE) is an encryption algorithm meeting homomorphic operation property of ciphertext, namely, after data is homomorphic encrypted, specific calculation is carried out on the ciphertext, and plaintext after corresponding homomorphic decryption is carried out on the obtained ciphertext calculation result is equivalent to directly carrying out the same calculation on plaintext data, so that "computable invisibility" of the data is realized. Common homomorphic encryption algorithms may include, for example, RSA algorithm, elGamal algorithm, paillier algorithm, boneh-Goh-Nissim algorithm, and the like.

After the first main body Guest and the second main body Guest obtain Boolean fragments of the PSI-Circuit, the second main body Guest encrypts the Boolean value by using a homomorphic encryption algorithm and sends the Boolean value to the first main body Guest, and the first main body Guest directly selects ciphertext according to position information of Hash mapping of the cuckoo and position information of repeated elements, and uses corresponding plaintext Boolean fragments for homomorphic calculation and obtains a final fragmentation result. The main disadvantages of this approach are large traffic and low computational performance. The large traffic is because only one boolean value is encrypted by one ciphertext, and the low calculation performance is mainly reflected in homomorphic encryption/decryption and homomorphic calculation.

In order to solve the defects in the related art, the embodiment of the application provides a new privacy set intersection algorithm, which not only supports PSI-Circuit of repeated elements, and can rearrange Boolean fragments calculated by the PSI-Circuit to be consistent with the element sequence of a first main body Guest, but also can remove extra Boolean fragments introduced by cuckoo hash in the PSI-Circuit.

Fig. 7 illustrates a data processing method for performing privacy set intersection based on location inverse mapping in one embodiment of the present application, which may be performed by any terminal device or server having a data calculation function. The embodiment of the present application will be described taking as an example a data processing method performed by the first body shown in fig. 1. As shown in fig. 7, the data processing method includes the following steps S710 to S740.

S710: and replacing the repeated elements of the characteristic sequence held by the first main body to obtain a second data sequence without the repeated elements.

The feature sequence held by the first body, that is, the feature sequence in the above embodiment, is a data set formed by orderly arranging a plurality of data elements, where one or more elements may be included. When the repeated elements exist in the feature sequence, the repeated elements in the feature sequence can be removed in a repeated element replacement mode, and the position sequence of all the elements is kept not to be disturbed.

For example, the feature sequence is (a, b, c, d, c, d) comprising two sets of repeating elements c and d. After substitution of the repeated elements, a second data sequence (a, b, c, d, s0, s 1) without repeated elements can be obtained.

In one embodiment of the present application, a method for repeating element substitution of a feature sequence may include: acquiring random values which are different from all elements in the characteristic sequence; and replacing the repeated elements in the characteristic sequence with random values to obtain a second data sequence without the repeated elements. Taking the feature sequence (a, b, c, d, c, d) as an example, first, random values s0 and s1 different from each element therein are obtained, then one of the repeated elements c is replaced with the random value s0, and the other of the repeated elements d is replaced with the random value s1, thereby obtaining a second data sequence (a, b, c, d, s0, s 1) containing no repeated elements.

S720: mapping elements in the second data sequence to the position sequence and obtaining a position map for mapping elements in the position sequence to the feature sequence.

A sequence of positions is a set of positions made up of an ordered arrangement of several element positions, where each element position can be used to populate an element. Each element may be padded to an element position in the sequence of positions by performing a mapping process on the elements in the second sequence of data. After the mapping process, the arrangement order of the elements in the second data sequence is disturbed, i.e. the arrangement order of the elements in the position sequence is different from the arrangement order of the elements in the second data sequence. In order to restore the position sequence, the embodiment of the application can record the inverse mapping corresponding to the element in the second data sequence while carrying out mapping processing on the element in the second data sequence, namely, obtain the position mapping for mapping the element in the position sequence to the feature sequence.

In one embodiment of the present application, a method of mapping elements in a second data sequence to a position sequence may include: acquiring a position sequence comprising a plurality of element positions, wherein the number of the positions of the position sequence is larger than that of the second data sequence; hash mapping is carried out on the elements in the second data sequence, and the elements in the second data sequence are filled to element positions according to the mapping result; and filling random values into blank element positions which are not filled with any elements in the position sequence.

For example, the second data sequence includes m elements, the position sequence includes n element positions, after the position sequence is subjected to Hash mapping of cuckoo, m elements can be mapped and filled in the position sequence, and the rest n-m blank element positions can be filled with random values. According to the embodiment of the application, the random value is filled into the blank element positions, so that the position sequence of the position sequence is not disturbed, and the effectiveness of position mapping is ensured. For n-m random values filled in the position sequence, the n-m random values can be mapped into n-m newly added elements in the second data sequence through position mapping, so that mapping and inverse mapping relation of all elements in the position sequence and the second data sequence are maintained. For example, embodiments of the present application may add n-m newly added elements to m elements in the second data sequence, or may add n-m newly added elements to any specified position in the second data sequence.

S730: and carrying out privacy set intersection on the position sequence and the data held by the second main body to obtain first secret fragments held by the first main body and the second main body respectively, wherein plaintext corresponding to the first secret fragments is used for indicating whether intersection data exists between the position sequence and the data held by the second main body.

The data held by the second body may be another sequence of positions corresponding to the sequence of positions held by the first body. For example, the sequence of positions held by the second body is a set of positions made up of an ordered arrangement of several element positions, where each element position may be populated with one or more elements. The second main body can use the same or different mapping method as the first main body to obtain a corresponding position sequence after mapping the data sequence held by the second main body. For example, the second body may delete the repeated elements directly in the data sequence held by itself, and then perform mapping processing by using a simple hash algorithm to obtain the corresponding position sequence. Or the second main body can replace the repeated elements in the data sequence held by the second main body with random values, and then the corresponding position sequence is obtained by mapping processing by using a cuckoo hash algorithm.

By performing privacy set intersection on the position sequence held by the first subject and the data held by the second subject, a first secret piece held by the first subject and the second subject, respectively, can be obtained. Through the privacy set intersection, the first main body or the second main body can realize intersection judgment of the position sequences under the condition that the other side can not know the data held by the other side. Under the condition that the first secret piece held by the first main body and the first secret piece held by the second main body are obtained at the same time, a corresponding plaintext can be obtained by decryption, and the plaintext is used for indicating whether intersection data exists between a position sequence held by the first main body and data held by the second main body.

S740: and mapping the first secret shards held by the first main body according to the position mapping to obtain intersection secret shards held by the first main body, wherein a plaintext corresponding to the intersection secret shards is used for indicating whether each element in the feature sequence is intersection data.

The intersection secret piece obtained by the privacy set intersection has the same position sequence as the position sequence held by the first main body. After the first secret piece held by the first main body is mapped by using the position mapping, an intersection secret piece held by the first main body can be obtained, and the intersection secret piece and the feature sequence have the same position sequence, so that a plaintext corresponding to the intersection secret piece can be used for indicating whether each element in the feature sequence is intersection data of data held by the first main body and the second main body.

In the data processing method provided by the embodiment of the application, the second data sequence is obtained by replacing the repeated elements of the characteristic sequence, and the position mapping for recovering the position sequence of the second data sequence is obtained while the second data sequence is mapped to the position sequence, so that after the intersection of the privacy set is completed, the intersection secret shards can be mapped by using the position mapping to obtain the intersection secret shards with the position sequence recovered. Because the intersection secret shards and the feature sequence have the same position sequence, the decrypted intersection data and the feature sequence can be kept to have the same position sequence, and the first main body does not need to spend extra computing resources and communication resources to do secret sharing of the data position sequence with the second main body, so that the data processing efficiency can be improved while the data security is ensured, and the resource consumption of the data processing is reduced.

Fig. 8 illustrates a data processing method for performing privacy set intersection based on mapping decomposition on inverse location mapping in an embodiment of the present application, which may be performed by any terminal device or server having a data calculation function. The embodiment of the present application will be described taking as an example a data processing method performed by the first body shown in fig. 1. As shown in fig. 8, the data processing method includes the following steps S810 to S860.

S810: and replacing the repeated elements of the characteristic sequence held by the first main body to obtain a second data sequence without the repeated elements.

S820: mapping elements in the second data sequence to the position sequence and obtaining a position map for mapping elements in the position sequence to the feature sequence.

S830: and carrying out privacy set intersection on the position sequence and the data held by the second main body to obtain first secret fragments held by the first main body and the second main body respectively, wherein plaintext corresponding to the first secret fragments is used for indicating whether intersection data exists between the position sequence and the data held by the second main body.

S840: a first sub-map and a second sub-map corresponding to the position map are obtained, the first sub-map being used for recovering the repeated elements in the data sequence, and the second sub-map being used for recovering the position order in the data sequence.

S850: and recovering the repeated elements in the first secret patch held by the first main body according to the first sub-map to obtain a third secret patch held by the first main body.

S860: and recovering the position sequence of each element in the third secret shard held by the first main body to be the same as the characteristic sequence according to the second sub-mapping to obtain an intersection secret shard held by the first main body, wherein a plaintext corresponding to the intersection secret shard is used for indicating whether each element in the characteristic sequence is intersection data.

In the data processing method provided by the embodiment of the application, the recovery of the repeated elements and the recovery of the position sequence can be respectively realized by decomposing the position mapping, so that the atomization degree of data processing is improved, the data processing efficiency is further improved, and the resource consumption is reduced.

For details of the steps S810 to S830, reference may be made to the descriptions of the steps S710 to S730 in the above embodiments, and the details are not repeated here.

In step S840, the position map is decomposed into two sub-maps, a first sub-map for recovering the repeating elements in the data sequence and a second sub-map for recovering the position order in the data sequence.

As shown in fig. 9, the repeated element substitution is performed on the feature sequence 901 held by the first body, so that a second data sequence 902 containing no repeated element is obtained, a position sequence 903 is obtained after the mapping process is performed on the second data sequence 902 using the cuckoo hash map, and a position map f for mapping the position sequence 903 to the feature sequence 901 is obtained.

After the privacy set intersection of the location sequence 903, a first privacy tile 904 held by the first principal may be obtained.

Decomposing the position map f to obtain a first sub-map equivalent to the first sub-mapAnd a second sub-map phi. Use the first sub-map->The repeated elements may be recovered in the first secret patch 904, resulting in a third secret patch 905 held by the first body. The positional order may then be restored for the third secret patch 905 using the second sub-map phi, resulting in an intersection secret patch 906 held by the first subject. Since the order of the positions of the elements in the intersection secret patch 906 has been restored to be the same as the feature sequence, its corresponding plaintext can be used to indicate whether each element in the feature sequence is intersection data of the first and second subjects holding data.

In step S850, the repeated elements are restored in the first secret partition held by the first body according to the first sub-map, and a third secret partition held by the first body is obtained.

In one embodiment of the present application, a method of recovering a repeating element according to a first sub-map may include: performing unintentional copying on the first sub-map and the first secret piece held by the second main body to obtain a first intermediate vector held by the first main body and a third secret piece held by the second main body; the unintentional copying is to copy and map elements in the data sequence on the premise of not revealing the respective held data; and performing mapping processing on the first secret patch held by the first main body according to the first sub-mapping, and performing exclusive OR operation with the first intermediate vector held by the first main body to obtain a third secret patch held by the first main body. Replication mapping refers to the replication of one part of an element in a data sequence to another part of the element.

In step S860, the position order of each element in the third secret patch held by the first body is restored to be the same as the feature sequence according to the second sub-map, so as to obtain an intersection secret patch held by the first body, and the plaintext corresponding to the intersection secret patch is used to indicate whether each element in the feature sequence is intersection data.

In one embodiment of the present application, a method of restoring a position order according to a second sub-map may include: performing careless replacement on the second sub-map and a third secret patch held by the second main body to obtain a second intermediate vector held by the first main body and an intersection secret patch held by the second main body; the inadvertent replacement is to perform replacement mapping on elements in the data sequence on the premise of not revealing the respective held data; and mapping the third secret shard held by the first main body according to the second sub-mapping, and performing exclusive OR operation with the second intermediate vector held by the first main body to obtain the intersection secret shard held by the first main body. Permutation mapping refers to exchanging a portion of an element in a data sequence with another portion of the element.

In an embodiment of the application, the repetitive elements and the positional order are recovered in an intersection secret shard held by the first principal using two sub-mappings corresponding to the careless duplication and careless replacement, respectively. FIG. 6 illustrates a schematic diagram of data processing based on careless copying and careless replacement in one embodiment of the application.

As shown in fig. 10, the first secret patch G1 held by the first principal Guest and the first secret patch H1 held by the second principal Host can be obtained by performing the privacy set intersection of the position sequence held by the first principal and the data held by the second principal.

To the first sub-mapAnd performing unintentional copying on the first secret patch H1 held by the second main body to obtain a first intermediate vector I1 held by the first main body and a third secret patch H3 held by the second main body.

According to the first sub-mapAfter the mapping processing is performed on the first secret slice H1 held by the first main body, the first secret slice H1 held by the first main body and the first intermediate vector I1 held by the first main body are subjected to exclusive or operation, so as to obtain a third secret slice G3 held by the first main body.

And performing unintentional replacement on the second sub-map phi and the third secret patch H3 held by the second main body to obtain a second intermediate vector I2 held by the first main body and an intersection secret patch H2 held by the second main body.

And mapping the third secret patch G3 held by the first main body according to the second sub-mapping phi, and performing exclusive OR operation with the second intermediate vector I2 held by the first main body to obtain the intersection secret patch G2 held by the first main body.

In some embodiments of the application, the first sub-map may be further mappedDecomposition is performed to achieve unintentional replication. FIG. 11 illustrates a schematic diagram of an implementation of unintentional replication based on a first sub-map decomposition in one embodiment of the application.

As shown in fig. 11, for the first sub-map The method for performing the unintentional duplication of the first secret partition H1 held by the second body includes the following steps S1110 to S1160.

S1110: acquiring and first sub-mappingThe corresponding third sub-map ψ for placing the element to be restored in the adjacent position and the boolean vector bool for determining whether to copy two elements of the adjacent position as duplicate elements.

For example, a first secret slice H1 held by the second body is denoted (a, b, c, d, e, f), mapped to a first sub-mapThe first secret shard H1 held by the second body is intended to recover two repeated elements c and d in the first secret shard H1, that is, the elements c and d in the first secret shard H1 are copied once, so as to obtain secret shards (a, b, c, d, c, d) after recovering the repeated elements. At this time, the first sub-map->Can be expressed as a positional mapping relationship: (1, 2,3,4,5, 6) - > (1,2,3,4,3,4). I.e., the 5 th element becomes the 3 rd element and the 6 th element becomes the 4 th element.

According to the first sub-mapA third sub-map ψ and boolean vectors bool equivalent thereto can be constructed. The third sub-map ψ is used to place the elements to be restored in adjacent positions, for example, the third sub-map ψ may be used to change the positions (1,2,3,4,3,4) to (1, 2,3, 4).

The boolean vector pool is used to determine whether to copy two elements of adjacent positions as repeating elements. For example, a boolean vector bool= (1,1,0,1,0) may be constructed from positions (1, 2,3, 4), where a value of 1 indicates that no duplication is required between two adjacent positions and a value of 0 indicates that duplication is required between two adjacent positions.

S1120: the third sub-map ψ and the first secret patch H1 held by the second body are carelessly permuted to obtain a fourth secret patch G4 held by the first body and a fourth secret patch H4 held by the second body.

Taking the first secret patch H1 held by the second body as an example, the fourth secret patch G4 held by the first body after the first secret patch H1 held by the second body is inadvertently replaced by the third sub-map ψ can be expressed as (t) ₁ ,t _,2 …,t ₆ ) The fourth secret piece H4 held by the second principal can be expressed as (s ₁ ,s ₂ ,…,s ₆ ). In case the fourth secret stripe G4 held by the first body and the fourth secret stripe H4 held by the second body are obtained simultaneously, the corresponding real values (a, b, c, e, d, f) can be decrypted, i.e. the secret stripes (a, b, c, d, e, f) are changed from the positions (1,2,3,4,3,4) to (1, 2,3, 4) by means of the third sub-map ψ.

S1130: and (3) performing unintentional copying on the Boolean vector pool and the fourth secret piece H4 held by the second main body to obtain a third intermediate vector I3 held by the first main body and a fifth secret piece H5 held by the second main body.

For example, according to the positional relationship of the secret pieces (a, b, c, e, d, f), the piece of the element c needs to be copied to the position of the element e, and the piece of the element d needs to be copied to the position of the element f, and the boolean vector of the corresponding structure is (1,1,0,1,0). Wherein the meaning of each numerical value is: (a, b) no duplication, (b, c) no duplication, (c, e) no duplication, (e, d) no duplication and (d, f) no duplication.

Based on the unintended copying of the Boolean vector pool and the fourth secret shard H4 held by the second principal, the third intermediate vector I3 held by the first principal is obtained asThe fifth secret piece H5 held by the second principal can be expressed as (r ₁ ，r _2, …,r ₆ ). I.e. to realize that will (s ₁ ,s ₂ ,…,s ₆ ) Is transformed into(s) ₁ ,s ₂ ,s ₃ ,s ₃ ,s ₅ ,s ₅ )。

S1140: and mapping the fourth secret piece G4 held by the first main body according to the Boolean vector, and performing exclusive OR operation with the third intermediate vector I3 to obtain a fifth secret piece G5 held by the first main body.

The elements in the fourth secret shard H4 held by the first body are selectively duplicated according to the bool vector, so that a new secret shard can be obtained, namely (t) ₁ ,t ₂ ,t ₃ ,t ₄ ,t ₅ ,t ₆ ) Is converted into (t) ₁ ,t ₂ ,t ₃ ,t ₃ ,t ₅ ,t ₅ ). After exclusive or operation of the new secret piece with the third intermediate vector I3, a fifth secret piece G5 held by the first body can be obtained, which can be expressed, for example, as (τ ₁ ,τ ₂ ,…,τ ₆ )。

S1150: inverse mapping ψ for the third sub-mapping ^-1 And the fifth secret shard H5 held by the second main body is subjected to unintentional replacement to obtain a fourth intermediate vector I4 held by the first main body and a third secret shard H3 held by the second main body.

Inverse mapping ψ with third sub-mapping ^-1 And a fifth secret piece H5 (r) held by the second body ₁ ,r2,…,r ₆ ) As input, performing an inadvertent permutation OP, a fourth intermediate vector I4 held by the first principal may be obtained, e.g. expressed asAt the same time, a third secret slice H3 held by the second body can be obtained, for example expressed as (v) ₁ ,v ₂ ,…,v ₆ )。

S1160: inverse mapping ψ according to the third sub-mapping ^-1 After the mapping process is performed on the fifth secret slice H5 held by the first body, the fifth secret slice H5 and the fourth intermediate vector I4 held by the first body are subjected to exclusive or operation, so as to obtain a first intermediate vector I1 held by the first body.

A fifth secret piece H5 (τ) held by the first body ₁ ,τ ₂ ,…,τ ₆ ) Inverse mapping ψ in the third sub-mapping ^-1 After the action, with the first main bodyFour intermediate vectors I4 Performing exclusive OR operation to obtain a first intermediate vector I1 (u) ₁ ,u ₂ ,…,u ₆ )。

In one embodiment of the present application, the method of inadvertently copying the boolean vector and the fourth secret piece held by the second body in step S1130 may include: sequentially selecting two elements at adjacent positions from a fourth secret partition held by a second main body to obtain N-1 element pairs, wherein N is the number of elements in the fourth secret partition held by the second main body; and selecting a mapping rule according to the Boolean vector, and respectively carrying out mapping processing on N-1 element pairs according to the mapping rule to obtain a third intermediate vector held by the first main body and a fifth secret partition held by the second main body.

For example, a fourth secret piece held by the second body is denoted (a, b, c, e, d, f) containing 5 elements. The corresponding 4 element pairs can be expressed as: (a, b), (b, c), (c, e), (e, d) and (d, f). Mapping rules chosen from boolean vectors may be used to implement copying or non-copying of both elements of an element pair.

In one embodiment of the present application, the mapping rules that may be selected include:

(1) And mapping the first element in the element pair by using the first mapping parameter and the second mapping parameter respectively. For example, the element pair is (a, b), and the element pair obtained by performing mapping processing based on the mapping rule may be expressed as (a, a), that is, copying of two adjacent elements is completed.

(2) And mapping the first element in the element pair by using the first mapping parameter, and mapping the second element in the element pair by using the second mapping parameter. For example, the element pair is (a, b), and the element pair obtained by performing mapping processing based on the mapping rule is still represented as (a, b), that is, the adjacent two elements do not need to be duplicated.

In another embodiment of the present application, the mapping rules that may be selected include any two of the following rules:

(1) And mapping the first element in the element pair by using the first mapping parameter and the second mapping parameter respectively. For example, the element pair is (a, b), and the element pair obtained by performing the mapping process based on the mapping rule may be expressed as (a, a).

(2) And mapping the second element in the element pair by using the first mapping parameter and the second mapping parameter respectively. For example, the element pair is (a, b), and the element pair obtained by performing the mapping process based on the mapping rule may be represented as (b, b).

(3) And mapping the first element in the element pair by using the first mapping parameter, and mapping the second element in the element pair by using the second mapping parameter. For example, the element pair is (a, b), and the element pair obtained by performing the mapping process based on the mapping rule may be expressed as (a, b).

(4) And mapping the second element in the element pair by using the first mapping parameter, and mapping the first element in the element pair by using the second mapping parameter. For example, the element pair is (a, b), and the element pair obtained by performing the mapping process based on the mapping rule may be represented as (b, a).

In an embodiment of the present application, for step S1130 in the above embodiment, the method of selecting a mapping rule according to a boolean vector and performing mapping processing on N-1 element pairs according to the mapping rule, respectively, may include the following steps S1131 to S1136.

S1131: n-1 mapping units are obtained, and each mapping unit is used for carrying out mapping processing on one element pair.

S1132: the Boolean vector and N-1 element pairs are respectively input into N-1 mapping units.

S1133: and selecting a mapping rule corresponding to the mapping unit according to the value of each element in the Boolean vector.

S1134: and mapping the element pairs input into the mapping unit according to the selected mapping rule to obtain the output parameters held by the first main body and the mapping parameters held by the second main body.

S1135: and collecting output parameters held by the first main body to obtain a third intermediate vector held by the first main body.

S1136: and collecting mapping parameters held by the second main body to obtain a fifth secret shard held by the second main body.

In one embodiment of the present application, for step S1133, the method for selecting the mapping rule corresponding to the mapping unit according to the values of the elements in the boolean vector may include:

s11331: the target element corresponding to the mapping unit is obtained from the boolean vector.

The Boolean vector includes N-1 elements, each of which corresponds to a target element of a mapping unit.

S11332: when the target element value is the first value, selecting a mapping rule corresponding to the mapping unit as follows: and mapping the first element in the element pair by using the first mapping parameter and the second mapping parameter respectively.

For example, when the target element in the boolean vector has a value of 0, it indicates that the element pair input to the corresponding mapping unit needs to be copied, so that the first element in the element pair may be mapped using the first mapping parameter and the second mapping parameter, respectively.

S11333: when the target element takes a value of a second value different from the first value, selecting a mapping rule corresponding to the mapping unit as follows: and mapping the first element in the element pair by using the first mapping parameter, and mapping the second element in the element pair by using the second mapping parameter.

For example, when the target element in the boolean vector has a value of 1, it means that the element pair input to the corresponding mapping unit does not need to be duplicated, so that the first element in the element pair may be mapped using the first mapping parameter and the second element in the element pair may be mapped using the second mapping parameter.

Both the inadvertent replacement and the inadvertent duplication of embodiments of the present application may be accomplished by combining multiple 1-switch based mapping units.

As shown in fig. 8, the 1-switch based mapping unit may perform an unintentional copy of a single bit sigma e {0,1} of the first body Guest input and (a, b) of the second body Host input, and eventually a secret shard of the first and second body Guest outputs (a, b) or (b, a), i.e. when sigma=0,otherwise/>

A mapping network for achieving an inadvertent permutation OP may be constructed based on a combination of a plurality of mapping units. Since any permutation of n elements can be decomposed into 2 logs ₂ n-1, each having n/2 swaps, performing n/2 1-switches on each swap network, and iterating 2 logs altogether ₂ The OP can be replaced unintentionally after n-1 times.

As shown in fig. 13, the 1-switch based mapping unit may perform an inadvertent permutation on a single bit sigma e {0,1} of the first body Guest input and (a, b) of the second body Host input, and eventually the first and second body Guest outputs (a, a) or (a, b) secret shards, i.e. when sigma=0,when a=1 is used,

similar to the inadvertent replacement, embodiments of the application may construct a mapping network for implementing the inadvertent duplication based on a combination of multiple mapping units. For example, a serial network may be formed in a serial manner, or a parallel network may be formed in a parallel manner.

As shown in fig. 14, N-1 mapping units are sequentially connected in series, and a serial network for mapping boolean vectors and N-1 element pairs can be obtained.

The first input end of the mapping unit is used for inputting a first element in the element pair or connecting with the output end of the previous mapping unit;

The second input end of the mapping unit is used for inputting a second element in the element pair;

the first output end of the mapping unit is used for outputting output data obtained by performing exclusive OR operation on the first mapping parameter of the mapping unit and the input data of the first input end;

the second output end of the mapping unit is used for outputting output data obtained by performing exclusive-or operation on the second mapping parameter of the mapping unit and the input data of the first input end/the second input end.

Taking the serial network shown in fig. 14 as an example, mapping the element pairs of the input mapping unit according to the selected mapping rule to obtain the output parameters held by the first main body and the mapping parameters held by the second main body, and the implementation method may include:

performing exclusive OR operation on the first mapping parameter of the mapping unit and the input data of the first input end of the mapping unit to obtain a first output parameter; when the target element value corresponding to the mapping unit in the Boolean vector is a first numerical value, performing exclusive OR operation on the second mapping parameter of the mapping unit and the input data of the first input end of the mapping unit to obtain a second output parameter; when the target element value corresponding to the mapping unit in the Boolean vector is a first value different from the first value, performing exclusive OR operation on the second mapping parameter of the mapping unit and the input data of the first input end of the mapping unit to obtain a second output parameter; the first mapping parameter of the mapping unit and the second mapping parameter of the mapping unit are mapping parameters held by the second main body, and the first output parameter is an output parameter held by the first main body.

As shown in fig. 15, N-1 mapping units are sequentially connected in parallel, and a parallel network for mapping boolean vectors and N-1 element pairs can be obtained.

The first input end of the mapping unit is used for inputting a first element in the element pair or inputting a second mapping parameter of the previous mapping unit;

Taking the parallel network shown in fig. 15 as an example, mapping the element pairs of the input mapping unit according to the selected mapping rule to obtain the output parameters held by the first main body and the mapping parameters held by the second main body, and the specific implementation method may include:

performing exclusive OR operation on the first mapping parameter of the current mapping unit and the input data of the first input end of the current mapping unit to obtain a first output parameter; when the value of a target element corresponding to the current mapping unit in the Boolean vector is a first numerical value, performing exclusive-OR operation on a second mapping parameter of the current mapping unit and input data of a first input end of the current mapping unit to obtain a second output parameter, performing exclusive-OR operation on the second output parameter and data transmitted by a previous mapping unit, and transmitting the second output parameter and the data transmitted by a previous mapping unit to a first output end and a second output end of a subsequent mapping unit; when the target element value corresponding to the current mapping unit in the Boolean vector is a second value different from the first value, performing exclusive OR operation on the second mapping parameter of the current mapping unit and the input data of the first input end of the current mapping unit to obtain a second output parameter, performing exclusive OR operation on the second output parameter and the data transmitted by the previous mapping unit, transmitting the second output parameter and the data transmitted by the previous mapping unit to the first output end of the next mapping unit, and transmitting a zero value to the second output end of the next mapping unit; the first mapping parameter of the current mapping unit and the second mapping parameter of the current mapping unit are mapping parameters held by the second main body, and the first output parameter and data transmitted by the previous mapping unit are subjected to exclusive OR operation to obtain the output parameter held by the first main body.

For example, each adjacent mapping unit 1-switch is assigned a random value (as shown in FIG. 11And->) Executing 5 1-switches, inputting bol= (1,1,0,1,0), the first body Guest will get the slice And->Then, according to (1,1,0,1,0), the exclusive-or operation is performed, wherein the dotted line part in the figure indicates that the target element of the boolean vector input into the mapping unit has a value of 1, which indicates that the duplication is not required, and the exclusive-or step is omitted, otherwise, the exclusive-or operation is required to be performed. Finally, the first main body Guest is obtained The second main body Host gets r _i Corresponds to(s) ₁ ,s ₂ ,s ₃ ,s ₃ ,s ₅ ,s ₅ )。

The following describes a data processing method provided by the embodiment of the present application in connection with a specific application scenario.

Setting the input data length of the Guest and the Host as m, firstly, choosing a plurality of random numbers without repetition by the Guest to replace the deleted repeated data, and directly deleting the repeated data by the Host. Then, the Guest performs a cuckoo hash map of m- > n, and the Host performs a simple hash map. And finally, obtaining n Boolean fragments by the Guest and the Host. According to the original input data and the cuckoo hash map, the Guest can obtain a mapping f of n- > n, namely, the cuckoo hash is used for filling random number order mapping to the back of the input data set of the Guest. For example, m=6, n=8, let the cuckoo hash map 6 non-duplicate data to (4,1,6,7,8,3) these 6 positions, then there is no data at the 2 nd and 5 th positions, select two non-duplicate random numbers to fill in these two positions, map f to map 2 to 7, map 5 to 8, and place these two random numbers down the original data, the rest map the resulting positions of PSI-Circuit to the positions of the original data according to the cuckoo hash, i.e., f (4) =1, f (1) =2, f (6) =3, f (7) =4, f (8) =5, f (3) =6).

Guest constructs two maps from map fAnd phi. Construction of repetitive element position according to random number substitution>Let phi=f ^-1 . For example, let (a, b, c, d, c, d) be replaced by a random number followed by (a, b, c, d, s0, s 1), mapped to (b, r0, s1, a, r1, c, d, s 0) by cuckoo Ha Xihou, where both r0 and r1 have no random numbers for filling. According to the mapping position of the repeating elements c and d, < >>Map position 8 to 6 and map position 3 to position 7, i.e., (1, 2,3,4,5,6,7, 8) - > (1,2,7,4,5,6,7,6). First of all to Guest +.>And PSI-Circuit Boolean partition of Host performs OR, then Guest gets Boolean partition +.>Host gets Boolean splits->Guest will map->Acting on PSI-Circuit Boolean splits and with +.>Exclusive or. The vector after the exclusive OR is recorded as +.>Then->And->Boolean fragmenting results for PSI-Circuit pass +.>The mapped boolean fragments. Then phi for Guest and +.Host>Executing OP, guest gets Boolean slice +.>Host gets Boolean splits->The final Guest will have the substitution phi acting on +.>Go up and go up->Exclusive OR, recording the vector after exclusive OR asThen->And->And f mapping the Boolean slicing result of the PSI-Circuit. And cutting off the last n-m Boolean fragments (corresponding to the n-m positions filled with random numbers in the cuckoo hash), namely, the Boolean fragments with the same sequence as the original input data of the Guest, wherein the real Boolean value of each row represents whether the row element is in an intersection or not.

It is worth noting that in the process of performing OR, the OP performing the permutation ψ and the OP performing the permutation φ may be combined into the OP performing the permutation after the combination of the pair φ° ψ, thereby reducing the computational overhead and the communication overhead by half.

The embodiment of the application provides a PSI-Circuit capable of processing repeated elements and solves the problem that the Boolean slicing result in the existing PSI-Circuit is inconsistent with the input data set of the Guest. By replacing the repeated elements of the input data with random numbers, after the Boolean fragmenting result of the PSI-Circuit is obtained, OR is designed according to the mapping between the replaced repeated elements and the replaced random numbers and the replacement mapping of the original data in the PSI-Circuit, so that the final Boolean fragmenting keeps the sequence of the original input data of the Guest.

The embodiment of the application solves the problem of Circuit privacy set intersection aiming at repeated elements, and keeps consistent with the original input data sequence of the Guest, so that the data secret sharing of the Guest after PSI-Circuit can be used for executing any multiparty security calculation, and is not limited to calculation with very limited summation/variance and the like. As shown in table 1, compared with the PSI-Circuit combined homomorphic encryption scheme, the calculation overhead of the embodiment of the present application is reduced by about 3.44 times and the communication overhead is reduced by about 20% when m=10000 is tested on two servers equipped with Intel (R) Xeon (R) platform 8255C CPU@2.50GHz,40GB memory.

TABLE 1

	Computational overhead	Communication overhead
			Embodiments of the application	1.8 seconds	22MB
PSI-Circuit + homomorphic encryption	8 seconds	26.3MB

PSI-Circuit under test takes about 1.2s and the communication overhead is about 20MB. Therefore, the overhead of PSI-Circuit is removed, and the embodiment of the application is obviously superior to the order preserving method based on the paillier homomorphic encryption algorithm.

It should be noted that although the steps of the methods of the present application are depicted in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

The following describes embodiments of the apparatus of the present application that may be used to perform the data processing methods of the above-described embodiments of the present application. Fig. 16 schematically shows a block diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 16, the data processing apparatus 1600 includes:

a first obtaining module 1610, configured to obtain a feature sequence held by a first body, where the feature sequence includes derived features obtained by performing a stitching process on original features of a target entity;

A second obtaining module 1620 configured to obtain an intersection secret piece held by the first subject, where a plaintext corresponding to the first secret piece is used to indicate whether each element in the feature sequence is intersection data of the first subject and second subject holding data;

a prediction module 1630 configured to predict whether the target entity is an entity that the first principal and the second principal commonly hold a feature from an intersection secret shard held by the first principal.

Specific details of the data processing apparatus provided in each embodiment of the present application have been described in the corresponding method embodiments, and are not described herein.

Fig. 17 schematically shows a block diagram of a computer system of an electronic device for implementing an embodiment of the application.

It should be noted that, the computer system 1700 of the electronic device shown in fig. 17 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 17, the computer system 1700 includes a central processing unit 1701 (Central Processing Unit, CPU) that can execute various appropriate actions and processes according to a program stored in a Read-Only Memory 1702 (ROM) or a program loaded from a storage portion 1708 into a random access Memory 1703 (Random Access Memory, RAM). In the random access memory 1703, various programs and data necessary for the system operation are also stored. The cpu 1701, the rom 1702, and the ram 1703 are connected to each other via a bus 1704. An Input/Output interface 1705 (i.e., an I/O interface) is also connected to the bus 1704.

The following components are connected to the input/output interface 1705: an input section 1706 including a keyboard, a mouse, and the like; an output portion 1707 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker, etc.; a storage portion 1708 including a hard disk or the like; and a communication section 1709 including a network interface card such as a local area network card, a modem, or the like. The communication section 1709 performs communication processing via a network such as the internet. The driver 1710 is also connected to the input/output interface 1705 as needed. A removable medium 1711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 1710, so that a computer program read therefrom is installed into the storage portion 1708 as needed.

In particular, the processes described in the various method flowcharts may be implemented as computer software programs according to embodiments of the application. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1709, and/or installed from the removable media 1711. The computer programs, when executed by the central processor 1701, perform the various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of data processing, comprising:

2. The data processing method of claim 1, wherein predicting whether the target entity is an entity having a feature in common with the first principal and the second principal based on the intersection secret shard held by the first principal, comprises:

obtaining a prediction model obtained by training according to plaintext data, wherein the prediction model is used for predicting whether the plaintext data is intersection data or not and whether the plaintext data belongs to the mapping relation between the same entities or not;

mapping the intersection secret shards held by the first main body according to the prediction model to obtain entity secret shards held by the first main body, wherein a plaintext corresponding to the entity secret shards is used for indicating whether the target entity is an entity with common holding characteristics of the first main body and the second main body;

And determining whether the target entity is an entity with the common holding characteristics of the first entity and the second entity according to the entity secret shards held by the first entity.

3. The data processing method of claim 2, wherein determining whether the target entity is an entity having characteristics in common with the first principal and the second principal based on the entity secret shards held by the first principal, comprises:

extracting the most significant bits from the entity secret shards held by the first body and the second body respectively;

performing exclusive OR operation on the two most significant bits to obtain a symbol bit plaintext corresponding to the entity secret piece;

and determining whether the target entity is an entity with the common holding characteristics of the first main body and the second main body according to the comparison result of the sign bit plaintext and the zero value.

4. The data processing method according to claim 2, wherein before mapping the intersection secret shards held by the first principal according to the predictive model, the method further comprises:

and converting the intersection secret shard held by the first main body into an arithmetic shard by Boolean shard.

5. The data processing method according to claim 2, wherein mapping the intersection secret piece held by the first body according to the prediction model to obtain the entity secret piece held by the first body includes:

obtaining mapping parameters for multiplying input data from the prediction model;

converting the mapping parameters into state weights for representing different characteristic states of the input data, the state weights including intersection weights for representing that the input data is intersection data and non-intersection weights for representing that the input data is not intersection data;

and carrying out weighting operation on the intersection secret shards held by the first main body according to the state weight to obtain the entity secret shards held by the first main body.

6. The data processing method according to claim 5, wherein the state weight further includes a default value weight for indicating that the input data is a default value; and performing a weighted operation on the intersection secret shards held by the first main body according to the state weight to obtain entity secret shards held by the first main body, wherein the weighted operation comprises the following steps:

Weighting the intersection secret slices held by the first main body according to the intersection weight and the non-intersection weight to obtain a weighted result;

and correcting the weighted result according to the intersection weight and the default value weight to obtain the entity secret shard held by the first main body.

7. The data processing method according to claim 1, wherein acquiring the feature sequence held by the first subject includes:

acquiring a plurality of original features of a target entity held by a first subject;

splicing at least two original features to obtain derivative features of the target entity;

a feature sequence comprising the derived feature is obtained.

8. The method for processing data according to claim 7, wherein the step of performing a stitching process on at least two of the original features to obtain derivative features of the target entity includes:

classifying the original features of the target entity to obtain unique identification features capable of uniquely identifying the target entity and non-unique identification features incapable of uniquely identifying the target entity;

and performing splicing processing on at least two non-unique identification features to obtain derivative features of the target entity.

9. The method of claim 8, wherein obtaining a feature sequence comprising the derived feature comprises:

and combining the unique identification feature and the derivative feature into a feature sequence of the target entity.

10. The data processing method according to any one of claims 1 to 9, characterized in that acquiring an intersection secret piece held by the first subject includes:

performing repeated element replacement on the characteristic sequence held by the first main body to obtain a second data sequence without the repeated elements;

mapping elements in the second data sequence to a position sequence, and acquiring a position map for mapping elements in the position sequence to the feature sequence;

carrying out privacy set intersection on the position sequence and data held by a second main body to obtain first secret patches held by the first main body and the second main body respectively, wherein plaintext corresponding to the first secret patches is used for indicating whether intersection data exists between the position sequence and the data held by the second main body;

and mapping the first secret shard held by the first main body according to the position mapping to obtain an intersection secret shard held by the first main body, wherein a plaintext corresponding to the intersection secret shard is used for indicating whether each element in the feature sequence is the intersection data.

11. The data processing method according to claim 10, wherein performing mapping processing on the first secret piece held by the first body according to the location map to obtain an intersection secret piece held by the first body includes:

obtaining a first sub-map and a second sub-map corresponding to the position map, wherein the first sub-map is used for recovering repeated elements in a data sequence, and the second sub-map is used for recovering position sequences in the data sequence;

recovering repeated elements in a first secret partition held by the first main body according to the first sub-map to obtain a third secret partition held by the first main body;

and according to the second sub-map, the position sequence of each element in the third secret shard held by the first main body is restored to be the same as the characteristic sequence, and the intersection secret shard held by the first main body is obtained.

12. The data processing method according to claim 11, wherein the step of restoring the order of the positions of the elements in the third secret piece held by the first body to be the same as the feature sequence according to the second sub-map to obtain the intersection secret piece held by the first body includes:

Performing inadvertent replacement on the second sub-map and a third secret partition held by the second main body to obtain a second intermediate vector held by the first main body and an intersection secret partition held by the second main body; the inadvertent replacement is to perform replacement mapping on elements in the data sequence on the premise of not revealing the respective held data;

and mapping the third secret shard held by the first main body according to the second sub-mapping, and performing exclusive-or operation with the second intermediate vector held by the first main body to obtain the intersection secret shard held by the first main body.

13. The method of claim 11, wherein recovering the repeated elements in the first secret partition held by the first body according to the first sub-map to obtain a third secret partition held by the first body, comprises:

performing unintentional copying on the first sub-map and the first secret sharing held by the second main body to obtain a first intermediate vector held by the first main body and a third secret sharing held by the second main body; the unintentional copying is to copy and map elements in a data sequence on the premise of not revealing the respective held data;

And mapping the first secret shard held by the first main body according to the first sub-map, and performing exclusive-or operation with the first intermediate vector held by the first main body to obtain a third secret shard held by the first main body.

14. A data processing apparatus, comprising:

15. A computer readable medium, characterized in that the computer readable medium has stored thereon a computer program which, when executed by a processor, implements the data processing method of any of claims 1 to 13.

16. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to cause the electronic device to perform the data processing method of any one of claims 1 to 13 via execution of the executable instructions.

17. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the data processing method of any of claims 1 to 13.