CN117010002B

CN117010002B - Sample identifier alignment method and device, electronic equipment and storage medium

Info

Publication number: CN117010002B
Application number: CN202311275065.2A
Authority: CN
Inventors: 赵恢强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2024-01-05
Anticipated expiration: 2043-09-28
Also published as: CN117010002A

Abstract

The application provides an alignment method and device of sample identifiers, electronic equipment and a storage medium; the method comprises the following steps: encrypting a first sample identifier of each first data sample aiming at a plurality of first data samples to obtain a plurality of first encrypted identifiers; generating index groups corresponding to a plurality of first encryption identifications, wherein the index groups comprise at least one sub-index group, and indexes in the sub-index groups are used for distinguishing the same first encryption identifications; transmitting the plurality of first encryption identifications and the index group to a receiver; the first encryption identifier and the index group are used for a receiver to select and acquire a target sample identifier from a plurality of first encryption identifiers and a plurality of second encryption identifiers based on the index group; and receiving the target sample identifier sent by the receiver, and determining the aligned sample identifier according to the target sample identifier and the plurality of first encryption identifiers. By the method and the device, the speed of sample alignment can be improved.

Description

Sample identifier alignment method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a method and apparatus for aligning sample identifiers, an electronic device, and a storage medium.

Background

Secure multiparty computing methods can be broadly divided into two categories, one being noise-based computing methods, represented by differential privacy (Differential Privacy); another type is a cryptography-based computing method that encodes or encrypts raw data, making it difficult for people to recover the raw data from the encrypted data, where the inadvertent transmission (OT, oblivious Transfer) algorithm is common.

However, when repeated key (primary key) is present in the multiparty data, the speed of sample alignment using the above method is reduced.

Disclosure of Invention

The embodiment of the application provides an alignment method and device for sample identification, electronic equipment and a storage medium, which can improve the speed of sample alignment.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides an alignment method of sample marks, which comprises the following steps:

encrypting a first sample identifier of each first data sample aiming at a plurality of first data samples to obtain a plurality of first encrypted identifiers; when the same first encryption identifier exists in the plurality of first encryption identifiers, generating index groups corresponding to the plurality of first encryption identifiers, wherein the index groups comprise at least one sub-index group, and indexes in the sub-index groups are used for distinguishing the same first encryption identifier; transmitting the plurality of first encryption identifications and the index group to a receiver; the first encryption identifier and the index group are used for the receiver to select and acquire a target sample identifier from the plurality of first encryption identifiers and the plurality of second encryption identifiers based on the index group; the indexes corresponding to the same target sample identifiers are the same, and the second encryption identifier is obtained by encrypting the second sample identifier of the receiver; and receiving the target sample identifier sent by the receiver, and determining the aligned sample identifier according to the target sample identifier and the plurality of first encryption identifiers.

The embodiment of the application provides an alignment method of sample marks, which comprises the following steps: encrypting a second sample identifier of each second data sample aiming at a plurality of second data samples to obtain a plurality of second encrypted identifiers; receiving a plurality of first encryption identifications and index groups sent by a sender; the first encryption identifier is obtained by encrypting a first sample identifier of the sender; when the same first encryption identifier exists in the plurality of first encryption identifiers, the index group comprises at least one sub-index group, and indexes in the sub-index group are used for distinguishing the same first encryption identifier; selecting a target sample identifier from the plurality of first encrypted identifiers and the plurality of second encrypted identifiers based on the index group; the indexes corresponding to the same target sample identifiers are the same; transmitting the target sample identification to the sender; the target sample identifier is configured to determine, by the sender, an aligned sample identifier according to the target sample identifier and the plurality of first encrypted identifiers.

An embodiment of the present application provides an alignment device for sample identification, including: the first encryption module is used for encrypting the first sample identifiers of the first data samples aiming at the plurality of first data samples respectively to obtain a plurality of first encrypted identifiers; the index generation module is used for generating index groups corresponding to the plurality of first encryption identifications when the same first encryption identifications exist in the plurality of first encryption identifications, wherein the index groups comprise at least one sub-index group, and indexes in the sub-index groups are used for distinguishing the same second encryption identifications; the first identifier sending module is used for sending the plurality of first encryption identifiers and the index group to a receiver; the first encryption identifier and the index group are used for the receiver to select and acquire a target sample identifier from the plurality of first encryption identifiers and the plurality of second encryption identifiers based on the index group; the indexes corresponding to the same target sample identifiers are the same, and the second encryption identifier is obtained by encrypting the second sample identifier of the receiver; and the sample identification determining module is used for receiving the target sample identification sent by the receiver and determining the aligned sample identification according to the target sample identification and the plurality of first encryption identifications.

In the above scheme, the index generating module is further configured to divide the plurality of first encrypted identifiers into identifier groups to obtain a first encrypted identifier group and a second encrypted identifier group; the first encrypted identifiers in the first encrypted identifier group are the same, and the first encrypted identifiers in the second encrypted identifier group are different from each other; and adding corresponding indexes for the first encryption identifications in the first encryption identification group and the first encryption identifications in the second encryption identification group respectively to obtain a plurality of index groups corresponding to the first encryption identifications.

In the above scheme, the index generating module is further configured to add different indexes to each of the first encryption identifications in the first encryption identification group, so as to obtain a sub-index group corresponding to the first encryption identification group; adding the same index to each of the first encrypted identifications in the second encrypted identification group; and constructing a plurality of index groups corresponding to the first encryption identifications according to the sub index groups corresponding to each first encryption identification group and indexes in each second encryption identification group.

In the above scheme, the index generating module is further configured to sort the first encrypted identifiers in the first encrypted identifier group to obtain a first encrypted identifier sequence; and sequentially adding indexes which are arranged from small to large or from large to small for each first encryption identifier based on the sequence of each first encryption identifier in the first encryption identifier sequence, so as to obtain a sub-index group corresponding to the first encryption identifier group.

In the above scheme, the index generating module is further configured to randomly add different indexes to each of the first encrypted identifiers in the first encrypted identifier group, so as to obtain a sub-index group corresponding to the first encrypted identifier group.

In the above solution, the index generating module is further configured to add an index to each of the first encryption identifiers; wherein, the first indexes corresponding to the same first encryption identification are different; the second indexes corresponding to the other first encryption identifications except the same first encryption identification in the plurality of first encryption identifications are the same; the second index is identical to at least one of the first indexes; and constructing and obtaining a plurality of index groups corresponding to the first encryption identifications according to the first index and the second index.

In the above scheme, the index generating module is further configured to add different indexes to the same first encryption identifier to obtain a corresponding sub-index group; and generating index groups corresponding to the plurality of first encryption identifications according to at least one sub-index group.

In the above scheme, the sample identifier determining module is further configured to perform intersection on the target sample identifier and the plurality of first encrypted identifiers, so as to obtain an aligned sample identifier.

An embodiment of the present application provides an alignment device for sample identification, including: the second encryption module is used for encrypting a second sample identifier of each second data sample aiming at a plurality of second data samples to obtain a plurality of second encrypted identifiers; the index receiving module is used for receiving a plurality of first encryption identifications and index groups sent by the sender; the first encryption identifier is obtained by encrypting a first sample identifier of the sender; when the same first encryption identifier exists in the plurality of first encryption identifiers, the index group comprises at least one sub-index group, and indexes in the sub-index group are used for distinguishing the same first encryption identifier; the identification selecting module is used for selecting and acquiring a target sample identification from the plurality of first encryption identifications and the plurality of second encryption identifications based on the index group; the indexes corresponding to the same target sample identifiers are the same; a second identifier sending module, configured to send the target sample identifier to the sender; the target sample identifier is configured to determine, by the sender, an aligned sample identifier according to the target sample identifier and the plurality of first encrypted identifiers.

In the above scheme, the identifier selecting module is further configured to perform intersection on the plurality of first encrypted identifiers and the plurality of second encrypted identifiers to obtain aligned sample identifiers; the index of the sample identifier is the same as the index of the first encryption identifier corresponding to the sample identifier; and filtering the sample identification based on the index group to obtain at least one target sample identification.

In the above scheme, the identifier selecting module is further configured to filter, when the indexes in the sub-index group are sequentially arranged from small to large based on the initial number, the sample identifier corresponding to the index greater than the initial number from the sample identifiers, so as to obtain at least one target sample identifier.

An embodiment of the present application provides an electronic device, including: a memory for storing computer executable instructions; and the processor is used for realizing the sample identification alignment method provided by the embodiment of the application when executing the computer executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium, which stores a computer program or computer executable instructions for implementing the sample identifier alignment method provided by the embodiment of the application when the computer program or the computer executable instructions are executed by a processor.

Embodiments of the present application provide a computer program product comprising a computer program or computer executable instructions that, when executed by a processor, implement a method for aligning sample identifiers provided by embodiments of the present application.

The embodiment of the application has the following beneficial effects:

according to the embodiment of the application, the first sample identifier is encrypted to obtain the first encrypted identifier, and the second sample identifier is encrypted to obtain the second encrypted identifier, so that both parties can be ensured not to expose own original data in the sample alignment period. According to the method and the device, the corresponding index group is added for the first encryption identification, the first encryption identification and the index group are sent to the receiver together, the receiver can select and obtain the target sample identification from a plurality of first encryption identifications and a plurality of second encryption identifications based on the index group, because the index group comprises at least one sub-index group, indexes in the sub-index group are used for distinguishing repeated first encryption identifications, the receiver can remove repeated data based on the index group to obtain the target sample identification, and further the aligned sample identification can be determined based on the target sample identification sent by the receiver and the first encryption identification. When the embodiment of the application is applied to a An Quanyang aligning PSI algorithm scene, if the aligning keys in the data of both parties have repeated data, the speed of sample alignment can be improved.

Drawings

FIG. 1 is a schematic structural diagram of an alignment system architecture for sample identification provided in an embodiment of the present application;

fig. 2A is a schematic structural diagram of an alignment device for sample identification according to an embodiment of the present application;

fig. 2B is a schematic structural diagram of a second alignment device for sample identification according to an embodiment of the present disclosure;

fig. 3A is a flowchart illustrating a method for aligning sample identifiers according to an embodiment of the present application;

fig. 3B is a second flowchart of a method for aligning sample identifiers according to an embodiment of the present application;

fig. 3C is a flowchart third of a method for aligning sample identifiers according to an embodiment of the present application;

fig. 3D is a flowchart of a method for aligning sample identifiers according to an embodiment of the present application;

fig. 3E is a flowchart fifth of a sample identifier alignment method provided in an embodiment of the present application;

fig. 3F is a flowchart sixth of a method for aligning sample identifiers according to an embodiment of the present application;

fig. 4A is a flowchart seventh of a method for aligning sample identifiers according to an embodiment of the present application;

fig. 4B is a flowchart eighth of a method for aligning sample identifiers according to an embodiment of the present application;

fig. 5 is a flowchart of a method for aligning sample identifiers according to an embodiment of the present application;

Fig. 6 is an application environment diagram of a sample identifier alignment method provided in an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used in the embodiments of the present application have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the embodiments of the application is for the purpose of describing the embodiments of the application only and is not intended to be limiting of the application.

Before further describing embodiments of the present application in detail, the terms and expressions that are referred to in the embodiments of the present application are described, and are suitable for the following explanation.

1) Secure Multi-party computing (MPC, secure Multi-party Computation) refers to a plurality of parties cooperatively completing a common computing task according to a agreed Secure computing protocol without revealing respective data.

2) An inadvertent transmission algorithm (OT, oblivious Transfer), meaning that sender a has n messages, receiver B wants to receive k messages in a, ensures that a does not know what k messages B wants, nor does B know other messages in a than k messages.

3) Longitudinal federal learning (Vertical Federated Learning), also known as Sample-aligned federal learning (Sample-Aligned Federated Learning), is essentially a combination of features, i.e., training samples of longitudinal federal learning participants overlap much, but data features of each Sample overlap little. The general process of longitudinal federal learning is: firstly, carrying out encryption sample alignment on the participant data to obtain overlapped sample data; the central node generates a secret key pair and sends public keys to each participant to encrypt data to be transmitted; the participants initialize the model parameters related to themselves respectively, and train the selected sample data locally to train out the characteristic intermediate results related to themselves respectively; each participant encrypts the trained characteristic intermediate result based on the public key (generally homomorphic encryption) and then interacts with the public key; each participant continues training based on the encrypted intermediate result obtained by the interaction, and sends the trained model parameters (still encrypted) to the central node; after decryption, the center node returns the respective model parameters to each participant; each participant updates its own model parameters. In the whole process, each participant does not know the data and the characteristics of the other party, and the participant only obtains model parameters related to the participant after training is finished.

4) Privacy set intersection (PSI, private Set Intersection), a classical problem in the field of multiparty security computing, requires that participants calculate together the intersection of a set of multiple participants without disclosing the local set to each other, and cannot reveal information beyond the intersection to any of the participants. In the vertical federation learning scenario, PSI is also called An Quanyang book alignment or database crash, which means that multiple parties (typically two parties) cannot acquire any information except for intersections in the process of sample alignment, that is, each party needs to first calculate an intersection between its own training sample ID (Identity document, identifier) set, and perform subsequent vertical federation model training based on the calculated training sample ID intersection.

5) The alignment of sample identifications is also referred to as encryption entity alignment. The field identified as unique identification data in the party can be understood as a key, such as id, identification card, and mobile phone number. The sample identification alignment process is to find the common record of the two-party data sets according to the data and the identification selected by the two parties, and store the record as an alignment result in the same sequence. For example: user groups of a-party and B-party companies that are longitudinally federally trained are different, and an encryption-based user ID alignment technique is used to ensure that the a-party and the B-party can be aligned to a common user without exposing the respective original data.

Secure multiparty computing originates from the intelligent million-rich problem (Yao's Milliconaires' problem) in the 1982 period, which discusses two million-rich Alice and Bob who wish to know who are richer in them but do not disclose their actual wealth. The million-rich problem is an important issue in cryptography, and its solution is applied in e-commerce and data mining. Multiparty security computing methods can be broadly divided into two categories, one being noise-based, represented by differential privacy (Differential Privacy); the other type is based on cryptography, encodes or encrypts original data, so that other users can hardly restore the original data from the encrypted data, and mainly comprises: homomorphic encryption (HE: homomorphic Encryption), inadvertent transmission (OT: oblivious Transfer), garbled circuits (GC: compressed Circuit), key Sharing (SS: secret Sharing), and the like. Secure multiparty computing is the cryptographic basis for many applications such as electronic election, threshold signature, electronic auction, etc.

OT is a cryptographic protocol, is the cryptographic idea behind a garbled circuit, and solves the problems: suppose A has n data B wants to know one of them +.>By means of the OT protocol B obtains +.>But is not aware ofAt the same time A does not know +.>. The OT algorithm takes the form of 1-out-of-n and k-out-of-n, etc. In practical applications, the OT algorithm may be implemented in various ways, for example, based on discrete logarithms, and based on RSA (encryption algorithm) principles.

In the related art, the secure sample alignment based on the OT algorithm is the most common and best-performing technology, and particularly, when repeated data does not exist in the key to be aligned between the two parties, the performance is better. However, when repeated intersection keys exist in the multiparty data, the performance of the method cannot meet the requirements, and the sample alignment speed is greatly reduced.

In view of at least one of the foregoing problems with the related art, embodiments of the present application provide a method, apparatus, device, computer-readable storage medium, and computer program product for aligning sample identifiers, which can increase the speed of sample alignment when there is duplicate data among multiple parties, and exemplary applications of the electronic device provided by the embodiments of the present application are described below. In one implementation manner, the electronic device provided in the embodiment of the present application may be implemented as a terminal or as a server. In one implementation manner, the electronic device provided in the embodiments of the present application may be implemented as any terminal with a data processing function, such as a notebook computer, a tablet computer, a desktop computer, a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device, an intelligent robot, an intelligent home appliance, and an intelligent vehicle-mounted device; in another implementation manner, the electronic device provided in the embodiment of the present application may be implemented as a server, where the server may be an independent physical server, or may be a server cluster or a distributed system formed by multiple physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDN, content Delivery Network), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiments of the present application. In the following, an exemplary application when the electronic device is implemented as a server will be described.

Referring to fig. 1, fig. 1 is a schematic architecture diagram of a sample identifier alignment system 100 according to an embodiment of the present application, in order to implement an alignment application supporting a sample identifier, a terminal 400-1 is connected to a server 200-1 through a network 300-1, a terminal 400-2 is connected to a server 200-2 through a network 300-2, and the network 300-1 and the network 300-2 may be wide area networks or local area networks, or a combination of the two.

When the secure sample alignment of both parties is performed, the user may input a requirement to perform a sample alignment operation (for example, longitudinal federal learning) through the terminal 400-1 and the terminal 400-2, respectively, the terminal 400-1 generates an alignment request of a sample identifier in response to the requirement to perform the sample alignment operation, and sends the alignment request of the sample identifier to the server 200-1 through the network 300-1. The terminal 400-2 performs an operation of sample alignment in response to the demand to generate an alignment request of sample identification, and transmits the alignment request of sample identification to the server 200-2 through the network 300-2. After receiving the alignment request of the sample identifier, the server 200-1 responds to the alignment request of the sample identifier and encrypts the first sample identifier of each first data sample for a plurality of first data samples to obtain a plurality of first encrypted identifiers; when the same first encryption identifier exists in the plurality of first encryption identifiers, generating index groups corresponding to the plurality of first encryption identifiers, wherein the index groups comprise at least one sub-index group, and indexes in the sub-index groups are used for distinguishing the same first encryption identifier; the plurality of first encrypted identifications and the index group are transmitted to the receiving side (i.e., the server 200-2). After receiving the alignment request of the sample identifier, the server 200-2 respectively encrypts the second sample identifier of each second data sample for the plurality of second data samples in response to the alignment request of the sample identifier to obtain a plurality of second encrypted identifiers; after receiving the plurality of first encrypted identifiers and the index group sent by the server 200-1, selecting and obtaining a target sample identifier from the plurality of first encrypted identifiers and the plurality of second encrypted identifiers based on the index group; the indexes corresponding to the same target sample identifiers are the same; the target sample identification is returned to the server 200-1. The server 200-1 receives the target sample identifier sent by the receiver, and determines the aligned sample identifier according to the target sample identifier and the plurality of first encrypted identifiers.

In some embodiments, the method for aligning sample identifiers in the embodiments of the present application may also be performed by the terminals 400-1 and 400-2, that is, the user may input a requirement for sample alignment operation (such as longitudinal federal learning) through the terminals 400-1 and 400-2, respectively, where the terminal 400-1 encrypts the first sample identifier of each first data sample for a plurality of first data samples in response to the requirement for sample alignment operation, so as to obtain a plurality of first encrypted identifiers; when the same first encryption identifier exists in the plurality of first encryption identifiers, generating an index group corresponding to the plurality of first encryption identifiers; the plurality of first encrypted identifications and the index group are transmitted to the terminal 400-2. After receiving the plurality of first encrypted identifiers and the index group sent by the terminal 400-1, the terminal 400-2 selects and obtains a target sample identifier from the plurality of first encrypted identifiers and the plurality of second encrypted identifiers based on the index group; the target sample identity is returned to the terminal 400-1. The terminal 400-1 receives the target sample identifier sent by the terminal 400-2, and determines the aligned sample identifier according to the target sample identifier and the plurality of first encrypted identifiers.

Referring to fig. 2A, fig. 2A is a schematic structural diagram of a terminal 400-1 of a sample identifier alignment method provided in an embodiment of the present application, and the terminal 400-1 shown in fig. 2A includes: at least one processor 410, a memory 450, at least one network interface 420, and a user interface 430. The various components in terminal 400-1 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable connected communication between these components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 440 in fig. 2A.

The processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, a digital signal processor (Digital Signal Processor, DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable presentation of the media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

Memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 450 optionally includes one or more storage devices physically remote from processor 410.

Memory 450 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (Random Access Memory, RAM). The memory 450 described in the embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 451 including system programs, e.g., framework layer, core library layer, driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;

a network communication module 452 for accessing other electronic devices via one or more (wired or wireless) network interfaces 420, the exemplary network interface 420 comprising: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (Universal Serial Bus, USB), etc.;

a presentation module 453 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 431 (e.g., a display screen, speakers, etc.) associated with the user interface 430;

an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.

In some embodiments, the sample identifier alignment device in the user terminal 400-1 provided in the embodiments of the present application may be implemented in software, and fig. 2A shows the sample identifier alignment device 455A stored in the memory 450, which may be software in the form of a program and a plug-in, and includes the following software modules: the first encryption module 4551A, the index generation module 4552A, the first identification transmission module 4553A and the sample identification determination module 4554A are logical, and thus may be arbitrarily combined or further split according to the implemented functions. The functions of the respective modules will be described hereinafter.

In other embodiments, as shown in fig. 2B, the alignment device 455B for providing sample identification in the user terminal 400-2 according to the embodiments of the present application may be implemented in software, which may be software in the form of a program, a plug-in, or the like, including the following software modules: the second encryption module 4551B, the index receiving module 4552B, the identification selection module 4553B and the second identification transmitting module 4554B are logical, and thus may be arbitrarily combined or further split according to the implemented functions. The functions of the respective modules will be described hereinafter.

In other embodiments, the apparatus provided by the embodiments of the present application may be implemented in hardware, and by way of example, the apparatus provided by the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the method of aligning sample identifications provided by the embodiments of the present application, e.g., the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (Application Specific Integrated Circuit, ASIC), digital signal processors (Digital Signal Processor, DSP), programmable logic devices (ProgrammableLogic Device, PLD), complex programmable logic devices (Complex Programmable Logic Device, CPLD), field programmable gate arrays (Field-Programmable Gate Array, FPGA), or other electronic components.

The method for aligning the sample identifier provided by the embodiments of the present application may be performed by an electronic device, where the electronic device may be a server or a terminal, that is, the method for aligning the sample identifier of the embodiments of the present application may be performed by the server or the terminal, or may be performed by interaction between the server and the terminal.

Fig. 3A is a schematic flowchart of an alternative method for aligning sample identifiers according to an embodiment of the present application, and the steps shown in fig. 3A will be described below, where, as shown in fig. 3A, the method includes the following steps S101 to S104, where the execution subject of the sample identifier aligning method is taken as a server, and the method is described as an example:

in step S101, for a plurality of first data samples, the first sample identifier of each first data sample is encrypted, so as to obtain a plurality of first encrypted identifiers.

In some embodiments, the first data sample is a data sample that one of the participants involved in the An Quanyang present alignment PSI process. The first data sample may include a unique first sample identification and at least one data characteristic. The first sample identification is a field that uniquely identifies the first data sample. The first encryption identifier is a mask id generated by encrypting the first sample identifier. Taking party a as an example, the first data sample of party a is shown in table 1, and the encrypted first data sample of party a is shown in table 2 (data features not shown). The_id is a first sample identifier, namely a key which is solved by both parties in the PSI process; af1, af2 and af3 are all corresponding data features, and _mask_id is the first encrypted identification.

Table 1A square data sample

Table 2 encrypted a-side data samples

It should be noted that, in the embodiment of the present application, a specific method for encrypting the first sample identifier to obtain the first encrypted identifier is not limited.

In some embodiments, the first sample identity of each first data sample may be encrypted using an OPRF-PSI algorithm to obtain a plurality of first encrypted identities. Wherein OPRF (Oblivious Pseudorandom Function) is an unintentional pseudorandom function. The first sample identification may also be encrypted using a hash function-based PSI algorithm, an unintentional transfer (OT) -based PSI algorithm, a homomorphic encryption-based PSI algorithm, a differential privacy-based PSI algorithm, or the like.

In step S102, when the same first encryption identifier exists in the plurality of first encryption identifiers, an index group corresponding to the plurality of first encryption identifiers is generated, where the index group includes at least one sub-index group, and indexes in the sub-index group are used to distinguish the same first encryption identifier.

In some embodiments, when the same first encryption identifier exists, different indexes can be added to the same first encryption identifier to obtain at least one sub-index group. Each index in the sub-index group corresponds to the same first encryption identifier, and the indexes corresponding to the first encryption identifiers are different. At least one group of identical first encrypted identifications can be selected from the plurality of first encrypted identifications, and the number of sub-index groups is the same as the number of groups of identical first encrypted identifications. And constructing index groups corresponding to the plurality of first encryption identifications according to the sub-index groups.

In some embodiments, in addition to adding different indexes to the same first encryption identification, resulting in a sub-index set, indexes may be added to other separate first encryption identifications. The index added by each individual first encryption identification may be the same or different. The index added by each individual first encryption identification may be the same as at least one index in the sub-index group.

In some embodiments, referring to fig. 3B, step S102 shown in fig. 3A may be implemented by the following steps S1021A to S1022A, which are described in detail below.

In step S1021A, the plurality of first encrypted identifiers are divided into a first encrypted identifier group and a second encrypted identifier group.

The first encrypted identifiers in the first encrypted identifier group are the same, and the first encrypted identifiers in the second encrypted identifier group are different from each other.

In this embodiment of the present application, the basis of the identifier group division is whether the first encrypted identifiers are the same. The plurality of identical first encrypted identifications is divided into a first encrypted identification group. The other first encrypted identifications which do not exist the same as the first encrypted identification are divided into a second encrypted identification group. The number of first encrypted identification groups is at least one. The number of second encrypted identification groups may be 0 or 1. When all the first encrypted identifications exist the same first encrypted identification, the number of the second encrypted identification groups is 0.

In step S1022A, corresponding indexes are added to the first encrypted identifier in the first encrypted identifier group and the first encrypted identifier in the second encrypted identifier group, respectively, to obtain index groups corresponding to the plurality of first encrypted identifiers.

In this embodiment of the present application, different indexes may be added to the first encryption identifier in the first encryption identifier group, so as to obtain a corresponding sub-index group. An index is added to the first encrypted identification in the second encrypted identification group. And constructing a plurality of index groups corresponding to the first encryption identifications according to the sub index groups corresponding to the first encryption identification groups and indexes corresponding to the second encryption identification groups.

According to the embodiment of the application, the plurality of first encryption identifications are divided into the identification groups so as to divide the same first encryption identification and different first encryption identifications into groups, and the efficiency of index addition is improved.

In some embodiments, referring to fig. 3C, step S1022A shown in fig. 3B may be implemented by the following steps S10221A to S10223A, which are specifically described below.

In step S10221A, a different index is added to each first encryption flag in the first encryption flag group, so as to obtain a sub-index group corresponding to the first encryption flag group.

In the embodiment of the present application, the index adding rule of each first encrypted identifier in the first encrypted identifier group is not limited.

In step S10222A, the same index is added to each first encrypted identification in the second encrypted identification group.

In this embodiment of the present application, the same index may be added to the first encrypted identifier that is complementary to the same first encrypted identifier in the second encrypted identifier group. The index corresponding to the second encrypted identification group may be different from each index in the first encrypted identification group. The index corresponding to the second encrypted identification group can be the same as at least one index in the first encrypted identification group, so that subsequent filtering processing is facilitated.

In step S10223A, an index group corresponding to the plurality of first encryption identifications is constructed according to the sub-index group corresponding to each first encryption identification group and the index in each second encryption identification group.

In this embodiment of the present application, the plurality of sub-index groups and each index in the second encryption identification group together form an index group corresponding to the plurality of first encryption identifications.

In some embodiments, referring to fig. 3D, step S10221A shown in fig. 3C may be implemented by the following steps S102211A to S102212a, which are described in detail below.

In step S102211a, the first encrypted identifiers in the first encrypted identifier group are sorted to obtain a first encrypted identifier sequence.

In some embodiments, the method of ordering the first encrypted identifiers in the first encrypted identifier group is not limited. For example, the first encrypted identifiers in the first encrypted identifier group may be randomly arranged to obtain a first encrypted identifier sequence.

In step S102212a, indexes of the first encrypted identifiers arranged from small to large or from large to small are sequentially added to the first encrypted identifiers based on the order of the first encrypted identifiers in the first encrypted identifier sequence, so as to obtain a sub-index group corresponding to the first encrypted identifier group.

In some embodiments, the index may be an Arabic number. After the first encryption identification sequence is obtained, corresponding indexes can be added to each first encryption identification in the first encryption identification sequence according to the size of Arabic numerals from small to large or from large to small in sequence, so as to obtain a sub-index group corresponding to the first encryption identification group. If indexes are added according to the small-to-large arrangement rule, the smallest indexes in each sub-index group are the same, for example, 1. If the indexes are added according to the big-to-small arrangement rule, the largest indexes in each sub-index group are the same, for example, 100.

According to the embodiment of the application, the same first encryption identifications are sequenced, and indexes which are arranged from small to large or from large to small are sequentially added based on the sequence, so that the normalization of the indexes is improved, and the subsequent filtering and screening is simplified.

In other embodiments, step S10221A shown in fig. 3C may be implemented by: and randomly adding different indexes for each first encryption identifier in the first encryption identifier group to obtain a sub-index group corresponding to the first encryption identifier group.

In the embodiment of the present application, the specific content of the index is not limited. By way of example, the index may be one of an Arabic number, an English letter, a Greek letter, and the like. The adding rules of the indexes in each sub-index group can be the same, for example, english letters starting from A, so that the subsequent filtering processing is facilitated.

In some embodiments, referring to fig. 3E, step S102 shown in fig. 3A may be implemented by the following steps S1021B to S1022B, which are described in detail below.

In step S1021B, an index is added for each first encryption identification; wherein, the first indexes corresponding to the same first encryption identification are different; the second indexes corresponding to the other first encryption identifications except the same first encryption identification in the plurality of first encryption identifications are the same; the second index is identical to at least one of the plurality of first indexes.

In this embodiment of the present application, different first indexes are added to the same first encryption identifier in the plurality of first encryption identifiers, and the same second indexes are added to other separate first encryption identifiers. The second index is identical to at least one of the plurality of first indices to facilitate subsequent filtering weights based on the identical first index and second index. Taking party a as an example, the first sample data is encrypted and indexed as shown in table 3. Where_mask_idex is the index. For the same first encryption identifier xxx, the added first indexes are sequentially 1 and 2; for the separate first encryption identifications www and yyy, the same second index 1 is added.

Table 3 indexed a-side data

In step S1022B, an index group corresponding to the plurality of first encryption identifications is constructed according to the first index and the second index.

In this embodiment of the present application, a plurality of first indexes and second indexes together form an index group corresponding to the first encryption identifier.

According to the embodiment of the application, the index is directly added to each first encryption identifier, so that the grouping process is omitted, and the algorithm is simplified.

In some embodiments, referring to fig. 3F, step S102 shown in fig. 3A may be implemented by the following steps S1021C to S1022C, which are described in detail below.

In step S1021C, different indexes are added to the same first encryption flag, and a corresponding sub-index group is obtained.

In some embodiments, different indexes may be added only for the same first encryption identification, resulting in corresponding sub-index groups. Other independent first encryption identifications in the plurality of first encryption identifications do not add an index additionally and can be defaulted to be null values.

In step S1022C, an index group corresponding to the plurality of first encryption identifications is generated according to at least one sub-index group.

According to the embodiment of the application, the index is added to the same first encryption identifier, so that the data quantity to be processed is reduced, and the sample alignment speed is improved.

In step S103, a plurality of first encryption identifications and index groups are transmitted to the receiving side.

The first encryption identifier and the index group are used for a receiver to select and acquire a target sample identifier from a plurality of first encryption identifiers and a plurality of second encryption identifiers based on the index group.

The indexes corresponding to the same target sample identification are the same, and the second encryption identification is obtained by encrypting the second sample identification of the receiver.

With continued reference to fig. 3A, in the process of performing secure sample alignment, the party with the smaller data amount may be taken as the sender, which includes the first data sample. And taking the party with more data volume as a receiving party, wherein the receiving party comprises a second data sample. The second data sample may include a unique second sample identification and at least one data characteristic. The second sample identification is a field that uniquely identifies the second data sample.

In some embodiments, after the plurality of first encryption identifications and the index group are sent to the receiver, the receiver may perform intersection on the first encryption identifications and the second encryption identifications, and then filter out repeated data based on indexes in the index group to obtain the target sample identification. The number of target sample identifications is at least one.

In step S104, a target sample identifier sent by the receiving party is received, and the aligned sample identifiers are determined according to the target sample identifier and the plurality of first encrypted identifiers.

In some embodiments, step S104 shown in fig. 3A may be implemented by: and carrying out intersection on the target sample identifier and the plurality of first encryption identifiers to obtain aligned sample identifiers.

In this embodiment of the present application, the intersection refers to combining a row of first data samples corresponding to a first encryption identifier with a row of second data samples corresponding to a target sample identifier to obtain a row of new data samples. However, it should be noted that the security sample alignment is used as a method for solving the privacy set, and the participating parties cannot learn about the specific data features (af 1, bf1, etc.). That is, the essence of the intersection is merely the intersection of the first encrypted identification and the target sample identification. For example, the aligned sample identifier of party a includes only the data features of party a itself, and the data features of party B are empty. The cross alignment of the data is a subsequent step after the alignment of the sample identifiers, which is not described herein in detail.

In the security sample alignment PSI algorithm, when the alignment keys in the data of both parties have repeated data, the sample alignment speed is improved.

Referring to fig. 4A, fig. 4A is a schematic flow chart of a method for aligning sample identifiers according to another embodiment of the present application, and will be described with reference to the steps shown in fig. 4A.

In step S201, the second sample identifier of each second data sample is encrypted for a plurality of second data samples, so as to obtain a plurality of second encrypted identifiers.

For example, the participant a and the participant B perform An Quanyang self-alignment, the data volume of the participant a is small, the participant a is selected as the sender, the participant B is selected as the receiver, the data in the data set of the participant a is the first data sample, and the data in the data set of the participant B is the second data sample. The original data of the B-party can be seen in the following table 4, the encrypted B-party data can be seen in the following table 5, wherein _id is a second sample identifier, bf1 and bf2 are both corresponding data characteristics, and _mask_id is a second encrypted identifier.

Table 4B square data sample

TABLE 5 encrypted B-party data samples

In this embodiment of the present application, the method for encrypting the second sample identifier of the second data sample to obtain the plurality of second encrypted identifiers is consistent with the encryption method of the first sample identifier, which is not described herein.

In step S202, a plurality of first encryption identifications and index groups transmitted by a transmitting side are received.

The first encryption identifier is obtained by encrypting a first sample identifier of the sender. When the same first encryption identifier exists in the plurality of first encryption identifiers, the index group comprises at least one sub-index group, and indexes in the sub-index group are used for distinguishing the same first encryption identifier.

In this embodiment of the present application, the process of adding the index by the sender is described in detail above, and will not be described herein.

In step S203, a target sample identifier is selected from the plurality of first encrypted identifiers and the plurality of second encrypted identifiers based on the index group.

Wherein the indexes corresponding to the same target sample identification are the same.

In this embodiment of the present application, a plurality of first encryption identifiers and a plurality of second encryption identifiers may be first submitted, and then, based on indexes in an index group, the submitted identifiers are filtered and selected according to a preset filtering policy, so as to obtain at least one target sample identifier. The preset filtering strategy changes with the adding mode of the index.

In some embodiments, referring to fig. 4B, step S203 shown in fig. 4A may be implemented by the following steps S2031 to S2032, which are specifically described below.

In step S2031, the plurality of first encrypted identifiers and the plurality of second encrypted identifiers are interleaved to obtain aligned sample identifiers. The index of the sample identity is the same as the index of the first encrypted identity to which the sample identity corresponds.

In the embodiment of the application, the intersection is performed on the first encrypted identifier and the second encrypted identifier which is the same as the first encrypted identifier, and only the encrypted identifiers shared by both parties of the party are reserved. For a group of identical first encrypted identifiers and second encrypted identifiers, the number of the sample identifiers after the cross alignment is calculated, and the product of the number of the first encrypted identifiers and the number of the second encrypted identifiers is obtained. The index moves along with the corresponding first encrypted identification.

Illustratively, party a and party B identify the data after intersection as table 6 below. The encryption identifications shared by the A side and the B side are xxx and yyy, so that the data sample after intersection only comprises the encryption identifications of xxx and yyy. The same two first encryption identifications xxx exist in the original data of the A side, and indexes are 1 and 2 respectively. The same second encrypted identification xxx also exists in the B-party original data. The intersection process is that the first encryption identifier xxx with the index of 1 in the A side is combined with one piece of the two encryption identifiers xxx in the B side, the first encryption identifier xxx with the index of 1 is recombined with the other piece of the two encryption identifiers xxx remained in the B side, and so on.

TABLE 6 data sample of B-party after intersection

In step S2032, the sample identification is filtered based on the index set to obtain at least one target sample identification.

In the embodiment of the application, based on indexes in the index group, filtering and selecting the sample identification after intersection according to a preset filtering strategy to obtain at least one target sample identification. The preset filtering strategy changes with the adding mode of the index.

In some embodiments, step S2032 shown in fig. 4B may be implemented by the following method: and under the condition that the indexes in the sub-index group are sequentially arranged from small to large based on the initial numbers, filtering sample identifiers corresponding to the indexes larger than the initial numbers from the sample identifiers to obtain at least one target sample identifier.

In this embodiment of the present application, if the sender adds the index, the corresponding indexes are added to the same first encryption identifier in the plurality of first encryption identifiers in order from small to large based on the initial number, and the same initial number is added to other individual first encryption identifiers as the index. The preset filtering strategy may be to filter out the index larger than the initial number, so as to ensure that only the target sample identifier with the index being the initial number is reserved, and duplicate data is removed. The indexes of the target sample identifications obtained in the mode are the same.

Illustratively, taking table 3 as an example, the initial number is 1, and the corresponding indexes are added in order from 1 to large for the same first encryption identifier xxx. Party a has two identical first encryption identifications xxx, corresponding to indexes 1 and 2. The a-party has other separate first encrypted identifications www and yyy, both indexed by the initial number 1. After intersection with B-side, table 6 was obtained, and records in which _mask_index was greater than 1 were filtered out to obtain table 7 below.

Table 7B side filtered target sample identification

In other embodiments, if the sender adds the index, the index is an english letter or a greek letter added randomly, and so on, then the preset filtering policy may be to randomly select one of the sample identifiers as the target sample identifier for each group of identical sample identifiers, and directly use each of the other individual sample identifiers as the target sample identifier for each group of identical sample identifiers.

In other embodiments, if the sender adds an index, only the same first encrypted identification is indexed, and the other defaults to a null value. The preset filtering strategy can reserve the sample identifier which does not carry the index for default, and the sample identifier carrying the index can still be filtered by adopting the methods.

In step S204, the target sample identification is transmitted to the sender.

The target sample identifier is used for determining the aligned sample identifier by the sender according to the target sample identifier and the plurality of first encryption identifiers.

With continued reference to fig. 4A, the sender may cross over the plurality of target sample identifiers and the plurality of first encrypted identifiers to obtain aligned sample identifiers.

In the embodiment of the application, the sample identifier obtained by the sender is the same as the sample identifier obtained by the receiver. The process of intersection is consistent with the foregoing steps, and will not be described in detail herein. For example, the sender's post-intersection data sample may be found in table 8 below.

TABLE 8A side data sample after intersection

According to the embodiment of the application, the first sample identifier is encrypted to obtain the first encrypted identifier, and the second sample identifier is encrypted to obtain the second encrypted identifier, so that both parties can be ensured not to expose own original data in the sample alignment period. According to the method and the device, the corresponding index group is added for the first encryption identification, the first encryption identification and the index group are sent to the receiver together, the receiver can select and obtain the target sample identification from a plurality of first encryption identifications and a plurality of second encryption identifications based on the index group, because the index group comprises at least one sub-index group, indexes in the sub-index group are used for distinguishing repeated first encryption identifications, the receiver can remove repeated data based on the index group to obtain the target sample identification, and further the aligned sample identification can be determined based on the target sample identification sent by the receiver and the first encryption identification. In the security sample alignment PSI algorithm, when the alignment keys in the data of both parties have repeated data, the sample alignment speed is improved.

The method for aligning the sample identifier in the embodiment of the present application will be described below in connection with the interaction between the sender a side and the receiver B side in the system for aligning the sample identifier. It should be noted that, the sample identifier alignment method is substantially the same as the sample identifier alignment method performed by the server in the above embodiment, and some steps may be performed by the terminal or the server, so the present embodiment is merely an exemplary illustration of the steps that are the same as those in the above embodiment but are different in execution subject, and may be performed by any execution subject in the implementation process, which is not limited in this embodiment of the present application. Party a may be terminal 400-1 or server 200-1 and party b may be terminal 400-2 or server 200-2.

Fig. 5 is another optional flowchart of a sample identifier alignment method provided in an embodiment of the present application, as shown in fig. 5, the method includes the following steps S301 to S307:

in step S301, the a party encrypts the first sample identifier of each first data sample, so as to obtain a plurality of first encrypted identifiers.

In step S302, the a party generates index groups corresponding to the plurality of first encryption identifications.

In step S303, a sends a plurality of first encryption identifications and an index group to the B-party.

Step S304, the B side encrypts the second sample identification of each second data sample respectively to obtain a plurality of second encrypted identifications.

In step S305, the B-party selects and obtains the target sample identifier from the plurality of first encrypted identifiers and the plurality of second encrypted identifiers based on the index group.

Step 306, b sends the target sample identity to party a.

In step 307, the a-party determines the aligned sample identifier according to the target sample identifier and the plurality of first encrypted identifiers.

In another embodiment, the alignment method of the sample identification can be implemented in a mode of adding no index. After the encryption of the A side obtains the table 2 and the encryption of the B side obtains the table 5, the A side and the B side can respectively compress the repeated data of the A side and the B side to obtain the corresponding table 9 and the table 10.

Table 9 compressed a-side data

Table 10 compressed B-party data

Wherein_mask_id is a mask id generated after the OPRF-PSI algorithm, corresponds to the intersection key, and_cnt is the number of times the intersection key appears, then A, B side sends own_mask_id and_cnt to each other, and finally obtains the data parameters after intersection of the two sides as shown in the following table 11.

Data after intersection of A and B parties in table 11

In the following, an exemplary application of the embodiments of the present application in a practical application scenario will be described.

The embodiment of the present application is implemented in the context of federal modeling, where the first step in the workflow of federal modeling is to perform secure sample alignment, i.e., PSI algorithm, as shown in fig. 6.

Referring to fig. 6, the worm hole dependence is a plug-in or program for controlling the workflow, and can be used for regulating and controlling the PSI worm hole dependence, training the worm hole dependence and predicting the beginning and ending of the worm hole dependence. Each worm hole depends on the function of ensuring the operation of a single workflow. Specifically, PSI worm-hole dependent begin representing the first step of federal modeling (secure sample alignment). Guest is a Host side of federal modeling, host is a slave side of federal modeling, and both sides are used as participants to realize federal modeling. Fl_kot_psi_guest is the PSI process node of the master side Guest, and fl_kot_psi_host is the PSI process node of the slave side Host. PSI worm hole dependence begin is connected with FL_KOT_PSI_Guest and FL_KOT_PSI_Host, after PSI worm hole dependence begin running, only two nodes of FL_KOT_PSI_Guest and FL_KOT_PSI_Host are regulated and controlled to perform safe sample alignment, and other nodes stop running. The sample identifier alignment method provided by the embodiment of the application is respectively implemented in the FL_KOT_PSI_Guest and the FL_KOT_PSI_Host, so that the intersection efficiency of the data of the Host side Guest and the slave side Host when the data are repeated is improved, and a final training sample is obtained. An Quanyang after this alignment is completed, PSI worm hole dependent end control An Quanyang ends this alignment flow. At this time, training the worm hole depends on starting running. The training worm hole dependence is connected with two nodes of GBDT_guest_train and GBDT_host_train. GBDT_gust_train is the model training process node after the Host side Guest sample is aligned, and GBDT_host_train is the model training process node after the slave side Host sample is aligned. The training worm hole is dependent to start to run, only two nodes GBDT_guest_train and GBDT_host_train connected with the training worm hole are regulated to perform model training based on training samples obtained after FL_KOT_PSI_Guest and FL_KOT_PSI_host An Quanyang are aligned, and other nodes stop working. After training the worm hole dependence, starting a predicted worm hole dependence so as to enable the two nodes GBDT_guest_train and GBDT_host_train to obtain a common model after training, and outputting predicted data of the following nodes based on different data: GBDT_gust_pred, GBDT_gust_pred2, GBDT_host_pred, GBDT_host_pred2. Gbdt_gust_pred and gbdt_gust_pred2 are data predicted from different data of the master, and gbdt_host_pred and gbdt_host_pred2 are data predicted from different data of the slave.

In the method for aligning the security samples based on the OPRF-PSI algorithm, when repeated data are encountered, the two parties need to count mask_ids corresponding to the repeated data, and then the counts are unfolded and aligned when repeated intersection keys are aligned, so that the performance is low. The method and the device can efficiently solve the problem of repeated data alignment without counting the repeated data, and are specifically described as follows:

the original first sample data of the A side is table 1, after the first sample identification of the first sample data passes through the KOT-PST algorithm, a corresponding index is added to obtain table 3 (for convenience of description, only keys related to intersection, namely_id and intermediate results are listed in each table). Specifically, indexes are added to the same first encryption identifier xxx in sequence from 1, and indexes 1 are added to other different first encryption identifiers. Table 5 is obtained after the second sample identification of the second sample data is subjected to the KOT-PST algorithm, with the original first sample data of the B side being Table 4. The mask_id is a mask id generated after KOT-PSI algorithm and corresponds to the intersection key, the mask_index is an index of each intersection key, the A side sends the mask_id and the mask_index to the B side, the B side receives the mask_id and performs intersection with the mask_id, and the mask_index of the A side is carried to obtain a table 6. And after the B side obtains the intersection result, filtering out records with the mask index larger than 1 to obtain the target sample identification in the table 7. And then the B side sends the target sample identification to the A side, and the A side uses the own mask id and the received mask id to perform intersection so as to obtain a final intersection result table 8. The embodiment of the application efficiently solves the problem of low performance when repeated key crossing exists in the alignment of the security samples.

According to the sample identification alignment method, the test is performed based on 1000 ten thousand, 10 hundred million and other levels of data of a certain application finance, and performance improvement can be achieved by more than 60% under the condition that a large number of repeated keys exist in samples of both sides.

For each federal learning component of a company-level federal learning solution PowerFL project, for example, secure sample alignment work is performed before model modeling such as federal GBDT, LR, DNN, the embodiment of the application provides an alignment method of sample identification, and the alignment method can be integrated into a company PowerFL Oteam code warehouse as an implementation scheme for efficient alignment and repeated intersection key, and is applied to PowerFL-SQL, so that service performance is greatly improved.

It will be appreciated that in the embodiments of the present application, related data such as user information is referred to, and when the embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

Continuing with the description below of an exemplary architecture of the sample-identified alignment device 455A provided in embodiments of the present application implemented as a software module, in some embodiments, as shown in fig. 2A, the software module in the sample-identified alignment device 455A may include:

The first encryption module 4551A is configured to encrypt, for a plurality of first data samples, a first sample identifier of each first data sample, respectively, to obtain a plurality of first encrypted identifiers.

The index generating module 4552A is configured to generate an index group corresponding to the plurality of first encryption identifications when the same first encryption identification exists in the plurality of first encryption identifications, where the index group includes at least one sub-index group, and indexes in the sub-index group are used to distinguish the same second encryption identification.

A first identifier sending module 4553A, configured to send a plurality of first encrypted identifiers and an index group to a receiver; the first encryption identifier and the index group are used for a receiver to select and acquire a target sample identifier from a plurality of first encryption identifiers and a plurality of second encryption identifiers based on the index group; the indexes corresponding to the same target sample identification are the same, and the second encryption identification is obtained by encrypting the second sample identification of the receiver.

The sample identifier determining module 4554A is configured to receive a target sample identifier sent by a receiver, and determine an aligned sample identifier according to the target sample identifier and a plurality of first encrypted identifiers.

In some embodiments, the index generating module 4552A is further configured to divide the plurality of first encrypted identifiers into an identifier group to obtain a first encrypted identifier group and a second encrypted identifier group; the first encrypted identifiers in the first encrypted identifier group are the same, and the first encrypted identifiers in the second encrypted identifier group are different; and adding corresponding indexes for the first encryption identifications in the first encryption identification group and the first encryption identifications in the second encryption identification group respectively to obtain index groups corresponding to the plurality of first encryption identifications.

In some embodiments, the index generating module 4552A is further configured to add a different index to each first encryption identifier in the first encryption identifier group, to obtain a sub-index group corresponding to the first encryption identifier group; adding the same index to each first encrypted identifier in the second encrypted identifier group; and constructing and obtaining index groups corresponding to the plurality of first encryption identifications according to the sub-index groups corresponding to each first encryption identification group and the indexes in each second encryption identification group.

In some embodiments, the index generating module 4552A is further configured to sort each first encrypted identifier in the first encrypted identifier group to obtain a first encrypted identifier sequence; and sequentially adding indexes which are arranged from small to large or from large to small to each first encryption identifier based on the sequence of each first encryption identifier in the first encryption identifier sequence, so as to obtain a sub-index group corresponding to the first encryption identifier group.

In some embodiments, the index generating module 4552A is further configured to randomly add a different index to each first encrypted identifier in the first encrypted identifier group, so as to obtain a sub-index group corresponding to the first encrypted identifier group.

In some embodiments, the index generation module 4552A is further configured to add an index for each first encryption identification; wherein, the first indexes corresponding to the same first encryption identification are different; the second indexes corresponding to the other first encryption identifications except the same first encryption identification in the plurality of first encryption identifications are the same; the second index is identical to at least one of the plurality of first indexes; and constructing an index group corresponding to the plurality of first encryption identifications according to the first index and the second index.

In some embodiments, the index generating module 4552A is further configured to add different indexes to the same first encryption identifier to obtain a corresponding sub-index group; and generating index groups corresponding to the plurality of first encryption identifications according to at least one sub-index group.

In some embodiments, the sample identifier determining module 4554A is further configured to cross the target sample identifier with the plurality of first encrypted identifiers to obtain an aligned sample identifier.

As shown in fig. 2B, in the information recommendation device 455B provided in another embodiment of the present application, the software modules may include:

the second encryption module 4551B is configured to encrypt, for a plurality of second data samples, a second sample identifier of each second data sample, respectively, to obtain a plurality of second encrypted identifiers.

An index receiving module 4552B, configured to receive a plurality of first encryption identifications and an index group sent by a sender; the first encryption identifier is obtained by encrypting a first sample identifier of the sender; when the same first encryption identifier exists in the plurality of first encryption identifiers, the index group comprises at least one sub-index group, and indexes in the sub-index group are used for distinguishing the same first encryption identifier.

The identifier selecting module 4553B is configured to select, based on the index group, a target sample identifier from the plurality of first encrypted identifiers and the plurality of second encrypted identifiers; wherein the indexes corresponding to the same target sample identification are the same.

A second identifier sending module 4554B, configured to send the target sample identifier to the sender; the target sample identifier is used for determining the aligned sample identifier by the sender according to the target sample identifier and the plurality of first encryption identifiers.

In some embodiments, the identifier selection module 4553B is further configured to cross the plurality of first encrypted identifiers and the plurality of second encrypted identifiers to obtain aligned sample identifiers; the index of the sample identifier is the same as the index of the first encryption identifier corresponding to the sample identifier; and filtering the sample identifiers based on the index group to obtain at least one target sample identifier.

In some embodiments, the identifier selection module 4553B is further configured to, in a case where the indexes in the sub-index group are arranged in order from small to large based on the initial number, filter, from the sample identifiers, the sample identifier corresponding to the index greater than the initial number, and obtain at least one target sample identifier.

Embodiments of the present application provide a computer program product comprising a computer program or computer-executable instructions stored in a computer-readable storage medium. The processor of the electronic device reads the computer executable instructions from the computer readable storage medium, and the processor executes the computer executable instructions, so that the electronic device executes the sample identification alignment method according to the embodiment of the application.

The present embodiments provide a computer readable storage medium storing computer executable instructions or a computer program stored therein, which when executed by a processor, cause the processor to perform a method for aligning sample identifiers provided by the embodiments of the present application, for example, the method for aligning sample identifiers as illustrated in fig. 3A.

In some embodiments, the computer readable storage medium may be RAM, ROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, computer-executable instructions may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, in the form of programs, software modules, scripts, or code, and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, computer-executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (Hyper Text Markup Language, HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, computer-executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or, alternatively, on multiple electronic devices distributed across multiple sites and interconnected by a communication network.

In summary, according to the embodiment of the present application, during the sample alignment period, when the alignment keys in the data of both parties have repeated data while both parties are ensured not to expose their own original data, the speed of sample alignment is improved.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modifications, equivalent substitutions, improvements, etc. that are within the spirit and scope of the present application are intended to be included within the scope of the present application.

Claims

1. A method of alignment of sample identifiers, the method comprising:

encrypting a first sample identifier of each first data sample aiming at a plurality of first data samples to obtain a plurality of first encrypted identifiers;

when the same first encryption identifier exists in the plurality of first encryption identifiers, generating index groups corresponding to the plurality of first encryption identifiers, wherein the index groups comprise at least one sub-index group, and indexes in the sub-index groups are used for distinguishing the same first encryption identifier;

Transmitting the plurality of first encryption identifications and the index group to a receiver; the first encryption identifier and the index group are used for the receiver to perform intersection on the plurality of first encryption identifiers and the plurality of second encryption identifiers to obtain aligned sample identifiers, and based on the index group, the sample identifiers are filtered to obtain at least one target sample identifier, and the second encryption identifier is obtained by encrypting the second sample identifier of the receiver;

the method comprises the steps of obtaining the number of sample identifiers obtained after intersecting and aligning a group of identical first encrypted identifiers and second encrypted identifiers, wherein the number is the product of the number of the first encrypted identifiers and the number of the second encrypted identifiers; the index of the sample identifier is the same as the index of the first encryption identifier corresponding to the sample identifier; the indexes corresponding to the same target sample identifiers are the same;

and receiving the target sample identifier sent by the receiver, and determining the aligned sample identifier according to the target sample identifier and the plurality of first encryption identifiers.

2. The method of claim 1, wherein the generating the index group corresponding to the plurality of first encrypted identifications comprises:

Dividing the identification groups of the plurality of first encryption identifications to obtain a first encryption identification group and a second encryption identification group;

the first encrypted identifiers in the first encrypted identifier group are the same, and the first encrypted identifiers in the second encrypted identifier group are different from each other;

and adding corresponding indexes for the first encryption identifications in the first encryption identification group and the first encryption identifications in the second encryption identification group respectively to obtain a plurality of index groups corresponding to the first encryption identifications.

3. The method of claim 2, wherein the adding the corresponding index to the first encrypted identifier in the first encrypted identifier group and the first encrypted identifier in the second encrypted identifier group to obtain a plurality of index groups corresponding to the first encrypted identifiers includes:

adding different indexes to each first encryption identifier in the first encryption identifier group to obtain a sub-index group corresponding to the first encryption identifier group;

adding the same index to each of the first encrypted identifications in the second encrypted identification group;

and constructing a plurality of index groups corresponding to the first encryption identifications according to the sub index groups corresponding to each first encryption identification group and indexes in each second encryption identification group.

4. The method of claim 3, wherein adding a different index to each of the first encrypted identifiers in the first encrypted identifier group to obtain a sub-index group corresponding to the first encrypted identifier group includes:

ordering all the first encryption identifications in the first encryption identification group to obtain a first encryption identification sequence;

and sequentially adding indexes which are arranged from small to large or from large to small for each first encryption identifier based on the sequence of each first encryption identifier in the first encryption identifier sequence, so as to obtain a sub-index group corresponding to the first encryption identifier group.

5. The method of claim 3, wherein adding a different index to each of the first encrypted identifiers in the first encrypted identifier group to obtain a sub-index group corresponding to the first encrypted identifier group includes:

and randomly adding different indexes for each first encryption identifier in the first encryption identifier group to obtain a sub-index group corresponding to the first encryption identifier group.

6. The method of claim 1, wherein the generating the index group corresponding to the plurality of first encrypted identifications comprises:

Adding an index to each first encryption identification; wherein, the first indexes corresponding to the same first encryption identification are different; the second indexes corresponding to the other first encryption identifications except the same first encryption identification in the plurality of first encryption identifications are the same; the second index is identical to at least one of the first indexes;

and constructing and obtaining a plurality of index groups corresponding to the first encryption identifications according to the first index and the second index.

7. The method of claim 1, wherein the generating the index group corresponding to the plurality of first encrypted identifications comprises:

adding different indexes for the same first encryption identification to obtain a corresponding sub-index group;

and generating index groups corresponding to the plurality of first encryption identifications according to at least one sub-index group.

8. The method of claim 1, wherein the determining the aligned sample identity from the target sample identity and the plurality of first encrypted identities comprises:

and intersecting the target sample identifier and the plurality of first encryption identifiers to obtain aligned sample identifiers.

9. A method of alignment of sample identifiers, the method comprising:

encrypting a second sample identifier of each second data sample aiming at a plurality of second data samples to obtain a plurality of second encrypted identifiers;

receiving a plurality of first encryption identifications and index groups sent by a sender; the first encryption identifier is obtained by encrypting a first sample identifier of the sender; when the same first encryption identifier exists in the plurality of first encryption identifiers, the index group comprises at least one sub-index group, and indexes in the sub-index group are used for distinguishing the same first encryption identifier;

intersection is carried out on the plurality of first encryption identifications and the plurality of second encryption identifications to obtain aligned sample identifications, and the sample identifications are filtered based on the index group to obtain at least one target sample identification; the method comprises the steps of obtaining the number of sample identifiers obtained after intersecting and aligning a group of identical first encrypted identifiers and second encrypted identifiers, wherein the number is the product of the number of the first encrypted identifiers and the number of the second encrypted identifiers; the index of the sample identifier is the same as the index of the first encryption identifier corresponding to the sample identifier; the indexes corresponding to the same target sample identifiers are the same;

Transmitting the target sample identification to the sender; the target sample identifier is configured to determine, by the sender, an aligned sample identifier according to the target sample identifier and the plurality of first encrypted identifiers.

10. The method of claim 9, wherein, in the case that each index in the sub-index group is arranged in order from small to large based on an initial number, the filtering the sample identifier based on the index group to obtain at least one target sample identifier includes:

and filtering the sample identifier corresponding to the index larger than the initial number from the sample identifiers to obtain at least one target sample identifier.

11. An alignment device for sample identification, the device comprising:

the first encryption module is used for encrypting the first sample identifiers of the first data samples aiming at the plurality of first data samples respectively to obtain a plurality of first encrypted identifiers;

the index generation module is used for generating index groups corresponding to the plurality of first encryption identifications when the same first encryption identifications exist in the plurality of first encryption identifications, wherein the index groups comprise at least one sub-index group, and indexes in the sub-index groups are used for distinguishing the same first encryption identifications;

The first identifier sending module is used for sending the plurality of first encryption identifiers and the index group to a receiver; the first encryption identifier and the index group are used for the receiver to perform intersection on the plurality of first encryption identifiers and the plurality of second encryption identifiers to obtain aligned sample identifiers, and based on the index group, the sample identifiers are filtered to obtain at least one target sample identifier, and the second encryption identifier is obtained by encrypting the second sample identifier of the receiver; the method comprises the steps of obtaining the number of sample identifiers obtained after intersecting and aligning a group of identical first encrypted identifiers and second encrypted identifiers, wherein the number is the product of the number of the first encrypted identifiers and the number of the second encrypted identifiers; the index of the sample identifier is the same as the index of the first encryption identifier corresponding to the sample identifier; the indexes corresponding to the same target sample identifiers are the same;

and the sample identification determining module is used for receiving the target sample identification sent by the receiver and determining the aligned sample identification according to the target sample identification and the plurality of first encryption identifications.

12. An alignment device for sample identification, the device comprising:

the second encryption module is used for encrypting a second sample identifier of each second data sample aiming at a plurality of second data samples to obtain a plurality of second encrypted identifiers;

the index receiving module is used for receiving a plurality of first encryption identifications and index groups sent by the sender; the first encryption identifier is obtained by encrypting a first sample identifier of the sender; when the same first encryption identifier exists in the plurality of first encryption identifiers, the index group comprises at least one sub-index group, and indexes in the sub-index group are used for distinguishing the same first encryption identifier;

the identification selection module is used for intersecting the plurality of first encryption identifications and the plurality of second encryption identifications to obtain aligned sample identifications, and filtering the sample identifications based on the index group to obtain at least one target sample identification; the method comprises the steps of obtaining the number of sample identifiers obtained after intersecting and aligning a group of identical first encrypted identifiers and second encrypted identifiers, wherein the number is the product of the number of the first encrypted identifiers and the number of the second encrypted identifiers; the index of the sample identifier is the same as the index of the first encryption identifier corresponding to the sample identifier; the indexes corresponding to the same target sample identifiers are the same;

A second identifier sending module, configured to send the target sample identifier to the sender; the target sample identifier is configured to determine, by the sender, an aligned sample identifier according to the target sample identifier and the plurality of first encrypted identifiers.

13. An electronic device, the electronic device comprising:

a memory for storing computer executable instructions;

a processor for implementing the method of aligning sample identifications of any one of claims 1 to 8 or the method of aligning sample identifications of any one of claims 9 to 10 when executing computer-executable instructions or computer programs stored in the memory.

14. A computer readable storage medium storing computer executable instructions or a computer program, which when executed by a processor, implements the method of aligning sample identifications according to any one of claims 1 to 8 or the method of aligning sample identifications according to any one of claims 9 to 10.