CN111611601A

CN111611601A - Multi-data-party user analysis model joint training method and device and storage medium

Info

Publication number: CN111611601A
Application number: CN202010370875.6A
Authority: CN
Inventors: 戴佳
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Smart Technology Co Ltd; OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2020-09-01

Abstract

The invention relates to artificial intelligence, and discloses a multi-data-party user analysis model joint training method, which comprises the following steps: constructing a public key and a private key, and distributing the public key to at least two data side terminal devices; receiving encrypted sample data obtained by the at least two data side terminal devices by using the public key to encrypt data; decrypting the encrypted sample data by using the private key, and removing repeated data in the decrypted encrypted sample data to obtain training sample data; constructing an initial user analysis model, and training the initial user analysis model according to the training sample data to obtain a user analysis model; and distributing the user analysis model to the at least two data side terminal devices. The invention also relates to a blockchain technique, wherein the public key and the private key are stored in the blockchain.

Description

Multi-data-party user analysis model joint training method and device and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for multi-data-party user analysis model joint training, electronic equipment and a computer readable storage medium.

Background

With the rise of machine learning and big data, most companies can perform modeling training according to the existing user data of the companies to obtain a user analysis model for user behavior analysis. For example, in the current insurance claim settlement rescue service, a plurality of fraud reporting situations occur, in order to solve the problems of low speed and high cost of manual auditing, the current processing method is to use big data row fraud detection, count out the characteristics of known user fraud behaviors by researching the reporting data of own company, use machine learning method for modeling, train to obtain a fraud model, and apply the fraud model to the user reporting detection link. However, the current method has the following problems: the data for model training is only the data of the insurance company, the data volume cannot contain the users in the whole network, and the more general fraud behaviors cannot be counted, so that when the fraud users occur in one company, the users can choose a mode of replacing the company to continue fraud.

In summary, in the existing modeling method, due to privacy protection, companies are generally reluctant to exchange user data, so that training data that can be obtained by each company is only data of the company, and a trained user analysis model may not be very accurate.

Disclosure of Invention

The invention provides a method, a device, electronic equipment and a computer readable storage medium for joint training of multi-party user analysis models, and mainly aims to realize the joint training of the user analysis models by all parties without exchanging data.

Using an asymmetric encryption algorithm to construct a public key and a private key, and distributing the public key to at least two data side terminal devices;

receiving encrypted sample data obtained by the at least two data side terminal devices by using the public key to perform data encryption operation;

decrypting the encrypted sample data by using the private key to obtain sample data;

performing repeated data elimination operation on the sample data to obtain training sample data;

constructing an initial user analysis model, and training the initial user analysis model according to the training sample data to obtain a user analysis model; and

and distributing the user analysis model to the at least two data side terminal devices.

Optionally, the method further comprises:

and performing distributed storage on the sample data by using the following Hash function:

slice_id＝(w1×(hash_str(point_name)/b1)+w2×(day_time(time)/b2))

wherein: slice _ id is a fragment number allocated to data, hash _ str (point _ name) is a quantization function of a data name of the data added to the storage node, day _ time (time) is a quantization function of a time period of the data added to the storage node, b1 is the dispersion degree of the data name, and b2 is the dispersion degree of the time period; w1 and w2 are weight coefficients.

Optionally, the performing a repeated data elimination operation on the sample data includes:

calculating the similarity between the data of different users in the sample data;

and according to the similarity between the data, eliminating the data of the repeated user in the sample data.

Optionally, the calculation formula of the similarity is:

wherein, X_iI-th feature data, Y, representing user X_iSim (X, Y) represents the similarity of users X and Y for the ith feature data of user Y.

Optionally, the training the initial user analysis model according to the training sample data to obtain a user analysis model includes:

distributing the initial user analysis model to the at least two data side terminal devices;

receiving model parameters which are transmitted by the at least two data side terminal devices and encrypted by using the public key, wherein the model parameters are obtained by training the initial user analysis model by using respective feature data and feature label data of the at least two data side terminal devices;

decrypting the encrypted model parameters by using the private key to obtain decrypted model parameters, and calculating according to the decrypted model parameters to obtain total model parameters;

and updating the model parameters of the initial user analysis model by using the total model parameters to obtain the user analysis model.

In order to solve the above problem, the present invention further provides a multi-data-party user analysis model joint training apparatus, including:

the key construction and distribution module is used for constructing a public key and a private key by using an asymmetric encryption algorithm and distributing the public key to at least two data side terminal devices;

the encryption receiving module is used for receiving the encrypted sample data obtained by the data encryption operation of the at least two data side terminal devices by using the public key;

the decryption summarizing module is used for carrying out decryption operation on the encrypted sample data by using the private key to obtain the sample data;

the data removing module is used for removing the repeated data from the sample data to obtain training sample data;

the model training and distributing module is used for constructing an initial user analysis model and training the initial user analysis model according to the training sample data to obtain a user analysis model; and distributing the user analysis model to the at least two data side terminal devices.

Optionally, the public key and the private key are stored in a block chain, and the performing the operation of removing the repeated data from the sample data includes:

calculating the similarity between the data of different users in the sample data by using the following calculation formula of the similarity:

wherein, X_iI-th feature data, Y, representing user X_iFor the ith feature data of user Y, sim (X, Y) represents the similarity of users X and Y;

Training the initial user analysis model according to the training sample data to obtain a user analysis model, including:

In order to solve the above problem, the present invention also provides an electronic device, including:

a memory storing at least one instruction; and

a processor executing instructions stored in the memory to implement the multi-data-party user analysis model joint training described above.

To solve the above problem, the present invention further provides a computer-readable storage medium having at least one instruction stored therein, where the at least one instruction is executed by a processor in an electronic device to implement the joint training of multiple data-party user analysis models described above.

The embodiment of the invention uses an asymmetric encryption algorithm to construct a public key and a private key, and distributes the public key to at least two data side terminal devices, and the at least two data side terminal devices use the public key to carry out data encryption operation, thereby preventing the data of each data side from being exchanged and leaked; furthermore, repeated data elimination operation is carried out on the sample data, so that the calculated amount of model training can be reduced, and the accuracy of the model can be improved; in addition, the initial user analysis model is trained according to training sample data obtained from the at least two data side terminal devices, so that the user analysis model can be trained by combining all data, and the accuracy of the model is higher.

Drawings

FIG. 1 is a schematic flow chart of a method for joint training of multiple data user analysis models according to an embodiment of the present invention;

FIG. 2 is a block diagram of a multi-data-party user analysis model joint training apparatus according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an internal structure of an electronic device for a multi-data-party user analysis model joint training method according to an embodiment of the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a multi-data-party user analysis model joint training method. Referring to fig. 1, a flow chart of a multi-data-party user analysis model joint training method according to an embodiment of the present invention is shown. The method may be performed by an apparatus, which may be implemented by software and/or hardware.

In this embodiment, the multi-data-party user analysis model joint training method includes:

s1, constructing a public key and a private key by using an asymmetric encryption algorithm, and distributing the public key to at least two data parties.

In detail, in the embodiment of the invention, a third party organization constructs the public key and the private key through a pre-constructed model training system, reserves the private key, and distributes the public key to at least two data side terminal devices. It is emphasized that, to further ensure the privacy and security of the public and private keys, the public and private keys may also be stored in nodes of a blockchain.

And S2, receiving the encrypted sample data obtained by the data encryption operation of the at least two data side terminal devices by using the public key.

In the embodiment of the invention, when the at least two data side terminal devices receive the public key of the third party organization, the public key is used for encrypting respective user data, and the encrypted user data is sent to the third party organization, thereby realizing the confidentiality of data.

And S3, decrypting the encrypted data by using the private key to obtain sample data.

In detail, the decrypting the encrypted data by using the private key to obtain sample data includes:

decrypting the encrypted sample data of the at least two data side terminal devices by using the private key to obtain decrypted data;

summarizing the decrypted data to obtain summarized data;

and performing distributed storage on the summarized data to obtain the sample data.

Because the summarized data is data provided by at least two data parties, the data volume is huge, and in order to reduce the storage and calculation pressure of computer equipment and realize permanent recording of the summarized data, the embodiment of the invention adopts a block chain mechanism to store the summarized data in a distributed manner.

In detail, the present invention performs distributed storage on the summarized data by using the following Hash function:

slice_id＝(w1×(hash_str(point_name)/b1)+w2×(day_time(time)/b2))

wherein: slice _ id is a fragment number allocated to data in a database, hash _ str (point _ name) is a quantization function of a data name of a storage node of the database to which the data is added, day _ time (time) is a quantization function of a time period of the storage node of the database to which the data is added, b1 is the dispersion degree of the data name in the database, and b2 is the dispersion degree of the time period in the database; w1 and w2 are weight coefficients and can be set artificially. In an extreme case, when w2 is set to 0, it indicates that data is distributed according to the data names of the data completely, and the data of all times of each data name is stored in the same data slice or a backup slice thereof; similarly, if w1 is set to 0, it indicates that data is distributed according to the time period of data addition to the storage node, and data of all data names in each time period is stored in the same data slice or a backup slice thereof.

And S4, performing repeated data elimination operation on the sample data to obtain training sample data.

Because the sample data is from different data parties, there may be repeated data, such as data including the user lie XX in the data party a, and data including the user lie XX in the data party B, and in order to reduce the calculation amount of model training and improve the model accuracy, the embodiment of the present invention needs to perform the operation of removing the repeated data from the sample data.

In detail, the embodiment of the present invention calculates the similarity between the data of different users in the sample data, and eliminates the data of the repeat user in the sample data according to the similarity between the data. Wherein, the calculation formula of the similarity is as follows:

sim (X, Y) values range from-1 to 1, where-1 means that the two vectors point in exactly the opposite direction, 1 means that their points are identical, and 0 usually means that they are independent of each other. In the embodiment of the present invention, when the sim (X, Y) value is 1, it indicates that the user X and the user Y are duplicate data, and the common user of the at least two data parties, therefore, the data of the user X is deleted or the data of the user Y is deleted.

S5, constructing an initial user analysis model, and training the initial user analysis model according to the training sample data to obtain a user analysis model.

In detail, the initial user analysis model can be constructed by using a linear model, a tree structure model and a convolutional neural network model.

In detail, the training of the model can be completed by the following steps in the implementation of the invention:

And S6, distributing the user analysis model to the at least two data side terminal devices.

After the user analysis model is distributed to the at least two data parties, the at least two data parties can utilize the user analysis model to perform analysis and detection of user behaviors.

The embodiment of the invention uses an asymmetric encryption algorithm to construct a public key and a private key, and distributes the public key to at least two data side terminal devices, so that the at least two data side terminal devices perform data encryption operation by using the public key, thereby preventing data exchange and leakage of each side; furthermore, repeated data elimination operation is carried out on the sample data, so that the calculated amount of model training can be reduced, and the accuracy of the model can be improved; furthermore, the initial user analysis model is trained according to training sample data obtained from the at least two data side terminal devices, so that the user analysis model can be trained by combining all data, and the model accuracy is higher.

FIG. 2 is a functional block diagram of the multi-data-party user analysis model joint training apparatus according to the present invention.

The multi-data-party user analysis model joint training apparatus 100 of the present invention can be installed in an electronic device. According to the realized functions, the multi-data-party user analysis model joint training device can comprise a key construction and distribution module 101, an encryption receiving module 102, a decryption summarizing module 103, a data eliminating module 104 and a model training and distribution module 105. A module according to the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

the key construction and distribution module 101 is configured to construct a public key and a private key using an asymmetric cryptographic algorithm, and distribute the public key to at least two data side terminal devices.

The encryption receiving module 102 is configured to receive encryption sample data obtained by the at least two data side terminal devices performing data encryption operation by using the public key.

In the embodiment of the present invention, when the at least two data side terminal devices receive the public key of the third party organization, the public key is used to encrypt respective user data, and the encrypted user data is sent to the third party organization.

The decryption summary module 103 is configured to perform a decryption operation on the encrypted sample data by using the private key to obtain the sample data.

summarizing the decrypted data to obtain summarized data;

slice_id＝(w1×(hash_str(point_name)/b1)+w2×(day_time(time)/b2))

The data eliminating module 104 is configured to perform repeated data eliminating operation on the sample data to obtain training sample data.

The sample data is from different data parties, so that repeated data may exist, and in order to reduce the calculation amount of model training and improve the model accuracy, the repeated data removal operation needs to be performed on the sample data.

The model training and distributing module 105 is configured to construct an initial user analysis model, train the initial user analysis model according to the training sample data, and obtain a user analysis model; and distributing the user analysis model to the at least two data side terminal devices.

Fig. 3 is a schematic structural diagram of an electronic device for implementing joint training of multiple data user analysis models according to the present invention.

The electronic device 1 may include a processor 10, a memory 11 and a bus, and may further include a computer program, such as a multi-data-party user analysis model joint training program 12, stored in the memory 11 and executable on the processor 10.

The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of the multi-data-party user analysis model joint training program 12, but also for temporarily storing data that has been output or will be output.

The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., a multi-data user analysis model joint training program, etc.) stored in the memory 11 and calling data stored in the memory 11.

The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.

Fig. 3 only shows an electronic device with components, and it will be understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.

For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.

Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The multi-data-party user analysis model joint training program 12 stored in the memory 11 of the electronic device 1 is a combination of instructions, and when executed in the processor 10, can implement:

using an asymmetric encryption algorithm to construct a public key and a private key, and distributing the public key to at least two data side terminal devices; it should be emphasized that, in order to further ensure the privacy and security of the public key and the private key, the public key and the private key may also be stored in a node of a block chain;

Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not described herein again.

Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A multi-data-party user analysis model joint training method is characterized by comprising the following steps:

2. The method for multi-data party user analysis model joint training as defined in claim 1, the method further comprising:

slice_id＝(w1×(hash_str(point_name)/b1)+w2×(day_time(time)/b2))

3. The multi-data party user analysis model joint training method of claim 1, wherein the performing repeated data culling operations on the sample data comprises:

4. The multi-data party user analysis model joint training method of claim 3, wherein the similarity is calculated by the formula:

5. The method for jointly training multiple data party user analysis models according to any one of claims 1 to 4, wherein the public key and the private key are stored in a blockchain, and the training of the initial user analysis model according to the training sample data to obtain a user analysis model comprises:

6. A multi-data-party user analysis model joint training apparatus, the apparatus comprising:

and the model training and distributing module is used for constructing an initial user analysis model, training the initial user analysis model according to the training sample data to obtain a user analysis model, and distributing the user analysis model to the at least two data side terminal devices.

7. The multi-data-party user analysis model joint training apparatus according to claim 6, wherein the performing repeated data culling operations on the sample data comprises:

8. The apparatus for multi-data party user analysis model joint training according to claim 6 or 7, wherein the public key and the private key are stored in a block chain, and the training of the initial user analysis model according to the training sample data to obtain a user analysis model comprises:

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method for joint training of multiple data party user analysis models as claimed in any one of claims 1 to 5.

10. A computer-readable storage medium, storing a computer program, wherein the computer program, when executed by a processor, implements the multi-data-party user analysis model joint training according to any one of claims 1 to 5.