CN110378749B - Client similarity evaluation method and device, terminal equipment and storage medium - Google Patents

Client similarity evaluation method and device, terminal equipment and storage medium Download PDF

Info

Publication number
CN110378749B
CN110378749B CN201910681352.0A CN201910681352A CN110378749B CN 110378749 B CN110378749 B CN 110378749B CN 201910681352 A CN201910681352 A CN 201910681352A CN 110378749 B CN110378749 B CN 110378749B
Authority
CN
China
Prior art keywords
sample data
client
similarity
server
test
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910681352.0A
Other languages
Chinese (zh)
Other versions
CN110378749A (en
Inventor
魏锡光
李�权
曹祥
刘洋
陈天健
杨强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN201910681352.0A priority Critical patent/CN110378749B/en
Publication of CN110378749A publication Critical patent/CN110378749A/en
Application granted granted Critical
Publication of CN110378749B publication Critical patent/CN110378749B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0202Market predictions or forecasting for commercial activities

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Theoretical Computer Science (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Game Theory and Decision Science (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)

Abstract

The invention discloses a method, a device, a terminal device and a storage medium for evaluating user data similarity, wherein the method for evaluating the user data similarity comprises the following steps: the server side obtains pre-stored sample data or sample data of each client side to serve as first sample data; combining the first sample data and the second sample data generated by the server to form a test sample set, and testing the test sample set; and evaluating the similarity of the user data of each client based on the test result obtained by the test of the test sample set. The invention realizes that the similarity of the user data of the client is evaluated under the condition that the client does not contact with the real user data of the federal learning client, improves the knowledge of the federal learning system to the user, ensures the safety of the user data, and promotes the federal learning system to provide high-quality service for the user in a targeted manner.

Description

Client similarity evaluation method and device, terminal equipment and storage medium
Technical Field
The present invention relates to the technical field of Fintech (financial technology), and in particular, to a method, an apparatus, a terminal device, and a storage medium for evaluating client similarity.
Background
With rapid development of financial technologies, particularly internet financial technologies, there have been increasing applications of technologies in the financial field, wherein federal learning technology is receiving increasing attention based on security guarantee of user privacy and data.
Federal learning (federated learning) refers to a method of machine learning modeling by joining different participants, or party, also known as data owners, or clients. In federal learning, participants do not need to expose their own data to other participants and coordinators (also called servers, parameter servers, or aggregation servers (aggregation server)), so federal learning can well protect user privacy and data security, and can solve the problem of data islanding.
However, in the existing federal learning, especially in the transverse federal learning (the transverse federal learning is that when samples of different institutions overlap less, but feature dimensions overlap more, the data of the parts with identical features of the multiparty users and the users are not identical are extracted for training), based on the safety consideration of the federal learning mechanism for the user data, the service end of the federal learning cannot contact the original data of the client user, so that the knowledge of the service end to the client user in the federal learning is greatly limited, and the service end of the federal learning is difficult to provide high-quality service for the client user in a targeted manner.
Disclosure of Invention
The invention mainly aims to provide a method, a device, a terminal device and a storage medium for evaluating the similarity of clients, and aims to evaluate the similarity of the clients under the condition of not contacting user data of federal learning clients, so that the knowledge of a federal learning system to the users is improved, and the federal learning system is promoted to provide high-quality services for the users in a targeted manner.
In order to achieve the above object, the present invention provides a method for evaluating client similarity, where the method for evaluating client similarity is applied to a federal learning system, and the federal learning system includes: the method for evaluating the similarity of the clients comprises the following steps:
the server side obtains pre-stored sample data or sample data of each client side to serve as first sample data;
combining the first sample data and the second sample data generated by the server to form a test sample set, and testing the test sample set;
and evaluating the similarity of the user data of each client based on the test result obtained by the test of the test sample set.
Optionally, the step of obtaining, by the server, pre-stored sample data or sample data of each client as the first sample data includes:
the server detects a pre-stored sample data set;
acquiring sample data based on a random sampling mode from the sample data set as first sample data; or alternatively, the process may be performed,
the server acquires sample data randomly input by each client as the first sample data.
Optionally, after the step of obtaining the pre-stored sample data by the server or obtaining the sample data of each client as the first sample data, the method further includes:
the server generates second sample data based on the first sample data.
Optionally, the step of generating, by the server, second sample data based on the first sample data includes:
and the server randomly adds noise in the acquired first sample data and/or randomly perturbs the first sample data to generate the second sample data.
Optionally, the step of combining the first sample data and the second sample data generated by the server into a test sample set includes:
Extracting first target sample data and second target sample data from the first sample data and the second sample data according to a preset proportion;
and combining the first target sample data and the second target sample data to obtain a test sample set of the first sample data and the second sample data.
Optionally, the step of testing the test sample set includes:
the server side invokes a machine learning model of each client side;
training test is performed on the first target sample data and the second target sample data in the test sample set based on each machine learning model.
Optionally, the step of evaluating the similarity of the user data of each client based on the test result obtained by testing the test sample set includes:
the server records each test result of training test of each machine learning model;
sequentially extracting any two test results, and calculating the similarity of the user data based on a similarity evaluation function; or alternatively, the process may be performed,
and performing unsupervised clustering on the obtained test results to evaluate the similarity of the user data.
In addition, the invention also provides a device for evaluating the similarity of the client, which is applied to a federal learning system, and the federal learning system comprises: the device for evaluating the similarity of the clients comprises a server and a plurality of clients, wherein the clients are provided with the evaluation device, and the evaluation device comprises:
the acquisition module is used for acquiring pre-stored sample data or sample data of each client side as first sample data by the server side;
the testing module is used for combining the first sample data and the second sample data generated by the server into a testing sample set and testing the testing sample set;
and the evaluation module is used for evaluating the similarity of the user data of each client based on the test result obtained by the test of the test sample set.
In addition, the invention also provides a terminal device, which comprises: the system comprises a memory, a processor and a client similarity evaluation program stored on the memory and capable of running on the processor, wherein the client similarity evaluation program realizes the steps of the client similarity evaluation method when being executed by the processor.
In addition, the invention also provides a storage medium, which is applied to a computer, wherein the storage medium stores a client similarity evaluation program, and the client similarity evaluation program realizes the steps of the client similarity evaluation method when being executed by a processor.
The method comprises the steps of obtaining pre-stored sample data or obtaining sample data of each client side to serve as first sample data through the server side; combining the first sample data and the second sample data generated by the server to form a test sample set, and testing the test sample set; and evaluating the similarity of the user data of each client based on the test result obtained by the test of the test sample set. In the federal learning system, user data which is stored in advance and is irrelevant to each client connected with the current server is collected based on the server, or the server collects user data which is temporarily and randomly input and is used as first sample data for evaluating the similarity of each client, the collected first sample data is combined with second sample data generated by the current server to form a test sample set for waiting for testing of the server, after the server tests the test sample set through a calling model to obtain a test result, the similarity of the user data of each client connected with the current server is evaluated according to the test result by using the existing arbitrary data similarity evaluation function, so that the similarity of the user data of the clients is evaluated under the condition that the user data of the federal learning client is not contacted, the knowledge of the federal learning system to the user is improved, the safety of the user data is ensured, and the federal learning system is promoted to provide high-quality service for the user in a targeted manner.
Drawings
FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a first embodiment of a method for evaluating client similarity according to the present invention;
FIG. 3 is a flowchart illustrating a second embodiment of a method for evaluating client similarity according to the present invention;
FIG. 4 is a schematic diagram of an application scenario in an embodiment of a method for evaluating client similarity according to the present invention;
FIG. 5 is a block diagram of a client similarity evaluation system according to the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
As shown in fig. 1, fig. 1 is a schematic structural diagram of a hardware running environment according to an embodiment of the present invention.
It should be noted that fig. 1 may be a schematic structural diagram of a hardware operating environment of a terminal device. The terminal equipment of the embodiment of the invention can be PC, portable computer and other terminal equipment.
As shown in fig. 1, the terminal device may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
It will be appreciated by those skilled in the art that the terminal device structure shown in fig. 1 is not limiting of the terminal device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and an evaluation program of client similarity may be included in a memory 1005 as one type of computer storage medium. The operating system is a program for managing and controlling hardware and software resources of the sample terminal equipment, and supports the operation of evaluation programs of client similarity and other software or programs.
The terminal device shown in fig. 1 and other terminals together form a federal learning system, where the federal learning system at least includes a service end and a plurality of clients, and in the terminal device shown in fig. 1, the user interface 1003 is mainly used to perform data communication with each terminal; the network interface 1004 is mainly used for connecting a background server and carrying out data communication with the background server; and the processor 1001 may be configured to call an evaluation program of client similarity stored in the memory 1005, and perform the following operations:
the server side obtains pre-stored sample data or sample data of each client side to serve as first sample data;
Combining the first sample data and the second sample data generated by the server to form a test sample set, and testing the test sample set;
and evaluating the similarity of the user data of each client based on the test result obtained by the test of the test sample set.
Further, the processor 1001 may be further configured to invoke the client similarity evaluation program stored in the memory 1005, and perform the following steps:
the server detects a pre-stored sample data set;
acquiring sample data based on a random sampling mode from the sample data set as first sample data; or alternatively, the process may be performed,
the server acquires sample data randomly input by each client as the first sample data.
Further, the processor 1001 may be further configured to invoke an evaluation program of client similarity stored in the memory 1005, and after executing the server to obtain pre-stored sample data or obtain sample data of each client as first sample data, execute the following steps:
the server generates second sample data based on the first sample data.
Further, the processor 1001 may be further configured to invoke the client similarity evaluation program stored in the memory 1005, and perform the following steps:
And the server randomly adds noise in the acquired first sample data and/or randomly perturbs the first sample data to generate the second sample data.
Further, the processor 1001 may be further configured to invoke the client similarity evaluation program stored in the memory 1005, and perform the following steps:
extracting first target sample data and second target sample data from the first sample data and the second sample data according to a preset proportion;
and combining the first target sample data and the second target sample data to obtain a test sample set of the first sample data and the second sample data.
Further, the processor 1001 may be further configured to invoke the client similarity evaluation program stored in the memory 1005, and perform the following steps:
the server side invokes a machine learning model of each client side;
training test is performed on the first target sample data and the second target sample data in the test sample set based on each machine learning model.
Further, the processor 1001 may be further configured to invoke the client similarity evaluation program stored in the memory 1005, and perform the following steps:
The server records each test result of training test of each machine learning model; sequentially extracting any two test results, and calculating the similarity of the user data based on a similarity evaluation function; or alternatively, the process may be performed,
and performing unsupervised clustering on the obtained test results to evaluate the similarity of the user data.
Based on the above structure, various embodiments of the method for evaluating client similarity of the present invention are presented.
Referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of a method for evaluating client similarity according to the present invention.
The embodiments of the present invention provide embodiments of a method for evaluating client similarity, it being noted that although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in a different order than that illustrated herein.
The method for evaluating the similarity of the client according to the embodiment of the present invention is applied to the above terminal device, and the terminal device according to the embodiment of the present invention may be a terminal device such as a PC, a portable computer, etc., which is not limited herein, and further, the method for evaluating the similarity of the client according to the present invention is applied to a federal learning system, which is an application scenario of the method for evaluating the similarity of the client according to the present invention shown in fig. 4, where the federal learning system shown in the scenario includes at least one server and a plurality of clients.
The method for evaluating the similarity of the client side in the embodiment comprises the following steps:
in step S100, the server obtains pre-stored sample data or obtains sample data of each client as first sample data.
The server side in the federal learning system acquires user data which is irrelevant to each client user connected with the server side in the current federal learning system, and the user data is used as first sample data for testing and evaluating the similarity of the user data of each current client.
In the embodiment, the evaluation method of the client similarity is applied to a federal learning system, and is particularly suitable for transverse federal learning in federal learning. The horizontal federation learning is to take out the data of the client users with the same data characteristics and the users with different data characteristics from each other to perform the federation machine learning when the data characteristics of the clients (users) overlap more and the data of the clients (users) overlap less. For example, in federal learning formed by banks in two different regions, because the user groups of the two banks come from the regions where the two banks are located respectively, the intersection of the users is very small, but because the business of the banks is very similar, the characteristics of the user data recorded by the two banks are mostly the same, and then the federal learning model can be constructed by using the federal learning in the lateral direction in federal learning to predict the behaviors of the clients of the two banks, so as to provide services for the two banks.
Specifically, for example, in one federal learning system represented by the scenario shown in fig. 4, when the server receives an instruction for evaluating the similarity of the user data of the client 1 to the client 6,6 client users in the current federal learning system, the server starts to acquire 6 pieces of first sample data that are not related to the data characteristics of the current 6 client users (for example, the industry domain to which the 6 client users belong is a banking domain, then the current server may acquire other pieces of user data such as 6 pieces of user data in the e-commerce industry domain as the first sample data) for evaluating the similarity of the user data of the current 6 client users.
Further, step S100 includes:
in step S101, the server detects a pre-stored sample data set.
The server detects a sample data set which is stored in advance on the current server and used for evaluating the similarity of user data of all clients in the federal learning system.
In this embodiment, the sample data set pre-stored by the server may include user data that is not related to user data of each client connected to the server in the current federal learning system, for example, the industry domain to which each client connected to the server in the current federal learning system belongs is a banking domain, and the sample data set pre-stored by the server may include other user data in the e-commerce industry domain.
Step S102, obtaining sample data from the sample data set based on a random sampling manner as first sample data of each client.
The server side extracts the sample data with the same number as the number of the clients connected with the current server side from the detected sample data set based on a random sampling mode, and the sample data are used as first sample data for evaluating the similarity of the user data of each client.
Specifically, for example, in a federal learning system represented by the scenario shown in fig. 4, after the server detects a pre-stored sample data set, from among other user data included in the sample data set and different from the banking domain to which each client user currently belongs, such as in the e-commerce domain, or from among a certain number of user data temporarily randomly input by a research, development and maintenance person of the current federal learning system, in order to evaluate the similarity of the client user data, 6 sample data as many as the current 6 clients are proposed based on the existing random sampling method, and the 6 sample data thus extracted are used as first sample data for evaluating the similarity between the respective user data of the 6 clients connected to the current server.
Step S103, the server acquires sample data randomly input by each client as the first sample data.
The server acquires a certain amount of user book data which are temporarily and randomly input for evaluating the similarity of the user data of the client by research, development and maintenance personnel of the current federal learning system, and the user book data are used as first sample data.
Step S200, combining the first sample data with the second sample data generated by the server to form a test sample set, and testing the test sample set.
The server generates second sample data with the same number as the number of the clients connected with the current server, mixes the acquired first sample data with the generated second sample data to form a test sample set, and tests the first sample data and the second sample data in the test sample set.
Specifically, for example, in a federal learning system represented by the scenario shown in fig. 4, when the server side obtains 6 first sample data, which are not related to the data characteristics of the current 6 client users, from the sample data set stored in advance in the server side based on the received instruction for evaluating the similarity of the user data of the client 1 to the client 6,6 client users in the current federal learning system, the server side generates 6 second sample data, which are the same in number as the 6 clients connected to the current server side, and mixes the obtained 6 first sample data with the generated 6 second sample data to form a test sample set required for testing the first sample data and the second sample data before evaluating the similarity of the user data of each client, and immediately tests the first sample data and the second sample data in the test sample set after the current server side detects that the test sample set is combined.
Further, in step S200, the step of combining the first sample data with the second sample data generated by the server into a test sample set includes:
step S201, extracting first target sample data and second target sample data from the first sample data and the second sample data according to a preset ratio.
Step S202, combining the extracted first target sample data and the second target sample data to obtain a test sample set of the first sample data and the second sample data.
When the server side obtains the same number of sample data as the number of clients connected with the server side in the federation learning system from a sample data set stored in advance by the server side based on a received instruction for evaluating the similarity of user data of the clients in the federation learning system, the server side extracts the first target sample data and the second target sample data according to a preset proportion from the obtained first sample data and the second sample data generated by the current server side (for example, in a 1:1 proportion relation, extracts all the obtained first sample data as the first target sample data, and extracts all the generated 6 second sample data as the second target sample data), and combines the extracted first target sample data with the second target sample data, so as to obtain a data set of the first sample data and the second sample data, and the server side marks the data set obtained by mixing the first sample data and the second sample data as a test sample set for the current server side to call a model to test the first sample data and the second sample data.
In this embodiment, the preset ratio is that the server side is based on the evaluation of the client similarity requirement, and the preset ratio relation between the number of the first sample data and the number of the second sample data is extracted, so that in order to obtain a result of evaluating the client similarity more accurately, the present invention may also use other numerical ratios as the preset ratio to extract the first target sample data and the second target sample data for evaluating the client similarity, that is, the specific numerical values of the preset ratio should not be limited.
Further, in step S200, the step of testing the test sample set includes:
in step S203, the server invokes the machine learning model of each client.
Step S204, performing a training test on the first target sample data and the second target sample data in the test sample set based on each machine learning model.
The server side collects and retrieves local machine learning models of all client terminals connected with the current server side in the current federal learning system, and performs training test on first target sample data and second target sample data in the test sample set based on the collected and retrieved local machine learning models of all client terminals.
Specifically, for example, in a federal learning system represented by a scenario shown in fig. 4, a server collects 6 local machine training models of 6 clients connected to the server, and sequentially invokes 1 local machine learning model of the 6 local machine training models, and randomly selects 1 first target sample data and 1 second target sample data in a test sample set to perform a local training test, until the 6 local machine training models all complete the local training test on the first target sample data and the second target sample data in the test sample set.
And step S300, evaluating the similarity of the user data of each client based on the test result obtained by testing the test sample set.
The server generates second sample data with the same number as the number of the clients connected with the current server, combines the first sample data and the second sample data to form a test sample set, tests the first sample data and the second sample data in the test sample set, and evaluates the similarity of the user data of each client based on test results obtained by the test.
Specifically, for example, in a federal learning system represented by a scenario shown in fig. 4, when a server obtains 6 first sample data that are not related to data features of the current 6 client users from a sample data set stored in advance in the server based on a received instruction for evaluating similarity of user data of the client users in the current federal learning system, the server generates 6 second sample data with the same number as 6 clients connected to the current server, extracts 6 first target sample data and 6 second target sample data according to a 1:1 number ratio, combines the extracted first target sample data and second target sample data to form a test sample set required for testing the first sample data and the second sample data, and immediately after the current server detects that the test sample set is combined, tests the first target sample data and the second target sample data in the test sample set, records test results of each local training model for testing the first target sample data and the second target sample data, and performs an arbitrary clustering function based on the test results of each local training model for evaluating similarity of the user data.
Further, step S300 includes:
step S301, the server records each test result of the training test performed on each machine learning model.
The server in the current federal learning system records the process of training and testing the first sample data and the second sample data in the test sample set on the local machine learning training models of all clients, thereby recording the test results of the training and testing the first sample data and the second sample data by all the local machine learning training models.
Specifically, for example, in a federal learning system represented by a scenario shown in fig. 4, 1 local machine learning model in 6 local machine training models of 6 clients is sequentially called at a server, and in a process of randomly selecting 1 first target sample data and 1 second target sample data in a test sample set to perform a local training test, results of training test on each target first sample data and a target random sample by each local machine training model are recorded, so that after the local training test on the first target sample data and the second target sample data in the test sample set is completed by each of the 6 local machine training models, 6 test results of training test on the first target sample data and the second target sample data in the test sample set are obtained by each of the 6 local machine learning training models.
Step S302, any two test results are extracted successively, and the similarity of the user data is calculated based on a similarity evaluation function.
The server side sequentially and arbitrarily extracts two test results from all test results of the recorded local machine training models for training and testing the first sample data and the second sample data, and calculates by using the existing arbitrary data similarity evaluation function so as to obtain calculation results for evaluating the similarity of user data of the client sides corresponding to the two test results.
Specifically, for example, the server sequentially extracts two test results (sequentially extracts a first test result and a second test result, a third test result of the first test result, a first test result and a fourth test result until each test result is combined with other test results) in a combined form from 6 test results obtained by recording 6 local machine learning training models, sequentially performs training tests on the first target sample data and the second target sample data in the test sample set, calculates the two test results by using the existing arbitrary data similarity evaluation function, and evaluates the similarity of clients of the current 6 clients in a one-to-one correspondence manner based on the 15 calculated results, namely, the client similarity of the client corresponding to the group of test results with the largest calculated result is the highest.
Step S303, performing unsupervised clustering on the obtained test results to evaluate the similarity of the user data.
And performing unsupervised clustering on all test results of the recorded local machine training models for training test on the first sample data and the second sample data, so as to obtain the similarity of the user data of the client corresponding to any two test results.
Specifically, for example, when the number of clients connected to the server of the federal learning system is so large that the test results are calculated by combining the data similarity evaluation functions, and a large amount of resources or time is required to be consumed, the server directly performs unsupervised clustering on a large number of test results obtained by performing training test on the first target sample data and the second target sample data in the test sample set sequentially from the local machine learning training model obtained by recording, so that the similarity between the user data of a large number of clients connected to the server of the federal learning system is evaluated.
The method comprises the steps that user data which are irrelevant to users of all clients connected with a server in a current federal learning system are obtained through the server in the federal learning system and serve as first sample data for testing and evaluating the similarity of the user data of all the clients at present, the server generates second sample data with the same number as that of the clients connected with the current server, the obtained first sample data are mixed with the generated second sample data to form a test sample set, the first sample data and the second sample data in the test sample set are tested, the server generates second sample data with the same number as that of the clients connected with the current server, the obtained first sample data and the generated second sample data are mixed to form the test sample set, the first sample data and the second sample data in the test sample set are tested, and the similarity of the user data of all the clients is evaluated based on test results obtained through testing.
According to the federal learning system, the similarity of the user data of the client is evaluated under the condition that the client does not contact the real user data of the federal learning client, so that the knowledge of the federal learning system to the user is improved, the safety of the user data is ensured, and the federal learning system is promoted to provide high-quality service for the user in a targeted manner.
Further, a second embodiment of the method for evaluating client similarity of the present invention is presented.
Referring to fig. 3, fig. 3 is a flowchart of a second embodiment of the method for evaluating the similarity of parameters according to the present invention, in this embodiment, after the step S100 of obtaining pre-stored sample data or obtaining sample data of each client as the first sample data, the method for evaluating the similarity of clients according to the present invention further includes:
in step S400, the server generates second sample data based on the first sample data.
After acquiring user data irrelevant to each client user connected with a server in the current federal learning system, the server serves as first sample data for testing and evaluating the similarity of the current client user data, and then generates second sample data with the same quantity as the first sample data according to the acquired first sample data.
Specifically, for example, in a federal learning system represented by the scenario shown in fig. 4, after the server obtains 6 pieces of first sample data that are not related to the data features of the current 6 client users as the first sample data based on the received instruction for evaluating the similarity of the user data of the client users in the current federal learning system, the server further generates 6 pieces of second sample data one by one according to the 6 pieces of user data.
Further, step S400 includes:
in step S401, the server randomly adds noise to the obtained first sample data, and/or randomly perturbs the first sample data to generate the second sample data.
After the server acquires the first sample data from the pre-stored sample data set, randomly adding data noise into each acquired first sample data, and/or randomly performing data disturbance on each acquired first sample data in sequence, so as to generate second sample data with the same quantity as each first sample data.
Specifically, for example, in a federal learning system represented by a scenario shown in fig. 4, after a server side randomly adds data noise to the extracted sample data in sequence to perform redundancy processing on the sample data after the sample data is taken as first sample data for evaluating the similarity between the respective user data and a client connected to the current server side, or the server side randomly perturbs the extracted sample data in sequence to perform scrambling processing on the sample data, or the server side randomly perturbs the sample data while randomly adding data noise to the extracted sample data to perform redundancy processing on the sample data in sequence to perform redundancy processing on the sample data in a certain number of user data of the current federal learning system, which are included in a pre-stored sample data set, other user data such as in the field of e-commerce to which the current client user belongs.
In this embodiment, the manner in which the server generates the second sample data is not limited to adding data noise or performing data disturbance to the extracted sample data, and the server may also generate the second sample data by performing processing such as cutting, disorder, or the like on the sample data, or performing a combination operation of the processing such as cutting, disorder, or the like.
According to the invention, through the server side of the federal learning system, after user data which is irrelevant to each client side user connected with the server side in the current federal learning system is obtained and is used as first sample data for testing and evaluating the similarity of the user data of each current client side, data cutting, noise adding, disorder and/or disturbance and the like are carried out according to each obtained first sample data so as to generate second sample data with the same quantity as the first sample data, and therefore the first sample data collected by the server side and the generated second sample data are combined to form a test sample set so as to be used by the server side to call each client side local machine learning model for testing, and then the similarity of the user data of each client side is evaluated. The method and the device have the advantages that the similarity of the client in the federal learning system can be evaluated based on the first sample data and the randomly generated sample data, so that the federal learning system can be improved to know the client user, accurate service is provided for the user more pertinently, the user does not need to touch the real original data of the user, and the safety of the user data is ensured.
In addition, referring to fig. 5, an embodiment of the present invention further provides a client similarity evaluation device, where the client similarity evaluation device is applied to a federal learning system, and the federal learning system includes: the device for evaluating the similarity of the clients comprises a server and a plurality of clients, wherein the clients are provided with the evaluation device, and the evaluation device comprises:
the acquisition module is used for acquiring pre-stored sample data or sample data of each client side as first sample data by the server side;
the testing module is used for combining the first sample data and the second sample data generated by the server into a testing sample set and testing the testing sample set;
and the evaluation module is used for evaluating the similarity of the user data of each client based on the test result obtained by the test of the test sample set.
Preferably, the acquisition module comprises:
the first acquisition unit is used for detecting a pre-stored sample data set by the server;
the first obtaining unit is further configured to obtain, from the sample data set, sample data based on a random sampling manner, as first sample data of each client; or alternatively, the process may be performed,
The second acquisition unit is used for acquiring the sample data randomly input by each client side by the server side as the first sample data.
Preferably, the device for evaluating the similarity of the clients further includes:
and the generation module is used for generating second sample data based on the first sample data by the server.
Preferably, the generating module includes:
the generating unit is used for randomly adding noise into the acquired first sample data and/or randomly disturbing the first sample data by the server side so as to generate the second sample data.
Preferably, the test module comprises:
a data extraction unit, configured to extract first target sample data and second target sample data from the first sample data and the second sample data according to a preset ratio;
and the data combination unit is used for combining the extracted first target sample data and the second target sample data to obtain a test sample set of the first sample data and the second sample data.
Preferably, the test module further comprises:
the calling unit is used for calling the machine learning model of each client side by the server side;
And the test unit is used for carrying out training test on the first target sample data and the second target sample data in the test sample set based on each machine learning model.
Preferably, the evaluation module comprises:
the test result acquisition unit is used for recording each test result of training test of each machine learning model by the server;
the first evaluation unit is used for successively extracting any two test results and calculating the similarity of the user data based on a similarity evaluation function; or alternatively, the process may be performed,
and the second evaluation unit is used for performing unsupervised clustering on the obtained test results so as to evaluate the similarity of the user data.
The steps of the method for evaluating the parameter similarity when each module of the device for evaluating the client similarity provided in this embodiment runs are not described herein.
In addition, the embodiment of the invention also provides a storage medium which is applied to a computer, namely the storage medium is a computer readable storage medium, the storage medium stores a client similarity evaluation program, and the client similarity evaluation program realizes the steps of the client similarity evaluation method when being executed by a processor.
The method implemented when the evaluation program of the client similarity running on the processor is executed may refer to various embodiments of the evaluation method based on the client similarity according to the present invention, which are not described herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. The method for evaluating the similarity of the clients is characterized in that the method for evaluating the similarity of the clients is applied to a federal learning system, and the federal learning system comprises: the method for evaluating the similarity of the clients comprises the following steps:
the server side obtains pre-stored sample data or randomly input sample data of each client side to serve as first sample data; the sample data is user data which is not related to each client connected with the server, and the sample data is not real user data of each client;
combining the first sample data and the second sample data which are generated by the server and have the same number as the second sample data generated by the client connected with the server into a test sample set, and calling a local machine learning training model of each client to carry out a local training test on the test sample set;
And evaluating the similarity of each client by using a test result obtained by testing the similarity evaluation function based on the test sample set.
2. The method for evaluating the similarity of clients according to claim 1, wherein the step of the server obtaining pre-stored sample data or obtaining sample data randomly input by each client as the first sample data comprises:
the server detects a pre-stored sample data set;
acquiring sample data based on a random sampling mode from the sample data set as first sample data; or alternatively, the process may be performed,
the server acquires sample data randomly input by each client as the first sample data.
3. The method for evaluating the similarity of clients according to claim 1, wherein after the step of the server obtaining pre-stored sample data or obtaining sample data randomly inputted by each client as first sample data, the method further comprises:
the server generates second sample data based on the first sample data.
4. The method for evaluating the similarity of clients according to claim 3, wherein the step of generating second sample data by the server based on the first sample data includes:
And the server randomly adds noise in the acquired first sample data and/or randomly perturbs the first sample data to generate the second sample data.
5. The method for evaluating the similarity of clients according to claim 1, wherein the step of combining the first sample data and the server-side generating the same number of second sample data as the server-side connection clients into a test sample set includes:
extracting first target sample data and second target sample data from the first sample data and the second sample data according to a preset proportion;
and combining the first target sample data and the second target sample data to obtain a test sample set of the first sample data and the second sample data.
6. The method for evaluating client similarity according to claim 5, wherein said step of performing a local training test on said test sample set comprises:
the server side invokes a machine learning model of each client side;
training test is performed on the first target sample data and the second target sample data in the test sample set based on each machine learning model.
7. The method for evaluating the similarity of clients according to claim 6, wherein the step of evaluating the similarity of the clients based on the test result obtained by the test performed by the test sample set comprises:
the server records each test result of training test of each machine learning model;
sequentially extracting any two test results, and calculating the similarity of the client based on a similarity evaluation function; or alternatively, the process may be performed,
and performing unsupervised clustering on the obtained test results to evaluate the similarity of the clients.
8. The device for evaluating the similarity of the clients is characterized in that the device for evaluating the similarity of the clients is applied to a federal learning system, and the federal learning system comprises: the device for evaluating the similarity of the clients comprises a server and a plurality of clients, wherein the clients are provided with the evaluation device, and the evaluation device comprises:
the acquisition module is used for acquiring pre-stored sample data or acquiring sample data randomly input by each client side as first sample data by the server side; the sample data is user data which is not related to each client connected with the server, and the sample data is not real user data of each client;
The testing module is used for combining the first sample data and the second sample data which are generated by the server and have the same quantity as the second sample data generated by the client connected with the server into a testing sample set, and calling a local machine learning training model of each client to carry out a local training test on the testing sample set;
and the evaluation module is used for evaluating the similarity of each client by utilizing a similarity evaluation function based on a test result obtained by testing the test sample set.
9. A terminal device, characterized in that the terminal device comprises: memory, a processor and a client similarity evaluation program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the client similarity evaluation method according to any one of claims 1 to 7.
10. A storage medium, characterized in that it is applied to a computer, on which a client similarity evaluation program is stored, which when executed by a processor, implements the steps of the client similarity evaluation method according to any one of claims 1 to 7.
CN201910681352.0A 2019-07-25 2019-07-25 Client similarity evaluation method and device, terminal equipment and storage medium Active CN110378749B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910681352.0A CN110378749B (en) 2019-07-25 2019-07-25 Client similarity evaluation method and device, terminal equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910681352.0A CN110378749B (en) 2019-07-25 2019-07-25 Client similarity evaluation method and device, terminal equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110378749A CN110378749A (en) 2019-10-25
CN110378749B true CN110378749B (en) 2023-09-26

Family

ID=68256257

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910681352.0A Active CN110378749B (en) 2019-07-25 2019-07-25 Client similarity evaluation method and device, terminal equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110378749B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046425B (en) * 2019-12-12 2021-07-13 支付宝(杭州)信息技术有限公司 Method and device for risk identification by combining multiple parties
CN111340614B (en) * 2020-02-28 2021-05-18 深圳前海微众银行股份有限公司 Sample sampling method and device based on federal learning and readable storage medium
CN111582508A (en) * 2020-04-09 2020-08-25 上海淇毓信息科技有限公司 Strategy making method and device based on federated learning framework and electronic equipment
CN113554476B (en) * 2020-04-23 2024-04-19 京东科技控股股份有限公司 Training method and system of credit prediction model, electronic equipment and storage medium
CN111625587B (en) * 2020-05-28 2022-02-15 泰康保险集团股份有限公司 Data sharing apparatus
CN111598186B (en) * 2020-06-05 2021-07-16 腾讯科技(深圳)有限公司 Decision model training method, prediction method and device based on longitudinal federal learning
US20220114475A1 (en) * 2020-10-09 2022-04-14 Rui Zhu Methods and systems for decentralized federated learning
CN115730640A (en) * 2021-08-31 2023-03-03 华为技术有限公司 Data processing method, device and system
CN117556268A (en) * 2022-07-31 2024-02-13 华为技术有限公司 Data quality measurement method and device
CN117933427B (en) * 2024-03-19 2024-05-28 南京邮电大学 Differential privacy federal learning method for double sampling optimization of smart grid

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105678557A (en) * 2016-01-04 2016-06-15 百度在线网络技术(北京)有限公司 Method and device for generating model, method and device for evaluating service quality
CN107545275A (en) * 2017-07-27 2018-01-05 华南理工大学 The unbalanced data Ensemble classifier method that resampling is merged with cost sensitive learning
CN107644036A (en) * 2016-07-21 2018-01-30 阿里巴巴集团控股有限公司 A kind of method, apparatus and system of data object push
CN109635462A (en) * 2018-12-17 2019-04-16 深圳前海微众银行股份有限公司 Model parameter training method, device, equipment and medium based on federation's study
CN109871702A (en) * 2019-02-18 2019-06-11 深圳前海微众银行股份有限公司 Federal model training method, system, equipment and computer readable storage medium
CN109948674A (en) * 2019-03-05 2019-06-28 清华大学 Method for measuring similarity and system based on depth meta learning
CN110008696A (en) * 2019-03-29 2019-07-12 武汉大学 A kind of user data Rebuilding Attack method towards the study of depth federation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105678557A (en) * 2016-01-04 2016-06-15 百度在线网络技术(北京)有限公司 Method and device for generating model, method and device for evaluating service quality
CN107644036A (en) * 2016-07-21 2018-01-30 阿里巴巴集团控股有限公司 A kind of method, apparatus and system of data object push
CN107545275A (en) * 2017-07-27 2018-01-05 华南理工大学 The unbalanced data Ensemble classifier method that resampling is merged with cost sensitive learning
CN109635462A (en) * 2018-12-17 2019-04-16 深圳前海微众银行股份有限公司 Model parameter training method, device, equipment and medium based on federation's study
CN109871702A (en) * 2019-02-18 2019-06-11 深圳前海微众银行股份有限公司 Federal model training method, system, equipment and computer readable storage medium
CN109948674A (en) * 2019-03-05 2019-06-28 清华大学 Method for measuring similarity and system based on depth meta learning
CN110008696A (en) * 2019-03-29 2019-07-12 武汉大学 A kind of user data Rebuilding Attack method towards the study of depth federation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于相似样本归并的大样本混合信用评估模型;张润驰;杜亚斌;薛立国;徐源浩;吴心弘;;管理科学学报(07);82-95 *

Also Published As

Publication number Publication date
CN110378749A (en) 2019-10-25

Similar Documents

Publication Publication Date Title
CN110378749B (en) Client similarity evaluation method and device, terminal equipment and storage medium
CN107391359B (en) Service testing method and device
US10374934B2 (en) Method and program product for a private performance network with geographical load simulation
CN105160554A (en) Game questionnaire data processing method and device
CN111274490B (en) Method and device for processing consultation information
CN107688533A (en) Applied program testing method, device, computer equipment and storage medium
CN111192170B (en) Question pushing method, device, equipment and computer readable storage medium
CN112307464A (en) Fraud identification method and device and electronic equipment
CN107368407B (en) Information processing method and device
CN108804501B (en) Method and device for detecting effective information
CN107885872B (en) Method and device for generating information
KR102151322B1 (en) Information push method and device
CN109413004A (en) Verification method, device and equipment
CN105512208B (en) Information publishing method, device and system
KR20180122111A (en) Service and method for providing performance of event planning online with offline
Lima et al. Land of lost knowledge: an initial investigation into projects lost knowledge
CN110992166B (en) Method and device for testing online loan application
CN110990275B (en) Page display test method and device for mobile banking
CN112817816B (en) Embedded point processing method and device, computer equipment and storage medium
CN114005440A (en) Question-answering method, system, electronic equipment and storage medium based on voice interaction
CN116226204A (en) Scene determination method, device, equipment and storage medium based on joint learning platform
JP6122138B2 (en) Method and device for optimizing information diffusion between communities linked by interaction similarity
JP2014074966A (en) Task processing method, program and system
CN111966506A (en) Content sharing method based on different application programs and computer equipment
CN111131354A (en) Method and apparatus for generating information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant