CN116028832A

CN116028832A - Sample clustering processing method and device, storage medium and electronic equipment

Info

Publication number: CN116028832A
Application number: CN202310064936.XA
Authority: CN
Inventors: 彭胜波; 周宏�; 陈林; 侯雄辉
Original assignee: China Tobacco Corp Guizhou Provincial Co
Current assignee: China Tobacco Corp Guizhou Provincial Co
Priority date: 2023-02-06
Filing date: 2023-02-06
Publication date: 2023-04-28

Abstract

The embodiment of the disclosure provides a sample clustering processing method, which comprises the following steps: acquiring local sample distance information of each sample based on the characteristic information of each local sample; processing the local sample distance information based on a preset protocol to obtain the full-feature dimension distance information of each sample; wherein the preset protocol includes at least one of: SPDZ, ABY, ABY3 or NPDZ; based on the full-feature dimension distance information, acquiring clustering information of each sample; and clustering each local sample based on the clustering information of each sample. By using the preset protocol, the local server can complete calculation of the full-feature dimension distance information of each sample based on the local sample distance information without calculating by means of a central server, so that the problem that a trusted central server is difficult to find in the related technology is solved, and clustering processing and unsupervised learning of samples in a plurality of local servers in a federal learning system are realized.

Description

Sample clustering processing method and device, storage medium and electronic equipment

Technical Field

The disclosure relates to the technical field of computers, in particular to the technical field of federal learning and the like, and particularly relates to a sample clustering processing method, a device, a storage medium and electronic equipment.

Background

Federal learning (Federated Learning) is an emerging artificial intelligence basic technology, and is first proposed by Google in 2016, originally used for solving the problem of local model updating of android mobile phone terminal users, and the design goal is to develop high-efficiency machine learning among multiple participants or multiple computing nodes on the premise of guaranteeing information security during large data exchange, protecting terminal data and personal data privacy and guaranteeing legal compliance.

Federal learning platforms are generally composed of data holder nodes and a central server node. The amount or number of features of each data holder's local data may not be sufficient to support a successful model training and thus requires support by other data holders, while the federated learning center server works similarly to the distributed machine learning center server. Taking classification tasks as an example, a central server collects gradients of all data holders, and returns new gradients after aggregation operation is performed in the server. In the collaborative modeling process of one federation learning, the training of the data holder on the local data only occurs locally so as to protect the data privacy, the gradient generated by iteration is used as interaction information after desensitization, the interaction information is uploaded to a server trusted by a third party instead of the local data, and the server waits for returning the aggregated parameters to update the model.

At present, research on the Union learning algorithm is mainly focused on the field of supervised learning, but less research is performed on the field of unsupervised learning. The existing unsupervised federal K-Means algorithm requires the help of a central server. In a practical environment, however, it is difficult to find a trusted central server.

Disclosure of Invention

The disclosure provides a sample clustering processing method, a sample clustering processing device, a storage medium and electronic equipment, and solves the problem that related technologies need to use a trusted center server to perform sample clustering.

According to an aspect of the present disclosure, there is provided a sample clustering method, including:

acquiring local sample distance information of each sample based on the characteristic information of each local sample;

processing the local sample distance information based on a preset protocol to obtain the full-feature dimension distance information of each sample; wherein the preset protocol includes at least one of: SPDZ, ABY, ABY3 or NPDZ;

acquiring clustering information of each sample based on the full-feature dimension distance information;

and carrying out clustering processing on the local samples based on the clustering information of the samples.

According to another aspect of the present disclosure, there is provided a sample cluster processing apparatus, including:

the first acquisition module is used for acquiring local sample distance information of each sample based on the characteristic information of each local sample;

the second acquisition module is used for processing the local sample distance information based on a preset protocol to acquire the full-feature dimension distance information of each sample; wherein the preset protocol includes at least one of: SPDZ, ABY, ABY3 or NPDZ;

the third acquisition module is used for acquiring the clustering information of each sample based on the full-feature dimension distance information;

and the clustering module is used for carrying out clustering processing on the local samples based on the clustering information of the samples.

According to yet another aspect of the present disclosure, there is provided a computer-readable storage medium having stored therein computer instructions that are executed by a computer to implement any one of the methods as in the embodiments of the present disclosure.

According to yet another aspect of the present disclosure, there is provided an electronic device comprising a processor and a memory, wherein the memory stores instructions executable by the processor to implement any one of the methods as in the embodiments of the present disclosure.

According to the sample clustering processing method provided by the embodiment of the disclosure, by using the preset protocol (SPDZ, ABY, ABY or NPDZ), the local server can complete calculation of the full-feature dimension distance information of each sample based on the local sample distance information without calculating by means of a central server, so that the problem that a trusted central server is difficult to find in the related art is solved, and clustering processing and unsupervised learning of samples in a plurality of local servers in a federal learning system are realized.

Drawings

FIG. 1 is a sample clustering method according to an embodiment of the present disclosure;

FIG. 2 is a sample cluster processing device according to an embodiment of the disclosure;

FIG. 3 is another sample cluster processing device according to an embodiment of the disclosure;

FIG. 4 is yet another sample cluster processing device according to an embodiment of the disclosure;

FIG. 5 is yet another sample cluster processing device according to an embodiment of the disclosure;

fig. 6 is yet another sample cluster processing device according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments in this disclosure without inventive faculty, are intended to be within the scope of this disclosure.

It should be noted that, the terminal device in the embodiments of the present disclosure may include, but is not limited to, smart devices such as a mobile phone, a personal digital assistant (Personal Digital Assistant, PDA), a wireless handheld device, and a Tablet Computer (Tablet Computer); the display device may include, but is not limited to, a personal computer, a television, or the like having a display function.

In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Based on the problems mentioned in the related art, the present disclosure provides a sample clustering scheme, which can be applied to federal learning without supervised learning.

Fig. 1 is a sample clustering method according to an embodiment of the disclosure, which may be specifically applied to a local server in a federal learning system, as shown in fig. 1, and may specifically include the following steps:

s102, acquiring local sample distance information of each sample based on the characteristic information of each local sample;

s104, processing the local sample distance information based on a preset protocol to obtain the full-feature dimension distance information of each sample; wherein the preset protocol includes at least one of: SPDZ, ABY, ABY3 or NPDZ;

s106, acquiring clustering information of each sample based on the full-feature dimension distance information;

s108, carrying out clustering processing on the local samples based on the clustering information of the samples.

The architecture of the federal learning system in this embodiment includes a plurality of local servers that participate in federal learning. In the application scenario of the present embodiment, each local server is required to have completed the sample alignment work before performing the sample clustering processing of the present embodiment. After the samples are aligned, the samples participating in federal learning in all local servers participating in federal learning have intersections. Specifically, each local server participating in federal learning includes samples of the same sample ID, but the characteristics of samples of the same sample ID in different local servers are not the same. For example, for a certain sample ID, feature A, B, C is included in the first local server; features D, E, F are included in the second local server; a feature G, H, I is included in the third local server, and so on.

SPDZ is a multiparty secure computing protocol existing prior to federal learning, and is formed by splicing the first letters of surnames of four inventors, and the basic method of SPDZ is secret sharing. On deployment, each participant of the federal learning system can directly participate as an SPDZ node, and two, three or more SPDZ safe calculation nodes can be adopted for each party to provide data for the nodes for further calculation in a secret sharing mode. Based on the above distribution characteristics of the samples of the federal learning system, on each individual local server, the partial characteristics of the samples are stored, and when unsupervised learning is realized in distributed federal learning, the local servers can complete calculation of the full-feature dimension distance information of each sample based on the local sample distance information by using an SPDZ protocol, and calculation is not needed by means of a central server, so that the problem that a trusted central server is difficult to find in the related technology is solved, and clustering processing and unsupervised learning of the samples in a plurality of local servers in the federal learning system are realized. In addition to the SPDZ protocol, three secure multiparty computing protocols ABY, ABY3, NPDZ may also be used in the embodiments of the present application to complete the full feature dimension distance information computation for each sample based on the local sample distance information.

In an alternative implementation, based on the characteristic information of each local sample, obtaining the local sample distance information of each sample may be implemented by:

the local server can identify each when calculating the local sample distance informationSpatial distance between samples in the feature dimension. For example, on the ith local server, a Euclidean distance calculation formula using the following formula (1) can be used to calculate sample X in the N-dimensional feature space ₁ And sample X ₂ Distance of (2):

wherein x is _1i Representing sample x ₁ Is the ith feature, x _2i Representing sample x ₂ The corresponding N may be the feature dimension of any local server at this time.

The local sample distance information of each local server may include distance information between any two samples local to the local server. In particular, the distance information between any two samples may be d calculated by the above formula, or some variation of d calculated based on the above formula, such as a square or other mathematical variation.

In an alternative implementation, the local sample distance information is processed based on the SPDZ protocol, and the obtaining of the full-feature dimension distance information of each sample may be implemented by the following ways:

first, each local server can locally calculate the euclidean distance between all samples of the local using the above formula (1), and can be expressed as

Where n is the number of samples, k _i For the local server P _i Number of features of the sample.

Secondly, processing the local sample distance information according to an addition operator of an SPDZ protocol to obtain full-feature dimension distance information d of each sample _n*m Where m is the sum of the number of features of the sample on all local servers.

It should be noted that, besides the SPDZ protocol, three secure multiparty computing protocols, ABY3, NPDZ, may also be used in the embodiments of the present application to complete the calculation of the full-feature dimension distance information of each sample based on the local sample distance information.

In an optional implementation, based on the full-feature dimension distance information, obtaining cluster information of each sample includes: processing the full-feature dimension distance information of each sample based on a MapReduce model to obtain the local density of each sample; and processing the local density of the samples based on a MapReduce model to obtain the following distance of each sample.

The MapReduce model is a programming model for parallel operation of large-scale datasets. The concepts Map and Reduce, and their main ideas, are all borrowed from functional programming languages, as well as from vector programming languages. The method greatly facilitates programmers to run own programs on the distributed system under the condition of not carrying out distributed parallel programming. Current software implementations specify a Map function to Map a set of key-value pairs to a new set of key-value pairs, and a concurrent Reduce function to ensure that each of all mapped key-value pairs share the same key-set. The MapReduce parallel processing framework is used for parallelizing the algorithm calculation process, so that the method is applicable to large-scale data sets, and the clustering processing efficiency of federal learning is further improved.

In an optional implementation, the processing the full feature dimension distance information of each sample based on the MapReduce model to obtain the local density of each sample includes: acquiring distance information of full feature dimensions of each sample in a Map task; setting a globally unique distance threshold value according to the distance information in a Reduce task; acquiring distance information of full feature dimensions of each sample in a Map task; and calculating the local density of each sample based on the distance information of the full feature dimension of each sample and the distance threshold value in a Reduce task.

The density peak based clustering algorithm is collectively referred to as a fast search and discovery density peak based clustering algorithm (clustering by fast search and find of density peaks, DPC). This embodiment may be implemented based on the DPC algorithm.

The setting of the distance threshold may be performed by: through the distance threshold d _c The distance from the sample in all the neighbor samples of each sample can be controlled to be less than or equal to the distance threshold d _c The number of neighbor samples of (a) accounts for 1-2% of the number of all neighbor samples. In this way, each sample can be more accurately controlled to be clustered. In the present embodiment, the distance threshold d of each sample _c Can be realized in an iterative manner. For example, for any sample, d is set first _c 0, at this time due to d _c 0 is less than d _c Is 0. Resetting d _c For maximum distance d from neighbor samples _max The method comprises the steps of carrying out a first treatment on the surface of the At this time due to d _c Is d _max Less than or equal to d _max The number of samples is 100% of the ratio. Then set d _c Is 0 and d _max Then detect that the distance in the neighbor samples is less than the d _c Whether the amount of (2) is 1-2%. If the number is more than 2%, continuing to set d _c Is 0 and current d _c Average of values; and (5) continuing detection. If less than 1%, continuing to set d _c For the current d _c And d _max And continuing to detect; and so on until a reasonable d is obtained _c The distance from the sample in all the neighbor samples of the sample can be less than or equal to the distance threshold d _c The number of neighbor samples of (a) is 1-2% of the number of all neighbor samples. The distance threshold d for obtaining each sample provided in this embodiment is as described above _c In practical applications, the distance threshold d can be set in other ways _c As long as the installation requirements can be satisfied, the present invention is not limited thereto.

In particular, to increase the diversity of local density values of samples, the reduction algorithm is subject to large statistical errors on small data sets. In the embodiment, the local density method of the solving node in the DPC algorithm is improved by introducing a power exponent kernel function and a normalization method, and the improved local density calculation method is shown in a formula (2). Because the local density of each sample calculated by the formula (2) is independent and cannot be reasonably compared with the local densities of other samples, in this embodiment, the local density obtained by the formula (2) is normalized by adopting the formula (3), so that the local densities of all samples are comparable.

The local density of each sample can be obtained according to the following formulas (2) and (3).

Where d (i, j) represents distance information of full feature dimension between sample i and sample j, ρ' _i Representing the local density of each sample calculated based on equation (2); ρ _min Representing the local density ρ 'of each sample calculated based on equation (2)' _i Minimum value ρ of _max Representing the local density ρ 'of each sample calculated based on equation (2)' _i Maximum value ρ of _i The normalized local density of sample i is shown.

In addition, the parallel calculation of the local density is carried out through the MapReduce model, so that the local density calculation efficiency of federal learning is further improved.

In an optional implementation, the processing the local density of the samples based on the MapReduce model to obtain the following distance of each sample includes: obtaining the local density of each sample in a Map task; obtaining the maximum value of the local density in a Reduce task, and adding the local density into a density list; sorting the local densities in the density list to obtain a sorted local density list; and determining the following neighbors of the samples and the following distances of the samples relative to the following neighbors according to the sequenced local densities in a Map task.

For any one sample, after determining the corresponding truncated distance, it may be determined that, of all neighbor samples for that sample, the distance is less than 1-2% of the neighbor samples for the distance threshold. Then, further, searching a sample with a local density larger than that of the sample in 1-2% of neighbor samples of the sample as an ideal target sample which can be followed by the sample. The distance between the sample and the target sample at this time may be the following distance of the sample.

In addition, parallel calculation according to the distance is performed through a MapReduce model, so that the calculation efficiency of the following distance of federal learning is further improved.

In an optional implementation, the clustering processing of the local samples based on the clustering information of the samples includes: calculating a local density threshold according to the local density; calculating a following distance threshold according to the following distance; determining a clustering center of each sample according to the local density threshold and the following distance threshold in a Map task; and completing the clustering of each sample according to the clustering center in the Reduce task.

The selection of the clustering center of the DPC algorithm needs to be manually specified according to a decision diagram, and the method cannot be well applied to complex data sets, so that a strategy for automatically selecting the clustering center according to the data is needed. Based on this, in the present disclosure, the local density threshold value and the following distance threshold value may be calculated based on the local density and the following distance of each sample, and for example, the local density threshold value and the following distance threshold value may be calculated specifically using the following formulas (4) and (5).

Wherein ρ is _c Is the local density threshold, delta _c To follow the distance threshold, λ, β are coefficients that can be adjusted, n is the number of samples.

Local density ρ due to cluster center _i And following distance delta _i Generally larger, so the set of cluster center points C can be defined as: c= { ρ _i |ρ _i ≥ρ _c ∧δ _i ≥δ _c I=0, 1,2, … n, where i is the i-th sample and n is the number of samples.

For example, based on this principle, in a specific implementation, at least one sample identifier with a local density greater than a local density threshold and a following distance greater than a following distance threshold may be obtained from the identifiers of each sample, and each sample identifier is used as a cluster center point; then, the identification of each other sample is allocated to the cluster center point which is larger than the local density of the sample and is closest to the cluster center point, so as to obtain a cluster set of the samples. By the method, the central points can be clustered reasonably and accurately; and then clustering is carried out based on the clustering center points, so that the accuracy and the clustering precision of sample clustering can be effectively improved, and the stability of sample clustering can be effectively ensured.

In addition, clustering processing according to samples is carried out through a MapReduce model, so that the efficiency of clustering federal learning samples can be further improved.

Fig. 2 is a sample cluster processing apparatus according to an embodiment of the present disclosure, as shown in fig. 2, the apparatus including: a first obtaining module 202, configured to obtain local sample distance information of each sample based on feature information of each local sample; a second obtaining module 204, configured to process the local sample distance information based on a preset protocol, and obtain full-feature dimension distance information of each sample; wherein the preset protocol includes at least one of: SPDZ, ABY, ABY3 or NPDZ; a third obtaining module 206, configured to obtain cluster information of each sample based on the full-feature dimension distance information; and the clustering module 208 is configured to perform local clustering on each sample based on the clustering information of each sample.

Fig. 3 is another sample cluster processing apparatus according to an embodiment of the disclosure, as shown in fig. 3, in an alternative implementation, the third obtaining module 206 includes: a first obtaining submodule 2062, configured to process the full-feature dimension distance information of each sample based on the MapReduce model, and obtain a local density of each sample; a second obtaining sub-module 2064, configured to process the local density of the samples based on the MapReduce model, and obtain the following distance of each sample.

Fig. 4 is a further sample cluster processing apparatus according to an embodiment of the disclosure, as shown in fig. 4, in an alternative implementation, the first obtaining submodule 2062 includes: a first obtaining subunit 20622, configured to obtain, in a Map task, distance information of a full feature dimension of each sample; a setting subunit 20624 configured to set a globally unique distance threshold in the Reduce task according to the distance information; a calculating subunit 20626 configured to calculate, in the Reduce task, a local density of each of the samples based on the distance information of the full feature dimension of each of the samples and the distance threshold. And carrying out parallel computation of local density through a MapReduce model, and further improving the local density computation efficiency of federal learning.

Fig. 5 is a further sample cluster processing apparatus according to an embodiment of the disclosure, as shown in fig. 5, in an alternative implementation, the second obtaining submodule 2064 includes: a second obtaining subunit 20642, configured to obtain, in a Map task, a local density of each of the samples; an adding subunit 20644, configured to obtain a maximum value of the local density in the Reduce task, and add the local density to a density list; a sorting subunit 20646, configured to sort the local densities in the density list, so as to obtain a sorted local density list; a determining subunit 20648, configured to determine, in a Map task, a following neighbor of each sample and a following distance of each sample relative to the following neighbor according to the sequenced local densities. Parallel calculation according to the distance is performed through a MapReduce model, and the calculation efficiency of the following distance of the federal learning is further improved.

Fig. 6 is a further sample clustering apparatus according to an embodiment of the disclosure, as shown in fig. 6, in an alternative implementation, the clustering module 208 includes: a first calculation submodule 2082 for calculating a local density threshold from the local density; a second calculation submodule 2084 for calculating a following distance threshold according to the following distance; a determining submodule 2086, configured to determine, in a Map task, a cluster center of each sample according to the local density threshold and the following distance threshold; and a clustering submodule 2088, configured to complete clustering of each sample according to the clustering center in the Reduce task. Clustering processing according to samples is carried out through a MapReduce model, so that the efficiency of federal learning sample clustering can be further improved.

The embodiment of the invention also provides an electronic device, which comprises a processor and a memory; the number of processors in the electronic device may be one or more, and the memory may be a computer-readable storage medium, which may be used to store a computer-executable program. The processor executes the software programs and instructions stored in the memory to perform the various functional applications of the electronic device and data processing, i.e., to implement the methods of any of the embodiments described above.

Embodiments of the present application also provide a storage medium containing computer-executable instructions that, when executed by a computer processor, implement the method of any of the embodiments described above.

Optionally, the processor implements a sample clustering processing method by executing an instruction, the method comprising:

s1, acquiring local sample distance information of each sample based on characteristic information of each local sample;

s2, processing the local sample distance information by a preset protocol to obtain the full-feature dimension distance information of each sample; wherein the preset protocol includes at least one of: SPDZ, ABY, ABY3 or NPDZ;

s3, acquiring clustering information of each sample based on the full-feature dimension distance information;

s4, based on the clustering information of the samples, carrying out local clustering processing on the samples.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application.

In general, the various embodiments of the application may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the application is not limited thereto.

Embodiments of the present application may be implemented by a data processor of a mobile device executing computer program instructions, e.g. in a processor entity, either in hardware, or in a combination of software and hardware. The computer program instructions may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages.

The block diagrams of any logic flow in the figures of this application may represent program steps, or may represent interconnected logic circuits, modules, and functions, or may represent a combination of program steps and logic circuits, modules, and functions. The computer program may be stored on a memory. The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as, but not limited to, read Only Memory (ROM), random Access Memory (RAM), optical storage devices and systems (digital versatile disk DVD or CD optical disk), etc. The computer readable medium may include a non-transitory storage medium. The data processor may be of any type suitable to the local technical environment, such as, but not limited to, general purpose computers, special purpose computers, microprocessors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), programmable logic devices (FGPAs), and processors based on a multi-core processor architecture.

By way of exemplary and non-limiting example, a detailed description of exemplary embodiments of the present application has been provided above. Various modifications and adaptations to the above embodiments may become apparent to those skilled in the art without departing from the scope of the invention, which is defined in the accompanying drawings and claims. Accordingly, the proper scope of the invention is to be determined according to the claims.

Claims

1. A sample clustering method, comprising:

2. The method of claim 1, wherein the obtaining cluster information for each of the samples based on the full feature dimension distance information comprises:

processing the full-feature dimension distance information of each sample based on a MapReduce model to obtain the local density of each sample;

and processing the local density of the samples based on a MapReduce model to obtain the following distance of each sample.

3. The method of claim 2, wherein the processing the full feature dimension distance information for each of the samples based on the MapReduce model to obtain the local density for each of the samples comprises:

acquiring distance information of full feature dimensions of each sample in a Map task;

and setting a distance threshold value according to the distance information in a Reduce task, and calculating the local density of each sample based on the distance information of the full feature dimension of each sample and the distance threshold value.

4. The method of claim 2, wherein after calculating the local density of each of the samples based on the distance information of the full feature dimension of each of the samples and the cutoff threshold of each of the samples, further comprising:

and carrying out normalization processing on the local density of each sample.

5. The method of any of claims 2-4, wherein the processing the local densities of the samples based on a MapReduce model to obtain a following distance for each of the samples comprises:

obtaining the local density of each sample in a Map task;

obtaining the maximum value of the local density in a Reduce task, and adding the local density into a density list;

sorting the local densities in the density list;

and determining the following neighbors of the samples and the following distances of the samples relative to the following neighbors according to the sequenced local densities in a Map task.

6. The method of claim 5, wherein clustering each local sample based on the clustering information of each sample comprises:

calculating a local density threshold according to the local density;

calculating a following distance threshold according to the following distance;

determining a clustering center of each sample according to the local density threshold and the following distance threshold in a Map task;

and completing the clustering of each sample according to the clustering center in the Reduce task.

7. A sample cluster processing apparatus, comprising:

8. The apparatus of claim 7, wherein the third acquisition module comprises:

the first acquisition submodule is used for processing the full-feature dimension distance information of each sample based on the MapReduce model to acquire the local density of each sample;

and the second acquisition submodule is used for processing the local density of the samples based on the MapReduce model and acquiring the following distance of each sample.

9. A computer readable storage medium having stored therein computer instructions executable by a computer to implement the method of any of claims 1-6.

10. An electronic device comprising a processor and a memory, wherein the memory stores instructions executable by the processor to implement the method of any of claims 1-6.