CN114840873A

CN114840873A - Symbolic regression method based on federal genetic programming

Info

Publication number: CN114840873A
Application number: CN202210366425.9A
Authority: CN
Inventors: 钟竞辉; 董俊兰; 陈伟能
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-04-08
Filing date: 2022-04-08
Publication date: 2022-08-02

Abstract

The invention discloses a symbolic regression method based on federal genetic programming. The method comprises the following steps: creating multiple threads, determining the number of clients accessing a server, and ensuring that the clients access the server successfully; randomly initializing a population; the method comprises the steps that a plurality of clients perform population fitness calculation in parallel, whether an adaptation value reaches a termination condition is judged, if yes, quitting is performed, and otherwise, server fitness aggregation is performed; carrying out fitness aggregation according to a mean shift aggregation mechanism to obtain an aggregated fitness value F; selecting individuals according to the aggregated fitness value F, taking the selected individuals as a father line, and breeding next generation program individuals through a genetic operator; the genes are randomly changed into new values with a certain probability; each gene is crossed with a mutation carrier to generate a population, and whether the adaptive value reaches the termination condition is judged by returning. The symbolic regression method can fully utilize data information, and has a better effect compared with the traditional gene programming algorithm.

Description

Symbolic regression method based on federal genetic programming

Technical Field

The invention relates to the specific technical field of intelligent calculation and high-performance calculation, in particular to a symbolic regression method based on federal genetic programming.

Background

With the popularization of intellectualization, various edge devices become indispensable components of life, such as smart phones, smart computers, smart appliances, and the like. Various data are stored in various devices in a dispersed mode, if the data are stored in a centralized mode in a server, great safety hazards are brought in the transmission process, and communication overhead is huge. At present, the security of a network space has a great influence on individuals and even the whole country, and how to design a machine learning framework by using data of edge devices is the key point of current research on the premise of protecting data privacy and security.

In recent years, deep learning models have poor interpretability and high requirements for hardware, and more researchers are focusing on interpretable machine learning, so that symbolic regression becomes a hot topic. Genetic Programming (GP) algorithms are currently the dominant method of studying symbolic regression problems. The principle of genetic programming is to optimize chromosomes in a nonlinear tree structure program, namely a genetic algorithm, and simultaneously analyze the chromosomes. At present, genetic programming is widely applied to the fields of pattern recognition, image analysis, symbolic regression and the like.

However, the existing genetic programming algorithms have the following disadvantages: on the one hand, none of the current technologies consider data privacy and data security issues from the dimensionality of the data. On the other hand, current genetic programming searches are driven purely by prediction errors observed on training data samples, which do not provide sufficient guidance for the desired model when the data samples do not cover the input space sufficiently (large-scale symbolic regression method and system based on adaptive parallel genetic algorithm).

Disclosure of Invention

From the viewpoint of protecting the privacy and the security of data, the technical problem which is not considered by the distributed genetic programming is solved. The invention provides a symbolic regression method based on federal genetic programming, which can train a global model under the condition of not concentrating data. Each client can process local data locally in parallel without sending the raw data to the server. The method not only protects the privacy and the safety of the data, but also reduces the data acquisition time. In addition, a Mean Shift aggregation mechanism is proposed to aggregate local fitness. Considering the relative importance of the samples, the mechanism studies the possibility of improving symbolic regression on real data by incorporating weights into the fitness function.

The purpose of the invention is realized by at least one of the following technical solutions.

A symbolic regression method based on federal genetic programming, comprising the steps of:

s1: initialization: creating multiple threads, determining the number of clients accessing a server, and ensuring that the clients access the server successfully; randomly initializing a population, wherein the size of the population is NP;

s2: calculating the client side fitness: the plurality of clients perform population fitness calculation in parallel, judge whether the fitness value reaches a termination condition, exit if yes, otherwise execute step S3;

s3: server fitness aggregation: carrying out fitness aggregation according to a Mean shift aggregation mechanism to obtain aggregated population fitness F;

s4: gene selection: selecting individuals according to the aggregated population fitness F, taking the selected individuals as a father line, and breeding next generation program individuals through genetic operators;

s5: gene mutation: the genes are randomly changed into new values with a certain probability;

s6: gene crossing: each gene is crossed with the mutation vector to generate a population, and the process returns to step S2.

Further, in step S1, a symbolic regression system for symbolic regression is constructed, where the symbolic regression system includes multiple clients and a central server, that is, a server, where the server sends the population to the clients, and the clients calculate fitness according to their own data and return the fitness to the server, and neither of the clients nor the server transmits original data, thereby solving the problem of data unshared in a privacy environment.

Further, in step S1, the server and the plurality of clients are started; the server monitors whether a server side applying for access or a connected client side needing to be disconnected exists in real time, and when a new client side requests for access, the server immediately responds to the access of the client side; after all the clients are successfully connected, performing population initialization on the server; in the server, the server confirms the IP and the port of the connected client and then uniformly sends the initial population to the client;

population initialization in the server means that NP random chromosomes are generated to form an initial population, and the initial population is specifically represented as follows:

X＝{X _i |X _i ＝[x _i,1 ,x _i,2 ,...,x _i,L ],i＝1,2,...,NP} (1)

wherein, X _i Is a vector representing the ith chromosome, i is the index of the chromosome in the population, x _i,j Is the ith chromosome X _i L is the length of the chromosome, NP represents the population size; each chromosome comprises a main program and a plurality of sub-functions, and the main program and the sub-functions are composed of head and tail gene expressions;

and in the client, the IP address and the port number of the server needing to be connected are confirmed before starting, and after the server is successfully connected, the server is waited to send a population for fitness calculation.

Further, in step S2, after the client acquires the population, each chromosome in the population is encoded into an expression with a length equal to that of the chromosome; assume that the data set for all clients is represented as follows:

D＝{D ₁ ,...,D _k ,...,D _K } (2)

Wherein D is _k Data indicating the kth client connected to the server, where K is 1 to K, and K is the number of clients connected to the server; the overall population fitness f is obtained through chromosome coding and calculation and is expressed as follows:

wherein NP denotes the population size, f _k (X _i ) The fitness value calculated by the ith chromosome in the population at the kth client is represented, and i is 1 to NP.

Further, in step S3, a mean shift aggregation mechanism is adopted, each chromosome aggregates a plurality of fitness degrees according to the importance of each client, and a mean shift aggregation mechanism algorithm is specifically as follows:

s3.1: initializing the population fitness F of the polymerization to be 0, and acquiring a random central point x;

s3.2: input core bandwidth h, aggregation termination distance s _d And the overall population fitness f and the client weight W ═ W ₁ ,w ₂ ,...,w _k ]；

S3.3: calculating all distances from the whole population fitness f to a random central point x, and then finding all points in the range of the nuclear bandwidth h, wherein the set is called as a set M;

s3.4: calculating the vector from the random central point x to each point in the set M, and adding all the vectors to obtain M _h (x)；

S3.5: random center point x along M _h (x) Becomes x' ═ x + | | M _h (x)||；

S3.6: looping step S3.3-step S3.5 until | M is satisfied _h (x)||＜s _d Step S3.7 is executed;

s3.7: outputting the aggregated population fitness F;

the kernel bandwidth h in the mean shift polymerization mechanism algorithm is an important parameter of a Gaussian kernel function, and different polymerization effects are obtained by different values; the weight W of the client is calculated according to the percentage of the client data volume to the total data volume of all the clients.

Further, M _h (x) The specific calculation formula is as follows:

wherein x is _i Denotes the ith chromosome in the population, w _k Representing the weight of the kth client,

representing a gaussian kernel function.

Further, in step S4, based on the polymerization obtained in step S3Population fitness F ═ F _c (X ₁ ),...,f _c (X _i ),...,f _c (X _NP ) And selecting offspring to replace the chromosomes of the parents to form a new population, which is as follows:

wherein, f (U) _i ) Representing the parent chromosome U _i Fitness of (a), parent chromosome represents the last round of trained chromosomes, f _c (X _i ) Denotes the ith chromosome X _i Fitness value of the polymerization.

Further, in step S5, based on the conventional DE mutation scheme "DE/current-to-best/1", the genes in the chromosome are randomly changed to new values with a certain probability as follows:

Y _i ＝X _i +β(X _best -X _i )+β{X _r1 -X _r2 } (5)

wherein, Y _i Denotes the ith chromosome X in the population _i The mutational vector of (1), X _best Is the best individual in the population, X _r1 、X _r2 And X _i Are respectively three different individuals, X _r1 And X _r2 Randomly selecting from the population; beta is a scaling factor and takes the value rand (0, 1).

Further, in step S6, the ith chromosome X in the population _i Each element in (a) and the mutation carrier Y _i Each element is crossed, and a new test carrier is created, so that the population can search a better solution in a solution space; chromosome i X of the population _i Each gene in (a) is obtained by mutating the vector Y _i Creating a new trial vector Z _i Specifically, the following are shown:

wherein z is _i,j 、y _i,j And x _i,j Respectively represent test vectors Z _i Mutant vector Y _i And chromosome X _i The jth element of (1); CR represents the cross probability and takes the value of rand (0, 1); l is a random integer between 1 and L, the length of the L chromosome;

after the cross operation is finished, generating a new population; and sending the new population to the client and returning to the step S2.

Further, in step S2, the fitness calculation uses the Root Mean Square Error (RMSE) in the symbolic regression, a set value of the Root Mean Square Error (RMSE) is given, and when the population fitness f is smaller than the set value, that is, when the termination condition is reached, the symbolic regression is completed.

Compared with the prior art, the invention has the advantages that:

(1) aiming at the existing distributed GP technology, the invention can protect the privacy and the safety of data by training a global model through federal learning. Meanwhile, the client side has absolute free right, can enter the whole system at any time and can exit at the same time, and the application scene in the real environment is more met.

(2) The invention further improves the searching performance of the genetic programming algorithm by adopting a mean shift polymerization-based method, and simultaneously considers different weights given by inconsistent importance degrees of data samples, thereby effectively solving the problem of symbolic regression in a real environment.

(3) The symbolic regression method can fully utilize data information, and has a better effect compared with the traditional gene programming algorithm.

Drawings

FIG. 1 is a block diagram of an algorithm for a symbolic regression method based on Federal genetic Programming according to an embodiment of the present invention;

FIG. 2 is a schematic illustration of chromosome coding in an embodiment of the present invention;

FIG. 3 is a schematic diagram of symbolic regression to be solved in an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings.

Example 1:

the main purpose of this work is to solve the problem of symbolic regression, which does not allow data to be transmitted to a central server when it is scattered over different local machines. Meanwhile, the data distribution of each client does not cover the whole sample space, and the data amount of each client is different. When the client-side trains the model independently, each client-side can train a plurality of different function expressions which are far away from the approximate function. As shown in fig. 3, the present invention provides a joint training mode, which combines data of multiple clients to perform training, so as to finally obtain a desired objective function.

A symbolic regression method based on federal genetic programming, as shown in fig. 1, comprising the steps of:

the method comprises the steps that a symbol regression system for symbol regression is built, the symbol regression system comprises a plurality of clients and a central server, namely a server, the server sends a population to the clients, the clients calculate fitness according to data of the clients and return the fitness to the server, the data transmitted by the clients and the server are not original data, and the problem that the data are not shared in a privacy environment is solved.

Starting a server and a plurality of clients; the server monitors whether a server side applying for access or a connected client side needing to be disconnected exists in real time, and when a new client side requests for access, the server immediately responds to the access of the client side; after all the clients are successfully connected, performing population initialization on the server; in the server, the server confirms the IP and the port of the connected client and then uniformly sends the initial population to the client;

X＝{X _i |X _i ＝[x _i,1 ,x _i,2 ,...,x _i,L ],i＝1,2,...,NP} (1)

Wherein, X _i Is a vector representing the ith chromosome, i is the index of the chromosome in the population, x _i,j Is the ith chromosome X _i L is the length of the chromosome, NP represents the population size; each chromosome comprises a main program and a plurality of sub-functions, and the main program and the sub-functions are both composed of head and tail gene expressions, as shown in FIG. 2;

after the client acquires the population, each chromosome in the population is coded into an expression with the length equal to that of the chromosome; assume that the data set for all clients is represented as follows:

D＝{D ₁ ,...,D _k ,...,D _K } (2)

And calculating the fitness by adopting a Root Mean Square Error (RMSE) in the symbol regression, giving a set value of the Root Mean Square Error (RMSE), and finishing the symbol regression when the population fitness f is smaller than the set value, namely reaching a termination condition.

and (3) adopting a mean shift aggregation mechanism, aggregating a plurality of fitness degrees by each chromosome according to the importance of each client, wherein the mean shift aggregation mechanism algorithm is specifically as follows:

S3.5: random center point x along M _h (x) Becomes x' ═ x + | | M _h (x)||；

S3.6: step S3.3-step S3.5 are looped until | M is satisfied _h (x)||＜s _d Step S3.7 is executed;

s3.7: outputting the aggregated population fitness F;

the kernel bandwidth h in the mean shift polymerization mechanism algorithm is an important parameter of a Gaussian kernel function, and different polymerization effects are obtained by different values; the weight W of the client is calculated according to the percentage of the data volume of the client to the total data volume of all the clients.

M _h (x) The specific calculation formula is as follows:

representing a gaussian kernel function.

S4: gene selection:

population fitness F ═ { F ═ F based on the aggregation obtained in step S3 _c (X ₁ ),...,f _c (X _i ),...,f _c (X _NP ) And selecting offspring to replace the chromosomes of the parents to form a new population, which is as follows:

S5: gene mutation:

based on the traditional DE mutation scheme 'DE/current-to-best/1', genes in a chromosome are randomly changed into new values with a certain probability, which is as follows:

Y _i ＝X _i +β(X _best -X _i )+β{X _r1 -X _r2 } (5)

S6: gene crossing:

chromosome i X of the population _i Each element in (a) and the mutation carrier Y _i Each element is crossed, and a new test carrier is created, so that the population can search a better solution in a solution space; chromosome i X of the population _i Each gene in (a) is obtained by mutating the vector Y _i Creating a new trial vector Z _i Specifically, the following are shown:

In this example, to verify the performance of the algorithm framework of the present invention, verification was first performed on 5 artificially customized standard datasets. The parameters of the algorithm of the invention are set as follows: the population size is NP equal to 30, the maximum iteration number R equal to 20000, s _d 0.5, 3 for the nuclear bandwidth h, and RMSE for the fitness value end value<10 ^-4 。

Example 2:

in this example, to further verify the effectiveness of the present invention, verification was performed on 5 noisy data sets. The parameters of the algorithm of the invention are set as follows: the population size is NP-50, the maximum iteration number R-20000, s _d 0.5, 3 for the nuclear bandwidth h, and RMSE for the fitness value end value<10 ^-4 。

Example 3:

in this embodiment, finally, the present invention performs verification on 2 real scene data sets. The parameters of the algorithm of the invention are set as follows: the population size is NP-50, the maximum number of iterations R-20000, s _d 0.5, 3 for the nuclear bandwidth h, and RMSE for the fitness value end value<10 ^-4 。

The final results of the three implementation cases show that the method is obviously superior to the existing genetic programming algorithm in the RMSE and convergence rate of the data set under different environments. The method can protect data information and improve the searching capability of a genetic programming algorithm.

Claims

1. A symbolic regression method based on federal genetic programming is characterized by comprising the following steps:

S4: gene selection: selecting individuals according to the aggregated population fitness F, breeding the next generation of program individuals by using the selected individuals as a father line;

2. The symbolic regression method based on federal genetic programming according to claim 1, wherein in step S1, a symbolic regression system for symbolic regression is constructed, the symbolic regression system includes a plurality of clients and a central server, the server sends the population to the clients, the clients calculate the fitness according to their own data and return the fitness to the server, both parties do not transmit original data, and the problem of data unshared in a private environment is solved.

3. The symbolic regression method based on federal genetic programming according to claim 2, wherein in step S1, the server and the plurality of clients are started; the server monitors whether a server side applying for access or a connected client side needing to be disconnected exists in real time, and when a new client side requests for access, the server immediately responds to the access of the client side; after all the clients are successfully connected, performing population initialization on the server; in the server, the server confirms the IP and the port of the connected client and then uniformly sends the initial population to the client;

X＝{X _i |X _i ＝[x _i,1 ,x _i,2 ,...,x _i,L ],i＝1,2,...,NP} (1)

and in the client, the IP address and the port number of the server needing to be connected are confirmed before starting, and after the server is successfully connected, the server waits for sending a population to carry out fitness calculation.

4. The symbolic regression method based on federal genetic programming according to claim 1, wherein in step S2, after the client obtains the population, each chromosome code in the population is an expression with the same length as the chromosome; assume that the data set for all clients is represented as follows:

D＝{D ₁ ,...,D _k ,...,D _K } (2)

5. The symbolic regression method according to claim 1, wherein in step S3, a mean shift aggregation mechanism is adopted, each chromosome aggregates a plurality of fitness degrees according to the importance of each client, and the mean shift aggregation mechanism algorithm is specifically as follows:

s3.2: input kernel bandwidth h, aggregation termination distance sd, overall population fitness f, and client weight W ═ W ₁ ,w ₂ ,...,w _k ]；

S3.5: random center point x along M _h (x) Becomes x' ═ x + | | M _h (x)||；

s3.7: outputting the aggregated population fitness F;

6. The symbolic regression method based on federal genetic programming according to claim 5, wherein M is _h (x) The specific calculation formula is as follows:

wherein x is _i Denotes the ith chromosome in the population, w _k Indicating the kth clientThe weight of the end-side is,

representing a gaussian kernel function.

7. The symbolic regression method based on federal genetic programming according to claim 1, wherein in step S4, the population fitness F ═ F based on the aggregation obtained in step S3 _c (X ₁ ),...,f _c (X _i ),...,f _c (X _NP ) And selecting offspring to replace the chromosomes of the parents to form a new population, which is as follows:

8. The symbolic regression method based on federal genetic programming as claimed in claim 1, wherein, in step S5, based on the traditional DE mutation scheme "DE/current-to-best/1", the genes in the chromosome are randomly changed into new values with a certain probability as follows:

Y _i ＝X _i +β(X _best -X _i )+β{X _r1 -X _r2 } (5)

9. The Federal genetic Programming based glyph according to claim 1Number regression method, characterized in that, in step S6, the ith chromosome X in the population _i Each element in (a) and the mutation carrier Y _i Each element is crossed, and a new test carrier is created, so that the population can search a better solution in a solution space; chromosome i X of the population _i Each gene in (a) is obtained by mutating the vector Y _i Creating a new trial vector Z _i Specifically, the following are shown:

10. The symbolic regression method based on federal genetic programming according to any one of claims 1 to 9, wherein in step S2, the fitness calculation uses a Root Mean Square Error (RMSE) in symbolic regression, a set value of the Root Mean Square Error (RMSE) is given, and when the population fitness f is smaller than the set value, that is, when a termination condition is reached, the symbolic regression is completed.