CN114840873A - Symbolic regression method based on federal genetic programming - Google Patents

Symbolic regression method based on federal genetic programming Download PDF

Info

Publication number
CN114840873A
CN114840873A CN202210366425.9A CN202210366425A CN114840873A CN 114840873 A CN114840873 A CN 114840873A CN 202210366425 A CN202210366425 A CN 202210366425A CN 114840873 A CN114840873 A CN 114840873A
Authority
CN
China
Prior art keywords
population
fitness
server
chromosome
client
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210366425.9A
Other languages
Chinese (zh)
Inventor
钟竞辉
董俊兰
陈伟能
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202210366425.9A priority Critical patent/CN114840873A/en
Publication of CN114840873A publication Critical patent/CN114840873A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Bioethics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Physiology (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a symbolic regression method based on federal genetic programming. The method comprises the following steps: creating multiple threads, determining the number of clients accessing a server, and ensuring that the clients access the server successfully; randomly initializing a population; the method comprises the steps that a plurality of clients perform population fitness calculation in parallel, whether an adaptation value reaches a termination condition is judged, if yes, quitting is performed, and otherwise, server fitness aggregation is performed; carrying out fitness aggregation according to a mean shift aggregation mechanism to obtain an aggregated fitness value F; selecting individuals according to the aggregated fitness value F, taking the selected individuals as a father line, and breeding next generation program individuals through a genetic operator; the genes are randomly changed into new values with a certain probability; each gene is crossed with a mutation carrier to generate a population, and whether the adaptive value reaches the termination condition is judged by returning. The symbolic regression method can fully utilize data information, and has a better effect compared with the traditional gene programming algorithm.

Description

Symbolic regression method based on federal genetic programming
Technical Field
The invention relates to the specific technical field of intelligent calculation and high-performance calculation, in particular to a symbolic regression method based on federal genetic programming.
Background
With the popularization of intellectualization, various edge devices become indispensable components of life, such as smart phones, smart computers, smart appliances, and the like. Various data are stored in various devices in a dispersed mode, if the data are stored in a centralized mode in a server, great safety hazards are brought in the transmission process, and communication overhead is huge. At present, the security of a network space has a great influence on individuals and even the whole country, and how to design a machine learning framework by using data of edge devices is the key point of current research on the premise of protecting data privacy and security.
In recent years, deep learning models have poor interpretability and high requirements for hardware, and more researchers are focusing on interpretable machine learning, so that symbolic regression becomes a hot topic. Genetic Programming (GP) algorithms are currently the dominant method of studying symbolic regression problems. The principle of genetic programming is to optimize chromosomes in a nonlinear tree structure program, namely a genetic algorithm, and simultaneously analyze the chromosomes. At present, genetic programming is widely applied to the fields of pattern recognition, image analysis, symbolic regression and the like.
However, the existing genetic programming algorithms have the following disadvantages: on the one hand, none of the current technologies consider data privacy and data security issues from the dimensionality of the data. On the other hand, current genetic programming searches are driven purely by prediction errors observed on training data samples, which do not provide sufficient guidance for the desired model when the data samples do not cover the input space sufficiently (large-scale symbolic regression method and system based on adaptive parallel genetic algorithm).
Disclosure of Invention
From the viewpoint of protecting the privacy and the security of data, the technical problem which is not considered by the distributed genetic programming is solved. The invention provides a symbolic regression method based on federal genetic programming, which can train a global model under the condition of not concentrating data. Each client can process local data locally in parallel without sending the raw data to the server. The method not only protects the privacy and the safety of the data, but also reduces the data acquisition time. In addition, a Mean Shift aggregation mechanism is proposed to aggregate local fitness. Considering the relative importance of the samples, the mechanism studies the possibility of improving symbolic regression on real data by incorporating weights into the fitness function.
The purpose of the invention is realized by at least one of the following technical solutions.
A symbolic regression method based on federal genetic programming, comprising the steps of:
s1: initialization: creating multiple threads, determining the number of clients accessing a server, and ensuring that the clients access the server successfully; randomly initializing a population, wherein the size of the population is NP;
s2: calculating the client side fitness: the plurality of clients perform population fitness calculation in parallel, judge whether the fitness value reaches a termination condition, exit if yes, otherwise execute step S3;
s3: server fitness aggregation: carrying out fitness aggregation according to a Mean shift aggregation mechanism to obtain aggregated population fitness F;
s4: gene selection: selecting individuals according to the aggregated population fitness F, taking the selected individuals as a father line, and breeding next generation program individuals through genetic operators;
s5: gene mutation: the genes are randomly changed into new values with a certain probability;
s6: gene crossing: each gene is crossed with the mutation vector to generate a population, and the process returns to step S2.
Further, in step S1, a symbolic regression system for symbolic regression is constructed, where the symbolic regression system includes multiple clients and a central server, that is, a server, where the server sends the population to the clients, and the clients calculate fitness according to their own data and return the fitness to the server, and neither of the clients nor the server transmits original data, thereby solving the problem of data unshared in a privacy environment.
Further, in step S1, the server and the plurality of clients are started; the server monitors whether a server side applying for access or a connected client side needing to be disconnected exists in real time, and when a new client side requests for access, the server immediately responds to the access of the client side; after all the clients are successfully connected, performing population initialization on the server; in the server, the server confirms the IP and the port of the connected client and then uniformly sends the initial population to the client;
population initialization in the server means that NP random chromosomes are generated to form an initial population, and the initial population is specifically represented as follows:
X={X i |X i =[x i,1 ,x i,2 ,...,x i,L ],i=1,2,...,NP} (1)
wherein, X i Is a vector representing the ith chromosome, i is the index of the chromosome in the population, x i,j Is the ith chromosome X i L is the length of the chromosome, NP represents the population size; each chromosome comprises a main program and a plurality of sub-functions, and the main program and the sub-functions are composed of head and tail gene expressions;
and in the client, the IP address and the port number of the server needing to be connected are confirmed before starting, and after the server is successfully connected, the server is waited to send a population for fitness calculation.
Further, in step S2, after the client acquires the population, each chromosome in the population is encoded into an expression with a length equal to that of the chromosome; assume that the data set for all clients is represented as follows:
D={D 1 ,...,D k ,...,D K } (2)
Wherein D is k Data indicating the kth client connected to the server, where K is 1 to K, and K is the number of clients connected to the server; the overall population fitness f is obtained through chromosome coding and calculation and is expressed as follows:
Figure BDA0003587306330000031
wherein NP denotes the population size, f k (X i ) The fitness value calculated by the ith chromosome in the population at the kth client is represented, and i is 1 to NP.
Further, in step S3, a mean shift aggregation mechanism is adopted, each chromosome aggregates a plurality of fitness degrees according to the importance of each client, and a mean shift aggregation mechanism algorithm is specifically as follows:
s3.1: initializing the population fitness F of the polymerization to be 0, and acquiring a random central point x;
s3.2: input core bandwidth h, aggregation termination distance s d And the overall population fitness f and the client weight W ═ W 1 ,w 2 ,...,w k ];
S3.3: calculating all distances from the whole population fitness f to a random central point x, and then finding all points in the range of the nuclear bandwidth h, wherein the set is called as a set M;
s3.4: calculating the vector from the random central point x to each point in the set M, and adding all the vectors to obtain M h (x);
S3.5: random center point x along M h (x) Becomes x' ═ x + | | M h (x)||;
S3.6: looping step S3.3-step S3.5 until | M is satisfied h (x)||<s d Step S3.7 is executed;
s3.7: outputting the aggregated population fitness F;
the kernel bandwidth h in the mean shift polymerization mechanism algorithm is an important parameter of a Gaussian kernel function, and different polymerization effects are obtained by different values; the weight W of the client is calculated according to the percentage of the client data volume to the total data volume of all the clients.
Further, M h (x) The specific calculation formula is as follows:
Figure BDA0003587306330000041
wherein x is i Denotes the ith chromosome in the population, w k Representing the weight of the kth client,
Figure BDA0003587306330000042
representing a gaussian kernel function.
Further, in step S4, based on the polymerization obtained in step S3Population fitness F ═ F c (X 1 ),...,f c (X i ),...,f c (X NP ) And selecting offspring to replace the chromosomes of the parents to form a new population, which is as follows:
Figure BDA0003587306330000043
wherein, f (U) i ) Representing the parent chromosome U i Fitness of (a), parent chromosome represents the last round of trained chromosomes, f c (X i ) Denotes the ith chromosome X i Fitness value of the polymerization.
Further, in step S5, based on the conventional DE mutation scheme "DE/current-to-best/1", the genes in the chromosome are randomly changed to new values with a certain probability as follows:
Y i =X i +β(X best -X i )+β{X r1 -X r2 } (5)
wherein, Y i Denotes the ith chromosome X in the population i The mutational vector of (1), X best Is the best individual in the population, X r1 、X r2 And X i Are respectively three different individuals, X r1 And X r2 Randomly selecting from the population; beta is a scaling factor and takes the value rand (0, 1).
Further, in step S6, the ith chromosome X in the population i Each element in (a) and the mutation carrier Y i Each element is crossed, and a new test carrier is created, so that the population can search a better solution in a solution space; chromosome i X of the population i Each gene in (a) is obtained by mutating the vector Y i Creating a new trial vector Z i Specifically, the following are shown:
Figure BDA0003587306330000051
wherein z is i,j 、y i,j And x i,j Respectively represent test vectors Z i Mutant vector Y i And chromosome X i The jth element of (1); CR represents the cross probability and takes the value of rand (0, 1); l is a random integer between 1 and L, the length of the L chromosome;
after the cross operation is finished, generating a new population; and sending the new population to the client and returning to the step S2.
Further, in step S2, the fitness calculation uses the Root Mean Square Error (RMSE) in the symbolic regression, a set value of the Root Mean Square Error (RMSE) is given, and when the population fitness f is smaller than the set value, that is, when the termination condition is reached, the symbolic regression is completed.
Compared with the prior art, the invention has the advantages that:
(1) aiming at the existing distributed GP technology, the invention can protect the privacy and the safety of data by training a global model through federal learning. Meanwhile, the client side has absolute free right, can enter the whole system at any time and can exit at the same time, and the application scene in the real environment is more met.
(2) The invention further improves the searching performance of the genetic programming algorithm by adopting a mean shift polymerization-based method, and simultaneously considers different weights given by inconsistent importance degrees of data samples, thereby effectively solving the problem of symbolic regression in a real environment.
(3) The symbolic regression method can fully utilize data information, and has a better effect compared with the traditional gene programming algorithm.
Drawings
FIG. 1 is a block diagram of an algorithm for a symbolic regression method based on Federal genetic Programming according to an embodiment of the present invention;
FIG. 2 is a schematic illustration of chromosome coding in an embodiment of the present invention;
FIG. 3 is a schematic diagram of symbolic regression to be solved in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings.
Example 1:
the main purpose of this work is to solve the problem of symbolic regression, which does not allow data to be transmitted to a central server when it is scattered over different local machines. Meanwhile, the data distribution of each client does not cover the whole sample space, and the data amount of each client is different. When the client-side trains the model independently, each client-side can train a plurality of different function expressions which are far away from the approximate function. As shown in fig. 3, the present invention provides a joint training mode, which combines data of multiple clients to perform training, so as to finally obtain a desired objective function.
A symbolic regression method based on federal genetic programming, as shown in fig. 1, comprising the steps of:
s1: initialization: creating multiple threads, determining the number of clients accessing a server, and ensuring that the clients access the server successfully; randomly initializing a population, wherein the size of the population is NP;
the method comprises the steps that a symbol regression system for symbol regression is built, the symbol regression system comprises a plurality of clients and a central server, namely a server, the server sends a population to the clients, the clients calculate fitness according to data of the clients and return the fitness to the server, the data transmitted by the clients and the server are not original data, and the problem that the data are not shared in a privacy environment is solved.
Starting a server and a plurality of clients; the server monitors whether a server side applying for access or a connected client side needing to be disconnected exists in real time, and when a new client side requests for access, the server immediately responds to the access of the client side; after all the clients are successfully connected, performing population initialization on the server; in the server, the server confirms the IP and the port of the connected client and then uniformly sends the initial population to the client;
population initialization in the server means that NP random chromosomes are generated to form an initial population, and the initial population is specifically represented as follows:
X={X i |X i =[x i,1 ,x i,2 ,...,x i,L ],i=1,2,...,NP} (1)
Wherein, X i Is a vector representing the ith chromosome, i is the index of the chromosome in the population, x i,j Is the ith chromosome X i L is the length of the chromosome, NP represents the population size; each chromosome comprises a main program and a plurality of sub-functions, and the main program and the sub-functions are both composed of head and tail gene expressions, as shown in FIG. 2;
and in the client, the IP address and the port number of the server needing to be connected are confirmed before starting, and after the server is successfully connected, the server is waited to send a population for fitness calculation.
S2: calculating the client side fitness: the plurality of clients perform population fitness calculation in parallel, judge whether the fitness value reaches a termination condition, exit if yes, otherwise execute step S3;
after the client acquires the population, each chromosome in the population is coded into an expression with the length equal to that of the chromosome; assume that the data set for all clients is represented as follows:
D={D 1 ,...,D k ,...,D K } (2)
wherein D is k Data indicating the kth client connected to the server, where K is 1 to K, and K is the number of clients connected to the server; the overall population fitness f is obtained through chromosome coding and calculation and is expressed as follows:
Figure BDA0003587306330000071
wherein NP denotes the population size, f k (X i ) The fitness value calculated by the ith chromosome in the population at the kth client is represented, and i is 1 to NP.
And calculating the fitness by adopting a Root Mean Square Error (RMSE) in the symbol regression, giving a set value of the Root Mean Square Error (RMSE), and finishing the symbol regression when the population fitness f is smaller than the set value, namely reaching a termination condition.
S3: server fitness aggregation: carrying out fitness aggregation according to a Mean shift aggregation mechanism to obtain aggregated population fitness F;
and (3) adopting a mean shift aggregation mechanism, aggregating a plurality of fitness degrees by each chromosome according to the importance of each client, wherein the mean shift aggregation mechanism algorithm is specifically as follows:
s3.1: initializing the population fitness F of the polymerization to be 0, and acquiring a random central point x;
s3.2: input core bandwidth h, aggregation termination distance s d And the overall population fitness f and the client weight W ═ W 1 ,w 2 ,...,w k ];
S3.3: calculating all distances from the whole population fitness f to a random central point x, and then finding all points in the range of the nuclear bandwidth h, wherein the set is called as a set M;
s3.4: calculating the vector from the random central point x to each point in the set M, and adding all the vectors to obtain M h (x);
S3.5: random center point x along M h (x) Becomes x' ═ x + | | M h (x)||;
S3.6: step S3.3-step S3.5 are looped until | M is satisfied h (x)||<s d Step S3.7 is executed;
s3.7: outputting the aggregated population fitness F;
the kernel bandwidth h in the mean shift polymerization mechanism algorithm is an important parameter of a Gaussian kernel function, and different polymerization effects are obtained by different values; the weight W of the client is calculated according to the percentage of the data volume of the client to the total data volume of all the clients.
M h (x) The specific calculation formula is as follows:
Figure BDA0003587306330000081
wherein x is i Denotes the ith chromosome in the population, w k Representing the weight of the kth client,
Figure BDA0003587306330000082
representing a gaussian kernel function.
S4: gene selection:
population fitness F ═ { F ═ F based on the aggregation obtained in step S3 c (X 1 ),...,f c (X i ),...,f c (X NP ) And selecting offspring to replace the chromosomes of the parents to form a new population, which is as follows:
Figure BDA0003587306330000083
wherein, f (U) i ) Representing the parent chromosome U i Fitness of (a), parent chromosome represents the last round of trained chromosomes, f c (X i ) Denotes the ith chromosome X i Fitness value of the polymerization.
S5: gene mutation:
based on the traditional DE mutation scheme 'DE/current-to-best/1', genes in a chromosome are randomly changed into new values with a certain probability, which is as follows:
Y i =X i +β(X best -X i )+β{X r1 -X r2 } (5)
wherein, Y i Denotes the ith chromosome X in the population i The mutational vector of (1), X best Is the best individual in the population, X r1 、X r2 And X i Are respectively three different individuals, X r1 And X r2 Randomly selecting from the population; beta is a scaling factor and takes the value rand (0, 1).
S6: gene crossing:
chromosome i X of the population i Each element in (a) and the mutation carrier Y i Each element is crossed, and a new test carrier is created, so that the population can search a better solution in a solution space; chromosome i X of the population i Each gene in (a) is obtained by mutating the vector Y i Creating a new trial vector Z i Specifically, the following are shown:
Figure BDA0003587306330000091
wherein z is i,j 、y i,j And x i,j Respectively represent test vectors Z i Mutant vector Y i And chromosome X i The jth element of (1); CR represents the cross probability and takes the value of rand (0, 1); l is a random integer between 1 and L, the length of the L chromosome;
after the cross operation is finished, generating a new population; and sending the new population to the client and returning to the step S2.
In this example, to verify the performance of the algorithm framework of the present invention, verification was first performed on 5 artificially customized standard datasets. The parameters of the algorithm of the invention are set as follows: the population size is NP equal to 30, the maximum iteration number R equal to 20000, s d 0.5, 3 for the nuclear bandwidth h, and RMSE for the fitness value end value<10 -4
Example 2:
in this example, to further verify the effectiveness of the present invention, verification was performed on 5 noisy data sets. The parameters of the algorithm of the invention are set as follows: the population size is NP-50, the maximum iteration number R-20000, s d 0.5, 3 for the nuclear bandwidth h, and RMSE for the fitness value end value<10 -4
Example 3:
in this embodiment, finally, the present invention performs verification on 2 real scene data sets. The parameters of the algorithm of the invention are set as follows: the population size is NP-50, the maximum number of iterations R-20000, s d 0.5, 3 for the nuclear bandwidth h, and RMSE for the fitness value end value<10 -4
The final results of the three implementation cases show that the method is obviously superior to the existing genetic programming algorithm in the RMSE and convergence rate of the data set under different environments. The method can protect data information and improve the searching capability of a genetic programming algorithm.

Claims (10)

1. A symbolic regression method based on federal genetic programming is characterized by comprising the following steps:
s1: initialization: creating multiple threads, determining the number of clients accessing a server, and ensuring that the clients access the server successfully; randomly initializing a population, wherein the size of the population is NP;
s2: calculating the client side fitness: the plurality of clients perform population fitness calculation in parallel, judge whether the fitness value reaches a termination condition, exit if yes, otherwise execute step S3;
s3: server fitness aggregation: carrying out fitness aggregation according to a Mean shift aggregation mechanism to obtain aggregated population fitness F;
S4: gene selection: selecting individuals according to the aggregated population fitness F, breeding the next generation of program individuals by using the selected individuals as a father line;
s5: gene mutation: the genes are randomly changed into new values with a certain probability;
s6: gene crossing: each gene is crossed with the mutation vector to generate a population, and the process returns to step S2.
2. The symbolic regression method based on federal genetic programming according to claim 1, wherein in step S1, a symbolic regression system for symbolic regression is constructed, the symbolic regression system includes a plurality of clients and a central server, the server sends the population to the clients, the clients calculate the fitness according to their own data and return the fitness to the server, both parties do not transmit original data, and the problem of data unshared in a private environment is solved.
3. The symbolic regression method based on federal genetic programming according to claim 2, wherein in step S1, the server and the plurality of clients are started; the server monitors whether a server side applying for access or a connected client side needing to be disconnected exists in real time, and when a new client side requests for access, the server immediately responds to the access of the client side; after all the clients are successfully connected, performing population initialization on the server; in the server, the server confirms the IP and the port of the connected client and then uniformly sends the initial population to the client;
Population initialization in the server means that NP random chromosomes are generated to form an initial population, and the initial population is specifically represented as follows:
X={X i |X i =[x i,1 ,x i,2 ,...,x i,L ],i=1,2,...,NP} (1)
wherein, X i Is a vector representing the ith chromosome, i is the index of the chromosome in the population, x i,j Is the ith chromosome X i L is the length of the chromosome, NP represents the population size; each chromosome comprises a main program and a plurality of sub-functions, and the main program and the sub-functions are composed of head and tail gene expressions;
and in the client, the IP address and the port number of the server needing to be connected are confirmed before starting, and after the server is successfully connected, the server waits for sending a population to carry out fitness calculation.
4. The symbolic regression method based on federal genetic programming according to claim 1, wherein in step S2, after the client obtains the population, each chromosome code in the population is an expression with the same length as the chromosome; assume that the data set for all clients is represented as follows:
D={D 1 ,...,D k ,...,D K } (2)
wherein D is k Data indicating the kth client connected to the server, where K is 1 to K, and K is the number of clients connected to the server; the overall population fitness f is obtained through chromosome coding and calculation and is expressed as follows:
Figure FDA0003587306320000021
Wherein NP denotes the population size, f k (X i ) The fitness value calculated by the ith chromosome in the population at the kth client is represented, and i is 1 to NP.
5. The symbolic regression method according to claim 1, wherein in step S3, a mean shift aggregation mechanism is adopted, each chromosome aggregates a plurality of fitness degrees according to the importance of each client, and the mean shift aggregation mechanism algorithm is specifically as follows:
s3.1: initializing the population fitness F of the polymerization to be 0, and acquiring a random central point x;
s3.2: input kernel bandwidth h, aggregation termination distance sd, overall population fitness f, and client weight W ═ W 1 ,w 2 ,...,w k ];
S3.3: calculating all distances from the whole population fitness f to a random central point x, and then finding all points in the range of the nuclear bandwidth h, wherein the set is called as a set M;
s3.4: calculating the vector from the random central point x to each point in the set M, and adding all the vectors to obtain M h (x);
S3.5: random center point x along M h (x) Becomes x' ═ x + | | M h (x)||;
S3.6: looping step S3.3-step S3.5 until | M is satisfied h (x)||<s d Step S3.7 is executed;
s3.7: outputting the aggregated population fitness F;
the kernel bandwidth h in the mean shift polymerization mechanism algorithm is an important parameter of a Gaussian kernel function, and different polymerization effects are obtained by different values; the weight W of the client is calculated according to the percentage of the client data volume to the total data volume of all the clients.
6. The symbolic regression method based on federal genetic programming according to claim 5, wherein M is h (x) The specific calculation formula is as follows:
Figure FDA0003587306320000031
wherein x is i Denotes the ith chromosome in the population, w k Indicating the kth clientThe weight of the end-side is,
Figure FDA0003587306320000032
representing a gaussian kernel function.
7. The symbolic regression method based on federal genetic programming according to claim 1, wherein in step S4, the population fitness F ═ F based on the aggregation obtained in step S3 c (X 1 ),...,f c (X i ),...,f c (X NP ) And selecting offspring to replace the chromosomes of the parents to form a new population, which is as follows:
Figure FDA0003587306320000033
wherein, f (U) i ) Representing the parent chromosome U i Fitness of (a), parent chromosome represents the last round of trained chromosomes, f c (X i ) Denotes the ith chromosome X i Fitness value of the polymerization.
8. The symbolic regression method based on federal genetic programming as claimed in claim 1, wherein, in step S5, based on the traditional DE mutation scheme "DE/current-to-best/1", the genes in the chromosome are randomly changed into new values with a certain probability as follows:
Y i =X i +β(X best -X i )+β{X r1 -X r2 } (5)
wherein, Y i Denotes the ith chromosome X in the population i The mutational vector of (1), X best Is the best individual in the population, X r1 、X r2 And X i Are respectively three different individuals, X r1 And X r2 Randomly selecting from the population; beta is a scaling factor and takes the value rand (0, 1).
9. The Federal genetic Programming based glyph according to claim 1Number regression method, characterized in that, in step S6, the ith chromosome X in the population i Each element in (a) and the mutation carrier Y i Each element is crossed, and a new test carrier is created, so that the population can search a better solution in a solution space; chromosome i X of the population i Each gene in (a) is obtained by mutating the vector Y i Creating a new trial vector Z i Specifically, the following are shown:
Figure FDA0003587306320000041
wherein z is i,j 、y i,j And x i,j Respectively represent test vectors Z i Mutant vector Y i And chromosome X i The jth element of (1); CR represents the cross probability and takes the value of rand (0, 1); l is a random integer between 1 and L, the length of the L chromosome;
after the cross operation is finished, generating a new population; and sending the new population to the client and returning to the step S2.
10. The symbolic regression method based on federal genetic programming according to any one of claims 1 to 9, wherein in step S2, the fitness calculation uses a Root Mean Square Error (RMSE) in symbolic regression, a set value of the Root Mean Square Error (RMSE) is given, and when the population fitness f is smaller than the set value, that is, when a termination condition is reached, the symbolic regression is completed.
CN202210366425.9A 2022-04-08 2022-04-08 Symbolic regression method based on federal genetic programming Pending CN114840873A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210366425.9A CN114840873A (en) 2022-04-08 2022-04-08 Symbolic regression method based on federal genetic programming

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210366425.9A CN114840873A (en) 2022-04-08 2022-04-08 Symbolic regression method based on federal genetic programming

Publications (1)

Publication Number Publication Date
CN114840873A true CN114840873A (en) 2022-08-02

Family

ID=82563596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210366425.9A Pending CN114840873A (en) 2022-04-08 2022-04-08 Symbolic regression method based on federal genetic programming

Country Status (1)

Country Link
CN (1) CN114840873A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117217150A (en) * 2023-09-13 2023-12-12 华南理工大学 DTCO formula modeling method based on genetic algorithm symbolic regression

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117217150A (en) * 2023-09-13 2023-12-12 华南理工大学 DTCO formula modeling method based on genetic algorithm symbolic regression
CN117217150B (en) * 2023-09-13 2024-05-17 华南理工大学 DTCO formula modeling method based on genetic algorithm symbolic regression

Similar Documents

Publication Publication Date Title
Ma et al. A highly accurate prediction algorithm for unknown web service QoS values
CN111797321B (en) Personalized knowledge recommendation method and system for different scenes
CN110084365A (en) A kind of service provider system and method based on deep learning
US20220130496A1 (en) Method of training prediction model for determining molecular binding force
CN111159638A (en) Power distribution network load missing data recovery method based on approximate low-rank matrix completion
US20220237935A1 (en) Method for training a font generation model, method for establishing a font library, and device
CN114585006B (en) Edge computing task unloading and resource allocation method based on deep learning
CN114840873A (en) Symbolic regression method based on federal genetic programming
CN106803092B (en) Method and device for determining standard problem data
CN117290721A (en) Digital twin modeling method, device, equipment and medium
CN116708009A (en) Network intrusion detection method based on federal learning
Tao et al. DNA double helix based hybrid GA for the gasoline blending recipe optimization problem
Wang et al. Learning regularity for evolutionary multiobjective search: A generative model-based approach
Shan et al. NASM: nonlinearly attentive similarity model for recommendation system via locally attentive embedding
CN106953820A (en) Signal blind checking method based on the plural continuous neural networks of double Sigmoid
CN117033997A (en) Data segmentation method, device, electronic equipment and medium
CN116957128A (en) Service index prediction method, device, equipment and storage medium
CN110175287B (en) Flink-based matrix decomposition implicit feedback recommendation method and system
CN108830422A (en) Optimization method, the apparatus and system of intelligent driving
Qian et al. SVM Multi-Classification Optimization Research based on Multi-Chromosome Genetic Algorithm
CN114625886A (en) Entity query method and system based on knowledge graph small sample relation learning model
CN114186168A (en) Correlation analysis method and device for intelligent city network resources
Qi et al. Qubit neural tree network with applications in nonlinear system modeling
CN109359182A (en) A kind of answer method and device
Morell et al. A multi-objective approach for communication reduction in federated learning under devices heterogeneity constraints

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination