WO2025061851A1

WO2025061851A1 - Circuitry, robot, navigation device, method and network

Info

Publication number: WO2025061851A1
Application number: PCT/EP2024/076276
Authority: WO
Inventors: Lukas MAUCH; Yihang CHEN
Original assignee: Sony Group Corporation; Sony Europe B.V.
Priority date: 2023-09-20
Filing date: 2024-09-19
Publication date: 2025-03-27

Abstract

The circuitry is configured to generate a candidate object using an order-preserving generative flow network whose input includes a set of objects. The circuitry is configured to determine the order of the candidate object within the set of objects and sample the candidate object based on its determined order.

Description

CIRCUITRY, ROBOT, NAVIGATION DEVICE, METHOD AND NETWORK TECHNICAL FIELD The present disclosure generally pertains to the field of devices that implement or use machine learning, in particular to a circuitry, robot, navigation device, method and network realizing blackbox optimization processes. TECHNICAL BACKGROUND With machine learning (ML) machines develop processes without needing to be explicitly told what to do by any human-developed algorithms. Machine-learning approaches have been applied to large language models, computer vision, speech recognition, email filtering, agriculture and medicine. Recently, generative artificial neural networks have been able to surpass results of many previous approaches. Blackbox optimization problems are a particular class of optimization problems that appear often in real life applications, e.g., for optimizing navigation or optimizing chemical reactions. The particularity of these problems is, that the objective function, is treated as a “black box”, i.e., the optimization algorithm has no access to the explicit form or to any analytical expression of the function and it can only interact with the function by evaluating it a specific points in the input space. In the context of optimization, an objective function is a mathematical relationship or a computational representation that quantifies the performance or fitness of potential elements within the environment that we seek to optimize. If the relationship cannot be formulated as a mathematical formula, we refer to it as a blackbox optimization problem. For example, it may be a goal of machine learning to determine the optimal path a robot shall take in a warehouse to be as efficient as possible. However, if there are numerous possible paths to take, the underlying relationship of all possible paths, i.e., the objective function describing the paths, may be unknown, i.e., a blackbox. Thus, as the aim is to determine the optimal path based on an environment that cannot be described in a function, i.e., objective function, blackbox optimization may be used. Although there exist techniques for solving blackbox optimization problems in machine learning, it is generally desirable to improve on the existing techniques. SUMMARY According to a first aspect the present disclosure provides a circuitry configured to generate a candidate object using an order-preserving generative flow network whose input comprises a set of objects, wherein the circuitry is configured to determine the order of the candidate object within the set of objects, and sample the candidate object based on its determined order. According to a second aspect the present disclosure provides a circuitry configured to perform a neural network architecture search using a generative flow network. According to a third aspect the present disclosure provides a pathfinding robot comprising circuitry configured to find a path using a generative flow network. According to a fourth aspect the present disclosure provides a navigation device comprising circuitry configured to find a route using a generative flow network. According to a fifth aspect the present disclosure provides a method for generating a candidate object using an order-preserving generative flow network whose input comprises a set of objects, wherein the method includes determining the order of the candidate object within the set of objects, and sampling the candidate object based on its determined order. According to a sixth aspect the present disclosure provides a method for training a generative flow network based on an order-preserving training criterion. According to a seventh aspect the present disclosure provides an order-preserving generative flow network whose input comprises a set of objects, configured to determine the order of the candidate object within the set of objects, and sample the candidate object based on its determined order. According to an eighth aspect the present disclosure provides an order-preserving generative flow network trained according to an order-preserving training criterion. Further aspects are set forth in the dependent claims, the drawings and the following description. BRIEF DESCRIPTION OF THE DRAWINGS Embodiments are explained by way of example with respect to the accompanying drawings, in which: Fig.1 shows an example of a blackbox optimization problem, i.e., optimizing an unknown objective function, being solved with a stochastic optimization algorithm, namely random sampling, i.e., via random search, to search for the optimal solution; Fig.2 shows an example of a generative machine learning model, for example, the GFlowNet described above, that is trained to generate compositional objects x proportional to a probability density function; Fig.3 illustrates log-linear based sampling. The horizontal axis indicates values x. The vertical axis on the left indicates a corresponding function f(x) for compositional objects x; Fig.4 illustrates a log-linear order-preserving generative model; Fig.5 illustrates an order-preserving GFlowNet training; Fig.6 illustrates a state-action graph of an order-preserving GFlowNet and the computation of training loss; Fig.7 illustrates an active learning setup for training an order-preserving GFlowNet; Fig.8 illustrates a robot navigating the optimal path based the OP GFlowNet; Fig.9 illustrates an example of a robot implementing a circuitry according to an embodiment of the present technique; Fig.10 schematically illustrates an embodiment of an electronic device comprising circuitry for generating a compositional object; Fig.11 illustrates the summarized results of the experiment of an example Neural Network Architecture Search; and Fig.12 illustrates the summarized results of an experiment of an example Neural Network Architecture Search. DETAILED DESCRIPTION OF EMBODIMENTS Before a detailed description of the embodiments under reference of Figs.3 to 9 are given, general explanations are made. Some embodiments pertain to a circuitry configured to generate a candidate object using an order-preserving generative flow network whose input comprises a set of objects, wherein the circuitry is configured to determine the order of the candidate object within the set of objects and sample the candidate object based on its determined order. Circuitry may for example include a processor, a memory (RAM, ROM or the like), a storage, input means (mouse, keyboard, camera, etc.), output means (display (e.g. liquid crystal, (organic) light emitting diode, etc.), loudspeakers, etc., a (wireless) interface, etc., as it is generally known for electronic devices (computers, smartphones, etc.). Moreover, it may include sensors for sensing still image or video image data (image sensor, camera sensor, video sensor, etc.), for sensing a fingerprint, for sensing environmental parameters (e.g. radar, humidity, light, temperature), etc. A candidate object may for example refer to a compositional object. A compositional object may include one or more graphs, categorical data, integer data, one or more strings, one or more images, etc. Candidate objects may for example refer to candidate objects that may be with high probability optimal in a given environment. A given environment may be anything, for example, a number of paths for pathfinding (the candidate object may be the optimal path), neural network architecture (the candidate object may be the optimal neural network architecture), molecule structures (the candidate object may be the molecule structure), or any other example (see also the explanations below). A generative flow network may relate to any machine-learning technique relying on training data for generating candidate objects at a frequency proportional to their associated reward. This is in contrast to other methods, for example, Markov chain Monte Carlo (MCMC) methods, which do not rely on training data and therefore typically require a lot of samples, i.e., more samples than the generative flow network requires, to capture all modes of the reward function. Concerning 1-dimensional problems, (e.g., scalar problem) the ordering or order may refer to a ranking or rank. That is, the ordering in one dimension may be the ranking. In the following if scalar problems, or one-dimensional problems, are discussed the ranking may refer to the ordering and the rank to the order. The order may be induced based on an objective function. The candidate object may be sampled proportional to the respective order of the candidate object. The candidate object may be sampled with a probability that is exponential to the respective order of the candidate object. The sampling of the candidate object may be based on a predefined initial state. The sampling of the candidate object may be based on a predefined set of possible actions. The sampling of the candidate object may be based on a terminate action, wherein the generation of a candidate object may be finished once the terminate action was sampled. The optimal candidate object, e.g., the optimal path, optimal neural network architecture, optimal molecular structure may be based on predefined conditions. For example, the set of objects may comprise a set of paths from a starting location to an end location, which a pathfinder starting from the starting location can navigate to reach the end location. Then, the candidate object is a candidate path to reach the end location starting from the starting location. That is, the candidate object is one path of the set of paths, for example, the optimal path. The optimal path may, for example, refer to the fastest path or the shortest path from starting location to end location etc. For example, the set of objects may comprise a set of molecules. Then, the candidate object is a candidate molecule. That is, the candidate object is one molecule of the set of molecules, for example the optimal molecule. The optimal molecule may refer to the molecule with the optimal conformation or structure, which may refer to a molecule with predefined desired properties, such as improved binding properties (e.g., binding affinity/selectivity to other target molecules or target structures) or the like. For example, the set of objects may comprise a set of neural network architectures for solving a given task, and the candidate object may be a candidate neural network architecture for solving the given task. The candidate object may be a neural network architecture of the set of neural network architectures, for example the optimal neural network architecture. Optimal neural network architecture may refer to optimal for the given, e.g., predefined, task. Some embodiments may pertain to a circuitry configured to perform a neural network architecture search based on a generative flow network. The generative flow network may be an order-preserving flow network. Performing the neural network architecture search may be based on user input. In this case the objective function that is aimed to be optimized may be the accuracy of a deep neural network (DNN), trained and evaluated on a specific dataset. Some embodiments may pertain to a robot for pathfinding comprising circuitry configured to find a path based on a generative flow network. The generative flow network may be an order- preserving flow network. Some embodiments may pertain to a navigation device comprising circuitry configured to find a route based on a generative flow network. The generative flow network may be an order- preserving flow network. Some embodiments may pertain to a vehicle including the navigation device. Some embodiments may pertain to a circuitry configured to train a generative flow network based on an order-preserving training criterion. The order preserving training criterion may include training to sample proportional to a proxy reward that is compatible with a partial order of a set of objects that reflects the difference of the quality of these the objects within a given environment. The set of objects may be candidate objects. The set of objects may include one or more objects. The set of objects may be compositional objects. The order preserving training criterion may be defined based on pairwise candidate comparisons. The order preserving training criterion may include learning a proxy reward that is compatible with the pairwise candidate comparisons. The order preserving training criterion may include determining an order-preserving loss function. The order-preserving loss function may be based on a cross-entropy between the circuitry output and the pairwise candidate comparisons. Some embodiments may pertain to a method for generating a candidate object based on an order- preserving generative flow network. Some embodiments may pertain to a method for generating a candidate object using an order- preserving generative flow network whose input comprises a set of objects, wherein the method includes determining the order of the candidate object within the set of objects, and sampling the candidate object based on its determined order. Some embodiments may pertain to a method for training a generative flow network based on an order-preserving training criterion. Some embodiments may pertain to using an order-preserving generative flow network for controlling a machine. The order-preserving generative flow network for use of controlling a machine may exhibit any feature described above regarding an order-preserving generative flow network and/or any feature described above regarding the circuitry and may further exhibit any suitable feature described below with respect to any one of Figs.1 to 12. Controlling the machine may include controlling a navigation path of the machine or it may include identifying a neural network architecture for solving a given task. Some embodiments may pertain to an order-preserving generative flow network whose input comprises a set of objects, configured to determine the order of the candidate object within the set of objects, and sample the candidate object based on its determined order. The order- preserving generative flow network may exhibit any feature described above regarding an order- preserving generative flow network and/or any feature described above regarding the circuitry and may further exhibit any suitable feature described below with respect to any one of Figs.1 to 12. Some embodiments may pertain to an order-preserving generative flow network trained according to an order-preserving training criterion, which may exhibit any features described above regarding the training criterion. The order-preserving generative flow network trained according to an order-preserving training criterion may exhibit any feature described above regarding an order-preserving generative flow network and/or any feature described above regarding the circuitry and may further exhibit any suitable feature described below with respect to any one of Figs.1 to 12. It is noted that the aspects described above for the circuitry may be combined in any suitable way, and that the circuitry described above may further exhibit any suitable feature described below with respect to any one of Figs.1 to 12. It is noted that the aspects described above for the robot, navigational system and/or vehicle may be combined in any suitable way, and that the robot, navigational system and/or vehicle described above may further exhibit any suitable feature described below with respect to any one of Figs.1 to 12. It is noted that the methods may exhibit any feature described with respect to the circuitry and/or any suitable feature described below with respect to any one of Figs.1 to 12. The methods may be performed by the circuitry and/or by an electronic device and/or by the robot, navigational system and/or vehicle as described above. It is noted that the robot, navigational system and/or vehicle may exhibit any feature described with respect to the circuitry and/or any suitable feature described below with respect to any one of Figs.1 to 12. The methods as described herein are also implemented in some embodiments as a computer program causing a computer and/or a processor to perform the method, when being carried out on the computer and/or processor. In some embodiments, also a non-transitory computer- readable recording medium is provided that stores therein a computer program product, which, when executed by a processor, such as the processor described above, causes the methods described herein to be performed. Blackbox optimization problems are a very important class of optimization problems, because they often appear in real-world applications, e.g., in optimizing navigation or pathfinding, such as the optimal path of a warehouse robot or the optimal travel route for travelling to a or between multiple destinations, or optimizing the architecture of a deep neural network for a specific task or simulation-based optimization of physical systems or chemical reaction. Simulation-based refers to the underlying objective functions that are desired to be optimized only being accessible through running simulations at specific input points. Blackbox problems are typically hard to solve, because: 1) No assumptions about special properties of the objective function, such as about smoothness or convexity can be made.2) The objective functions are often expensive to evaluate, either requiring a lot of time or involving costly physical experiments. It has been recognized that in such scenarios, it is beneficial to use optimization methods that require as few function evaluations as possible. This is very challenging in practice. Typically, blackbox problems are solved with stochastic optimization algorithms, that do not rely on exact gradients ore derivatives, but rather make use of random sampling to search for the optimal solutions. Examples are random search (see for example Fig.1) or more advanced algorithms like: 1) Bayesian optimization algorithms, 2) Evolutionary algorithms or 3) Generative Flow Network (GFlowNet) based optimization algorithms that have been proposed recently. All these algorithms either assume some regularities in the objective function (e.g., smoothness) or fit a model to the objective function, in order to efficiently find the optimal solution. Fig.1 shows an example of a one-dimensional blackbox optimization problem, i.e., optimizing an unknown scalar objective function that depends on a single scalar variable ^, being solved with a stochastic optimization algorithm, namely random sampling, i.e., via random search, to search for the optimal solution. The horizontal axis illustrates values ^. The vertical axis on the left indicates a corresponding function ^⁽^⁾ for values of ^. The second vertical axis ^(^) illustrates the probability density function that is used to propose new candidate values ^. Fig.1 includes graph 40 indicating the values ^(^). The graph 40 has a small peak at 42, a jump in values at 43, a local minimum at 45 and a second peak at 44, which is also the overall maximum. The dashed graph 46 indicates the corresponding values p(x). As the graph 46 is a horizontal line, it is indicated that the probability is uniform, that is, the probability is the same for all values x. In other words, the probability ^(^) for values ^ is independent of ^(^). Thus, Fig. 1 shows a uniform sampling of values ^ for random search, wherein the probability is ^(^), which indicate the probability of the values ^, correspond. Thus, Fig.1 illustrates a uniform sampling of compositional objects ^ used for random search. In this example, the random samples all have the same probability ^(^). However, a better solution to the blackbox optimization problem makes use of search algorithms that are based on generative machine learning model as described below in more detail and illustrated in Figs.2, 3 and 4. Generative machine learning model The machine-learning model described in the embodiments below is based on a generative model. It is trained to generate samples ^ proportional to a probability density function (PDF), namely a log-linear function of the rank of the samples with respect to a given objective function ^(^). Such a generative model may exhibit particular efficacy in stochastic optimization scenarios by enabling the generation of candidate samples with highly desirable statistics for the objective function. More specifically, samples that have a high rank with respect to a given objective function may be sampled exponentially more often than samples with a lower rank. The generative machine learning model may be identified by interacting with a trained model, i.e., by using it to generate candidate samples and by calculating rank-based (order-based) statistics of these samples, that are induced by a given objective function. The machine learning model may, for example, be implemented as a special GFlowNet (regarding GFlowNets see: Bengio, Yoshua, et al. "Gflownet foundations." arXiv preprint arXiv:2111.09266 (2021)). As introduced by (Bengio et al., 2021b), Generative Flow Networks (GFlowNets) are a novel class of generative machine learning models that can sample compositional objects from a probability distribution that is proportional to a given reward, i.e., from . That is, they can generate a diverse set of candidates with a probability proportional to a given reward ^(^). Generative flow networks (GFlowNets) are a method for learning a stochastic policy for generating compositional objects , such as graphs or strings, from a given unnormalized density by sequences of actions, where many possible action sequences may lead to the same object (see for example Malkin, Nikolay, et al. "Trajectory balance: Improved credit assignment in gflownets." Advances in Neural Information Processing Systems 35 (2022): 5955-5967), which is explained in more detail below. GFlowNets are a powerful tool for stochastic optimization, e.g., for building novel blackbox (combinatorial) solvers. In fact, they have been successfully applied to problems like molecule design (Bengio et al., 2021a; Jain et al., 2022), robust scheduling (Zhang et al., 2023a) and graph combinatorial problems (Zhang et al., 2023b), delivering state-of-the-art performance. As discussed by (Deleu et al., 2022; Nishikawa-Toomey et al., 2022), GFlowNets are also useful for Bayesian structure learning and can be regarded as a unifying framework for many generative models (Zhangetal.,2022), such as for: 1) hierarchical VAEs (Ranganathetal.,2016), 2) normalizing flows (Dinh et al., 2014) and 3) diffusion models (Song et al., 2021). In the following some definitions, following Section 3 of (Bengio et al., 2021b) are given. For a directed acyclic graph with state space and action space , the vertices states may be referred to as the edges actions. may be the initial state, the only state with no incoming edges; and terminal states set ^ may be the states with no outgoing edges. We note that Bengio et al. (2021a) allows terminal states with outgoing edges. Such difference can be dealt with by augmenting every such state by a new terminal state with the terminal action . A complete trajectory may be a sequence of transitions τ = (s0→s1→ ...→sn) going from the initial state s0 to a terminal state sn=x with (st→st+1) ∈ A for all 0 ≤ t ≤ n − 1. T may be the set of complete trajectories that we can construct by executing actions from the action space . A trajectory flow may be a nonnegative function F : T →ℝ_≥0. For any state s, the state flow may be defined as F⁽s⁾ = ∑_{τ:s∈ τ} F⁽τ⁾ . That is, the trajectory flows F⁽τ⁾ of all trajectories τ that include state s are summed. For any edge s→s′, the edge flow may be defined

_…) F(τ) . The forward transition PF probability and backward transition PB probability may be defined as PF(s′|s) ≔ F(s→ s′)/F(s), PB(s|s′) for the consecutive states s, s′. GFlowNets sample candidate objects x, starting in the initial state s0 and by sequentially drawing actions from P_F, that are applied to the current state, causing a state transition. This sequential sampling yields trajectories that finish in the terminal states, which correspond one to one to the (compositional) candidate objects. A normalizing factor (also often called partition function) Z is defined as ^ =

^(^) . A nontrivial nonnegative reward function R : X → ℝ≥0 may be given on the set of terminal states. GFlowNets (Bengio et al., 2021a) may aim to approximate a Markovian flow F on the graph G such that F(x) = R(x) ^x ∈ X. Eq.(1) In general, an objective optimizing for (Eq. (1)) cannot be minimized directly because ^⁽^⁾ is a sum over all trajectories leading to x. Previously, several objectives – flow matching, detailed balance, trajectory balance, and subtrajectory balance – have been proposed. We list their names, parametrizations and constraints in Table 1. By Proposition 10 in Bengio et al. (2021b) for flow matching, Proposition 6 of Bengio et al. (2021b) for detailed balance, and Proposition 1 of Malkin et al. (2022) for trajectory balance, if the training policy has full support and respective constraints are reached, the GFlowNet sampler does sample from the target distribution. In some embodiments the parametrization of the OP GFlowNet may be based on the following:

Table 1: Possible GFlowNet objectives. The complete trajectory ^ may be defined by ^ = (s0→s1 →…→s_n = x), and its subtrajectory s_m1 → … → s_m2, 0 ≤ m₁ < m₂ ≤ n. For the ease of future define some terminology here. Let ^ ^ ^ ≔ ^{^^}^_^ → ⋯ → ^_^ ^{^} ^₌ ^ discussion, we _^^ ^{^^}^^^ be a batch of trajectories of size B, and ^_^ = ^{^ ^ ^^} _^^^ be the terminal states of the trajectory batch T_B. Fig.2 shows an example of a generative machine learning model, for example, the GFlowNet described above, that is trained to generate compositional objects x proportional to a probability density function (dashed line). Fig.2 shows GFlowNet-based search algorithms sampled proportional to ^⁽^⁾ ∝ ^⁽^⁾ (solid line). In the context of GFlowNets this means to choose ^⁽^⁾ = ^⁽^⁾ (or, e.g., any exponentially scaled version of ^⁽^⁾). Hence, the probability of compositional object x being produced by the generative model follows function ^(^). Sampling with ^⁽^⁾ ∝ ^⁽^⁾, i.e., proportional to the value of the objective function, yields many candidates in regions where ^⁽^⁾ is large, i.e., samples that maximize ^⁽^⁾ with high probability. GFlowNets for combinatorial stochastic optimization of blackbox functions have two problems though: 1) GFlowNets require an explicit formulation of the scalar reward ^(^) that measures the global quality of an object ^. Explicit refers to the ability to compute ^(^) for any ^ ∈ ^. However, this is sometimes not possible. Consider for example multi-objective optimization (MOO) problems with a vector valued objective function. For such problems, the concept of pareto dominance allows to define a partial order over ^ by comparing two objects . However, it is not possible, not having the actual solution of the MOO problem, to define a scalar function ^(^) that induces a global order that is compatible with these pairwise comparisons. Hence, GFlowNets are not directly applicable to MOO problems and rely on additional ideas like linear scalarization (Jain et al., 2023).2). In order to identify high reward regions of ^(^), GFlowNets may typically operate on an exponentially scaled reward, i.e., on

In the following, we call these methods GFlowNet- . In other words, in order to prioritize the identification of high-reward options for ^, conventional practice involves training them with rewards raised to a high power, such as . However, this requires a manual tuning of the parameter . Further, GFlowNets can only be used if a scalar reward ^(^) can be calculated explicitly. This is for example not easily possible for multi-objective optimization problems, where only partial orders can be defined. In practice, the optimal

heavily depends on the geometric properties ^(^). Choosing a small will hinder the exploration of the maximal reward modes, since even a perfectly fitted GFlowNet will still have high probability to sample outside the maximal modes. On the contrary, choosing a large can hinder the exploration since the GFlowNet will encounter a highly sparse reward function in the early training stages, what causes the training to collapse. Order Preserving GFlow Networks The embodiments described below in more detail provide a novel training criterion for GFlowNets. GFlowNets trained with this novel training criterion are termed Order Preserving (OP) GFlowNets herein. Instead of being trained such that they sample proportional to a given reward (e.g., probability density function), only candidate comparisons may be defined that reflect the difference of quality of the objects within a given environment. The OP- GFlowNet then learns a proxy reward ^^{^}(^) that is compatible with these candidate comparisons. Hence, these models may learn to sample proportional to a proxy reward ^^{^}(^) that is compatible with a predefined partial order, thus eliminating: 1) the need for an explicit formulation of and 2) the choice of the exponential scaling parameter . Moreover, order-preserving reward may be "piecewise log-linear", and may efficiently assign high credit to the important substructures. The choice of the order-preserving reward may balance the exploration in the early training stages and the exploration in the later training stages, by gradually sparsifying the reward function. The GFlowNet of some embodiments can be applied to a synthesis environment, for example, HyperGrid (Bengio et al., 2021a), and to real-world applications, for example, NATS-Bench (Dong et al., 2021), and molecular designs (Shen et al., 2023). Further, order-preserving GFlowNets can be directly applied to problems where an explicit formulation of is not possible, such as to MOO problems. Concerning real world problems like Neural Architecture Search (NAS) and molecular design, OP-GFlowNets can discover maximal reward modes more efficiently than standard GFlowNets, while also yielding a faster mixing. Also, the probability distribution associated with ^^{^}(^) decays exponentially with the rank of a candidate option , which is very desirable for stochastic optimization. The scenario where the GFlowNet is used to sample argmaxx∈X f(x) is considered. A typical way to achieve this is by training a GFlowNet with reward exponent

However, the choice of β is dependent on target distribution. Instead, the use of an order preserving reward ^^{^}(x) to improve the sample efficiency of the GFlowNet is proposed. Thus, some embodiments use a rank preserving reward ^^{^}(^)for sampling. Definition In some embodiments, instead of giving out ^^{^}(^) explicitly, ^^{^(}^⁾ is assumed to keep the pairwise ranking information of ^(^), for some pairs of terminal states (^, ^^{^}). This may be interpreted as 0-1 classification problem. For the terminal states pair (^, ^^{^}) the ground truth label which may be 1 if ^(^^{^}) > ^(^), and 0 if ^(^^{^}) < ^(^), and 0,5 if ^(^^{^}) = ^(^) is constructed. Assuming ^^{^}(^) is the terminal flow of a GFlowNet, parametrized by ^, in the prediction model, (^, ^^{^})’s label may be set to 1 with probability

and to 0 with

respectively. The cross-entropy loss may be used in the training. To be more specific, the order preserving loss ℒOP (x, x′; f, ^^{^}) for state pair ^, ^^{^}, and scalar objective function f and ^^{^} may be defined as where CE(t, y) = - (t log y + (1 - t) log(1 - y)) is the cross-entropy loss between the target t and the model output y. Minimizing ℒ_OP with respect to ^^{^} may amount to enforcing ^^{^} to induce the same ranking (order) on ^ as the objective function f. The order preserving (OP) loss for the full set of terminal states ^ may be ℒ_^^ ^^; ^^{^(} _{⋅, ^} ⁾ _{^ ≔ ^} ^{^} ^,^^′∈^ℒ_^^(^, ^ , ^, ^^{^}) Eq. (2) where f( ^) may be the true target reward, and the expectation of (x, x′) may be taken uniformly over terminal state pairs in ^. In the parametrization of the trajectory balance (TB) objective, the terminal flow ^^{^(}^; ^⁾ may be computed from the auxiliary ^_^, ^_^^^^ ^_^ along a specific training trajectory ^ i.e.,

The TB-OP objective may be obtained by plugging equation (3) into equation (2), i.e.,

where ^ is the set of trajectories that can be constructed, reaching all terminal states in ^. Note, that OP GFlowNets can also be trained with other training criteria, such as with Detailed Balance (DB) or sub-trajectory balance (Sub-TB) with only minor modifications. The above is an example of an order-preserving criterion, according to some of the embodiments, at the example of a GFlowNet parametrized by ^ (the partition function / total flow), PF (forward transition probability), PB (backward transition probability) using the trajectory balance (TB) criterion. This may yield a GFlowNet with special sample statistics. Based on these sample statistics, an easy breakdown analysis may be performed (see next section) to see if the model is implemented in a product. In other words, the machine learning model according to some embodiments may be implemented as a modified version of the established GFlowNet, where an exact target reward for terminal states is not given, but rather their ranking is fixed, which is called order preserving (OP) GFlowNet. That is, the OP GFlowNet may be an implementation of the proposed machine learning model for candidate generation, i.e., it may sample proportional to a PDF that is a log- linear function of the ranking induced by ^(^). Fig.3 illustrates log-linear based sampling. The horizontal axis indicates values x. The vertical axis on the left indicates a corresponding function ^(^) for compositional objects x. The vertical axis on the right indicates the probability density function, that is the log-linear function of the rank of samples with respect to a given objective function, log p(x) for values of x. Fig.3 includes a graph 51 indicating the values of ^(^). The dashed projections 54 projecting values to the second vertical axis log p(x) indicate the corresponding values log p(x) of the probability density function. The overall shape of graph 51 corresponds to the shape of graph 50a of Fig.2. That is also graph 51 has a first small peak, followed by a sudden jump in values, followed by a local minimum, followed by another peak, the overall maximum. Fig.3, however, illustrates that the values ^(^), which indicates an unknown objective function, are proportional to the values ^^^ ^(^) indicating the log-linear function of the rank of samples with respect to a given objective function, which is further explained in Fig.4. At 51a, 51b and 51c for values x_a, x_b, and x_c the values for ^^^ ^(^) are illustrated slightly offset, due to the logarithimic nature of the values ^^^ ^(^), compared to the values for ^(^). This is an example for a generative model based on log linear based sampling. Such a generative model as illustrated in Fig.3, exhibits particular efficacy in stochastic optimization scenarios, e.g., compared to the example of random search (Fig.1), or other generative models (Fig.2), which is illustrated in Fig.4. Fig.4 illustrates a log-linear order-preserving generative model. Fig.4 illustrates a support set X of multiple compositional objects. The support set X is sorted by function ^(^), which is illustrated in sub-figure 60. The horizontal axis of sub-figure 60 illustrates the values x of the objects of support set X. The vertical axis of sub-figure 60 illustrates the function ^(^). Support set X includes more objects than are visualized, which is illustrated by the horizontal axis including values x₀ and x₁ and continuing at x_i and x_i+1. That is between the values x₁ and x_i the figure continues in a similar fashion and the same is true for the values after x_i+1 continuing. The order of objects of the support set X, as illustrated in sub-figure 60, based on rank, is preserved in the log-linear order-preserving model as illustrated in sub-figure 70. Subfigure 70 shows a horizontal axis indicating the values x of the objects of support set X. The vertical axis of sub-figure 70 indicates the probability density function ^(^). The generative machine learning model, enforced by the training criterion given above, is trained to generate samples according to a certain probability distribution ^(^). More specifically, as shown in Fig.4, let ^: ^ → ℝ be a function that induces a ranking of the values ^ ∈ ^. Further, let ^_^ and ^_^^^ be two adjacent samples after sorting with respect to ^⁽^⁾, i.e., ^⁽^_^ ⁾ ≤ ^(^_^^^) and ∄^ ∈ ^: ^⁽^_^ ⁾ ≤ ^(^) ≤ ^(^_^^^). The model learns to sample from the PDF ^⁽^⁾, which has the following property:

i.e., the PDF may be log-linear with respect to ^, which is the ranking of ^ that is induced by ^. Depending on the implementation ^ can be either chosen by hand or learned. Such a ranking- based generative model has a very desirable property for stochastic optimization. If trained properly, the model generates a sample ^ with ^⁽^⁾ ≥ ^(^^{^}) exponentially more often than the sample ^′. The factor ^ controls how fast ^⁽^⁾ grows with the ranking that is induced on ^ by ^. Fig. 4 therefore illustrates that samples that have a high rank with respect to a given objective function are sampled exponentially more often than samples with a lower rank. In other words, high rank ^ are sampled with high probability and low rank ^ are sampled with low probability. Fig.5 schematically illustrates this training of the order-preserving GFlowNet. The Order Preserving (OP) GFlowNet generates compositional objects ^, i.e., objects that can be constructed by an iterative process. They may define states ^_^ that are associated with partially constructed objects. And may define actions that cause a state transition and modify the objects. Examples of compositional objects may be 1) Graphs, 2) Categorical data, 3) Integer data, 4) Strings, 5) Images, etc. The initial state may be defined by the user. For example, for integer data, the initial state ^_^ = 0 may be defined. The set of possible actions may be defined by the user. For example, for integer data the action set ^ =

^^ 1, ^^^^^^^^^^} may be defined. The generation of an object may be finished, once the terminate action was sampled, i.e., the terminal state is arrived. A unique property of the OP GFlowNet of Fig.5 is that it samples objects proportional to the ranking of the object, that is induced the objective function ^(^). Let ^ be the set of possible compositional objects and |^| the number of compositional objects in ^. Set X includes possible compositional objects ^1, ^2 and ^3. From the set X of compositional objects, whose number corresponds to |X|, e.g., |X| = 3 in the example illustrated in Fig.5, objects ^1 to ^3 are sampled proportional to the ranking of the objects, that is induced by the objective function ^(^). That is, the objective function ^(^) induces a ranking on the values ^, that is denoted with k. The ranking may be an integer number, i.e., sorting the elements in ^ with respect to ^ yields ^_^. The GFlowNet learns to generate samples ^ with a probability that is exponential in ^. Reward axis 33 illustrates how high the objective function ^(^) for each compositional object is. Thus,^(^2) is largest, ^(^3) is lowest and ^(^1) is in between. Corresponding to the value of the objective function, the induced ranking is determined. Therefore, the compositional object ^2 has an induced ranking k of 2, which corresponds to the highest objective function value ^(^2), compositional object ^1 has an induced ranking k of 1, which corresponds to the medium objective function value ^(^1), and compositional object ^3 has an induced ranking k of 0, which corresponds to the lowest objective function value ^(^3). The ranking is an integer number, i.e., sorting the elements in X with respect to ^(^) yields ^_^. The order-preserving GFlowNet learns to generate samples ^ with a probability that is exponential in k. The probability sampling is based on:

where ^(^_^) is the probability sampling ^_^ is a ranked compositional object, k=0,1,…, |X|-1 is the induced ranking, |X| is the number of compositional objects in X, ^ is a learned positive factor that approaches ∞ with enough model capacity and training time, and ^_^ is the learned partition function. The term “order-preserving” relates to the fact, that the reward (or terminal flow) ^^{^}(^) that the OP GFlowNet learns to assign to a terminal state with value ^_^ = ^, induces the same order (ranking) as the original objective function ^(^) that we want to maximize. Note, that a standard GFlowNet explicitly fix the terminal flow (reward) ^⁽^⁾ = ^^{^ (}^⁾, with ^ ≥ 1. Hence, the standard GFlowNet has to learn all properties of the objective function. Note, that this also yields a model that is order-preserving, since the order that is induced by the learned terminal flow ^(^) is the same as the order that is induced by the objective function ^⁽^⁾. However, for OP GFlowNets, the order-preserving property is the only property that is retained from ^⁽^⁾ and all other information is lost. In other words, the relationship between the compositional object and their induced ranking k and the corresponding probability sampling is learned by the order-preserving GFlowNet. The order-preserving GFlowNet is trained in an active learning setup, for example, the active learning setup of Fig.7. Fig.6 illustrates a state-action graph of an order-preserving GFlowNet and the computation of training loss. In Fig.6 four trajectories through state-action graph 10 are visualized that lead to three different terminal states 12. All trajectories flow from the initial state s_0 to intermediate states 11 to a terminal state 12. The first trajectory flows from initial state s_0 to intermediate state s_1 to the intermediate state s_3 to the terminal state x_1. The second trajectory flows from initial state s_0 to intermediate state s_4 to intermediate state s_5 to terminal state x_2. The third trajectory flows from initial state s_0 to intermediate state s_4 to intermediate state s_5 to intermediate state s_3 to the terminal state x_1. The fourth trajectory flows from initial state s_0 to intermediate state s_6 to intermediate state s_7 to terminal state x_3. An objective function value ^⁽^⁾ that is related to the terminal state 12 a trajectory ends in corresponds to each trajectory. The highest reward ^⁽^_3⁾ corresponds to terminal state x_3, which is the end of the fourth trajectory. A medium ^⁽^_1⁾ corresponds to terminal state x_1, which is the end of the first and third trajectories. The lowest ^⁽^_2⁾ corresponds to the terminal state x_2, which corresponds to the end of the second trajectory. The trajectories are grouped in pairs of two. For each pair, the cross-entropy (CE) loss is calculated according to the following:

where ℒ_OP is the order-preserving loss, CE is the cross-entropy, where CE(t, y) = - (t log y + (1 - t) log(1 - y)) x and x’ are a terminal state pair, ^⁽^⁾ is the objective function we want to maximize, R^{^}(x) is the terminal flow the OP GFlowNet assigns to a trajectory that ends in terminal state x, ^(^⁽^⁾ > ^⁽^^{^)}) is an indicator function, which is 1 if the condition in the argument is fulfilled and 0, otherwise. In the example of Fig.6 the terminal state pairs are (x_1, x_2), (x_1, x_3), (x_2, x_3). Furthermore, the flow is balanced on each trajectory and the trajectory balance (TB) is based on the following: ^^{^} _(^) where Z, P_F, P_B are parametric functions, which essentially define the OP GFlowNet model, as introduced by Malkin, Nikolay, et al. “Trajectory balance: Improved credit assignment in gflownets." Advances in Neural Information Processing Systems 35 (2022): 5955-5967, R^{^}(x) is the reward function, t is a state index on the trajectory τ = (s0→s1→ ...→sn), st is a state at t, s_t-1 is a state at t-1. PseudoCode The algorithm of the OP GFlowNet may be summarized in a pseudo-code as follows: Eq.4

Fig.7 illustrates an active learning setup for training an order-preserving GFlowNet. The sampler 3 receives input from environment 2. Sampler 3 outputs initial dataset 5 in a random initialization. Initial dataset 5 is pushed to replay buffer 7. Replay buffer 7 outputs offline batch 8 in an offline selection. Offline batch 8 as well as online batch 6 are used as input to generate hybrid batch 9, which is used for updating sampler 3. Sampler 3 is updated with hybrid batch 9 and a new round begins. If the next round does not include exploration, sampler 3 outputs evaluation batch 4. Therefore, every round without exploration leads to evaluation batch 4. If the next round includes exploration the sampler outputs online batch 6, which is used as input for generating hybrid batch 9 for updating sampler 3. Online batch 6 is also pushed to the replay buffer 7 to generate offline batch 8. Every round with exploration leads to online batch 6. Sampler 2 corresponds to the OP GFlow Net. Sampler 3 is randomly initialized and used to sample initial dataset 5 of training trajectories. A trajectory is a state action sequence that leads to a terminal state as described in Fig.6. A ranking loss is used to compare trajectories based on the observed reward R(x) as explained in more detail in Fig.6. The parameters of the OP GFlowNet are updated to minimize the ranking loss. The current observed trajectories are added to replay buffer 7, such that they can be used in future training runs. The process is started from the beginning by sampling from the updated OP GFlowNet. Theoretical Analysis It may be assumed that ^^{^}(x) is upper bounded by 1. The log ^^{^}(x_i) may be piecewise linear with respect to the subscript i. In the special case where R(x_i) may be mutually different, log ^^{^}(x_i) may be an arithmetic progression. Therefore, this may be one property of the sample statistics of the GFlowNet according to some embodiments. Thus, the probability of observing x may be a log linear function of the rank of x (order of x) with respect to R(x), where R(x) may be the objective function or reward. This may be used for a breakdown analysis of products. Example (Mutually different reward). For , assume that the reward R(x) may beknown, and R(xi) < R(xj), 0 ≤ i < j ≤ n. The rank preserving reward ^^{^}(x)

[1/γ, 1], may be defined by minimizing the order preserving loss for neighboring pairs ℒ_^^^^ :

We have

We remark that in practice, γ may not be fixed value. Therefore, minimizing ℒ_^^^^ with variable γ drives it to infinity. Combined with Proposition 1, we claim that the rank preserving loss may gradually sparsify the learned reward ^^{^}(x) on mutually different true rewards R(x) during training. We also consider the more general setting, where , such that R(x_i) = R(x_j). We give an informal proposition in Proposition 2. Similar considerations may also be done for not mutually different rewards. OP GFlowNets may be used with all parametrizations and training objectives that are given in Table 1, or with modifications thereof. Evaluation Maximal Reward Modes In previous works, GFlowNets have been evaluated in its ability to learn to match the target distribution. For example, Madan et al. (2022); Nica et al. (2022) evaluate GFlowNet by Spearman correlation between log p(x) and R(x) on held-out data. However, according to some of the present embodiments a GFlowNet may learn to sample the terminal states proportional to the proxy reward ^^{^} instead of R(x), and ^^{^} may be unknown in prior. Therefore, the metrics measuring closeness to target distribution of R(x) may not be meaningful in some of our order preserving embodiments. Here, we focus on evaluating the GFlowNet’s ability to discover maximal reward modes. ^^{^} as a Proxy When evaluating f(x) is costly, ^^{^} may be used as a proxy. If we want to sample k terminal states with maximal rewards, we can first sample K ≫ k terminal states, and pick states with top-k ^^{^}. Then, we need only evaluate f(x) on k instead of K terminal states. We may define the ratio of boosting to be r_boost = K/k. For GFlowNet objective parametrize F(s), we can directly let ^^{^}= F(x), x

X . For TB objective, we need to use Equation (3) to approximate ^^{^}. Since the cost of evaluating ^^{^} is also non-negligible, we may only adopt this strategy when obtaining f(x) directly is significantly more difficult. For example, in the neural architecture search environment^, evaluating f(x) requires training a network to completion to get the test accuracy. The machine learning model may be a tool in an optimization toolbox. The user can provide some training set ^ =

The training set comprises a set of samples ^_^, … , ^_^. The samples ^_^, … , ^_^ are ordered according to function ^⁽^⁾. Based on this training set ^, the tool may propose a set of new candidates ^_^^^^ = {^_^^^, ^_^^^, … , ^_^^^}, for which the user can evaluate the objective function, i.e., {^(^_^^^), ^(^_^^^), … , ^(^_^^^)} may need to be computed. These new function values can be used to extend ^, such that the optimization algorithm can make more refined proposals in the future. Such an optimization algorithm can be implemented as part of a toolbox that is accessible through an API and either runs locally or as a cloud service. In both cases, the inner mechanisms how new candidates are generated are hidden from the user. However, as discussed in the next section, one can still test if the proposed machine learning model is used as part of the optimization algorithm only through interaction and observation. Path planning in autonomous systems Some embodiments may relate to a path planning algorithm in autonomous systems. Often the quality of a path in an environment is measured, using a reward function. The path planning algorithm may try to maximize this reward, by taking actions to navigate through the environment. The order-based generative model may be used to sample paths that maximize such a reward with high probability. An example is path planning of an agent in an environment, e.g., a robot that navigates around obstacles and needs to find an efficient route. In terms of OP GFlowNets, the route through the environment are the trajectories. The states are the positions in the environment that have been visited by the robot, i.e., 2D or 3D coordinates. The actions are changes of position (i.e. step direction and length). The objects ^ are the terminal positions of the robot within the environment. Depending on the terminal position, the robot will receive a reward that measures the quality of the trajectory. It could consist of the distance to the target position that we want to reach. By observing how often the robot ends at which end position, one can judge if an OP GFLowNet is implemented for path planning. i.e., the probability of ending up at a specific position should be the given function of it’s ranking with respect to the target distance. The applications of an OP GFlowNet are numerous. They can be used to solve any combinatorial optimization problems. Fig.8 illustrates a robot navigating the optimal path based the OP GFlowNet. The environment 2 includes multiple paths 81 that a robot 80 may take and multiple buildings 85 that the robot may pass on the way. The underlying objective function of the paths 81is not known. Also, the underlying objective function of the paths 81 and buildings 85 is not known. To navigate to a destination (not seen) in an optimal manner, for example, that is most efficient and passes as few buildings as possible, the robot calculates the optimal path 86 according to an embodiment of the OP GFlow Net as described above, for example, in Figs.3 to 7. In this sense, the optimal path 86 is a compositional object generated, based on the OP GFlowNet, by the circuitry included in the robot 80. Fig.9 illustrates an example of a robot implementing a circuitry according to an embodiment of the present technique. The robot 80 of Fig.9 corresponds to robot 80 of Fig.8. Robot 80 is a robot with a humanoid upper body and a moving mechanism using wheels. A flat spherical head 92 is provided on the body portion 91. Two cameras 90 are provided on the front surface of the head 92 in a shape imitating the human eye. Manipulators 93-1 and 93-2, which are manipulators with multiple degrees of freedom, are provided at the upper ends of the body portion 91. Hand portions 94-1 and 94-2 are provided at the tips of the manipulator portions 93-1 and 93-2, respectively. The robot 80 can grasp an object by the hand portions 94-1 and 94-2. At the lower end of the body portion 91, a dolly-shaped moving body portion 95 as a moving mechanism of the robot 80 is provided. Robot 80 can move by rotating the wheels provided on the left and right sides of the moving body portion 95 and changing the direction of the wheels. As described above, robot 80 is a so-called mobile manipulator capable of freely lifting and transporting an object while the object is being grasped by either one or both of the hand portions 94-1, 94-2. Instead of the dual-arm robot as shown in Fig.9, the robot 1 may be configured as a single-arm robot (one manipulator 93-1, 93-2). Further, instead of the carriage-shaped moving body portion 95, the leg portion may be provided as a moving mechanism. In this case, the body portion 91 is provided on the leg portion. Robot 80 includes a circuitry for generating a compositional object according to an embodiment of the OP GFlow Net as described above, for example, in Figs.3 to 7. For example, robot 80 may determine the optimal path 86 as described in Fig.8. Robot 80 may for example determine an optimal path in a warehouse. Implementation Fig.10 schematically illustrates an embodiment of an electronic device comprising circuitry for generating a compositional object. The electronic device 100 may be an electronic device, such as, a robot, e.g., a path finding robot, a navigation device, e.g., a navigation device in a vehicle, a terminal computer, a smartphone or other mobile devices etc. The electronic device 100 includes a CPU 101 as processor. Additionally, or alternatively, other computation hardware, such as GPU, TPU, DSP etc. may be used. The electronic device 100 further includes camera(s) 206, microphone(s) 107 and loudspeaker(s) 108 that are connected to the processor 101. The processor 101 may for example implement the generation of a compositional object, such as, the generation of a path (e.g., optimal path) that is most efficient. The processor 101 may for example, be configured to implement the machine learning model for generating the compositional object, such as the OP GFlowNet as explained in Figs.3 to 8. The microphone 107 may be configured to receive any kind of audio signal. The camera 106 may be one or more cameras, such as an RGB camera, and IR camera, a ToF camera, for example, an iToF or dTof, an event-based camera or the like. The electronic device 100 further includes a user interface 109 that is connected to the processor 101. This user interface 109 acts as a man-machine interface and enables a dialogue between a user and the electronic device 100. For example, a user may make configurations to the system using this user interface 109. The electronic device 100 further includes a Bluetooth interface 104, and a WLAN interface 105. These units 104, 105 act as I/O interfaces for data communication with external devices. An ethernet interface may also be possible. For example, additional loudspeakers, microphones, and cameras, e.g., a ToF camera, RGB camera or an event-based camera with WLAN or Bluetooth connection may be coupled to the processor 101 via these interfaces 104 and 105. The electronic device 100 further includes a data storage 102 and a data memory 103 (here a RAM). The data memory 103 is arranged to temporarily store or cache data or computer instructions for processing by the processor 101. The data storage 102 is arranged as a long-term storage, which may be obtained via the processor 101. The connection between the processor 101 and the camera 106 may include a camera serial interface (CSI). The CSI is an interface between a camera 106 and a host processor 101. Thus, control signals and data from the processor 101 to the camera 106 as well as from the camera 106 to the processor 101 may be sent. Furthermore, the electronic device 100 includes an artificial intelligence (AI) processor 110. The AI processor 110 may include a graphics processing unit (GPU) and/or a tensor processing unit 20 (TPU). The AI processor 110 may be configured to execute an AI model (e.g., an artificial neural network), for example, the machine learning model for generating the compositional object, such as the OP GFlowNet as explained in Figs.3 to 8. Neural-Network Architecture Search Some embodiments may relate to a neural-network architecture search. The quality of a neural network model can also be measured using a reward function. The order-based generative model may try to maximize this reward. That is, the order-based generative model may be used to sample neural-network architecture that maximize such a reward with high probability. Examples In the following the neural architecture search environment NATS-Bench (Dong et al., 2021), which includes three datasets: CIFAR10, CIFAR-100 and ImageNet-16-120 is described. For example, the topology search space in NATS-Bench, i.e. the densely connected DAG of 4 nodes and the operation set of 5 representative candidates may be chosen. The representative operations may be 1) zeroize, 2) skip connection, 3) 1-by-1 convolution, 4) 3-by-3 convolution, and (5) 3-by-3 average pooling layer. Each architecture can be uniquely determined by a sequence

of length 6, where indicates the operation from node i to node j. Therefore, the neural architecture search can be regarded as an order agnostic sequence generation problem, where the reward of each sequence is determined by the accuracy of the corresponding architecture. AutoRegressive MDP Design The GFlowNet (S, A, X ) may be used to tackle the problem. Each state may be a sequence of operations of length 6, with possible empty positions, i.e. . The initial state my be the empty sequence, and terminal states may be full sequences, i.e. . Each forward action may fill the empty position in the non- terminal state with some ^_^ , and each backward action may empty some non-empty position. Reward Design For ^ ∈ ^, the reward ^⁽^⁾

the test accuracy of ^’s corresponding architecture with the weights at the ^-th epoch during its standard training pipeline. To measure the cost to compute the reward ^_^ ⁽^⁾, the simulated train and test (T&T) time may be introduced, which is defined by the time to train the architecture to the epoch ^ and then evaluate its test accuracy. NATS-Bench provides APIs on ^_^ ⁽^⁾ and its T&T time for ^ ≤ 200. Following Dong et al. (2021), when training the GFlowNet, the test accuracy at epoch 12 may be used as the reward; when evaluating the candidates, the test accuracy at epoch 200 may be used as the reward. It is noted that the training reward ^_^^ is a proxy for the validation reward ^_^^^ with lower T&T time, and rank preserving methods only preserve the rank of ^_^^, ignoring the possible unnecessary information: ^_^^’s exact value. Experimental Details of Example 1 Firstly, the focus lies on training the GFlowNet in a multi-trial sampling procedure. It may be beneficial to use randomly generated initial datasets, and set the size to be 64. In each active training round, 10 new trajectories using the current training policy may be generated, and the GFlowNet on all the collected trajectories may be updated. Optionally, backward trajectories augmentation are used to sample 20 terminal states from the replay buffer, and 20 trajectories per terminal using the current backward policy to update the GFlowNet are generated. To monitor the training process, in each training round, the architecture which has the highest training reward, and the accumulated T&T time to compute all the training rewards up to that round is recorded. The training may be terminated when accumulated T&T time reaches some threshold, which may be 50000, 100000 and 200000 seconds for CIFAR10, CIFAR100 and ImageNet-16- 120 respectively. The RANDOM may be adopted as baselines, the results against previous multi- trial sampling methods were compared: 1) evolutionary strategy, e.g., REA (Real et al., 2019); 2) reinforcement learning (RL)-based methods, e.g., REINFORCE (Williams, 1992), 3) HPO methods, e.g., BOHB (Falkner et al., 2018). Fig.11 illustrates the summarized results of the experiment of an example Neural Network Architecture Search. The results of the experiment of Example 1 described above are summarized in Fig.11 which shows a multi-trial training of a GFlowNet sampler, wherein GFlowNet methods are TB, TB-RP, TB-RP-KL, TB-RP-KL-AUG, and previous multi-trial algorithms are REA, BOHB, REINFORCE. The accuracy (at epoch 12 and 200) of the recorded sequence of architectures, with respect to the recorded sequence of accumulated T&T time is plotted. The experimental results over 200 random seeds are averaged, and the mean is plotted. The results on trajectory balance (TB) and its rank preserving variants (TB-RP) are reported. Fig. 11 shows that order-preserving GFlowNets consistently improve over the previous baselines in both training and validation reward, especially in the early training stages. Besides, backward KL regularization (-KL) and backward trajectory augmentation (-AUG) also contribute positively towards the sampling efficiency. Fig.11 illustrates the best test accuracy is at epoch 12 and 200 of random baselines (Random). It is noted that the first 64 samples of TB type methods are generated by random policy. The performance jump in TB type methods’ curves indicate the start of the training. Experimental Details of Example 2 Once a trained GFlowNet sampler is generated, the learned order preserving reward may be used as a proxy to further boost the sampling efficiency. The following experimental settings may be adopted. The (unboosted) GFlowNet samplers may be obtained by training on a fixed dataset, i.e., the checkpoint sampler after the first round of the previous multi-trial training. The sampler’s performance may be measured by sequentially generating samples, and the highest validation accuracy obtained so far may be recorded. Each algorithm’s sample efficiency gain rgain is plotted, which indicates that the baseline (unboosted) takes rgain times of number of samples to reach a target accuracy compared to that algorithm. The procedure is repeated 100 times, and the mean is plotted in Fig.12. Fig.12 illustrates the summarized results of an experiment of an example Neural Network Architecture Search. Fig.12 illustrates a boosting of a GFlowNet sampler. The results of the experiment of Example 2 described above are summarized in Fig.12. To boost the sampler, the best terminal states (in terms of ^^{^}) out of rboost = 1, 2, 8, 32, 128 candidates states are selected, where rboost = 1 denotes the unboosted sampler. The highest test accuracy observed so far in the 100 samples, and the performance gain w.r.t. each target accuracy are plotted. It is found that setting rboost ≈ 8 reaches up to 300% gain. An electronic device including a user interface (e.g., electronic device 100 of Fig.10) may be used to search for an optimized neural-network architecture. Thus, the electronic device may comprise circuitry configured to perform the neural network architecture search. Service Furthermore, an electronic device (e.g., electronic device 100 of Fig.10 or robot 80 of Figs.8 and 9), with or without the circuitry configured to run the order-preserving generative model as described above, may be a node connected to a network connected to one or more other nodes, one of the other nodes may, for example, be a server or the like (e.g., cloud service). The server may include circuitry configured to run the order-based generative model with any one of the features described above (e.g., the neural network architecture search). Thus, the result of the order-preserving generative model run may be transmitted to the electronic device from the server. References Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, and Yoshua Bengio. Flow network based generative models for non-iterative diverse candidate generation. Neural Information Processing Systems (NeurIPS), 2021a. Yoshua Bengio, Salem Lahlou, Tristan Deleu, Edward Hu, Mo Tiwari, and Emmanuel Bengio. GFlowNet foundations. arXiv preprint 2111.09266, 2021b. Tristan Deleu, António Góis, Chris Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, and Yoshua Bengio. Bayesian structure learning with generative flow networks. In Uncertainty in Artificial Intelligence, pp.518–528. PMLR, 2022. Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014. Xuanyi Dong, Lu Liu, Katarzyna Musial, and Bogdan Gabrys. Nats-bench: Benchmarking nas algorithms for architecture topology and size. IEEE transactions on pattern analysis and machine intelligence, 44(7):3634–3646, 2021. Stefan Falkner, Aaron Klein, and Frank Hutter. Bohb: Robust and efficient hyperparameter optimization at scale. In International Conference on Machine Learning, pp.1437–1446. PMLR, 2018. Moksh Jain, Emmanuel Bengio, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Bonaventure F.P. Dossou, Chanakya Ekbote, Jie Fu, Tianyu Zhang, Micheal Kilgour, Dinghuai Zhang, Lena Simine, Payel Das, and Yoshua Bengio. Biological sequence design with GFlowNets. International Conference on Machine Learning (ICML), 2022. Moksh Jain, Sharath Chandra Raparthy, Alex Hernández-Garcia, Jarrid Rector-Brooks, Yoshua Bengio, Santiago Miret, and Emmanuel Bengio. Multi-objective gflownets. In International Conference on Machine Learning, pp.14631–14653. PMLR, 2023. Kanika Madan, Jarrid Rector-Brooks, Maksym Korablyov, Emmanuel Bengio, Moksh Jain, Andrei Nica, Tom Bosc, Yoshua Bengio, and Nikolay Malkin. Learning gflownets from partial episodes for improved convergence and stability. arXiv preprint arXiv:2209.12782, 2022. Nikolay Malkin, Moksh Jain, Emmanuel Bengio, Chen Sun, and Yoshua Bengio. Trajectory balance: Improved credit assignment in gflownets. arXiv preprint arXiv:2201.13259, 2022. Andrei Cristian Nica, Moksh Jain, Emmanuel Bengio, Cheng-Hao Liu, Maksym Korablyov, Michael M Bronstein, and Yoshua Bengio. Evaluating generalization in gflownets for molecule design. In ICLR2022 Machine Learning for Drug Discovery, 2022. Mizu Nishikawa-Toomey, Tristan Deleu, Jithendaraa Subramanian, Yoshua Bengio, and Laurent Charlin. Bayesian learning of causal structure and mechanisms with gflownets and variational bayes. arXiv preprint arXiv:2211.02763, 2022. Rajesh Ranganath, Dustin Tran, and David Blei. Hierarchical variational models. In International conference on machine learning, pp.324–333. PMLR, 2016. Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, volume 33, pp.4780–4789, 2019. Max W Shen, Emmanuel Bengio, Ehsan Hajiramezanali, Andreas Loukas, Kyunghyun Cho, and Tommaso Biancalani. Towards understanding and improving gflownet training. arXiv preprint arXiv:2305.07170, 2023. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning, pp.5–32, 1992. David W Zhang, Corrado Rainone, Markus Peschl, and Roberto Bondesan. Robust scheduling with gflownets. arXiv preprint arXiv:2302.05446, 2023a. Dinghuai Zhang, Ricky TQ Chen, Nikolay Malkin, and Yoshua Bengio. Unifying generative models with gflownets. arXiv preprint arXiv:2209.02606, 2022. Dinghuai Zhang, Hanjun Dai, Nikolay Malkin, Aaron Courville, Yoshua Bengio, and Ling Pan. Let the flows tell: Solving graph combinatorial optimization problems with gflownets. arXiv preprint arXiv:2305.17010, 2023b. Please note, that some embodiments may pertain to a method for controlling an electronic device, such as robot 80 of Figs.8 and 9. The method or any method discussed herein can also be implemented as a computer program causing a computer and/or a processor, such as processor 101 of Fig.10 discussed above, to perform the method, when being carried out on the computer and/or processor. In some embodiments, also a non-transitory computer-readable recording medium is provided that stores therein a computer program product, which, when executed by a processor, such as the processor described above, causes the method described to be performed. All units and entities described in this specification and claimed in the appended claims can, if not stated otherwise, be implemented as integrated circuit logic, for example on a chip, and functionality provided by such units and entities can, if not stated otherwise, be implemented by software. In so far as the embodiments of the disclosure described above are implemented, at least in part, using software-controlled data processing apparatus, it will be appreciated that a computer program providing such software control and a transmission, storage or other medium by which such a computer program is provided are envisaged as aspects of the present disclosure. Note that the present technology can also be configured as described below. (1) A circuitry configured to generate a candidate object using an order-preserving generative flow network whose input comprises a set of objects, wherein the circuitry is configured to determine the order of the candidate object within a set of objects, sample the candidate object based on its determined order. (2) The circuitry of (1), wherein the order is induced based on an objective function. (3) The circuitry of any one of (1) to (2), wherein the candidate object is sampled proportional to the respective order of the candidate object. (4) The circuitry of any one of (1) to (3), wherein the candidate object is sampled with a probability that is exponential to the respective order of the candidate object. (5) The circuitry of any one of (1) to (4), wherein sampling the candidate object is based on a predefined initial state. (6) The circuitry of any one of (1) to (5), wherein sampling the candidate object is based on a predefined set of possible actions. (7) The circuitry of any one of (1) to (6), wherein sampling the candidate object is based on a terminate action, wherein the generation of a candidate object is finished once the terminate action was sampled. (8) The circuitry of any one of (1) to (7), wherein the set of objects comprises a set of paths from a starting location to an end location, which a pathfinder starting from the starting location can navigate to reach the end location, and wherein the candidate object is a candidate path to reach the end location starting from the starting location. (9) The circuitry of any one of (1) to (8), wherein the set of objects comprises a set of molecules and the candidate object is a candidate molecule. (10) A circuitry configured to perform a neural network architecture search based on a generative flow network. (11) The circuitry of (10), wherein the generative flow network is an order-preserving flow network according to any one of (1) to (7). (12) The circuitry of (10) or (11), wherein performing the neural network architecture search is based on user input. (13) The circuitry of any one of (10) to (12), wherein the set of objects comprises a set of neural network architectures for solving a given task and the candidate object is a candidate neural network architecture for solving the task. (14) A pathfinding robot comprising circuitry configured to find a path using a generative flow network. (15) The pathfinding robot of (14), wherein the generative flow network is an order-preserving flow network according to any one of (1) to (8). (16) A navigation device comprising circuitry configured to find a route based on a generative flow network. (17) The navigation device of (16), wherein the generative flow network is an order- preserving flow network according to any one of (1) to (8). (18) A vehicle comprising the navigation device of (16) or (17). (19) A circuitry configured to train a generative flow network based on an order-preserving training criterion. (20) The circuitry of (19), wherein the order-preserving training criterion includes training to sample proportional to a proxy reward that is compatible with a partial order of a set of objects that reflects the difference of the quality of the objects within a given environment. (21) The circuitry of any one of (19) to (20), wherein the order-preserving training criterion is defined based on pairwise candidate comparisons. (22) The circuitry of any one of (19) to (21), wherein the order-preserving training criterion includes learning a proxy reward that is compatible with the pairwise candidate comparisons. (23) The circuitry of any one of (19) to (22), wherein the order-preserving training criterion includes determining an order-preserving loss function. (24) The circuitry of (23), wherein the order-preserving loss function is based on a cross- entropy between the circuitry output and the pairwise candidate comparisons. (25) A method for generating a candidate object based on an order-preserving generative flow network. (26) A method for generating a candidate object using an order-preserving generative flow network whose input comprises a set of objects, wherein the method includes determining the order of the candidate object within a set of objects, sampling the candidate object based on its determined order. (27) The method of any one of (25) or (26), wherein the order is induced based on an objective function. (28) The method of any one of (25) to (27), wherein the candidate object is sampled proportional to the respective order of the candidate object. (29) The method of any one of (25) to (28), wherein the candidate object is sampled with a probability that is exponential to the respective order of the candidate object. (30) The method of any one of (25) to (29), wherein sampling the candidate object is based on a predefined initial state. (31) The method of any one of (25) to (30), wherein sampling the candidate object is based on a predefined set of possible actions. (32) The method of any one of (25) to (31), wherein sampling the candidate object is based on a terminate action, wherein the generation of a candidate object is finished once the terminate action was sampled. (33) A method for training a generative flow network based on an order-preserving training criterion. (34) The method of (33), wherein the order-preserving training criterion includes training to sample proportional to a proxy reward that is compatible with a partial order of a set of objects that reflects the difference of the quality of the objects within a given environment. (35) The method of any one of (33) to (34), wherein the order-preserving training criterion is defined based on pairwise candidate comparisons. (36) The method of any one of (33) to (35), wherein the order-preserving training criterion includes learning a proxy reward that is compatible with the pairwise candidate comparisons. (37) The method of any one of (33) to (36), wherein the order-preserving training criterion includes determining an order-preserving loss function. (38) The method of any one of (33) to (37), wherein the order-preserving loss function is based on a cross-entropy between the method output and the pairwise candidate comparisons. (39) A computer program comprising program code causing a computer to perform the method according to any one of (25) to (38), when being carried out on a computer. (40) A non-transitory computer-readable recording medium that stores therein a computer program product, which, when executed by a processor, causes the method according to any one of (25) to (38) to be performed. (41) Using an order-preserving generative flow network as defined in any one of (1) to (24) for controlling a machine and/or involving optimal scheduling of actions. (42) The use of the order-preserving generative flow network of (41), wherein controlling the machine includes controlling a navigation path of the machine. (43) The use of the order-preserving generative flow network of (41) or (42), wherein controlling the machine includes identifying a neural network architecture for solving a given task. (44) An order-preserving generative flow network whose input comprises a set of objects, configured to determine the order of the candidate object within the set of objects, and sample the candidate object based on its determined order. (45) An order-preserving generative flow network as defined in any one of (1) to (9), (11) to (13), (15), (17) or (18). (46) An order-preserving generative flow network trained according to an order-preserving training criterion as defined in any one of (19) to (24).

Claims

CLAIMS 1. A circuitry configured to generate a candidate object using an order-preserving generative flow network whose input comprises a set of objects, wherein the circuitry is configured to determine the order of the candidate object within the set of objects, and sample the candidate object based on its determined order.

2. The circuitry of claim 1, wherein the order is induced based on an objective function.

3. The circuitry of claim 1, wherein the candidate object is sampled proportional to the respective order of the candidate object.

4. The circuitry of claim 1, wherein the candidate object is sampled with a probability that is exponential to the respective order of the candidate object.

5. The circuitry of claim 1, wherein sampling the candidate object is based on a predefined initial state and/or on a predefined set of possible actions.

6. The circuitry of claim 1, wherein sampling the candidate object is based on a terminate action, wherein the generation of a candidate object is finished once the terminate action was sampled.

7. The circuitry of claim 1, wherein the set of objects comprises a set of paths from a starting location to an end location, which a pathfinder starting from the starting location can navigate to reach the end location, and wherein the candidate object is a candidate path to reach the end location starting from the starting location.

8. The circuitry of claim 1, wherein the set of objects comprises a set of molecules and the candidate object is a candidate molecule.

9. A circuitry configured to perform a neural network architecture search using a generative flow network.

10. The circuitry of claim 9, wherein the generative flow network is an order-preserving flow network according to claim 1.

11. The circuitry of claim 9, wherein performing the neural network architecture search is based on user input.

12. The circuitry of claim 10, wherein the set of objects comprises a set of neural network architectures for solving a given task and the candidate object is a candidate neural network architecture for solving the task.

13. A pathfinding robot comprising circuitry configured to find a path using a generative flow network.

14. The pathfinding robot of claim 13, wherein the generative flow network is an order- preserving flow network according to claim 1.

15. A navigation device comprising circuitry configured to find a route using a generative flow network.

16. The navigation device of claim 15, wherein the generative flow network is an order- preserving flow network according to claim 1.

17. A method for generating a candidate object using an order-preserving generative flow network whose input comprises a set of objects, wherein the method includes determining the order of the candidate object within the set of objects, and sampling the candidate object based on its determined order.

18. A method for training a generative flow network using an order-preserving training criterion.

19. An order-preserving generative flow network whose input comprises a set of objects, configured to determine the order of the candidate object within the set of objects, and sample the candidate object based on its determined order.

20. An order-preserving generative flow network trained according to an order-preserving training criterion.