WO2022147584A2

WO2022147584A2 - Deep- reinforcement learning (rl), weight-resonant system and method for fixed-horizon search of optimality

Info

Publication number: WO2022147584A2
Application number: PCT/US2022/026747
Authority: WO
Inventors: Masood Seyed Mortazavi; Ning Yan
Original assignee: Futurewei Technologies, Inc.
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-07-07
Also published as: WO2022147584A3

Abstract

According to embodiments, a neural network running on at least one processor receives a constant input for a configuration design requiring N dimensions. The neural network outputs N probability distributions. The at least one processor generates a batch of sample configurations for the configuration design based on the N probability distributions. Each sample configuration of the batch of sample configurations corresponds to a different full configuration of a system. The at least one processor outputs the batch of sample configurations to an evaluator external to the neural network. The at least one processor updates parameters of the neural network based on a loss function.

Description

Deep-Reinforcement Learning (RL), Weight-Resonant System and Method for Fixed-Horizon Search of Optimality

TECHNICAL FIELD

[oooi] The present disclosure relates generally to artificial intelligence (Al), and, in particular embodiments, to methods and apparatus for reinforcement learning (RL) for fixed-horizon search of configurations with high-dimensionality.

BACKGROUND

[0002] Many system designs can involve a set of adjustable configurations. In a configuration of a system, there may be many configuration dimensions, with each dimension representing a different aspect of the configuration of the system and corresponding to a different configuration parameter for that aspect of the configuration. For each configuration dimension, there may be a variety of available (discrete or continuous) options for that dimension. Some system designs have very large configuration spaces because the number of configuration dimensions is large, or the number of available options within each dimension is large, or both. For example, for a system whose configuration has 20 dimensions with each dimension having 8 options, the number of different configurations for the system can reach 20⁸. Also, different configuration dimensions for a system can have different numbers of options. For instance, for a discrete system of fixed N dimensions, if each dimension has its own number of (fixed) options Mi, the number of possible configurations for the system could be fl^ .

[0003] Searching within these configuration spaces (e.g., fl ;=i Mi possible configurations) to find an optimal configuration that meets a combination of objectives can be technically challenging. For example, for a system with 20⁸ possible configurations (e.g., 20 dimensions with each dimension having 8 options), even if evaluating one configuration for a fully configured system over the combination of objectives would take just one second of computing time evaluating all the possible configurations one-by-one could take more than 8oo years of computing time to find the optimal configuration within all the possible configurations, which is inefficient and even not feasible. Furthermore, in many practical applications of system design, it can take far longer than one second to evaluate a fully configured system with one configuration either directly or through simulation. Thus, techniques to improve the efficiency and performance of computer operations in finding the optimal configuration, particularly in a search space with a large number of dimensions or a large number of options within each dimension, are desired.

SUMMARY

[0004] According to embodiments, a neural network running on at least one processor receives a constant input for a configuration design requiring N dimensions. The neural network outputs N probability distributions. The at least one processor generates a batch of sample configurations for the configuration design based on the N probability distributions. Each sample configuration of the batch of sample configurations corresponds to a different full configuration of a system. The at least one processor outputs the batch of sample configurations to an evaluator external to the neural network. The at least one processor updates parameters of the neural network based on a loss function.

[0005] In some embodiments, the at least one processor may repeat the receiving the constant input, the outputting the N probability distributions, the generating the batch of sample configurations, the outputting the batch of sample configurations, and the updating the parameters for a plurality of iterations, wherein the constant input remains the same throughout the plurality of iterations. In some embodiments, the constant input may include a single constant value, N different constant one-hot vectors, or N vectors each having all is. In some embodiments, the receiving the constant input and the outputting the N probability distributions maybe stateless. In some embodiments, the updating the parameters of the neural network may include updating weights of the neural network. In some embodiments, the N probability distributions may be joint probability distributions such that all of the N probability distributions are conditional on one another. In some embodiments, the loss function may be calculated based on rewards, the rewards generated based on performance metrics output by the evaluator. In some embodiments, the loss function may be updated by series-mode value estimation layers, and wherein the series-mode value estimation layers are updated based on the batch of sample configurations. In some embodiments, the loss function may be updated by branched-mode value estimation layers, and wherein the branched- mode value estimation layers are updated based on a task encoding tensor of the neural network. In some embodiments, the N probability distributions include N discrete probability density function (PDF) distributions, N continuous PDF distributions, or a mixture of M discrete PDF distributions and (N-M) continuous PDF distributions.

[0006] In so doing, embodiment techniques improve efficiency of memory utilization and performance of computer operations in finding the optimal configuration, particularly in a search space with a large number of dimensions or a large number of options within each dimension (or both).

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

[0008] FIGs. 1A and 1B illustrate the general RL theory and algorithms, according to some embodiments;

[0009] FIG. 2 illustrates one possible application scheme of RL for searching for the optimal configuration in system design, according to some embodiments;

[0010] FIG. 3 illustrates an improved application scheme of RL for searching for the optimal configuration in system design, according to some embodiments; [0011] FIG. 4A illustrates an example architecture of weight-resonance for fixed- length optimality search, according to some embodiments;

[0012] FIG. 4B illustrates more details of the example architecture of weightresonance for fixed-length optimality search, according to some embodiments;

[0013] FIG. 4C illustrates more details of the example architecture of weightresonance for fixed-length optimality search, according to some embodiments;

[0014] FIG. 4D illustrates more details of the example architecture of weightresonance for fixed-length optimality search, according to some embodiments;

[0015] FIG. 4E illustrates an example architecture of weight-resonance for fixed- length optimality search in the case of N discreate configuration dimensions, according to some embodiments;

[0016] FIG. 4F illustrates an example architecture of weight-resonance for fixed- length optimality search in the case of N discreate configuration dimensions, according to some embodiments;

[0017] FIG. 5 illustrates a flow chart of a method of weight-resonance for fixed- length-dimension optimal configuration search, according to some embodiments; and

[0018] FIG. 6 is a block diagram of a computing system that may be used for implementing the devices and methods disclosed herein, according to some embodiments.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

[0019] The structure and use of disclosed embodiments are discussed in detail below.

It should be appreciated, however, that the present disclosure provides many applicable concepts that can be embodied in a wide variety of specific contexts. The specific embodiments discussed are merely illustrative of specific structure and use of embodiments, and do not limit the scope of the disclosure. [0020] Searching for an optimal configuration that meets a combination of objectives within a large configuration space is technically challenging. If there is a way to evaluate a particular configuration according to some objective metrics, a reward-driven deep reinforcement learning (D-RL) system to explore the configuration space using one of the many common policy reinforcement algorithms may be devised. Embodiments of this disclosure provide a set of D-RL system architectures that make such exploration of the configuration space effective in fixed-horizon configuration spaces and thus improve efficiency of memory utilization and performance of the computer operations.

[0021] Many system designs have a fixed set of configuration dimensions. In one non-limiting example, a system is designed to determine what is the best possible placement for the block macros defined in a netlist for a central processing unit (CPU). In the netlist, there could be a fixed number of block macros (e.g., dimensions, with each dimension corresponding to a different configuration parameter for configuring the location of a different block macro) whose locations (e.g., options for each dimension) need to be fixed before a global placer places all the other (standard) cells. In another non-limiting example, a system is designed to determine what are the best sizes for the cache hierarchy in a chip. In the cache hierarchy, there maybe multiple cache levels (e.g., Lo, Li, L2, L3, etc.). Caches at different levels (e.g., dimensions, with each dimension corresponding to a different configuration parameter for configuring the size of a different cache level) often do not have the same size, and there could be many size options (e.g., options for each dimension) for each cache. In yet another non-limiting example, a system is designed to determine what is the optimal amino-acid sequence of a given length that would produce the best anti-body protein to fight a virus. In general, there could be 20 amino acid choices (e.g., options for each dimension), and proteins are generally between 50 to 2000 amino acids long (e.g., dimensions, with each dimension corresponding to a different configuration parameter for configuring a different amino acid’s choice). It maybe possible to evaluate the protein (e.g., amino-acid sequence) through topographic biochemical tools/analysis of virus/ receptor (in this case, antibody) binding. However, evaluating each of the possible protein configurations to find an optimal protein configuration is computationally intensive.

[0022] The technical problems solved by this disclosure are to find the optimum configuration (e.g., the optimum design) for a given system. The system model can be too complex to define, describe, or model accurately (e.g., how does the cache hierarchy impact the CPU design efficiency for a given benchmark set). Further, the combinatorics of the possible configurations can define a huge space of choices (e.g., the placement problem for the block macros defined in a netlist described above or the possible proteins of length of too that can be synthesized from the 20 available amino acids).

[0023] To search for an optimal solution, reinforcement learning (RL) may be utilized. Reinforcement learning is a machine learning training method based on rewarding desired behaviors and/or punishing undesired ones. In general, a reinforcement learning agent is able to perceive and interpret its environment, take actions, and learn through trial and error. However, applying RL to the search of an optimal solution itself may be technically challenging.

[0024] The method of reinforcement learning may include action (e.g., produce the configuration), collecting reward (i.e., evaluate the configuration), and using policy reinforcement algorithms to find the optimal action. The policy reinforcement algorithm used may be any of the policy reinforcement algorithms known in the art, such as REINFORCE, actor critic (AC), advantage actor critic (A2C), asynchronous advantage actor critic (A3C), trust region policy optimization (TRPO), and proximal policy optimization (PPO).

[0025] Further, artificial neural networks (ANNs) may be used to produce the optimal action (i.e., an optimal configuration). Traditionally, an ANN requires inputoutput data pairs, and uses back-propagation from some differentiable loss function (dependent on input-output datasets) to update the neural network in training sessions. [0026] FIGs. 1A and 1B illustrate the general RL theory and algorithms, according to some embodiments. As FIG. 1A shows, the working material of the RL may include episodes of the state-action-state-reward sequence. The episodes start from an initial state So and end in a terminal state ST (e.g., one episode has all dimensions configured such as going through the path once from S_oto ST in FIG. 1A). The general RL theory and algorithms may be devised for episodes of arbitrary length.

[0027] As FIG. 1B further shows, at state SM (1 < i < T), an Action Ai is taken, and the state transitions to state Si. The policy-gradient family of the RL algorithms relies on action estimation. The policy-gradient family of the RL algorithms tries to model based on itfaj Isj-i), which is the probability that the action chosen in state SM will be ai as n(aj |Sj-i, 0). Here, 0 represents the set of modeling parameters (e.g., the weights in the neural network). The modeling parameters maybe trained with the objective that the greatest value is accrued in each episode (i.e., the discounted sum of rewards are maximized) - using one of the many members of the PG family of RL algorithms (e.g., REINFORCE, AC, A2C, A3C, TRPO, and PPO).

[0028] FIG. 2 illustrates one possible application scheme of RL for searching for the optimal configuration in system design, according to some embodiments. Some system design configuration exploration may produce the following scheme. As shown in FIG. 2, actions may be configuration selection in each of the configuration dimensions. For example, at state S_o, the action may be configuring Ci to select an option value of dimension 1 of the configuration; at state Si, the action maybe configuring C₂to select an option value for dimension 2 of the configuration, at state SM, the action may be configuring Ci to select an option value of dimension i of the configuration (1 < i < T); and so on. For a system design the number of configuration dimensions is fixed (e.g., the number of dimensions in the example shown in FIG. 2 is a fixed number, T in FIG. 2). In each dimension, various options may be available. [0029] RL in fixed horizon optimality search (e.g., fixed-length episodes) needs to handle a fixed horizon (e.g., the number of dimensions, T in FIG. 2) for all episodes (i.e., T is always fixed) because the number of configuration dimensions is fixed. In the configurations, a given dimension may offer discrete or continuous options (e.g., actions)

[0030] However, some application scheme of RL, such as the scheme described above with respect to FIG. 2, can experience technical difficulties and complexities. The reward (which is used to measure of design goodness) can only be evaluated once all configuration dimensions are configured (with one of the available options in each dimension). That is, only fully configured system can be evaluated by a system evaluation mechanism (e.g., power/performance/area (PPA) in chip design, for a chip design that has multiple possible configuration dimensions).

[0031] FIG. 3 illustrates an improved application scheme of RL for searching for the optimal configuration in system design, according to some embodiments. Because the episode (e.g. a configuration of all dimensions) is of a fixed horizon (e.g., the number of the dimensions, or the number of the configuration parameters, is fixed), and because the reward is only collected at the end, embodiments in this disclosure reformulate the application scheme of RL as a single-step compound step action. For this embodiment, different permutations of configuration options for different dimensions may be evaluated to determine which produces the best reward. The reformulated application scheme improves the problem solution and allows for more effective and efficient action probability estimation. As shown in FIG. 3, the scheme improvement renders the RL system to a single-step stateless episode. The single-step compound action 304 in FIG. 3 is the combination of configuring all dimensions (e.g., the combination of “Configure Ci for dimension 1, Configure C₂ for dimension 2, Configure C₃ for dimension 3, Configure C₄for dimension 4, ..., Configure Cr-ifor dimension T-i, Configure CT for dimension T»). With the single-step compound step action 304, the improved application scheme is “stateless” in that there are no intermediate states (e.g., states Si, S₂, S₃, ..., ST-I as shown in FIG. 1A and FIG. 2). There are only two states in the improved scheme shown in FIG. 3: a blank state 302 with no dimension configured and a fully configured state 306 with all dimensions configured. At the blank state 302, the single-step compound action 304 is performed. Then, the state transitions to the fully configured state 306, and reward RT is collected. Other application schemes of RL (e.g., the scheme shown in FIG. 2) that need to consider constraints between dimensions. In contrast, with the improved application scheme of RL in FIG. 3, if an early Ci constrains Cj (i <j ), embodiment techniques ignore such constraints by modeling joint probabilities instead of the more complex conditional probabilities. The more complex conditional probabilities often cannot be parametrically defined due to complex state modifications caused by each configuration choice (in the multi-action, multi-step mode) leading to complex domains rendering such parametrization impossible and leading to arbitrary treatments for topologically complex probability domains (e.g., a 2-D placement domain with lots of holes, including some overlapping ones) Such chopped up domains, in the multi-step, stateful approach (particularly when continuous) are also arbitrarily discretized to provide a means for probability assignment, leading to further loss of fidelity.

[0032] In many conventional RL application schemes, there is a clear input-output data set. With the embodiment single-step compound action (e.g., action 304), there is no inputs (e.g., inputs are not necessary) when it comes to optimality search. The goal of the problem solution here is to find the optimal configurations. Conventional RL libraries assume the multi-step episodes (e.g., FIG. 1A) and do not have any optimizations for single-step compound-action scenarios. Embodiments of this disclosure focus on the scenario of the single-step, compound-action RL. The embodiment techniques can also solve the same type of problems that are solved by various multi-variate multi-armed bandit techniques. The embodiment techniques are still distinct from the solutions of multi-variate multi-armed bandits in many ways. For example, the embodiment techniques deal with multiple dimensions. Furthermore, if a configuration dimension must be represented with a continuous domain, that dimension alone would present an infinite-armed bandit. In addition, any configuration dimension (of the multiple configuration dimensions) that only offers discrete options, are each, individually, a multi-armed bandits.

[0033] FIG. 4A illustrates an example architecture 400 of weight-resonance for fixed- length optimality search, according to some embodiments. The example architecture 400 includes a neural network 404 (e.g., an action policy neural network). The neural network 404 includes layers of nodes with learnable parameter set 0 (e.g., weights among the nodes). The input 402 to the neural network 404 may be a constant for the blank state. That means, the input 402 may remain the same and unchanged throughout the search for the optimal configuration. In some embodiments, the input 402 may be a single constant value (e.g., a non-zero number). In some embodiments, the constant input 402 may be the tensor of is or hot-is (e.g., a group of bits among which the allowable combinations of values are only those with a single high (1) bit and all the others low (o)). In some other embodiments, the input 402 may be a single constant value (e.g., a non-zero number), and the neural network 404 may produce the output of the same size as the output produced by the tensor of is or hot-is.

[0034] The output 406 of the neural network 404 is a compound-action probability density function (PDF), or parameters of the PDF. The probability distribution gives the possibility of each outcome of a random event (e.g., a possible option is configured for a dimension). The compound-action PDF is a function used to define the probabilities of different possible occurrences (e.g., the probabilities of different options configured for each of the different dimensions). The compound-action PDF in the output 406 may be represented as n a_lt a₂, , a_T- , a_T; 9), where a_L is a possible option in configuration dimension i. Based on the compound-action PDF, a batch of sample configurations 408 (e.g., candidate configurations) are generated. Each sample configuration of the batch of sample configurations is a sample of full configuration of all T dimensions, with each dimension of the total T dimensions configured with a corresponding dimension option. In some embodiments, a batch of sample configurations are generated based on the compound-action PDF such that the distribution of the configured options for dimensions in the batch of sample configurations match the probability distribution of the compound-action PDF, n a^, a₂, . . . , a_T- ,a_T,- 0 .

[0035] The generated batch of sample configurations 408 may be sent across a nondifferentiability wall between systems to an evaluator 410. The evaluator 410 may evaluate each of the fully configured systems (using the batch of sample configurations). The evaluator 410 may be any external simulation/ evaluation system outside of and separate from the neural network 404. The evaluator 410 can evaluate the fully- configured system for finding the optimal configuration of the system design. Practical uses of the embodiment technique may require the evaluator 410 to be fast enough, which depends on the time available for the search of the optimal configuration and the complexity of configuration space. The evaluator 410 may output multi-dimensional performance metrics 412 for each fully configured system using a sample configuration of the batch of sample configurations (e.g., power/performance/area (PPA) in designing a chip system). Then, a scalar reward 414 is generated based on the multi-dimensional performance metrics. For example, the multi-dimensional performance metrics 412 may be normalized and weighted to generate a single number as the scalar reward 414 (e.g., weight of power x normalized power + weight of performance x normalized performance + weight of area x normalized area).

[0036] The scalar reward 414 is used to compute the statistical loss function 416. The statistical loss function 416 may be computed based on the scalar reward 414 using any member of the policy gradient family of RL algorithms known in the art, including but not limited to REINFORCE, AC, A2C, A3C, TRPO, and PPO). The loss function is a function that maps values of one or more variables onto a real number intuitively representing some cost associated with those values. The loss function is a function of evaluating how well a neural network (e.g., the neural network 404) models the dataset. If the modeling is off, the loss function for a neural network may output a larger number than for the better modeled neural network. For back-propagation, the loss function may depend on (and may be a differentiable function of) the neural network 404’s parameters (“weights”) so that those weights can be adjusted through back-propagation. In these differentiable functions, the scalar reward 414 plays the role of a standing (non- differentiable) parameter simply because the dependence of the reward 414 on the configuration is unknown to the agent. In other words, the reward 414 is rendered by the environment (e.g., the evaluator 410 (which maybe a simulator) evaluating the fully configured system). Various objective functions (for maximization or minimization) may be defined by each of the policy-gradient family of the algorithms. The objective function (loss) may be algebraically evaluated using the algebra supported by some autodifferentiation software library (e.g., TensorFlow, PyTorch, etc.) . Then, through back- propagation 418 algorithm supported by the aforementioned auto-differentiation library, the parameter set 0 of the neural network 404 can be updated to increase the reward.

[0037] The process described above (e.g., from input 402 to back-propagation 418) may repeat for multiple iterations to find the optimal configuration. The input 402 to the neural network 404 remains constant and unchanged throughout all the iterations. In some embodiments, the number of iterations maybe a pre-configured number. In some other embodiments, the process may stop when the scalar reward 414 is equal to or above a pre-determined threshold level. The sample configuration resulting in the highest value of the scalar reward 414 may be selected as the optional configuration.

[0038] FIG. 4B illustrates more details of the example architecture 400 of weightresonance for fixed-length optimality search, according to some embodiments. As explained above, the input 402 to the neural network 404 remains constant throughout the search for the optimal configuration. Even so, the neural network 404 can still learn. The loss (e.g., computed from the statistical loss function 416) that is backpropagated only depends on the reward of the generated configuration) and the PDF/distributional/sampling output 408 of the neural network 404. In general, the loss is defined so that it may decrease if overall discounted reward increases. The various policy-gradient algorithms determine the degree of this update. Any policy-gradient algorithm (e.g., REINFORCE, AC, A2C, A3C, TRPO, and PPO) known in the art maybe used to compute the loss although some algorithm may perform better than others. For example, PPO is generally considered better than A3C, A3C is generally considered better than A2C, A2C is generally considered better than AC, and AC is generally considered better than REINFORCE. Also, each of these algorithms come with hyper-parameters that determine the importance of various components in the objective function (or loss) that each algorithm formulates, and the embodiments are not limited to the algorithms listed above The neural network 404 can learn because the neural network parameters (e.g., 0) can be updated through stochastic gradient decent in order to minimize the loss function using a back-propagation algorithm. So, despite a constant input, the weights of the neural network 404 can keep updating (e.g., resonating) towards producing a joint distribution for the optimal configuration, using any of the policy gradient algorithms.

[0039] FIG. 4C illustrates more details of the example architecture 400 of weightresonance for fixed-length optimality search, according to some embodiments.

Embodiments in this disclosure approximate the joint probability in the output 406 as the product of conditional distributions, and these conditional distributions may be parametrically approximated as marginal distributions used to model these distributions because parameter dependence is captured through one or more task abstraction layers included in the neural network 404. The one or more task abstraction layers can use selfattention neural layers, recurrent neural networks (RNNs) neural layers, or multi-layer perceptron (MLP) neural layers in order to build interdependency in internal encoding of the neural network 404 that is learned for the parameter generator layers or the neural network 404. The neural network 404 may include the one or more task abstraction layers in order to jointly model the task across all available action components. In the limit of optimality, with the join distribution approximated as conditional distribution, the output 406 of the neural network 404 could achieve the result of the conditional probability.

[0040] FIG. 4D illustrates more details of the example architecture 400 of weightresonance for fixed-length optimality search, according to some embodiments. The neural network 404 may include one or more task abstraction layers 422, the task encoding tensor 424, and the action distribution head 426. The one or more task abstraction layers 422 may be used to ensure any distributional (joint) dependency is captured for the particular configuration optimality search task for which the output 406 of the neural network 404 is generated, and keeping the input 402 of the neural network 404 to a constant value helps the neural network 404 to resonate to this state.

[0041] The task encoding tensor 424 may be used to generate the action probability PDFs or the parameters that model these action PDFs (embodiments in this disclosure can assume at this stage marginal probability parameters may be used because the marginal probability parameters’ dependency can be captured in the earlier stages of the neural network 404 through the one or more task abstraction layers 422 (e.g., neck) and the one or more action distribution layers 426 (e.g., head). The one or more task abstraction layers 422 may include any or all of self- attention neural layers, RNNs (e.g., gated recurrent unit (GRU), long short-term memory (LSTM)) neural layers, or MLP neural layers. The number of layers and the compositions of layers in the one or more task abstraction layers 422 may depend on task complexity and may depend on tradeoffs related to training complexity of more complex networks (e.g., self-attention layers with a larger number of heads or deeper networks). The number of layers and the compositions of layers in the one or more task abstraction layers 422 may also be related to the numerical rounding errors and back-propagation signal loss in very deep networks Furthermore, the action distribution layers 426 may include any or all of self-attention neural layers, RNNs (e.g., GRU, LSTM) neural layers, or MLP neural layers.

[0042] The policy-gradient algorithms (e.g., REINFORCE, AC, A2C, A3C, TRPO, and PPO) may be predicated on estimating the advantages through a value estimator. For advantage estimation, a value estimation tier may be attached to the one or more task abstraction layers 422 (e.g., the branched mode one or more value estimation layers 432) or attached to the sample configurations 408 more directly (the series mode one or more value estimation layers 434). The branched mode one or more value estimation layers 432 or the series mode one or more value estimation layers 434 may include any or all of self- attention neural layers, RNNs (e.g., GRU, LSTM) neural layers, or MLP neural layers. In some embodiments, the value estimation tier (e.g., value estimation layers 432 or value estimation layers 434) may also be its own separate, parallel neural network, with a separate constant input for estimating the optimal configuration’s value through a similar weight-resonance scheme described in this disclosure. The two networks may be joined at the loss computation node and maybe both updated (resonating towards optimality) through the same back-propagation mechanisms mentioned earlier.

[0043] If the value estimation tier is attached to the series mode one or more value estimation layers 434, the value estimation tier may have its own independent loss function, which helps the value estimation tier to estimate the reward for any specific configuration. So, in the series mode, there maybe two loss functions. One loss function (e.g., the statistical loss function 416) is for the policy estimator, and the other loss function (not shown) is for the value estimation tier.

[0044] If the value estimation tier is attached to the branched mode one or more value estimation layers 432, the two loss functions may be combined into one loss function (e.g., the statistical loss function 416) for back-propagation. The differences in how the loss function(s) are used in the two modes are because the value estimation tier, in the series mode, depends on the configuration, not the policy that generates the configuration or the task abstraction. On the other hand, in the branched mode, the value estimation tier may produce, over multiple iterations, a general estimate of the most likely optimal reward.

[0045] FIG. 4E illustrates an example architecture 450 of weight-resonance for fixed- length optimality search in the case of N discreate configuration dimensions (N maybe T described with respect to FIG. 3), according to some embodiments. Embodiments techniques described with respect to FIGs. 4A-4D may be applied to the example architecture 450 in FIG. 4E. The example architecture 450 supports the case where all configuration dimensions are categorical and shows the number of parameters that need to be generated by the policy estimation head 425. For the example architecture 450, the number of discrete options in each dimension (form mi to IIIN) need to be known.

[0046] In some embodiments, the configuration dimensions (in an N-dimensional configuration) may include any combination of discrete and/or continuous distributions. The continuous distributions (not shown in FIG. 4E) may be modeled as parametric distributions such as Gaussian, Beta, etc., with the distribution chosen appropriately for that dimensions continuous search. For example, if the configuration dimension has bounded continuous options, embodiments may use a Beta distribution to model it. If the dimension has unbounded options, which may not be common, embodiments may use a Gaussian distribution to model it.

[0047] As shown in FIG. 4E, the input 402 to the neural network is a 1-hot constant input. In other embodiments, the input 402 to the neural network maybe all is. In yet other embodiments, the input 402 to the neural network maybe a constant value (e.g., a non-zero number). The neural network may include the task encoder 423 (e.g., the one or more task abstraction layers 422) and the policy estimation head 425 (e.g., the one or more action distribution layers 426).

[0048] The output 406 of the neural network is the attribute discrete distributions

(action probability). Based on the attribute discrete distributions, a batch of sample configurations 408 may be generated. Then, the batch of sample configurations 408 are sent to an evaluator for evaluations. The evaluator (e.g., evaluator 410) may be a multiobjective reward generator 411. The multi-objective rewards generated from the generator 411 may be transformed (e.g., through a weighted sum of all objective metrics) to generate a single (real-numbered) reward 415 (e.g., scaler reward 414), which is used by the policy enforcement 419 (e.g., the statistical loss function 416 and the back- propagation 418) to update the parameters (e.g., weights) of the neural network. The weighting maybe based on the linear combination of the objectives, for example, based

to generate the reward. For L objectives Oi, 0₂, ..., OL, Wj denotes the weight corresponding to objective Oj, and Oj denotes the performance metric corresponding to objective Oj. Also as shown in FIG. 4E, an estimation tier (e.g., the branched mode one or more value estimation layers 432) is attached to the task encoder 423.

[0049] FIG. 4F illustrates an example architecture 452 of weight-resonance for fixed- length optimality search in the case of N discreate configuration dimensions, according to some embodiments. The example architecture 452 in FIG. 4F is similar to the example architecture 450 in FIG. 4E except that estimation tier (e.g., the series mode one or more value estimation layers 434) is attached to the sample configurations 408.

[0050] FIG. 5 illustrates a flow chart of a method 500 of weight-resonance for fixed- length-dimension optimal configuration search, according to some embodiments. The method 500 may be carried out or performed by one or more processing units of a computing device. Examples of the processing units include, but are not limited to, graphics processing units (GPUs), tensor processing units (TPUs), application-specific integrated circuits (ASCIs), field-programmable gate arrays (FPGAs), artificial intelligence (Al) accelerators, or combinations thereof. The method 500 may also be carried out or performed by routines, subroutines, or modules of software executed by the one or more processing units. The method 500 may further be carried out or performed by a combination of hardware and software. Coding of the software for carrying out or performing the method 500 is well within the scope of a person of ordinary skill in the art having regard to the present disclosure. The method 500 may include additional or fewer operations than those shown and described and may be carried out or performed in a different order. Computer-readable code or instructions of the software executable by the one or more processing units may be stored on a non- transitory computer-readable medium, such as for example, the memory of a computing device.

[0051] The method 500 starts at operation 502, where a neural network running on at least one processor receives a constant input for a configuration design requiring N dimensions. At operation 504, the neural network outputs N probability distributions for N dimensions, each of the N probability distributions for a different dimension of the N dimensions. At operation 506, the at least one processor generates a batch of sample configurations for the configuration design based on the N probability distributions.

Each sample configuration of the batch of sample configurations corresponds to a different full configuration of a system (e.g., each dimension of the N dimensions for configuring the system is configured with a configuration option). At operation 508, the at least one processor outputs the batch of sample configurations to an evaluator external to the neural network. At operation 510, the at least one processor updates parameters of the neural network based on a loss function.

[0052] In some embodiments, the at least one processor may repeat the receiving the constant input, the outputting the N probability distributions, the generating the batch of sample configurations, the outputting the batch of sample configurations, and the updating the parameters for a plurality of iterations, wherein the constant input remains the same throughout the plurality of iterations. In some embodiments, the constant input may include a single constant value, N different constant one-hot vectors, or N vectors each having all is. In some embodiments, the receiving the constant input and the outputting the N probability distributions maybe stateless. In some embodiments, the updating the parameters of the neural network may include updating weights of the neural network. In some embodiments, the N probability distributions may be joint probability distributions such that all of the N probability distributions are conditional on one another. In some embodiments, the loss function may be calculated based on rewards, the rewards generated based on performance metrics output by the evaluator. In some embodiments, the loss function may be updated by series-mode value estimation layers, and wherein the series-mode value estimation layers are updated based on the batch of sample configurations. In some embodiments, the loss function may be updated by branched-mode value estimation layers, and wherein the branched- mode value estimation layers are updated based on a task encoding tensor of the neural network. In some embodiments, the N probability distributions include N discrete probability density function (PDF) distributions, N continuous PDF distributions, or a mixture of M discrete PDF distributions and (N-M) continuous PDF distributions.

[0053] Embodiments of this disclosure provide techniques to train a policy-gradient algorithm that is stateless and relies on weight resonance (with a constant input) to perform optimality search in multi-dimensional design configuration spaces. Design configurations maybe any aspect of a design of a system that can be expressed in terms of a discrete/categorical or continuous variable. A category is any set of choices that can be indexed by a fixed index set —the probability space of the category.

[0054] Policy-gradient algorithms using the embodiment techniques are improved by a single-step compound action policy, and the architectural frameworks are provided to implement the weight-resonance models to produce policy distributions and value estimations used in the policy-gradient algorithms. In contrast to other known approaches (e.g., modeling the full Pareto frontier), embodiments of this disclosure focus searching for the optimal configuration on a segment of the frontier that is closest to surfaces parallel to the desired operating surface. The operating surfaces are parallel (L- i)-dimensional surfaces in the L-dimensional multi-objective metric space, and are effectively generated subject to the function that transforms (combines, e.g., through weighted sums) the multi-objective metrics Oi to OL to a single reward value. Such functions define parallel surfaces in L dimension space of objective metrics as reward value (e.g., the weighted sum) varies.

[0055] Embodiments in this disclosure solve the technical problem of optimal configuration search, where the dimensions of a configuration are fixed, and where the configuration options selected within each dimension are also fixed, with either discrete or continuous parametric probability distribution for every possible option within each dimension. When the cardinality of these dimensions (and options within each dimension) in configurations grow or when high-dimensional join probability spaces arise, searching for the optimal configuration becomes a particularly hard technical problem. For example, if there are N configuration dimensions, and M discrete options in each configuration dimension, such configuration space can M^N possible configurations to select from. More generally, the number of possible configurations can be fl ;=i Mi for N configuration dimensions with discrete options (Mi options for configuration dimension i).

[0056] Furthermore, if options within a given dimension, come from a continuous domain, rather than from a discrete domain, the search space offers innumerable possibilities. Using the embodiment techniques, the options from the continuous domain may be solved by using parametric distributions (e.g., Gaussian for unbounded domain and or Beta for bounded domain distributions). So, embodiment techniques are still applicable to searching for the optimal configurations within the continuous spaces separately from the discrete spaces, or jointly along with the discrete spaces (mixtures of spaces, e.g., options for some dimensions in a configuration). According to embodiments, choices within each continuous dimension can be modeled through a parametric probability distribution (with a finite number of parameters). For example, the Gaussian and the Beta distributions may be used among such parametric probability distributions which can be defined with a finite set of parameters but can produce an innumerable set of options.

[0057] To solve technical problem of the high-combinatorial optimality search, embodiments of this disclosure provide deep-RL, weight-resonant systems and methods, which allow to solve the problem of the high-combinatorial optimality search as a stateless, single-step, compound-action RL problem. The optimal configurations could be where the PDF has its multi-dimensional maxima. These are the maximum likelihood optimal location(s). In some embodiments, the neural networks that produce selection probabilities

value estimation (estimating the value of optimal configuration) do not receive a varying input that depends on a state that changes with each selection of an option. Instead, the neural networks which model the selection probabilities (“action probabilities” or “policies”), and the embodiment optimal value estimations have constant inputs. Through D-RL search, the neural networks receive policy-gradient feedback (using any of the many policy-gradient algorithms known in the art) based on a reward estimation for the full- configured system which is produced in one compound-action step. The policy network for these policy-gradient algorithms may be the same as embodiment multi-dimensional option selection probabilities (“action probabilities” or “policies”) network (e.g., neural network 404). The value network of these algorithms maybe the same as embodiment multi-dimensional value estimation network. So, both the action policy network (e.g., neural network 404) and the value estimation network (e.g., one or more value estimation layers 432 or one or more value estimation layers 434) may receive constant inputs.

[0058] The embodiment neural network system (or any other parametric prediction mechanism with the ability to be adjusted through gradient descent, back-propagation, etc.) may also include a (potentially common) task abstraction tier, prior to final layers that produce action probability distribution and/or value estimation. The action policy neural network (e.g., neural network 404) may produce candidate configurations through probability estimation and sampling techniques.

[0059] The value network may either evaluate the multi-dimensional sample configuration or produce its estimate of a target value as the optimal value (relying on task abstraction tiers). Here, the system tries to correlate the value (which maybe the “critic” in AC, A2C, and A3C) and may pre-estimate the reward signal the evaluator/ simulator produces upon evaluation. The reward signals maybe produced by the evaluators that evaluate the produced candidate configurations. For each candidate configuration produced through sampling of the multi-dimensional distributions produced by the action policy network (e.g., neural network 404), a reward generator may evaluate the configuration within some physically realistic (the simulated system or the real system) environment and produce a reward signal (e.g., scalar reward 414). The embodiment techniques may then use the reward signal within one of the many policygradient algorithms (e.g., REINFORCE, AC, A2C, A3C, TRPO, and PPO, etc., which also depend on the configuration distribution, from the action policy network, and the value estimation) in order to create a back-propagation signal that updates all the weights within the action policy network (e.g., neural network 404).

[0060] To be short, while the input to the action policy network is fixed to be a constant, the weights of the embodiment neural networks (e.g., neural network 404 and value estimation neural network) can still be updated and resonate with the reward signal that produces a back-propagation feedback signal for updating the weights of the embodiment neural networks. In so doing, the embodiment neural networks can begin to learn task abstraction (closer to the constant input) and converge to an optimal solution solving the technical problem of searching for the optimal configuration efficiently and effectively, by maximizing the reward using any one of the many policy-gradient algorithms known in the art. [0061] Embodiments in this disclosure use the weight resonant neural networks

(e.g., neural network 404 and value estimation neural network) that search for the optimal configuration in configuration spaces. The embodiment weight resonant neural networks (e.g., neural network 404 and value estimation neural network) may receive a constant input on the input side to a constant input and receive the reward signals from the output side, which allows embodiment weight resonant neural networks to resonate the networks’ parameters towards eventually producing the optimal configuration. The embodiment techniques can work with multi-dimensional objectives (e.g., objectives Oi, 0₂, O₃, ... OL as shown in FIGs. 4E and 4F). These objectives maybe combined with corresponding weights in order to produce a single scalar objective (e.g., through linear combination of objectives Oi, 0₂, O₃, ... OL as shown in FIGs. 4E and 4F). So, the operating planes are essentially modeled through a weighted sum of the objectives. By using a weighted sum (or other methods) to combine the objective values, the embodiment techniques in effect search a particular zone of the Pareto frontier, and because of its data efficiency, the embodiment techniques can search in far larger combinatorial spaces than alternative approaches that model the whole frontier.

[0062] In sum, in contrast to known alternative techniques using multi-step RL (for more complex but lower dimensional problems) or multi-armed bandit (for simpler problems), the embodiment techniques utilize the single-step compound-action RL. The embodiment techniques do not need to model the state because there are only two states: the initial state (e.g., the blank state 302) and terminal state (e.g., the fully configured state 306). The embodiment techniques rely on updating the policy distribution and the value estimation neural networks (or other statistical estimators) through weight resonance, gradually producing a policy distribution that is highly likely to produce optimal configurations near a zone parallel to the operating plane and optimizing reward. Furthermore, in contrast to known techniques that unnaturally discretizes a continuous space (which reduces the fidelity and vastly increasing memory requirements during training), embodiment techniques, by using the single-step compound action (e.g., action 304), do not need to model probability on a broken domains. Thus, embodiment techniques do not need to discretize the continuous space. Rather, embodiment techniques rely on the reward feedback to learn unallowed moves that in multi-step lead to a requirement for state modeling.

[0063] FIG. 6 is a block diagram of a computing system 600 that may be used for implementing the devices and methods disclosed herein, according to some embodiments. Specific devices may utilize all of the components shown or only a subset of the components, and levels of integration may vary from device to device. Furthermore, a device may contain multiple instances of a component, such as multiple processing units, processors, memories, transmitters, receivers, etc. The computing system 600 includes a processing unit 602. The processing unit includes a central processing unit (CPU) 614, memory 608, and may further include a mass storage device 604, a video adapter 610, and an I/O interface 612 connected to a bus 620.

[0064] The bus 620 may be one or more of any type of several bus architectures including a memory bus or memory controller, a peripheral bus, or a video bus. The CPU 614 may comprise any type of electronic data processor. The memory 608 may comprise any type of non-transitory system memory such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), or a combination thereof. In an embodiment, the memory 608 may include ROM for use at boot-up, and DRAM for program and data storage for use while executing programs.

[0065] The mass storage 604 may comprise any type of non-transitory storage device configured to store data, programs, and other information and to make the data, programs, and other information accessible via the bus 620. The mass storage 604 may comprise, for example, one or more of a solid state drive, hard disk drive, a magnetic disk drive, or an optical disk drive. [0066] The video adapter 6io and the 1/ O interface 612 provide interfaces to couple external input and output devices to the processing unit 602. As illustrated, examples of input and output devices include a display 618 coupled to the video adapter 610 and a mouse, keyboard, or printer 616 coupled to the I/O interface 612. Other devices maybe coupled to the processing unit 602, and additional or fewer interface cards may be utilized. For example, a serial interface such as Universal Serial Bus (USB) (not shown) may be used to provide an interface for an external device.

[0067] The processing unit 602 also includes one or more network interfaces 606, which may comprise wired links, such as an Ethernet cable, or wireless links to access nodes or different networks. The network interfaces 606 allow the processing unit 602 to communicate with remote units via the networks. For example, the network interfaces 606 may provide wireless communication via one or more transmitters/transmit antennas and one or more receivers/ receive antennas. In an embodiment, the processing unit 602 is coupled to a local-area network 622 or a wide-area network for data processing and communications with remote devices, such as other processing units, the Internet, or remote storage facilities.

[0068] It should be appreciated that one or more steps of the embodiment methods provided herein maybe performed by corresponding units or modules. For example, methods and techniques described in this disclosure maybe performed by an RL agent running on the at least one processors of one or more computing systems (e.g., computing system 600), and the RL agent may include the neural network(s) described in this disclosure. The evaluator may be external or internal to the RL agent. The respective units or modules maybe hardware, software, or a combination thereof. For instance, one or more of the units or modules may be an integrated circuit, such as field programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs).

[0069] Although the description has been described in detail, it should be understood that various changes, substitutions and alterations can be made without departing from the spirit and scope of this disclosure as defined by the appended claims. Moreover, the scope of the disclosure is not intended to be limited to the particular embodiments described herein, as one of ordinary skill in the art will readily appreciate from this disclosure that processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed, may perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein. Furthermore, any aspects of different embodiments in this disclosure may be combined. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.

Claims

27 WHAT IS CLAIMED IS:

1. A method comprising: receiving, by a neural network running on at least one processor, a constant input for a configuration design requiring N dimensions; outputting, by the neural network, N probability distributions; generating, by the at least one processor, a batch of sample configurations for the configuration design based on the N probability distributions, wherein each sample configuration of the batch of sample configurations corresponds to a different full configuration of a system; outputting, by the at least one processor, the batch of sample configurations to an evaluator external to the neural network; and updating, by the at least one processor, parameters of the neural network based on a loss function.

2. The method of claim 1, further comprising: repeating, by the at least one processor, the receiving the constant input, the outputting the N probability distributions, the generating the batch of sample configurations, the outputting the batch of sample configurations, and the updating the parameters for a plurality of iterations, wherein the constant input remains the same throughout the plurality of iterations.

3. The method of any one of claims 1-2, wherein the constant input includes a single constant value, N different constant one-hot vectors, or N vectors each having all is.

4. The method of any one of claims 1-3, wherein the receiving the constant input and the outputting the N probability distributions are stateless.

5. The method of any one of claims 1-4, the updating the parameters of the neural network including: updating weights of the neural network.

6. The method of any one of claims 1-5, wherein the N probability distributions are joint probability distributions such that all of the N probability distributions are conditional on one another.

7. The method of any one of claims 1-6, wherein the loss function is calculated based on rewards, the rewards generated based on performance metrics output by the evaluator.

8. The method of any one of claims 1-7, wherein the loss function is updated by series-mode value estimation layers, and wherein the series-mode value estimation layers are updated based on the batch of sample configurations.

9. The method of any one of claims 1-8, wherein the loss function is updated by branched-mode value estimation layers, and wherein the branched-mode value estimation layers are updated based on a task encoding tensor of the neural network.

10. The method of any one of claims 1-9, wherein the N probability distributions include N discrete probability density function (PDF) distributions, N continuous PDF distributions, or a mixture of M discrete PDF distributions and (N-M) continuous PDF distributions.

11. An apparatus comprising: at least one processor; and a non-transitory computer readable storage medium storing programming, the programming including instructions that, when executed by the at least one processor, cause the apparatus to: receive, by a neural network running on the at least one processor, a constant input for a configuration design requiring N dimensions; output, by the neural network, N probability distributions; generate a batch of sample configurations for the configuration design based on the N probability distributions, wherein each sample configuration of the batch of sample configurations corresponds to a different full configuration of a system; output the batch of sample configurations to an evaluator external to the neural network; and update parameters of the neural network based on a loss function.

12. The apparatus of claim 11, the programming further including instructions that, when executed by the at least one processor, cause the apparatus to: repeat receiving the constant input, outputting the N probability distributions, generating the batch of sample configurations, outputting the batch of sample configurations, and updating the parameters for a plurality of iterations, wherein the constant input remains the same throughout the plurality of iterations.

13. The apparatus of any one of claims 11-12, wherein the constant input includes a single constant value, N different constant one-hot vectors, or N vectors each having all is.

14. The apparatus of any one of claims 11-13, wherein receiving the constant input and outputting the N probability distributions are stateless.

15. The apparatus of any one of claims 11-14, the instructions to update the parameters of the neural network including instructions to: update weights of the neural network.

16. The apparatus of any one of claims 11-15, wherein the N probability distributions are joint probability distributions such that all of the N probability distributions are conditional on one another.

17. The apparatus of any one of claims 11-16, wherein the loss function is calculated based on rewards, the rewards generated based on performance metrics output by the evaluator.

18. The apparatus of any one of claims 11-17, wherein the loss function is updated by series-mode value estimation layers, and wherein the series-mode value estimation layers are updated based on the batch of sample configurations.

19. The apparatus of any one of claims 11-18, wherein the loss function is updated by branched-mode value estimation layers, and wherein the branched-mode value estimation layers are updated based on a task encoding tensor of the neural network.

20. The apparatus of any one of claims 11-19, wherein the N probability distributions include N discrete probability density function (PDF) distributions, N continuous PDF distributions, or a mixture of M discrete PDF distributions and (N-M) continuous PDF distributions.

21. A non-transitory computer-readable medium having instructions stored thereon that, when executed by an apparatus, cause the apparatus to perform operations, the operations comprising: receiving, by a neural network running on the apparatus, a constant input for a configuration design requiring N dimensions; outputting, by the neural network, N probability distributions; generating a batch of sample configurations for the configuration design based on the N probability distributions, wherein each sample configuration of the batch of sample configurations corresponds to a different full configuration of a system; outputting the batch of sample configurations to an evaluator external to the neural network; and updating parameters of the neural network based on a loss function.

22. The non-transitory computer-readable medium of claim 21, the operations further comprising: repeating the receiving the constant input, the outputting the N probability distributions, the generating the batch of sample configurations, the outputting the batch 31 of sample configurations, and the updating the parameters for a plurality of iterations, wherein the constant input remains the same throughout the plurality of iterations.

23. The non-transitory computer- readable medium of any one of claims 21-22, wherein the constant input includes a single constant value, N different constant one-hot vectors, or N vectors each having all is.

24. The non-transitory computer-readable medium of any one of claims 21-23, wherein the receiving the constant input and the outputting the N probability distributions are stateless.

25. The non-transitory computer-readable medium of any one of claims 21-24, the updating the parameters of the neural network including: updating weights of the neural network.

26. The non-transitory computer-readable medium of any one of claims 21-25, wherein the N probability distributions are joint probability distributions such that all of the N probability distributions are conditional on one another.

27. The non-transitory computer-readable medium of any one of claims 21-26, wherein the loss function is calculated based on rewards, the rewards generated based on performance metrics output by the evaluator.

28. The non-transitory computer- readable medium of any one of claims 21-27, wherein the loss function is updated by series-mode value estimation layers, and wherein the series-mode value estimation layers are updated based on the batch of sample configurations.

29. The non-transitory computer-readable medium of any one of claims 21-28, wherein the loss function is updated by branched-mode value estimation layers, and 32 wherein the branched-mode value estimation layers are updated based on a task encoding tensor of the neural network.

30. The non-transitory computer- readable medium of any one of claims 21-29, wherein the N probability distributions include N discrete probability density function (PDF) distributions, N continuous PDF distributions, or a mixture of M discrete PDF distributions and (N-M) continuous PDF distributions.