CN117529733A

CN117529733A - Hierarchical supervision training of neural networks

Info

Publication number: CN117529733A
Application number: CN202280043463.9A
Authority: CN
Inventors: S·M·博瑟; H·蔡; Y·张; F·M·波利克里
Original assignee: Qualcomm Inc
Current assignee: Qualcomm Inc
Priority date: 2021-06-25
Filing date: 2022-06-25
Publication date: 2024-02-06

Abstract

Certain aspects of the present disclosure provide techniques for training neural networks using hierarchical supervision. An example method generally includes: a neural network having a plurality of stages is trained using a training dataset and an initial number of classification clusters into which data in the training dataset can be classified. A cluster validation set performance metric is generated for each stage based on the reduced number of classification clusters relative to the initial number of classification clusters and the validation dataset. The number of classified clusters to be implemented at each stage is selected based on the cluster validation set performance metrics and the angle selected with respect to the cluster validation set performance metrics for the final stage of the neural network. The neural network is retrained based on the training data set and the selected number of classification clusters for each stage of the neural network, and the trained neural network is deployed.

Description

Hierarchical supervision training of neural networks

Cross Reference to Related Applications

The present application claims priority from U.S. patent application Ser. No. 17/808,949, entitled "Hierarchical Supervised Training for Neural Networks (hierarchical supervision training for neural networks)" filed on month 24 of 2022, which claims benefit and priority from U.S. provisional patent application Ser. No. 63/214,940, entitled "Hierarchical Supervised Training for Neural Networks (hierarchical supervision training for neural networks)" filed on month 25 of 2021 and assigned to the assignee of the present application, the contents of both of which are incorporated herein by reference in their entirety.

Introduction to the invention

Aspects of the present disclosure relate to machine learning.

Some applications of machine learning may involve using neural networks to classify input data. These neural networks may be used, for example, in various scenarios where semantic information about the data to be classified may be used in the classification process, such as in semantic segmentation of the data (e.g., for data compression), augmented reality or virtual reality, in controlling autonomous vehicles, in operations based on domain-specific data (e.g., medical imaging), and so forth. In general, semantic segmentation attempts to classify (or assign labels to) each of a plurality of subcomponents of the data input into a neural network for classification. For example, a neural network used to classify different segments of an image may assign each pixel of the image one of a plurality of labels, whereby different regions of the image may be associated with different data categories.

In some examples, deep neural networks may be trained and deployed to perform various classification tasks using semantic segmentation. Deep neural networks generally include an input layer, one or more middle layers, and an output layer, which together attempt to perform various tasks, such as classifying an input into one of a plurality of categories, tracking objects across spatial regions, translating, predicting, and the like. However, for various reasons, supervised learning techniques used to train these deep neural networks may not accurately classify data.

Accordingly, improved techniques for training deep neural networks are needed.

Brief summary of the invention

Certain aspects provide a method for training a neural network. The method generally includes: a neural network having multiple stages is trained using a training dataset and an initial number of classification clusters into which data in the training dataset can be classified. A cluster validation set performance metric is generated for each of a plurality of stages of the neural network based on a reduced number of classification clusters relative to the initial number of classification clusters and a validation data set separate from the training data set. The number of classified clusters to be implemented at each of a plurality of stages of the neural network is selected based on the cluster validation set performance metrics and the selected angle relative to the cluster validation set performance metrics for the final stage of the neural network. The neural network is retrained based on the training data set and the selected number of classification clusters for each of a plurality of phases of the neural network, and the trained neural network is deployed.

Other aspects provide a method for classifying data using a trained neural network. The method generally includes receiving input for classification. The input is classified using a neural network having a plurality of stages. In general, each of the plurality of stages classifies the input using a different number of classification clusters. One or more actions are taken based on the classification of the input.

Other aspects provide: a processing system configured to perform the foregoing methods and those described herein; a non-transitory computer-readable medium comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods, as well as those described herein; a computer program product embodied on a computer-readable storage medium comprising code for performing the foregoing methods, as well as those methods further described herein; and a processing system comprising means for performing the foregoing methods, as well as those methods further described herein.

The following description and the annexed drawings set forth in detail certain illustrative features of the one or more embodiments.

Brief Description of Drawings

The drawings depict certain aspects of the one or more embodiments and are not, therefore, to be considered limiting of the scope of the disclosure.

FIG. 1 depicts an example architecture of a neural network for use in generating a inference from a received input.

Fig. 2 illustrates example operations that may be performed by a computer device to train a neural network using hierarchical supervision, in accordance with aspects of the present disclosure.

Fig. 3 illustrates example operations that may be performed by a computer device to classify data using a neural network trained using hierarchical supervision, in accordance with aspects of the present disclosure. Fig. 4 illustrates an example plot of cluster validation set performance as a function of the number of categorized clusters per stage in a neural network for each of a plurality of stages in the neural network, in accordance with aspects of the present disclosure.

Fig. 5 illustrates an example architecture of a neural network using hierarchical supervised training in accordance with aspects of the present disclosure.

Fig. 6 illustrates an example architecture of a neural network trained using hierarchical supervision including a segmentation transformer associated with each stage of the neural network, in accordance with aspects of the present disclosure.

Fig. 7 illustrates an example implementation of a processing system in which a neural network may be trained using hierarchical supervision, in accordance with aspects of the present disclosure.

Fig. 8 illustrates an example implementation of a processing system in which data may be classified using a neural network of hierarchical supervised training, in accordance with aspects of the present disclosure.

To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. It is contemplated that elements and features of one embodiment may be beneficially incorporated in other embodiments without further recitation.

Detailed Description

Aspects of the present disclosure provide apparatus, methods, processing systems, and computer readable media for training a neural network using hierarchical supervision and a different number of classification clusters at each stage of the neural network.

Neural networks used in various data classification tasks typically include several stages or layers that may perform discrete classification tasks to classify data input to the neural networks. These neural networks may include encoder-decoder architectures in which an encoder encodes an input into a hidden space (space) representation (or other compressed representation) of the input, a decoder generates a reconstruction of the input, and classification tasks are performed based on the hidden space representation of the input. These neural networks may also include multi-stage neural networks, where each stage of the neural network is configured to perform a task with respect to the input.

Example neural network architecture

Fig. 1 illustrates an architecture of a neural network for use in generating a inference from a received input. In general, the neural network 100 may include any number N of stages through which input or data derived from the input by stages is processed to generate inferences as an output of the neural network 100. As illustrated, the neural network 100 includes a plurality of stages 110, 120, 130, and 140, designated as stage 1, stage 2, stage N-1, and stage N, respectively. To generate inferences about the input, e.g., to classify the input or portion thereof as one of a plurality of categories, the input may be fed into stage 1 110. The output of stage 1 110 (e.g., a signature) may be used as an input to stage 2 120. More generally, for any stage following the initial stage of the neural network 100 (e.g., stages 120, 130, and 140 as illustrated in fig. 1), the inputs of that stage generally include the outputs of the previous stage. The output of stage 140 (e.g., the nth and final stages of neural network 100) may be an inference generated for the input. Although not depicted, in various embodiments, a "skip" connection (also referred to as a residual or shortcut connection) may also be used with the neural network 100 to skip a particular phase, or to accumulate the output of a phase with its input, to name a few examples.

The neural networks 100 may be affected by various complications, resulting in reduced accuracy of the output of these neural networks. As the neural network becomes deeper (e.g., as the neural network includes more intermediate stages between the input and output stages), the neural network may become increasingly affected by the gradient vanishing problem. The problem of gradient extinction generally refers to the situation that occurs when the gradient of the loss function is near zero when optimizing the loss function at each stage of the neural network. Thus, in a neural network affected by the gradient vanishing problem, the weights and deviations of each stage of the neural network may not be updated effectively, and the resulting neural network may not be able to make accurate inferences about the input data. In another example, intermediate stages of these neural networks may not be able to identify a reasonable pattern in the inputs that would allow the neural network to generate an accurate output for a given input.

To solve this gradient vanishing problem and the instability of the intermediate phases of the neural network to identify reasonable patterns in the input, and thus to improve the accuracy of the neural network, direct supervision for the intermediate phases in the neural network is proposed. In direct supervision of the training of each intermediate stage (e.g., stage 2 through stage N-1 as illustrated in fig. 1) in the neural network 100, the intermediate stages may be trained using an auxiliary loss function that adds a loss term and attempts to alleviate the gradient vanishing problem that the deep neural network may experience. Each intermediate stage may also be trained based on base true value (ground truh) data, such as a base true value map representing a desired classification for different portions of the input image. However, since intermediate stages of the neural network may have limited ability to accurately classify data (e.g., less expressive than the final stages of the neural network), these intermediate stages may also be unable to identify consistent patterns from the input data and the underlying truth map, thereby adversely affecting the accuracy of the inferences generated by the neural network. Further, in the middle stage of training the neural network, the difference in the expression ability of the middle stage and the final stage may be ignored.

Example methods for training neural networks using hierarchical supervision

To improve the accuracy of deep neural networks, aspects of the present disclosure describe techniques that may train a neural network using hierarchical supervision. In training a neural network using hierarchical supervision, intermediate stages of the neural network may be trained using a reduced number of classification clusters relative to the number of classification clusters to which data can be classified in the final stages of the neural network. In general, a classification cluster may represent a category into which data may be classified. As discussed in more detail herein, the classification clusters may be used to classify data on a finer granularity basis at a later stage of the neural network, and on a more general basis at an earlier stage of the neural network. By so doing, aspects of the present disclosure may simplify training of intermediate stages of a neural network such that the intermediate stages of the neural network may be trained using fewer computing resources (e.g., processing power, processing time, memory, etc.) than would be used when training the neural network using direct supervision of the intermediate stages of the neural network, in which each stage of the neural network is trained using a classification cluster into which data can be classified in the final stage of the neural network. Further, aspects of the present disclosure may provide a neural network that is able to generate inferences more accurately for input than a neural network that trains intermediate stages using direct supervision.

Fig. 2 illustrates example operations 200 that may be performed to train a neural network using hierarchical supervision in accordance with certain aspects of the present disclosure. Operation 200 may be performed by, for example, a physical or virtual computing device or a cluster of physical and/or virtual computing devices on which a neural network may be trained.

As illustrated, operation 200 begins at block 210 with training a neural network. The neural network generally includes a plurality of stages. The neural network may be trained using a training data set and an initial number of classification clusters into which data in the training data set may be classified. Generally, training the neural network may include training a new neural network from the training data set, further training a partially trained model, or fine-tuning the trained model (e.g., by performing retraining, incremental training, training in a federal learning scheme, etc.).

In general, neural networks may be trained using supervised learning techniques, where each element in a training dataset is labeled with information identifying the category to which the element belongs. The training data set may be generated as part of a larger data set from which the training data set and the validation data set may be generated. In general, the training data set may be significantly larger than the validation data set. For example, the training data set may account for ninety percent of the overall data set, while the validation data set may account for the remaining ten percent of the overall data set.

At block 220, cluster validation set performance metrics are generated for each of a plurality of phases. The cluster validation set performance metric may be based on a reduced number of classification clusters relative to the initial number of classification clusters, and a validation data set separate from the training data set. In general, reducing the number of classification clusters may result in classification clusters that cover a wider class of data. By doing so, the early stages of the neural network (which may have less robust ability to classify data at a granularity level) may be trained to classify data into a wider class. This may improve the performance of the neural networks used to classify the data, such as by improving the accuracy of predictions made using the neural networks, and reduce the computational resources used in training the neural networks.

In some aspects, the reduced number of classification clusters may be defined a priori (a priori). The set of classification clusters into which data in the training dataset may be classified may include a number of specific classification categories, which may be categorized as an overall genus. For example, assume that a collection of classification clusters includes "train", "car", "bus" and "bicycle" classifications. Classification of "trains", "cars", "buses" and "bicycles" may be combined into a single cluster representing, for example, wheeled transportation devices as an overall group, etc., based on human knowledge and a reduction in the prior definition of the set of classification clusters.

In certain aspects, the reduced number of classification clusters may be generated using an aggregate clustering technique. As discussed above, at block 210, the neural network may be trained using direct supervision over the training dataset. Two confusion matrices may be generated using a trained neural network: first confusion matrix calculated for training datasetAnd a second confusion matrix C calculated for the validation dataset _out . In general, the confusion matrix identifies a number of true positive (true positive) predictions, a number of false positive (false positive) predictions, and a number of false negative (false negative) predictions for each class in the set of classification clusters into which the data may be classified. Then, the adjacency matrix A can be calculated over the sorted cluster according to the following equation _out ：

For each stage i through N of the N-stage neural network, an adjacency matrix may be generated such thatAt each stepSegments, clusters in the computed adjacency matrix may be merged using aggregated clustering such that multiple neighboring clusters are generalized into a single cluster. A single cluster generally represents a broader class of data than the class associated with any one of a plurality of neighboring clusters that are merged into the single cluster.

In yet another example, spectral clustering may be used to reduce the set of classification clusters to which data may be classified to a smaller set of classification clusters. In general, spectral clustering allows grouping of classification clusters into a single larger group based on graph representations and edges connecting nodes in the graph, where each classification cluster is represented by a node in the graph. For spectral clustering of the classification clusters, the adjacency matrix A for a given stage i can be calculated according to equation (1) above _out，i . One or more orthogonal feature vectors may be identified in the adjacency matrix and clustered into clusters. Data points within the adjacency matrix representing different clusters in the collection of categorized clusters may be consolidated into a single, more extensive cluster based on determining that the data point is located in a row that is also assigned to a given cluster.

In yet another example, each stage of the neural network may include a segmentation transformer (segmentation transformer) module (also referred to as an Object Context Representation (OCR) module). In general, the segmentation transformer module (or OCR module) characterizes the data based on a relationship between the data and data of surrounding areas in the image, the characterization being based on the following assumptions: data points encompassed by a given classification of data points may be similarly classified. In such examples, the segmentation transformer may extract one-dimensional embeddings for each classification cluster. The per-class embedding may be extracted by performing reasoning on the validation dataset, and k-means clustering may be applied to these embeddings to generate a reduced number of classification clusters.

At block 230, a number of classification clusters to be implemented in each of a plurality of stages of the neural network is selected. The number of classified clusters may be selected based on the calculated cluster validation set performance metrics and the selected angle relative to the cluster validation set performance metrics for the final stage of the neural network, as discussed in more detail below and illustrated in fig. 4.

In some aspects, to select a number of classification clusters to be implemented at each of a plurality of stages of the neural network, the generated cluster validation set performance metrics for each of the plurality of stages of the neural network may be plotted to show a relationship between inference performance and the number of clusters implemented at each of the plurality of stages. The initial number of cluster validation set performance metrics and classification clusters generated for the final stage of the neural network may be selected as the origin. From the vertical axis drawn from the origin, the angle θ of the line drawn from the origin may be selected to identify the number of classification clusters to be implemented at each intermediate stage of the neural network. In certain aspects, the angle θ may range between 0 ° and 90 °. The selected angle θ=0° generally indicates that the neural network can be trained using direct supervision (e.g., using the same number of classification clusters), as each stage in the neural network can be trained using the same (or similar) number of classification clusters. The selected angle θ=90° generally indicates that each stage in the training should converge to the same or similar performance level (e.g., by using the number of classification clusters that result in the inference accuracy of any given stage being within a threshold amount of the inference accuracy at the origin). The selected angle θ between 0 ° and 90 ° may result in a progressive increase in the number of classification clusters used in each successive stage of the neural network. In certain aspects, a certain angle between 0 ° and 90 ° may result in the highest inference performance (e.g., classification accuracy) of the neural network.

In general, the selected angle θ can be used to identify the performance level of each stage of the neural network and the corresponding number of classification clusters to be implemented at each stage of the neural network. The selected angle θ may be identified on a cluster verification set evaluation metric graph generated at each stage of the neural network using various techniques. In one example, the selected angle θ may be selected based on a maximum increase in performance between different stages in the neural network. In one example, a hyper-parameter search may be performed to identify the angle θ that results in the highest performance (e.g., accuracy) of the neural network. In certain aspects, the angles may be selected such that successive stages in the neural network use an increased number of classification clusters relative to a previous stage of the neural network (e.g., the number of classification clusters increases monotonically with increasing number of layers).

At 240, the neural network is retrained based on the training data set and the selected number of classification clusters for each of the plurality of phases of the neural network.

In certain aspects, the neural network may be retrained using single-stage training or multi-stage training. In single-stage training, there may be no a priori knowledge of the ability of the neural network. To compensate, the selected number of classification clusters used in each stage may be defined a priori. For example, for the number of stages N in the neural network, the number of classification clusters at the ith stage (where 1.ltoreq.i.ltoreq.N) may be defined as

In multi-stage training, as discussed above, the number of classification clusters per stage may be selected using the plot and the angle θ selected relative to the defined origin. At each stage, the clusters of classifications may be combined using various clustering techniques (as discussed above) such that the number of clusters of classifications is equal to a smaller number than the total number of clusters of classifications, and equal to the performance of that stage on the plot and the number of clusters at the point where the lines drawn from the origin intersect using the selected angle θ.

In general, retraining the neural network can be performed by minimizing a loss function for each stage in the neural network. In phase N represents the output phase of the neural network, and any given phase i has K _i Each classification cluster (where K _i Representing a subset of K classification clusters of the neural network), the penalty associated with output phase N may be represented by the equation:

wherein L is _n Representing the binary penalty term associated with the classification cluster n. The overall loss term on the trained neural network can be represented by the following equation:

wherein gamma is _i Is a weight super parameter associated with stage i of the neural network.

At block 250, a trained neural network is deployed. The neural network may be deployed to an endpoint device, such as a mobile phone, desktop or laptop computer, a vehicle User Equipment (UE), etc., on which inference may be performed locally. In certain aspects, the neural network may be deployed to a networked computing system (e.g., a server or a cluster of servers). The networked computing system may be configured to receive a request from a remote computing device to perform an inference on a given input, generate an inference for the input using the neural network, and output the inference to the remote computing device for use in performing one or more actions on an application executing on the remote computing device. In training a neural network using hierarchical supervision, the backbone network may be trained by applying auxiliary supervision through a segmentation head attached to an intermediate (or transitional) layer of the neural network. For a set of S underlying truth predictions, a smaller set of semantic labels may be generated at each stage of the neural network such that Wherein i represents an intermediate stage in the neural network, and N<S, S. The resulting loss function may be represented by the following equation:

wherein the method comprises the steps ofIs the segmentation loss of the ith intermediate stage, gamma _i Is the weight of the i-th intermediate stage and +.>Representing the segmentation penalty of the final stage of the neural network. Unlike in a neural networkThe stages train the neural network using the same set of categories, aspects of the present disclosure train the neural network by supervising each middle tier with optimal task complexity in terms of the set of semantic categories.

As discussed, during training, each intermediate stage of the neural network may be trained using a reduced number of classifications relative to the complete classification set. By doing so, the learning task can be customized for each stage in the neural network, such that training is neither overly complex nor overly simple, both of which can result in non-optimized inference performance (e.g., accuracy) of the neural network. In some aspects, some intermediate stages may be trained to perform classification tasks on a very broad class, while other (later) intermediate stages may be trained to perform classification tasks on a narrower class. For example, in an object detection system, an intermediate layer of the neural network may be trained to classify objects as static objects or moving object classes, while later intermediate layers of the neural network may be trained to classify data more finely. For example, for static objects, an intermediate layer may be trained to classify those objects as biological or non-biological, a further intermediate layer may be trained to classify biological objects as one of a plurality of species, and so on.

In general, to allow the final segmentation layer to use the hierarchy of features to generate inferences, various fusion techniques can be used to provide semantic data sets to the final layer for use in segmentation. For example, for each intermediate layer, the segmented features of that layer may be input to an Object Context Representation (OCR) block that enhances the features by relational context attention. These enhanced intermediate features are then fused and provided to the final segmentation layer. To reduce the computational cost of the attendant task complexity, the number of channels defined for the intermediate OCR block may be set to a smaller number of channels than the number of channels of the next stage of which we set the number of channels (e.g., 1/2 of the number of channels of the next stage).

Example methods for classifying data with neural networks using hierarchical supervised training

Fig. 3 illustrates example operations that may be performed by a computing device to classify data using a neural network trained using hierarchical supervision, in accordance with certain aspects of the present disclosure. The operations 300 may be performed by, for example, a physical or virtual computing device or a cluster of physical and/or virtual computing devices that may be trained on a neural network that may be deployed and used to classify an input and take one or more actions based on the classification of the input.

As illustrated, the operation 300 begins at block 310 with receiving input for classification. The input may include, for example, images captured by one or more cameras or other imaging devices communicatively coupled with a computing device on which the neural network is deployed and executed. For example, the input may include field-specific imaging data, such as images captured by a medical imaging device (e.g., an X-ray machine, a computer tomography machine, a magnetic resonance imaging machine, etc.). In another example, the input may include information to be used in real-time decisions, such as camera or other imaging data of one or more imaging devices used by autonomous or semi-autonomous operating vehicle User Equipment (UE).

At block 320, the input is classified using a neural network having a plurality of stages. Each of the plurality of stages typically classifies the input using a different number of classification clusters. For example, each of a plurality of stages downstream of a final stage of the neural network (e.g., a stage preceding the final stage of the neural network) may be trained to generate inferences using a reduced number of classification clusters relative to the number of classification clusters used by the final stage. In certain aspects, these stages may use a monotonically increasing number of classification clusters as a function of stage number such that a first stage of the neural network classifies an input as x classification clusters, a second stage of the neural network classifies an input as y classification clusters, a third stage of the neural network classifies an input as z classification clusters, and so on, where x < y < z. The number of classification clusters used at each stage of the neural network may be defined a priori according to an equation defining the number of classification clusters as a function of the stage number, or may be selected based on the cluster verification set performance metric for the final stage of the neural network and the number of classification clusters used at the final stage and the angle selected for the line drawn from the points on the plot corresponding to the cluster verification set performance metric for the final stage of the neural network.

At block 330, one or more actions are taken based on the entered classification. In general, the one or more actions may be associated with a particular application for which data is being classified. In medical applications, where a neural network is used to classify domain-specific images, the one or more actions may include identifying portions of the image corresponding to areas of the human body where disease is present. In an autonomous or semi-autonomous vehicle application, the one or more actions may include identifying a direction of travel and applying steering input to cause the vehicle to travel in the identified direction, accelerate or decelerate the vehicle, or otherwise control the vehicle to avoid obstacles or to avoid damage to personnel or property in the vicinity of the vehicle.

Exemplary Cluster validation set performance metric plot for selecting the number of classification clusters used in a stage of a neural network

Fig. 4 illustrates an example plot 400 of cluster verification set performance as a function of the number of classified clusters in a neural network for each of a plurality of stages in the neural network.

In particular, plot 400 includes a first stage inference performance line 402, a second stage inference performance line 404, and a third stage inference performance line 406 for different stages of a three-stage neural network. In plot 400, inference accuracy is represented on the vertical axis by a mean cross-over-ratio (mlou) measurement for each number of classification clusters from a defined minimum value to a defined maximum value of the number of classification clusters. In general, inference accuracy increases as the number of classification clusters decreases (at the cost of the usefulness of any given inference, as a broad class may be less useful than a finer-grained class). The mIoU value for each stage and number of classification clusters of the neural network generally represents the accuracy of classification made by the neural network based on the ratio of the number of true positive examples to the number of true positive examples, false negative examples, and false negative examples identified by the neural network.

To identify the number of classification clusters to be used in an intermediate stage of retraining the neural network (e.g., a stage of the neural network other than the input stage and the final stage—in this example, stage 3), the reasoning performance of the final stage of the neural network for the maximum number of classification clusters to which the data is classified may be selected as the origin 410. An angle θ for drawing a line 420 from an origin 410 in the plot 400 may be selected, the angle being measured from the vertical axis to the horizontal axis. As discussed, when θ=0°, the neural network can be trained using direct supervision and the same number of classification clusters in each stage of the neural network. Meanwhile, when θ=90°, the inference performance of each stage may converge to a value within a threshold amount from the performance of the final stage at the origin 410.

Various techniques may be used to select the angle θ used when drawing the line 420 from the origin 410. In certain aspects, a "greedy" technique may be used to attempt to identify the angle that results in the greatest overall gain in inferential performance between one or more intermediate stages of the neural network and the final stage of the neural network.

After the angle θ is identified and the line 420 is drawn on the plot 400, the number of classification clusters to be used at each intermediate stage of the neural network may be identified. In general, the number of classification clusters to be used at any given intermediate stage of the neural network may be the number of classification clusters at the point where the inferred performance line intersects line 420. As illustrated in fig. 4, therefore, the second stage of the neural network may be retrained to classify data into the number of classification clusters at point K2 430, and the first stage of the neural network may be retrained to classify data into the number of classification clusters at point K1 440. In this way, hierarchical supervision of the neural network may be achieved by using a smaller number of classification clusters at an earlier stage of the neural network and increasing the number of classification clusters used at a later stage of the neural network until a maximum number of classification clusters are used at a final stage of the neural network.

Example architecture of neural networks Using hierarchical supervised training

Fig. 5 illustrates an example architecture of a neural network 500 using hierarchical supervised training in accordance with aspects of the present disclosure. The neural network 500 includes an input stage 510, a first intermediate stage 520, a second intermediate stage 530, and an output stage 540. Within each stage, as conceptually illustrated, the input from the previous stage may be further compressed into another representation, and the data generated by the previous stage may be the input to the current stage of the neural network.

The input stage 510 represents a stage of the neural network 500 that is configured to receive input of data to be classified by the neural network 500. The input stage 510 generally dispatches the received input to a first intermediate stage 520 that uses a first number of classification clusters to generate a first stage output 522 that is less than the number of classification clusters to which data may be classified at the output stage 540 of the neural network 500. In the example illustrated herein, the input received at the input stage 510 may be an image captured by an imaging device in an autonomous vehicle, and the first stage output 522 may include classifying different pixels in the input image (representing different portions of the environment in which the autonomous vehicle operates) into one of a plurality of object classifications (e.g., roads, buildings, other vehicles, etc.).

The output of the first intermediate stage 520 may be input to a second intermediate stage 530. Similar to the first intermediate stage 520, the second intermediate stage 530 may be configured to classify data input from the first intermediate stage 520 using a second number of classification clusters. The second number of classification clusters may be greater than the first number of classification clusters and may be less than the number of classification clusters into which the data may be classified in the output stage 540 of the neural network 500. For example, intermediate stage 520 may classify data using the number of classification clusters associated with point K1 440, while intermediate stage 530 may classify data using the number of classification clusters associated with point K2 430. In the example illustrated herein, the second stage output 532 also includes classifying different pixels in the received image into one of a plurality of categories. Different representations of these pixels, such as different color values, generally represent different classifications into which the data is classified. In this example, the second intermediate stage 530 may be configured to identify differences between different types of vehicles relative to the output 522 in which all vehicles in the image are similarly classified. Instead of classifying all vehicles in the image as broad categories of vehicles, the second intermediate stage 530 may classify vehicles as a first category of four-wheel vehicles and a second category of two-wheel vehicles.

The output of the second intermediate stage 530 may be provided as an input into an output stage 540 that is configured to generate a final classification of the data in the image and output the final classification 542 for identifying an action to perform based on the final classification. As discussed, the output stage 540 is generally trained to sort data into a number of sorting clusters that is greater than the number of sorting clusters implemented at the first intermediate stage 520 and the second stage 530. In this example, further fine-grained detail has been identified at the output stage 540 such that different portions of the plane are demarcated between the road surface and the non-road surface.

Each of the first intermediate stage 520, the second intermediate stage 530, and the output stage 540 may be trained using supervised learning techniques. As discussed, the supervised learning technique may be hierarchical such that early stages of the neural network 500 are trained to classify data into fewer classification clusters than later stages of the neural network. By so doing, aspects of the present disclosure may improve the accuracy of the neural network 500 while taking into account the computational power available to perform reasoning at any given stage of the neural network 500.

Fig. 6 illustrates an example architecture of a neural network 600 using hierarchical supervised training, where the neural network includes a segmentation transformer associated with each stage of the neural network, in accordance with aspects of the present disclosure.

As illustrated, the neural network 600 includes an input stage 610, a plurality of intermediate stages 620 and 630, and an output stage 640. Each intermediate stage 620 and 630 is associated with a corresponding segmentation transformer (or OCR module) 622 and 632, respectively, and the output stage 640 may be associated with an output segmentation transformer 642. As discussed, these segmentation transformers allow one-dimensional embedding to be extracted for each class.

As illustrated, each partition transformer 622, 632, and 642 may be configured to classify dataTo a selected number of classification clusters as a function of the stage of the neural network in which the segmentation transformer is deployed. For a neural network with a number of stages n=3, the segmentation transformer 642 (associated with the final stage 640 of the neural network 600) may be trained to classify data into a total number of classification clustersMultiple times. Intermediate stage 630, which is the second stage of neural network 600, may be trained to classify data into +.>Multiple times. Finally, an intermediate stage 620, which is the first stage of the neural network 600, may be trained to classify data into +_total number of classification clusters >Multiple times.

To provide additional information when training the neural network 600, the outputs of the segmentation transformers associated with stages in the neural network 600 other than the final stage 640 (e.g., the outputs of the segmentation transformers 622 and 632 as illustrated in fig. 6) may be spliced at a splicer 650. That is, for an N-stage neural network, the outputs of the split converters associated with stages 1 through N-1 of the neural network may be stitched. The output of the splicer 650 may be input to a final stage 640 of the neural network (e.g., stage 3 of the neural network, where n=3) to train the final stage. The stitching of the outputs of the segmentation transformer may introduce additional processing overhead in training when generating inferences through the neural network 600, but this may allow additional information to be used in training and generating inferences using the neural network 600 and improve the accuracy of the inferences generated by the neural network 600.

The performance of a neural network trained using the techniques discussed herein generally results in increased inference performance relative to training a multi-stage neural network using direct supervision. For example, the accuracy of reasoning through mean cross-over (mIoU) measurements is generally higher for neural networks trained using the hierarchical supervision techniques discussed herein than for neural networks trained using direct supervision techniques. In certain aspects, various techniques, such as incorporating a split transformer into each stage of a neural network, can lead to increased inference accuracy and increased throughput (e.g., as measured in units of billions of multiply-accumulate operations (MACs). The hierarchical supervision techniques discussed herein may still result in increased inference accuracy for the same or similar computational costs when controlled for constant throughput (e.g., a similar number of MACs).

Example processing system for training machine learning models using hierarchical supervision

Fig. 7 depicts an example processing system 700 for training a machine learning model (such as described herein with respect to fig. 2) using hierarchical supervision.

The processing system 700 includes a Central Processing Unit (CPU) 702, which in some examples may be a multi-core CPU. The instructions executed at the CPU 702 may be loaded, for example, from a program memory associated with the CPU 702 or may be loaded from the memory 724 or a memory partition.

The processing system 700 also includes additional processing components tailored for specific functions, such as a Graphics Processing Unit (GPU) 704, a Digital Signal Processor (DSP) 706, a Neural Processing Unit (NPU) 708, and a wireless connectivity component 712.

An NPU, such as 708, is typically a dedicated circuit configured to implement all the necessary control and arithmetic logic for performing machine learning algorithms, such as algorithms for processing Artificial Neural Networks (ANNs), deep Neural Networks (DNNs), random Forests (RF), and the like. The NPU is sometimes alternatively referred to as a Neural Signal Processor (NSP), tensor Processing Unit (TPU), neural Network Processor (NNP), intelligent Processing Unit (IPU), visual Processing Unit (VPU), or graphics processing unit.

The NPU (such as 708) is configured to accelerate performance of common machine learning tasks such as image classification, machine translation, object detection, and various other predictive models. In some examples, multiple NPUs may be instantiated on a single chip, such as a system on a chip (SoC), while in other examples, multiple NPUs may be part of a dedicated neural network accelerator.

The NPU may be optimized for training or inference, or in some cases configured to balance performance between the two. For NPUs that are capable of both training and inferring, these two tasks can still generally be performed independently.

NPUs designed to accelerate training are generally configured to accelerate optimization of new models, which is a highly computationally intensive operation involving inputting an existing dataset (typically labeled or tagged), iterating over the dataset, and then adjusting model parameters (such as weights and biases) in order to improve model performance. In general, optimizing based on mispredictions involves passing back through layers of the model and determining gradients to reduce prediction errors.

NPUs designed to accelerate inference are generally configured to operate on a complete model. Such NPUs may thus be configured to: new pieces of data are input and processed quickly through the already trained model to generate model outputs (e.g., inferences).

In one implementation, the NPU 708 is part of one or more of the CPU 702, GPU 704, and/or DSP 706.

The processing system 700 may also include one or more input and/or output devices 722, such as a screen, touch-sensitive surfaces (including touch-sensitive displays), physical buttons, speakers, microphones, and so forth.

In some examples, one or more processors of processing system 700 may be based on an ARM or RISC-V instruction set.

The processing system 700 also includes a memory 724, which memory 924 represents one or more static and/or dynamic memories, such as dynamic random access memory, flash-based static memory, or the like. In this example, memory 724 includes computer-executable components that are executable by one or more of the aforementioned processors of processing system 700.

In particular, in this example, memory 724 includes a neural network training component 724A, a cluster validation set performance metric generator component 724B, a classification cluster selection component 724C, a neural network retraining component 724D, and a neural network deployment component 724E. The depicted components, as well as other non-depicted components, may be configured to perform various aspects of the methods described herein.

In general, the processing system 700 and/or components thereof may be configured to perform the methods described herein. Notably, aspects of the processing system 700 may be distributed.

Fig. 8 depicts an example processing system 800 for classifying data with a multi-stage neural network trained using supervised learning techniques, such as discussed herein, for example, with reference to fig. 3.

The processing system 800 includes a Central Processing Unit (CPU) 802, which in some examples may be a multi-core CPU. The instructions executed at the CPU 802 may be loaded, for example, from a program memory associated with the CPU 802 or may be loaded from the memory 824 or a memory partition.

The processing system 800 also includes additional processing components tailored for specific functions, such as a Graphics Processing Unit (GPU) 804, a Digital Signal Processor (DSP) 806, a Neural Processing Unit (NPU) 808, a multimedia processing unit 810, and a wireless connectivity component 810.

An NPU, such as 808, is typically a dedicated circuit configured to implement all the necessary control and arithmetic logic for performing machine learning algorithms, such as algorithms for processing Artificial Neural Networks (ANNs), deep Neural Networks (DNNs), random Forests (RF), etc. The NPU is sometimes alternatively referred to as a Neural Signal Processor (NSP), tensor Processing Unit (TPU), neural Network Processor (NNP), intelligent Processing Unit (IPU), visual Processing Unit (VPU), or graphics processing unit.

The NPU (such as 808) may be configured similarly to the NPU 708 described above with respect to fig. 7. In one implementation, the NPU 808 is part of one or more of the CPU 802, GPU 804, and/or DSP 806.

In some examples, wireless connectivity component 812 may include subcomponents such as for third generation (3G) connectivity, fourth generation (4G) connectivity (e.g., 4G LTE), fifth generation connectivity (e.g., 5G or NR), wi-Fi connectivity, bluetooth connectivity, and other wireless data transmission standards. The wireless connectivity component 812 is further connected to one or more antennas 814.

The processing system 800 can also include one or more sensor processing units 816 associated with any manner of sensor, one or more Image Signal Processors (ISPs) 818 associated with any manner of image sensor, and/or a navigation processor 820, which navigation processor 820 can include satellite-based positioning system components (e.g., GPS or GLONASS) as well as inertial positioning system components.

The processing system 800 can also include one or more input and/or output devices 822, such as a screen, touch-sensitive surface (including a touch-sensitive display), physical buttons, speakers, microphones, and so forth.

In some examples, one or more processors of processing system 800 may be based on an ARM or RISC-V instruction set.

The processing system 800 also includes a memory 824, which memory 824 represents one or more static and/or dynamic memories, such as dynamic random access memory, flash-based static memory, or the like. In this example, memory 824 includes computer-executable components that are executable by one or more of the foregoing processors of processing system 800.

In particular, in this example, memory 824 includes an input receiving component 824A, an input classification component 824B, and an action taking component 824C. The depicted components, as well as other non-depicted components, may be configured to perform various aspects of the methods described herein.

In general, the processing system 800 and/or components thereof may be configured to perform the methods described herein.

It is noted that in other embodiments, aspects of processing system 800 may be omitted, such as where processing system 800 is a server computer or the like. For example, the multimedia processing unit 810, the wireless connectivity component 812, the sensor processing unit 816, the ISP 818, and/or the navigation processor 820 may be omitted in other embodiments. In addition, aspects of processing system 800 may be distributed, such as training a model and using the model to generate inferences.

Example clauses

Implementation details of various aspects of the present disclosure are described in the following numbered clauses.

Clause 1: a method, comprising: training a neural network having a plurality of stages using a training dataset into which data in the training dataset can be classified and an initial number of classification clusters; generating a cluster validation set performance metric for each of the plurality of stages of the neural network based on a reduced number of classification clusters relative to an initial number of classification clusters and a validation data set separate from the training data set; selecting a number of classification clusters to be implemented at each of the plurality of stages of the neural network based on the cluster verification set performance metrics and the selected angle relative to the cluster verification set performance metrics for a final stage of the neural network; retraining the neural network based on the training dataset and the selected number of classification clusters for each of the plurality of phases; and deploying the trained neural network.

Clause 2: the method according to clause 1, further comprising: for each of the plurality of phases: calculating an confusion matrix for the training data set and an confusion matrix for the validation data set, wherein discrete elements in one dimension of the confusion matrix represent one of a plurality of classification clusters; calculating an adjacency matrix based on the confusion matrix calculated for the training dataset and the confusion matrix calculated for the validation dataset; and generating the reduced number of classification clusters using aggregated clusters for neighboring clusters in the computed adjacency matrix such that a plurality of neighboring clusters are reduced to a single cluster representing a wider classification of data than each of the plurality of neighboring clusters.

Clause 3: the method of any of clauses 1 or 2, wherein generating the cluster verification set performance metric comprises: for the initial number of cluster sizes up to and including the classification clusters, a performance metric is calculated for each stage of the plurality of stages.

Clause 4: the method according to clause 3, wherein the performance metrics comprise: a mean cross-over ratio (mIoU) metric calculated as a function of a number of clusters for each of the plurality of stages of the neural network.

Clause 5: the method according to any one of clauses 1 to 4, wherein: the selected angle includes a zero degree angle; and training the neural network based on the training dataset and the selected number of classification clusters for each stage includes training the plurality of stages of the neural network using direct supervision.

Clause 6: the method according to any one of clauses 1 to 4, wherein: the selected angle includes a ninety degree angle; and training the neural network based on the training data set and the selected number of classification clusters for each stage includes training the plurality of stages of the neural network such that performance of each stage of the neural network converges to a performance level within a threshold.

Clause 7: the method of any of clauses 1-6, wherein retraining the neural network based on the training set and the selected number of classification clusters for each stage comprises minimizing a total loss function, wherein: the total loss function includes a sum of loss functions of each respective phase of the plurality of phases weighted by values associated with each respective phase of the plurality of phases, and the loss functions of the respective phases of the plurality of phases are based on a number of classification clusters selected for the respective phase.

Clause 8: the method of any of clauses 1-7, wherein retraining the neural network based on the training dataset and the selected number of classification clusters for each stage comprises: aggregating outputs of each of the plurality of phases except for a final phase of the neural network, and training the final phase of the neural network based on inputting the aggregated outputs of the plurality of phases of the neural network except for the final phase to a segmentation transformer module associated with the final phase of the neural network.

Clause 9: a method, comprising: receiving input for classification; classifying the input using a neural network having a plurality of stages, wherein each stage of the plurality of stages classifies the input using a different number of classification clusters; and taking one or more actions based on the classification of the input.

Clause 10: the method of clause 9, wherein classifying the output comprises: the input is classified at a stage of the plurality of stages based on inferences generated by a previous stage of the plurality of stages.

Clause 11: the method according to any of clauses 9 or 10, wherein: the neural network includes a neural network including a segmentation transformer at each stage of the neural network, the outputs of each stage of the neural network are aggregated except for a final stage of the neural network, and the aggregated outputs are input into the segmentation transformer associated with the final stage of the neural network to generate the classification of the inputs.

Clause 12: the method of any of clauses 9-11, wherein each stage of the plurality of stages classifies the input using a greater number of classification clusters than a preceding stage of the plurality of stages.

Clause 13: an apparatus, comprising: a memory having executable instructions stored thereon; and a processor configured to execute the executable instructions to cause the apparatus to perform the method according to any one of clauses 1 to 12.

Clause 14: an apparatus, comprising: means for performing the method according to any of clauses 1 to 12.

Clause 15: a non-transitory computer-readable medium having instructions stored thereon, which when executed by a processor perform the method according to any of clauses 1 to 12.

Clause 16: a computer program product embodied on a computer-readable storage medium, comprising code for performing the method according to any of clauses 1 to 12.

Additional considerations

The previous description is provided to enable any person skilled in the art to practice the various embodiments described herein. The examples discussed herein are not intended to limit the scope, applicability, or embodiment as set forth in the claims. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments. For example, changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as appropriate. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. Moreover, features described with reference to some examples may be combined in some other examples. For example, an apparatus may be implemented or a method practiced using any number of the aspects set forth herein. In addition, the scope of the present disclosure is intended to cover such an apparatus or method practiced using other structure, functionality, or both, that is complementary to, or different from, the various aspects of the present disclosure set forth herein. It should be understood that any aspect of the disclosure disclosed herein may be embodied by one or more elements of the claims.

As used herein, the phrase "exemplary" means "serving as an example, instance, or illustration. Any aspect described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects.

As used herein, a phrase referring to a list of items "at least one of" refers to any combination of these items, including individual members. As examples, a, b, c, a-b, a-c, b-c, and a-b-c, as well as any combination having multiple identical elements (e.g., a-a-a, a-a-b, a-a-c, a-b-b, a-c-c, b-b-b, b-b-c, c-c, and c-c-c, or any other ordering of a, b, and c).

As used herein, the term "determining" encompasses a wide variety of actions. For example, "determining" may include calculating, computing, processing, deriving, researching, looking up (e.g., looking up in a table, database, or another data structure), ascertaining, and the like. Also, "determining" may include receiving (e.g., receiving information), accessing (e.g., accessing data in memory), and the like. Also, "determining" may include parsing, selecting, choosing, establishing, and the like.

The methods disclosed herein comprise one or more steps or actions for achieving the method. These method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims. Furthermore, the various operations of the above-described methods may be performed by any suitable means capable of performing the corresponding functions. These means may comprise various hardware and/or software components and/or modules including, but not limited to, circuits, application Specific Integrated Circuits (ASICs), or processors. Generally, where there are operations illustrated in the figures, these operations may have corresponding counterpart means-plus-function components with similar numbers.

The following claims are not intended to be limited to the embodiments shown herein but are to be accorded the full scope consistent with the language of the claims. Within the claims, reference to an element in the singular is not intended to mean "one and only one" (unless specifically so stated) but rather "one or more". The term "some" means one or more unless specifically stated otherwise. No element of a claim should be construed under the specification of 35u.s.c. ≡112 (f) unless the element is explicitly recited using the phrase "means for … …" or in the case of method claims the element is recited using the phrase "step for … …". The elements of the various aspects described throughout this disclosure are all structural and functional equivalents that are presently or later to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Furthermore, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims.

Claims

1. A computer-implemented method for generating inferences using a machine learning model, comprising:

Receiving input for classification;

classifying the input using a neural network having a plurality of stages, wherein each stage of the plurality of stages classifies the input using a different number of classification clusters; and

one or more actions are taken based on the classification of the input.

2. The method of claim 1, wherein classifying the output comprises: the input is classified at a stage of the plurality of stages based on inferences generated by a previous stage of the plurality of stages.

3. The method of claim 1, wherein:

the neural network includes a neural network including a segmentation transformer at each stage of the neural network,

the outputs of each stage of the neural network, except for the final stage of the neural network, are aggregated and

the aggregated output is input into a segmentation transformer associated with the final stage of the neural network to generate the classification of the input.

4. The method of claim 1, wherein each of the plurality of stages classifies the input using a greater number of classification clusters than a previous stage of the plurality of stages.

5. A computer-implemented method for training a machine learning model, comprising:

training a neural network having a plurality of stages using a training dataset into which data in the training dataset can be classified and an initial number of classification clusters;

generating a cluster validation set performance metric for each of the plurality of stages of the neural network based on a reduced number of classification clusters relative to the initial number of classification clusters and a validation data set separate from the training data set;

selecting a number of classification clusters to be implemented at each of the plurality of stages of the neural network based on the cluster verification set performance metrics and the selected angle relative to the cluster verification set performance metrics for a final stage of the neural network;

retraining the neural network based on the training dataset and the selected number of classification clusters for each of the plurality of phases; and

a trained neural network is deployed.

6. The method of claim 5, further comprising, for each of the plurality of phases: calculating an confusion matrix for the training data set and an confusion matrix for the validation data set, wherein discrete elements in one dimension of the confusion matrix represent one of a plurality of classification clusters;

Calculating an adjacency matrix based on the confusion matrix calculated for the training dataset and the confusion matrix calculated for the validation dataset; and

the reduced number of classification clusters is generated using aggregated clusters for neighboring clusters in the computed adjacency matrix such that a plurality of neighboring clusters are reduced to a single cluster representing a wider classification of data than each of the plurality of neighboring clusters.

7. The method of claim 5, wherein generating the cluster validation set performance metric comprises: for the initial number of cluster sizes up to and including the classification clusters, a performance metric is calculated for each stage of the plurality of stages.

8. The method of claim 7, wherein the performance metrics comprise: a mean cross-over ratio (mIoU) metric calculated as a function of a number of clusters in each of the plurality of stages of the neural network.

9. The method of claim 5, wherein:

the selected angle includes a zero degree angle; and is also provided with

Training the neural network based on the training dataset and the selected number of classification clusters for each stage includes: the multiple phases of the neural network are trained using direct supervision.

10. The method of claim 5, wherein:

the selected angle includes a ninety degree angle; and is also provided with

Training the neural network based on the training dataset and the selected number of classification clusters for each stage includes: the plurality of stages in the neural network are trained such that performance of each stage of the neural network converges to a performance level within a threshold.

11. The method of claim 5, wherein retraining the neural network based on the training dataset and the selected number of classification clusters for each stage comprises minimizing a total loss function, wherein: the total loss function includes a sum of loss functions of each respective phase of the plurality of phases weighted by values associated with each respective phase of the plurality of phases, and

the loss function of a respective stage of the plurality of stages is based on a number of classification clusters selected for the respective stage.

12. The method of claim 5, wherein retraining the neural network based on the training data set and the selected number of classification clusters for each stage comprises:

aggregating the output of each of the plurality of phases except for the final phase of the neural network, an

The final stage of the neural network is trained based on inputting aggregated outputs of the plurality of stages of the neural network other than the final stage to a segmentation transformer module associated with the final stage of the neural network.

13. A processing system, comprising:

a memory having stored thereon computer executable instructions; and

a processor configured to execute the computer-executable instructions to cause the processing system to:

receiving input for classification;

one or more actions are taken based on the classification of the input.

14. The processing system of claim 13, wherein to classify the output, the processor is configured to cause the processing system to classify the input at a stage of the plurality of stages based on inferences generated by a previous stage of the plurality of stages.

15. The processing system of claim 13, wherein:

16. The processing system of claim 13, wherein each stage of the plurality of stages classifies the input using a greater number of classification clusters than a preceding stage of the plurality of stages.

17. A processing system, comprising:

a memory having stored thereon computer executable instructions; and

a trained neural network is deployed.

18. The processing system of claim 17, wherein the processor is further configured to cause the processing system to:

calculating an confusion matrix for the training data set and an confusion matrix for the validation data set, wherein discrete elements in one dimension of the confusion matrix represent one of a plurality of classification clusters;

19. The processing system of claim 17, wherein to generate the cluster validation set performance metrics, the processor is configured to cause the processing system to calculate a performance metric for each stage of the plurality of stages for the initial number of cluster sizes up to and including a classification cluster.

20. The processing system of claim 19, wherein the performance metrics comprise: a mean cross-over ratio (mIoU) metric calculated as a function of a number of clusters for each of the plurality of stages of the neural network.

21. The processing system of claim 17, wherein:

the selected angle includes a zero degree angle; and is also provided with

To train the neural network based on the training dataset and the selected number of classification clusters for each stage, the processor is configured to cause the processing system to train the plurality of stages of the neural network using direct supervision.

22. The processing system of claim 17, wherein:

the selected angle includes a ninety degree angle; and is also provided with

To train the neural network based on the training data set and the selected number of classification clusters for each stage, the processor is configured to cause the processing system to train the plurality of stages of the neural network to converge performance of each stage of the neural network to a performance level within a threshold.

23. The processing system of claim 17, wherein to retrain the neural network based on the training set and the selected number of classification clusters per stage, the processor is configured to minimize a total loss function for the processing system, wherein:

the total loss function includes a sum of loss functions of each respective phase of the plurality of phases weighted by values associated with each respective phase of the plurality of phases, an

24. The processing system of claim 17, wherein to retrain the neural network based on the training set and the selected number of classification clusters per stage, the processor is configured to cause the processing system to: