WO2023087953A1 - 搜索神经网络集成模型的方法、装置和电子设备 - Google Patents

搜索神经网络集成模型的方法、装置和电子设备 Download PDF

Info

Publication number
WO2023087953A1
WO2023087953A1 PCT/CN2022/123139 CN2022123139W WO2023087953A1 WO 2023087953 A1 WO2023087953 A1 WO 2023087953A1 CN 2022123139 W CN2022123139 W CN 2022123139W WO 2023087953 A1 WO2023087953 A1 WO 2023087953A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
network architecture
distribution
search
architecture
Prior art date
Application number
PCT/CN2022/123139
Other languages
English (en)
French (fr)
Inventor
茹彬鑫
万星辰
埃斯佩兰卡•佩德罗
卡路奇•法比奥•玛利亚
李震国
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023087953A1 publication Critical patent/WO2023087953A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the embodiments of the present application relate to the field of machine learning, and in particular to a method, device and electronic device for searching a neural network integrated model.
  • Models based on deep neural networks have achieved remarkable progress in various tasks such as image recognition, speech recognition, and machine translation.
  • the predicted probability (softmax probability) of a single deep model generally has large calibration errors and low confidence.
  • OOD data training data
  • the rejection ability of a single deep neural network is weak, it cannot reflect accurate uncertainty (uncertainty), and it is easy to be overconfident in wrong predictions. .
  • OOD data refers to test samples from a different distribution from training sample data. The difference may be due to a different environment in which the data was generated, or the sample was corrupted or perturbed.
  • OOD out-of- distribution
  • the machine learning model must have the ability to reject recognition.
  • a base learner refers to a single model in an ensemble of models.
  • An ensemble model is to form a better model/prediction by combining the predictions of multiple base learners to make a final prediction.
  • the integrated model can not only achieve higher test accuracy, but also have better calibrated prediction probability, especially for OOD data, which can show more accurately quantified uncertainty and higher robustness.
  • the ensemble model also has these advantages.
  • deep ensemble models can effectively improve the test performance by combining multiple neural networks with the same network architecture but different weight training initialisation values (initialisation) and averaging the final predicted output (output logits). Accuracy and model calibration accuracy.
  • Model calibration refers to keeping the predicted probability of the event outcome consistent with the actual empirical probability of the event. For example, in a binary classification task, if we take out 100 pictures with a predicted probability of 0.7, and 70 of them have a true label of 1, it means that the predicted probability of the model is consistent with the real empirical probability. In other words, the model's predictions are accurate and reliable. In practical applications, especially high-risk applications, the prediction probability of the machine learning model is often used for user judgment or decision making, so the confidence of its prediction is very important.
  • the representation/performance of an ensemble model often depends on the diversity of the base learners in its combination: the more divergent the base models are, the better the ensemble tends to be, so many ensemble methods try to promote diversity among the base learners.
  • the deep ensemble model increases diversity by changing the initial training weights of the basic learner; the hyper-deep ensemble model (hyper-deep ensemble) changes the training hyperparameters on the basis of changing the initial weights, to This increases diversity even further.
  • the basic learners of these deep integrated models all share the same neural network architecture, so a very natural extension to increase diversity is to use neural networks of different architectures to form an integrated model (architecture ensemble).
  • the integrated model is an integrated model composed of basic learners of multiple deep neural networks with different network architectures.
  • embodiments of the present application provide a method, device and terminal device for searching neural network integrated models.
  • the embodiment of the present application provides a method for searching the integrated model of neural network architecture, the method includes: obtaining a data set, the data set includes samples and labels in the classification task; using the neural network architecture distribution search algorithm Searching includes: determining the hyperparameters of the neural network architecture distribution; sampling a neural network architecture in the architecture distribution defined by the hyperparameters; training and evaluating the neural network architecture according to the samples and labels in the classification task, Obtain the performance index; determine the neural network architecture distribution sharing the hyperparameter according to the performance index, and obtain the candidate pool of the basic learner; the basic learner is a neural network architecture that meets the requirements of the architecture distribution; the neural network The architecture is formed by repeated stacking of neural network architecture units; the proxy model is determined; the proxy model is used to predict the test performance of the unassessed neural network architecture; the test performance of the basic learner in the candidate pool is predicted by the proxy model, and it is determined that the The k basic learners required by the classification task form an integrated model, and the size of the integrated model is
  • the number of evaluations of a single neural network architecture and a single integrated model is greatly reduced, thereby significantly reducing the difficulty and cost of the structural integrated model without reducing the search quality; compared with a single deep neural network model, the integrated model for It is better at rejecting OOD data, so it is more robust to data distribution shifts.
  • the search using the neural network architecture distribution search algorithm includes: using an approximate neural architecture search via operation distribution (ANASOD) algorithm based on the probability distribution of the learning operator to perform the neural network architecture search. Distribution search. In this way, a larger part of the search space can be traversed, greatly improving search efficiency.
  • ANASOD operation distribution
  • the determination of the hyperparameters of the neural network architecture distribution includes: determining the hyperparameters of the neural network architecture distribution is ANASOD code; the ANASOD code is to indicate the probability distribution of various operators in the neural network architecture unit The mapping of vectors, the ANASOD encoding and neural network architecture units is one-to-many. In this way, the NAS problem can be approximated by the operator probability distribution, and the search space can be greatly compressed.
  • the determination of the hyperparameters of the neural network architecture distribution includes: using a search strategy to optimize the hyperparameters of the neural network architecture distribution, the search strategy is Bayesian optimization, and the search strategy is used in the following A neural network unit that is more in line with the requirements than the performance index of the current neural network architecture unit is sampled in one iteration. In this way, each selection and evaluation is the architecture distribution covered by the hyperparameter definition, so a larger part of the search space can be traversed, which greatly improves the search efficiency.
  • sampling a neural network architecture in the architecture distribution defined by the hyperparameters includes: determining each operator in the constituent units of the neural network architecture according to the operator probability distribution defined by the ANASOD code The specific number; connect different operators according to the set search space to obtain the neural network architecture.
  • an effective architecture conforming to the hyperparameter definition can be obtained as a performance proxy for the distribution of all neural network architectures sharing this ANASOD encoding ⁇ .
  • the training and evaluation of the neural network architecture unit on the data set to obtain performance indicators includes: training the neural network architecture on the training data set; evaluating on the verification data set The neural network architecture obtains performance indicators; the training set data and verification set data belong to the data set.
  • the performance index y can be used as the performance index of all neural network architecture units sharing this ANASOD code ⁇ , which can effectively avoid repeated evaluation of similar architecture units The high cost of performance.
  • the searching using a neural network architecture distribution search (distributional NAS) algorithm further includes: determining a search strategy for the neural network architecture distribution according to the performance index and the hyperparameter. In this way, the search strategy is adjusted to determine the search strategy for the neural network architecture distribution search in the next iteration, and the neural network unit that is more satisfactory than the performance index of the current neural network architecture unit is sampled in the next iteration.
  • a neural network architecture distribution search distributed NAS
  • the search using the neural network architecture distribution search (distributional NAS) algorithm also includes: determining the hyperparameters of other unknown distributions according to the hyperparameters and performance indicators of the neural network architecture distribution obtained each time
  • the performance prediction value includes a mean value and a variance; a performance prediction strategy for neural network architecture distribution is determined according to the mean value and variance, and the performance prediction strategy is used to predict performance indicators of the neural network architecture distribution.
  • the performance prediction strategy ( ⁇ _t, y_t) can be updated according to the performance prediction value, so as to determine the next search strategy.
  • the determining the neural network architecture distribution sharing the hyperparameter according to the performance index, and obtaining the candidate pool of the basic learner includes: determining the neural network architecture distribution according to the performance index and the hyperparameter A search strategy for network architecture distribution; determine a performance prediction strategy for neural network architecture distribution according to the performance index and the neural network architecture unit; according to the search strategy and performance prediction strategy, share the hyperparameters Search in the distribution of neural network architectures to determine the candidate pool of base learners.
  • the optimal neural network architecture distribution can be obtained.
  • a high-quality architecture distribution can generate a high-quality neural network architecture with similar performance and provide a good basic learner candidate pool.
  • the determining the distribution of neural network architectures sharing the hyperparameters according to the performance indicators, and obtaining the candidate pool of the basic learner includes: according to multiple neural network architectures and corresponding performance indicators in the historical search Outputting a plurality of neural network architectures sharing the hyperparameters; determining the distribution of neural network architectures that meet the requirements according to the plurality of neural network architectures sharing the hyperparameters; generating multiple neural network architecture distributions according to the requirements A neural network architecture unit that obtains the generative distribution/candidate pool of the base learner.
  • using the neural network architecture distribution search method to learn the candidate pool/architecture distribution is more efficient and greatly reduces the number and cost of evaluating a single network architecture.
  • the determining the proxy model includes: obtaining the proxy model by training on the data set according to the neural network architecture unit and the performance index.
  • the predicting the test performance of the basic learners in the candidate pool through the proxy model, and determining k basic learners that meet the requirements of the task scenario to form an integrated model include: predicting the candidate pool through the proxy model The test performance of multiple basic learners in the test; according to the prediction results, perform a local search (local search) to determine q estimated vertex architectures, and the estimated vertex architectures are performance indicators predicted by the proxy model on the verification set
  • the neural network architecture is higher than the adjacent architecture; the k architectures whose performance indicators meet the requirements among the q estimated vertex architectures are combined to obtain an integrated model.
  • the optimal combination can be selected from the candidate pool, the difficulty of the extremely complex permutation and combination problem is reduced, and a high-quality integrated model can be searched out only by evaluating the combination of basic learners a few times.
  • combining the k architectures whose performance indicators meet the requirements among the q estimated vertex architectures includes: sorting the performance indicators of the q estimated vertex architectures from good to bad, and taking performance Indicators are combined in the preceding k architectures.
  • the optimal combination can be selected from the candidate pool, the difficulty of the extremely complex permutation and combination problem is reduced, and a high-quality integrated model can be searched out only by evaluating the combination of basic learners a few times.
  • combining the k architectures whose performance indicators meet the requirements among the q estimated vertex architectures includes: using a greedy selection algorithm to traverse the q estimated vertex architectures, according to This selection adds k architectures into the ensemble model.
  • the optimal combination can be selected from the candidate pool, and the possibility and complexity of permutation and combination are greatly reduced (selecting k from all basic learners in the candidate pool is reduced to selecting k from q basic learners ), a high-quality ensemble model can be searched for only by evaluating the combination of basic learners a few times.
  • the embodiment of the present application provides a device for searching an integrated model of neural network architecture
  • the device includes: a data acquisition module for acquiring a data set, the data set includes samples and labels in classification tasks; architecture distribution
  • the search module is used to search using the neural network architecture distribution search algorithm, including: a hyperparameter for determining the neural network architecture distribution; sampling a neural network architecture in the architecture distribution defined by the hyperparameter; according to the classification task
  • the samples and labels of the neural network architecture are trained and evaluated to obtain performance indicators; according to the performance indicators, the neural network architecture distribution sharing the hyperparameters is determined to obtain the candidate pool of the basic learner; the basic learner is in line with The neural network architecture required by the architecture distribution; the neural network architecture is formed by repeated stacking of neural network architecture units; the proxy model is determined; the proxy model is used to predict the test performance of the unevaluated neural network architecture; the architecture integration model combination
  • the module is used to predict the test performance of the basic learners in the candidate pool through the proxy model, and determine k basic learners that meet the
  • an embodiment of the present application provides an electronic device, including a processor and a memory; the processor is configured to execute computer-executed instructions stored in the memory, and the processor executes the computer-executed instructions to implement the first aspect
  • an embodiment of the present application provides a storage medium, including a readable storage medium and a computer program stored in the readable storage medium, and the computer program is used to implement the method described in any one embodiment of the first aspect.
  • Fig. 1 is the flowchart of the integrated model search method of the first scheme
  • FIG. 2 is a system architecture diagram provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of the application of the method of searching the neural network integrated model provided by the embodiment of the present application to the picture classification scene;
  • FIG. 4 is a schematic diagram of the application of the method for searching the neural network integrated model provided by the embodiment of the present application to the scene of target detection and recognition;
  • Fig. 5 is the flowchart of the search neural network integration model provided by the embodiment of the present application.
  • Figure 6 is a comparison curve of test errors obtained on CIFAR10 for various benchmarks including DistriNAS-PM provided by this application;
  • Figure 7 is a schematic diagram of 15 kinds of interference/noise methods randomly selected and added to the CIFAR10 and CIFAR100 verification set pictures;
  • Figure 8 is the effect diagram after random selection of interference/noise added to the CIFAR10 and CIFAR100 verification set pictures
  • Figure 9 is a schematic diagram of the OOD verification comparison between DistriNAS-PM provided by this application and other search methods in the NAS-Bench-201 space;
  • Fig. 10 is a schematic diagram of an electronic device.
  • the first scheme is shown in Figure 1, using random search (random search, NES-RS) or evolutionary algorithm (NES-RE) to search for an architecture suitable as a basic learner, so as to build a sufficiently large basic learner
  • the candidate pool and then use the greedy selection algorithm (greedy selection algorithm, GSA) to traverse the basic learners in the candidate pool, and select the members that make up the final set one by one.
  • GSA greedy selection algorithm
  • the random search algorithm takes the objective function and the size n pool of the basic learner candidate pool as input, randomly samples n pool architectural units in the NAS search space, and performs complete training and performance evaluation for each architectural unit to obtain its indicators.
  • the output is a pool of base learner candidates.
  • the evolutionary algorithm takes the objective function and the base learner candidate pool size n pool as input, randomly samples n init architectural units in the NAS search space, and performs complete training and performance evaluation for each architectural unit to obtain its indicators.
  • the n parent architectural units with the best performance indicators are used as parent units. Iterative execution before reaching the termination criteria: randomly sample B architectural units from the parent unit, and randomly mutate them to obtain B descendant architectural units; perform complete training and performance evaluation on the B descendant architectural units; traverse the B descendant Architectural unit, select the architectural unit that can form the integrated model of the maximum optimization objective function, add it to the parent architectural unit pool, and remove the oldest parent architectural unit to ensure that the size of the parent architectural unit pool remains unchanged.
  • n parent The output is a pool of base learner candidates.
  • the greedy algorithm takes the basic learner candidate pool and the architecture integration size k as input; initializes the architecture integration, sets it as the neural network architecture with the lowest test error in the candidate pool, and removes this architecture from the candidate pool. Iterative execution before less than k: traverse the remaining basic learners in the candidate pool, add them to the existing architecture ensemble one by one, and evaluate the performance of the new architecture ensemble; select the architecture that can lead to the greatest performance improvement to join the existing architecture ensemble , and remove this architecture from the candidate pool. The output is the final ensemble model.
  • This method adopts the traditional NAS thinking, and regards each neural network architecture in the architecture search space as a single individual, so each neural network architecture/basic learner selected into the candidate pool needs to be fully trained and evaluation.
  • the architecture search of such methods generally requires a candidate pool containing hundreds of neural network architectures, which leads to extremely high evaluation costs in the construction of candidate pools.
  • GPU time is a common unit to measure the calculation amount of an algorithm. It is the time that a single GPU needs to run to complete a task, specifically expressed as GPU-days (GPU-days), GPU-seconds (GPU-seconds), etc. .
  • the integrated model has the following characteristics: the performance of the integrated model depends on the average performance of the basic learners in the integration; under the premise of ensuring excellent average performance, the difference between the basic learners is the integration The greater the diversity of , the better the performance of the integrated model; the difference in the predicted output of the vertex architectures in the neural network architecture search space is relatively large, and the ensemble model composed of vertex architectures has high diversity.
  • the apex architecture refers to an architecture whose test accuracy is higher than that of other architectures directly adjacent to it.
  • a method for searching the integrated neural network model uses neural network architecture distribution search (distributional NAS) to determine the architecture distribution or candidate pool for generating the basic learner.
  • the surrogate model is used to predict the optimal neural network vertex architecture in the candidate pool, and these vertex architectures are combined to obtain an integrated model.
  • neural network architecture distribution search integrates neural networks with similar architectures or belongs to the same distribution for evaluation, thereby avoiding repeated evaluation of a single neural network architecture and greatly improving search efficiency.
  • the basic learners generated from the finally found architecture distribution have relatively close and excellent average performance, thus meeting the requirement that the performance of the integrated model depends on the average performance of the basic learners in the integrated model.
  • a method for searching neural network integrated models provided by the embodiment of the present application adopts a more efficient basic learner candidate pool generation scheme and a basic learner combination scheme, which can reduce the search cost of the integrated model, and greatly improve the integrated model under the premise of ensuring performance.
  • the search cost of the model makes the cost of searching the integrated model in a larger search space controllable and acceptable, and is suitable for more actual production scenarios.
  • a proxy model is usually a simple model that is used to simulate an overly complex or black-box practical problem, and can also be used to quickly predict the output of a black-box or overly complex problem.
  • Surrogate models can quickly and relatively accurately predict the test performance of neural network architectures, thereby avoiding extensive training and evaluation to obtain their true performance.
  • the difference is relatively large, and the integrated model composed of vertex architectures has a high diversity. Combining these vertex architectures can ensure the diversity of the integrated model and avoid multiple The evaluation cost of the ensemble model generated by trying different combinations of architectures.
  • the embodiment of the present application provides a method for searching an integrated model of a neural network, while reducing the number of evaluations, while specifically ensuring the above-mentioned characteristics of the integrated model features, so that users can efficiently search for high-performance integrated models.
  • FIG. 2 is a system architecture diagram provided by an embodiment of the present application.
  • a method for searching an integrated neural network model provided in an embodiment of the present application can be widely applied to various system architectures and scenarios that require the use of a convolutional neural network.
  • the data collection device 10 obtains the required data or samples through various channels, and provides the data to the computing device 11 .
  • the method for searching the neural network integrated model provided by the embodiment of the present application will run on the computing device 11 to search the neural network architecture and train the final searched architecture, and deploy the trained integrated model to various application scenarios devices, such as personal computers 12, servers 13 and mobile devices 14, etc.
  • FIG. 3 is a schematic diagram of a method for searching an integrated neural network model provided by an embodiment of the present application applied to a scene of image classification.
  • users often store a large number of pictures in user albums of smartphones or other multimedia storage devices, and classifying them according to the information in the pictures can help users manage and search.
  • the method for searching the neural network integrated model provided by the embodiment of the present application can quickly pre-search and train the most suitable convolutional neural network integrated model on similar image classification tasks and deploy it on the smartphone to replace common human experts.
  • the manually designed network model achieves higher classification accuracy and improves user experience.
  • FIG. 4 is a schematic diagram of the application of the method for searching the neural network integrated model provided by the embodiment of the present application to the target detection and recognition scene.
  • the integrated model searched by using the method of searching the neural network integrated model provided by the embodiment of the present application can identify the object target on the original picture, and output the picture after the detected target is marked.
  • the detection and recognition of objects in images or videos is widely used in tasks such as smart cities and autonomous driving.
  • the method for searching the neural network integrated model provided by the embodiment of the present application can be aimed at various goals and constraints, such as the hardware constraints of mobile devices, and find the most suitable for object detection and recognition tasks in various scenarios. Appropriate neural network backbones are deployed on related devices to improve recognition accuracy.
  • the method for searching the neural network integrated model provided in the embodiment of the present application is also applicable to the picture classification scenario of medical images.
  • Machine learning systems have been applied to medical imaging to help medical staff make diagnoses through imaging data.
  • the convolutional neural network integrated model searched by the method of searching the neural network integrated model provided by the embodiment of the application can not only efficiently and accurately classify and diagnose according to the characteristics of the image, but also because the integrated model Having a good calibration for the uncertainty of the prediction can also output the confidence of its diagnosis, helping doctors to filter cases that need manual confirmation.
  • the method for searching the neural network integrated model provided in the embodiment of the present application may be deployed on computing nodes of related devices. Its data and codes can be stored in various common storage devices in various computers. For the execution of instructions and function modules, steps other than performance evaluation can generally be performed by a central processing unit (CPU). Performance evaluation, on the other hand, involves training the neural network architecture, typically performed by a graphics processing unit (GPU).
  • the integrated model obtained by the method of searching the neural network integrated model provided in the embodiment of the present application can be deployed on various computers and mobile computing devices after training, and applied to image classification,
  • the method for searching the neural network integrated model includes: obtaining a data set, the data set includes samples and labels in the classification task; searching using the neural network architecture distribution search (distributional NAS) algorithm, including: determining Hyperparameters of the neural network architecture distribution; sampling a neural network architecture in the architecture distribution defined by the hyperparameters; training and evaluating the neural network architecture according to the samples and labels in the classification task, and obtaining performance indicators; determining the shared hyperparameters according to the performance indicators
  • Neural network architecture distribution obtain the candidate pool of the basic learner; the basic learner is a neural network architecture that meets the requirements of the architecture distribution; the neural network architecture is formed by repeated stacking of neural network architecture units; determine the surrogate model; To predict the test performance of the unevaluated neural network architecture; predict the test performance of the basic learners in the candidate pool through the proxy model, and determine k basic learners that meet the requirements of the classification task scenario to form an integrated model, and the size of the integrated model is k.
  • NAS neural network architecture distribution search
  • the evaluation of a single neural network architecture that is, the basic learner, includes training the neural network architecture from scratch on the training data set, and after the training is over, verifying the data set The test accuracy and other performances are evaluated on the Internet.
  • the performance of the entire integrated model is evaluated on the validation data set.
  • the number of evaluations is a common indicator to measure the computational cost of an algorithm.
  • FIG. 5 is a flowchart of a method for searching an integrated neural network model provided by an embodiment of the present application. Each step in the method for searching an integrated neural network model provided by the embodiment of the present application will be explained in detail below with reference to FIG. 5 .
  • the embodiment of the present application provides a method for searching for a neural network integrated model, and performs the following steps 1-5 to search for a neural network architecture distribution to obtain a candidate pool of basic learners.
  • Step 1 get the dataset.
  • the existing data and the data corresponding to the target task are obtained.
  • a certain number of pictures and their correct labels in the picture classification task can be obtained as samples (ground truth) from existing datasets or other manually labeled data. These data can be used as training data set and/or test set data.
  • Step 2 determine the architecture search space.
  • the shape of the candidate neural network architecture and the goal of neural network architecture search are determined.
  • a common search space is defined based on the neural architecture cell, including defining the number of operators in the search neural network architecture unit, the types of operators that can be selected, and the maximum connection between each operator. The number and number of times the neural network architecture unit is stacked in the final neural network architecture.
  • a NAS search space can be defined as: the number of operators in a neural network architecture unit a is 10, and there are 3 types of operators to choose from: A, B, and C; the maximum connection between each operator The number is i, and the number of times the neural network architecture unit is stacked is j times.
  • the objective function is defined depending on the objective task.
  • a common objective function is to maximize the task accuracy.
  • the accuracy rate on the verification set in the image classification problem can be defined as the objective function.
  • the objective function may also include other restrictions and objectives.
  • the objective function is defined to minimize the floating-point operations per second (FLOPs) of the neural network under the premise of maximizing the accuracy of image classification. , to apply to mobile devices with limited computing power.
  • FLOPs floating-point operations per second
  • the search space of the neural network architecture distribution defines the hyperparameters of the architecture distribution while defining the NAS search space, so as to determine the distribution probability of each operator appearing in the neural network architecture unit.
  • the neural network architecture distribution search distributed NAS
  • the neural network architecture search can use the approximate neural architecture search (anasod) algorithm based on the probability distribution of the learning operator to perform the neural network architecture distribution.
  • Search while defining the NAS search space, define the corresponding ANASOD code ⁇ as the hyperparameter of the architecture distribution.
  • the search space of ANASOD can be defined as: the code ⁇ corresponding to each neural network architecture unit a is a vector located in a k-dimensional simplex space, and each value in the vector is The probability of each operator appearing in the neural network architecture unit a.
  • a conventional neural network architecture unit has 10 operators, and there are three types of operators to choose from: A, B, and C, where A appears 5 times in the neural network architecture unit, and B appears in the neural network C appears twice in the architectural unit, and C appears three times in the neural network architectural unit, so the code ⁇ corresponding to the neural network architectural unit is: [0.5, 0.3, 0.2].
  • mapping between encoding ⁇ and neural network architecture units is one-to-many, and multiple similar neural network architecture units share the same encoding ⁇ , thus greatly compressing the search space.
  • the following neural network architecture units share the same code ⁇ .
  • Neural Network Architecture Unit 1 A-A-A-A-A-A-B-B-B-C-C
  • Neural Network Architecture Unit 2 A-A-A-A-B-A-B-B-C-C
  • Neural Network Architecture Unit 3 A-A-A-B-B-A-A-B-C-C
  • Other similar neural network architecture units obtained by permutation and combination are not one by one. lift.
  • the test accuracy performance of different neural network architecture units with the same ANASOD code ⁇ on the test set is often very similar, indicating that the approximation of the NAS problem only through the operator distribution probability is more accurate. of.
  • the ANASOD code ⁇ is defined as the vector of distribution probabilities of various operators in the neural network architecture unit, and the sum of the probability distributions of various operators in the neural network architecture unit is 1.
  • ANASOD encoding ⁇ is located in a vector space that is easier to optimize and smaller. Based on this, a series of ANASOD algorithms that approximate the NAS algorithm can be used in the search space corresponding to the code ⁇ , which greatly reduces the search difficulty and improves the search efficiency while basically keeping the search accuracy unaffected. Because a low-dimensional code ⁇ can correspond to a group of multiple approximate neural network architecture units, the ANASOD algorithm can directly search for multiple approximate neural network architecture units and apply them to the integrated model.
  • the neural network architecture integration model search method provided in the embodiment of the present application may also use other neural network architecture distribution search methods.
  • step 3 a search strategy is used to recommend the hyperparameters of the new neural network architecture distribution, so as to determine the hyperparameters of the neural network architecture distribution.
  • the search strategy can be Bayesian optimization, and Bayesian optimization is used to recommend hyperparameters for new architecture distributions.
  • Bayesian optimization can be used to recommend a new code ⁇ of ANASOD; an evolutionary algorithm can also be used to recommend hyperparameters of a new neural network architecture distribution.
  • search strategy can be expressed as: search strategy ( ⁇ , y); where ⁇ is the encoding/super parameter of ANASOD recommended by the Bayesian optimization model, and y is the performance index of the predicted neural network architecture distribution. Its meaning can be interpreted as follows: search for the neural network architecture distribution under the code ⁇ coverage of ANASOD, and the predicted performance index is y.
  • the performance prediction model used to predict the performance index of the neural network architecture distribution can be Gaussian process (gaussian process), Bayesian neural network (Bayesian neural network), random forest (random forest), etc.
  • the integrated model search method provided by the embodiment of the present application uses a search strategy to recommend a new neural network architecture distribution.
  • the super-participation first scheme is completely different; the first scheme uses the NAS method to search for candidate architectures that are suitable as the basic learner of the integrated model.
  • the search strategy directly recommends a new neural network architecture unit.
  • each selection and evaluation is the architecture distribution under the coverage of the hyperparameter definition, so a larger part of the search space can be traversed, which greatly improves the search efficiency.
  • Step 4 Randomly sample a neural network architecture in the architecture distribution defined by the hyperparameters, perform performance evaluation, and obtain performance indicators.
  • step 4 includes:
  • step 41 a neural network architecture is randomly sampled in the distribution defined by the hyperparameters.
  • the specific number of each operator in the neural network architecture unit is determined; according to the search space Restriction, different operators are connected randomly to obtain the neural network architecture unit; after the neural network architecture unit is determined, the neural network architecture unit is stacked several times according to the definition of the search space, and the neural network architecture a is obtained. After determining that the distribution probability of operators in neural network architecture a conforms to the ANASOD code ⁇ , use neural network architecture a as a performance proxy for the distribution of all neural network architectures that share this ANASOD code ⁇ .
  • Step 42 train and evaluate the neural network architecture a on the data set to obtain performance indicators.
  • the neural network architecture a on the training data set obtained in step 1, can be trained according to the conventional neural network optimization method according to the samples and labels in the classification task, and according to the target defined in step 2
  • the function is evaluated on the validation set and obtains its performance metric y.
  • This process can be expressed as performance evaluation (y, a).
  • the training set data and the validation set data belong to the same distribution and belong to the same data set.
  • the performance evaluation of the neural network architecture distribution can be performed according to the performance evaluation (y, a), and its performance index y is used as the performance index of all architecture distributions that share this hyperparameter ⁇ .
  • Step 5 update the search strategy according to the performance index y; output the current optimal neural network architecture distribution according to the search history while updating the search strategy; update the neural network architecture performance proxy model.
  • the current optimal neural network architecture distribution is the neural network architecture distribution that meets the requirements.
  • a search strategy for neural network architecture distribution can be determined according to performance indicators and hyperparameters, so that neural network architecture units with higher performance indicators can be searched for in the next iteration. Include the following steps:
  • Step 51 update the search strategy ( ⁇ _t, y_t) according to the hyperparameters and performance indicators, and determine the search strategy.
  • t is the number of iterations.
  • the search strategy is adjusted according to the hyperparameter ⁇ _t and the performance index y_t of the neural network architecture distribution obtained in each search, and the search strategy for the neural network architecture distribution search in the next iteration is determined.
  • Step 52 Determine the performance prediction strategy for the distribution of the neural network architecture according to the performance index and the hyperparameters of the distribution of the neural network architecture.
  • the performance prediction strategy ( ⁇ _t, y_t) can be updated according to the hyperparameter ⁇ _t and the performance index y_t of the neural network architecture distribution found each time, so as to determine the performance prediction strategy of the neural network architecture distribution.
  • the hyperparameter ⁇ _t and its performance index ( ⁇ _t, y_t) of the neural network architecture distribution obtained by each search are input into the performance prediction model, and the performance prediction value of other unknown distribution hyperparameters can be output, including the mean value (mean) m value and variance (variance) v value:
  • Performance prediction value (m, v) of the tth time performance prediction model ( ⁇
  • ⁇ _i, y_i ⁇ t_ ⁇ i 1, 2, . . . t ⁇ ).
  • is an unknown hyperparameter value
  • the ⁇ part refers to the history of hyperparameters that have been searched and evaluated.
  • Step 53 updating the performance prediction strategy according to the performance prediction value, so as to determine the next evaluation target ( ⁇ _(t+1), y_t+1).
  • Step 54 outputting the current optimal neural network architecture distribution according to the search history.
  • the distribution of neural network architectures that meet requirements is determined according to the distribution of multiple neural network architectures in the historical search and the corresponding performance indicators.
  • the search history has three architecture distribution hyperparameters and their performance: ( ⁇ _1, y_1), ( ⁇ _2, y_2), ( ⁇ _3, y_3), where the performance index y_2 of ( ⁇ _2, y_2) is optimal, Then the architecture distribution of the shared hyperparameter ⁇ _2 is the neural network architecture distribution that meets the requirements.
  • Step 55 Generate a plurality of neural network architecture units according to the required neural network architecture distribution, and obtain the generation distribution/candidate pool of the basic learner.
  • many specific neural network architectures can be randomly generated according to the optimal neural network architecture distribution output by S54, and the generation distribution/candidate pool of the basic learner can be obtained.
  • the optimal neural network architecture distribution is the neural network architecture distribution that meets the requirements.
  • Step 56 determine the proxy model, the proxy model is used to predict the test performance of the unevaluated neural network architecture, to help predict and assist in the search for the optimal vertex architecture, so as to quickly generate a high-quality integrated model.
  • the agent model can be trained and updated according to the neural network architecture units and performance indicators evaluated during the t times of searching.
  • the proxy model can be a Gaussian process with WL graph kernel (GPWL) based on the Weisfeiler-lehman graph kernel.
  • GPWL WL graph kernel
  • y GPWL model (a
  • ⁇ a_i, y_i ⁇ t_ ⁇ i 1, 2,...t ⁇ ), where a is the unknown neural network architecture, y is the architecture performance predicted by the proxy model, and the ⁇ part refers to the history of the search-evaluated neural network architecture.
  • the method proposed in the embodiment of this application uses the neural network architecture distribution search method to learn the candidate pool/architecture distribution, which is more efficient and greatly Reduces the number and cost of evaluating individual network architectures.
  • the ensemble model search in the first scheme mainly uses a greedy algorithm to select the basic learners in the combination one by one, but this method requires that the number of ensemble models evaluated is proportional to the size of the candidate pool and the size of the final ensemble model.
  • the method for searching the neural network integrated model in the embodiment of the present application performs the following step 6 to perform architecture sampling and search for architecture integration, and a high-quality integrated model can be searched out only by evaluating the combination of basic learners a few times (or even once). Step 6 will be described in detail below.
  • Step 6 Predict the test performance of the basic learners in the candidate pool through the proxy model, and determine k basic learners that meet the requirements of the classification task scenario to form an integrated model, and the size of the integrated model is k.
  • the generation distribution/candidate pool of the basic learner it is necessary to search for the most suitable k basic learners to form the integrated model, and the size of the integrated model is k.
  • step 6 obtains the integrated model through the following steps S61-S63
  • Step 61 use the surrogate model determined in Step 5 to directly predict the performance of other unevaluated architectures to avoid huge evaluation costs.
  • random sampling may be performed from the optimal architecture distribution, and then a local search (local search) may be performed based on predicted performance indicators starting from multiple sampled architectures.
  • Step 62 Determine q estimated vertex architectures according to the predicted performance indicators output by the proxy model.
  • the estimated vertex architectures are neural network architectures whose performance indicators predicted by the proxy model on the verification set are higher than those of adjacent architectures.
  • the adjacent architecture refers to the network architecture in which the arrangement of operators differs by only one bit.
  • the adjacent architecture there are three neural network architectures as follows, architecture 1: A-A-B-C, architecture 2: A-B-B-C, architecture 3: A-A-C-C, architecture 1 and architecture 2 are adjacent architectures, and architecture 1 and architecture 3 are adjacent architectures.
  • Step 63 Combine the k best performance indicators among the q estimated vertex architectures to obtain an integrated model.
  • the performance indicators of the q estimated vertex architectures can be sorted from best to worst, and k architectures whose performance indicators are at the top are combined to obtain an integrated model.
  • k of the q estimated vertex architectures whose performance indicators meet the requirements of the classification task can be combined to obtain an integrated model.
  • Classification task requirements can be the highest accuracy, the lowest error rate or the smallest loss function, etc.
  • a greedy algorithm can be used to find k one by one among the q estimated vertex structures and combine them to obtain the final integrated model.
  • step S7 is executed to output the integrated model.
  • the maximum number of searches or the maximum search time may be defined as the termination criterion, and if the termination criterion has not been reached, the iterations from step 3 to step 6 are continued. After the termination criterion is reached, the iteration of the algorithm in the embodiment of the present application terminates.
  • the downstream can directly apply the output integrated model to the various scenarios mentioned above.
  • the purpose is to search for high The performance neural network ensemble model, and set the training set data and the verification set data to belong to the same distribution.
  • the specific steps of the method for searching a neural network integrated model provided by the present application in the picture classification task scenario of Embodiment 1 are described in detail below.
  • data can be obtained from a common image classification dataset.
  • the neural network architecture can be searched by using the pictures in the training set and their manually marked labels, and verified on the verification set.
  • the total number of neural network architecture units included in the search space is 15,625, and after removing the isomorphic architecture units, there are a total of 6466 unique neural network architecture units.
  • the objective function of this application is to output the classification error rate of the integrated model on the CIFAR-10 validation set.
  • the integrated model expected to be found is an integrated model with a low classification error rate on the validation set.
  • the objective function of this application is also to output the classification error rate of the integrated model on the CIFAR-10 verification set.
  • the neural network integration model is searched based on ANASOD and Gaussian process.
  • the specific steps are as follows:
  • the initial architecture distribution searches ANASOD and the hyperparameters of the initial architecture distribution, also known as ANASOD encoding. The following steps are iterated until the termination criterion is reached:
  • the ANASOD neural network architecture distribution search method is used in the first stage of the architecture integration search, such as steps S703-S705, and the Gaussian process structure based on the Weisfeiler-lehman graph kernel is used in the second stage (gaussian process with WL graph kernel, GPWL) as a proxy model, such as the proxy model updated in step S705, to assist the prediction of the vertex model and the search for the final integration, such as step S706.
  • the Gaussian process structure based on the Weisfeiler-lehman graph kernel is used in the second stage (gaussian process with WL graph kernel, GPWL) as a proxy model, such as the proxy model updated in step S705, to assist the prediction of the vertex model and the search for the final integration, such as step S706.
  • the method for searching the neural network integrated model provided by the embodiment of the present application is compared with the existing architecture integrated search benchmarks NES-RS and NES-RE.
  • Table 1 is the comparison data between the method of searching the neural network integrated model (referred to as DistriNAS-PM) provided by the embodiment of the present application and other methods on the NAS-Bench-201 search space. The result is the average validation set error rate (%) over 10 trials ( ⁇ 1 standard error).
  • the method for searching the neural network integrated model provided by the embodiment of the present application is recorded as DistriNAS-PM.
  • Other methods include NES-RS, NES-RE and Deep Ensemble, which are compared and evaluated on the data sets CIFAR10, CIFAR100 and ImageNet16-120 respectively. Quantity, test error, and confidence level.
  • the number N of network architectures evaluated by the method DistriNAS-PM proposed in the embodiment of the present application is 30.
  • the method DistriNAS-PM proposed in the embodiment of the present application only needs less than 1/3 of the search cost can find an integrated model with a comparable or even lower test error.
  • the architecture integration found in the embodiment of this application not only has a low test error, but also the calibration degree and confidence of the model are comparable to the optimal integration model found by NES-RE.
  • the confidence is measured by negative log likelihood (NLL), the lower the NLL, the higher the degree of model calibration.
  • each of the basic learners in the architecture ensemble found in the embodiment of this application uses k different training initial weights for training to evaluate its corresponding deep ensemble Effect.
  • the results in Table 1 once again prove that the integrated model composed of different architectures is better than the integrated model composed of different training initial weights (initialization).
  • Fig. 6 is a comparison curve of test errors obtained on CIFAR10 for various benchmarks including DistriNAS-PM provided by this application.
  • the test errors achieved by each benchmark on CIFAR10 are compared as the search progresses and the number of architecture evaluations gradually increases.
  • integrated models searched by different methods in the NAS-Bench-201 search space such as DistriNAS-PM, NES-RS and NES-RE, and the deep integrated model (DeepEnsemble) corresponding to each architecture in the architecture integration found by DistriNAS in this application,
  • the optimal base learner the lower the error rate of the validation set error rate (%) on the CIFAR-10 dataset, the better.
  • the x-axis of Figure 8 is the number of architecture evaluations.
  • the architecture integration search method (distriNAS-pm) proposed by the embodiment of the present application can find architecture integrations with lower verification errors faster than NES-RS and NES-RE, greatly reducing the search cost; comparing architecture integration And deep integration, architecture ensemble validation achieves lower error; all ensemble models significantly outperform the optimal single neural network architecture/base learner.
  • Table 2 shows the comparison data of the method (DistriNAS-PM) proposed in the embodiment of this application and NES-RS using different search costs on the CIFAR10 task in the DARTS search space.
  • the result is the average validation set error rate (%) over 3 trials.
  • the embodiment of this application can use less time and computational cost to search for a neural network integration model with similar performance, which not only speeds up the integration model
  • the search efficiency also improves the accuracy of architecture ensemble search.
  • the embodiment of the present application can search for a more accurate neural network architecture integration with the same time and cost. Applied in actual production, using the embodiment of the present application can search for a convolutional neural network architecture integration with higher accuracy and apply it to image classification tasks in a shorter time and with less calculation.
  • This beneficial effect is mainly because the embodiment of the present application uses architecture distribution search to find the basic learner candidate pool (first stage) and uses a combination of proxy model and vertex architecture to find a high-quality integrated model (second stage), greatly Reduces the need to massively train and evaluate a single neural network architecture (single base learner).
  • Example 2 the robustness of the embodiment of the present application to OOD data is tested through the application of image classification.
  • OOD data is very common in many real-world applications, such as autonomous driving, medical image diagnosis, etc.
  • the embodiment of this application searches for a high-performance neural network integration model on common search spaces and datasets, where the common search spaces include DARTS and NAS-Bench-201, and the datasets include CIFAR10 and CIFAR100;
  • the verification samples of the degree of perturbation are used to evaluate the test error, calibration degree and recognition rejection ability of the architectures searched by different methods integrated on the OOD data.
  • Validation samples perturbed by various noises include CIFAR10-C, CIFAR100-C.
  • CIFAR10-C and CIFAR100-C are generated by adding random one of 15 perturbations/noises to the validation set images of CIFAR10 and CIFAR100, respectively.
  • the intensity of the disturbance/noise has 5 levels from low to high. The higher the intensity level, the greater the movement of the image generation distribution, that is, the greater the distribution difference between the disturbed image and the original image.
  • Figure 7 shows 15 interference/noise methods randomly selected to be added to the CIFAR10 and CIFAR100 verification set pictures.
  • Figure 8 is the effect diagram after random selection of interference/noise added to the CIFAR10 and CIFAR100 verification set pictures; Figure 8 shows that the higher the level of interference/noise, the greater the (distribution) difference between the generated disturbed picture and the original picture big.
  • the method DistriNAS-PM proposed in the embodiment of this application is compared with the existing architecture integration search methods NES-RS and NES-RE on the OOD task of CIFAR100-C.
  • the architecture integration of NES-RS and NES-RE is found after evaluating 100 neural network architectures, while the DistriNAS-PM proposed in the embodiment of this application only needs to evaluate 30 neural network architectures .
  • Figure 9 is a schematic diagram of the OOD verification comparison between DistriNAS-PM provided by this application and other search methods in the NAS-Bench-201 space; Figure 9 shows that in the NAS-Bench-201 space, compared with NES-RS and NES-RE In the embodiment of this application, DistriNAS-PM can find an architecture integration with lower error and better model calibration (lower NLL) on the OOD verification set with less than 1/3 of the cost.
  • the architectural integration found by DistriNAS-PM is even slightly better than that found by NES-RS and NES-RE in terms of test error and model calibration level (NLL). It proves that the search method proposed in the embodiment of this application can not only find a good integration faster, but also find an integration with higher robustness to OOD data.
  • Figure 9 also shows the performance of the optimal basic learner (bright color module) in the respective architecture ensembles found by different methods. It can be found that the test error and NLL achieved by all ensemble models on OOD data are obvious It is lower than the optimal single neural network architecture/basic learner, which proves that the integrated model mentioned in this application has better recognition rejection ability and robustness when the data is disturbed or the test data distribution is inconsistent with the training data distribution.
  • Table 3 is the comparison data of the method (DistriNAS-PM) proposed in the embodiment of this application and the NES-RS using different search costs on the CIFAR10-C task in the DARTS search space.
  • the result is the average validation set error rate (%) over 3 trials, as shown in Table 3.
  • the method for searching the neural network integrated model provided by the embodiment of the present application can search for a neural network integrated model with similar performance with less time and computational cost, both It speeds up the search efficiency of integrated models, and also improves the robustness of architecture integrated search to OOD data, obtaining lower verification errors, higher model calibration and more accurate uncertainty values.
  • using the embodiments of the present application can search for high-quality architecture integration suitable for high-risk or high-uncertainty usage scenarios in a shorter time and with a smaller amount of calculation.
  • the embodiment of the present application provides a method for efficiently searching a multi-neural network integrated model, based on a two-stage search framework that is efficient and more suitable for integrated model search, and uses neural architecture distribution search instead of traditional NAS to quickly find a basic learner.
  • Candidate pools to avoid repeated evaluation of similar network architectures to improve search efficiency; use proxy models to quickly select optimal and diverse vertex models to form target set models by predicting (rather than actual) performance in the candidate pool.
  • the method in the embodiment of the present application can search the integrated model more efficiently, thereby greatly reducing the search cost, and greatly enhancing the feasibility of the integrated model search in more application scenarios.
  • the original method relies on traditional NAS and needs to search and evaluate the basic learners one by one to build a large enough candidate pool.
  • this method uses architecture distribution search to quickly build a candidate pool for basic learners, simplifying the search space and reducing the difficulty of searching.
  • the embodiment of the present application uses the proxy model to search for the vertex model in the candidate pool, ensuring the diversification of the vertex model while being efficient, and ensuring that the set model is more efficient than a single basic learner. A large performance improvement can be obtained.
  • the embodiments of the present application can be combined with various types of architecture distributed search, and are applicable to different search spaces, have strong versatility, and can be applied to different scenarios and tasks.
  • recurrent neural network recurrent neural network
  • deep transformer self-attention networks deep transformer self-attention networks
  • An embodiment of the present application provides a device for searching an integrated model of a neural network architecture
  • the device includes: a data acquisition module for acquiring a data set, the data set includes samples and labels in classification tasks; an architecture distribution search module for Searching using the neural network architecture distribution search algorithm, including: hyperparameters used to determine the neural network architecture distribution; sampling a neural network architecture in the architecture distribution defined by the hyperparameters; according to the samples and labels in the classification task Train and evaluate the neural network architecture to obtain a performance index; determine the distribution of the neural network architecture sharing the hyperparameter according to the performance index, and obtain a candidate pool of the basic learner; the basic learner conforms to the architecture distribution
  • the required neural network architecture; the neural network architecture is formed by repeated stacking of neural network architecture units; the proxy model is determined; the proxy model is used to predict the test performance of the unevaluated neural network architecture; the architecture integration model combination, through the proxy model Predict the test performance of the basic learners in the candidate pool, and determine k basic learners that meet the requirements of the classification task to
  • An embodiment of the present application provides an electronic device 1000, as shown in FIG. 10 , including a processor 1001 and a memory 1002; the processor 1001 is used to execute computer-executed instructions stored in the memory 1002, and the processor 1001 operates The computer executes instructions to execute the method for searching a neural network structure based on evolutionary learning described in any of the above embodiments.
  • An embodiment of the present application provides a storage medium, including a readable storage medium and a computer program stored in the readable storage medium, the computer program is used to implement the neural network based on evolutionary learning described in any of the above-mentioned embodiments A method for structure search.
  • computer-readable media may include, but are not limited to: magnetic storage devices (e.g., hard disks, floppy disks, or tapes, etc.), optical disks (e.g., compact discs (compact discs, CDs), digital versatile discs (digital versatile discs, DVDs), etc.), smart cards and flash memory devices (for example, erasable programmable read-only memory (EPROM), card, stick or key drive, etc.).
  • magnetic storage devices e.g., hard disks, floppy disks, or tapes, etc.
  • optical disks e.g., compact discs (compact discs, CDs), digital versatile discs (digital versatile discs, DVDs), etc.
  • smart cards and flash memory devices for example, erasable programmable read-only memory (EPROM), card, stick or key drive, etc.
  • various storage media described herein can represent one or more devices and/or other machine-readable media for storing information.
  • the term "machine-readable medium” may include, but is not limited to, wireless channels and various other media capable of storing, containing and/or carrying instructions and/or data.
  • sequence numbers of the above-mentioned processes do not mean the order of execution, and the order of execution of the processes should be determined by their functions and internal logic, and should not be used in this application.
  • the implementation of the examples constitutes no limitation.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for enabling a computer device (which may be a personal computer, a server, or an access network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Abstract

一种搜索神经网络架构集成模型的方法,包括:获取数据集,数据集包括分类任务中的样本和标注;使用神经网络架构分布搜索算法进行搜索,包括:确定神经网络架构分布的超参;在超参定义的架构分布中采样一个有效的(valid)神经网络架构;对神经网络架构在数据集上训练和评估,得到性能指标;根据性能指标确定共享超参的神经网络架构分布,获得基础学习器的候选池;基础学习器为符合所述架构分布要求的神经网络架构;确定代理模型;代理模型用于预测未评估的神经网络架构的测试性能;通过代理模型预测所述候选池中基础学习器的测试性能,确定符合任务场景要求的k个多元化(diverse)基础学习器组成集成模型,所述集成模型的大小为k。

Description

搜索神经网络集成模型的方法、装置和电子设备
本申请要求于2021年11月22日提交中国专利局、申请号为202111387843.8、申请名称为“搜索神经网络集成模型的方法、装置和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及机器学习领域,尤其涉及一种搜索神经网络集成模型的方法、装置和电子设备。
背景技术
基于深度神经网络的模型在图像识别、语音识别和机器翻译等各种任务上取得了显著的进展。然而,单个深度模型的预测概率(softmax概率)一般校准误差较大,置信度较低。尤其是在面对测试数据分布和训练数据分布不一致时(OOD数据),单个深度神经网络的拒识能力较弱,不能反映出准确的不确定性(uncertainty),容易对错误的预测表现过度自信。这些问题大大限制了深度神经网络在高风险,高不确定性或涉及OOD数据的现实应用中的可靠性和鲁棒性。
其中,针对多分类问题,OOD数据指来自与训练样本数据分布不同的测试样本。该不同可能是由于数据生成环境不同,或者是样本受到损坏或者扰动。以自动驾驶为例,如果训练数据采集来自于晴天而测试数据来自于雨雪天,或者训练数据来自于郊区而测试数据来自于城市,那么测试数据相比于训练数据就是OOD(out-of-distribution)。针对该类数据,机器学习模型要具有拒识的能力。
基础学习器(base learner)是指集成模型组合中的单个模型。
集成模型(ensemble model)是通过将多个基础学习器的预测组合起来进行最终预测,形成一个更好的模型/预测。集成模型不但能达到更高的测试精度,而且还能拥有更好校准的预测概率,尤其是针对OOD数据能表现出更准确量化的不确定性和更高的鲁棒性。在深度神经网络模型上,集成模型也同样具有这些优势。
例如,深度集成模型(deep ensembles)通过将多个拥有相同网络架构但是不同的权重训练初始值(initialisation)的神经网络进行组合,将最终的预测输出(output logits)进行平均,能有效地提高测试精度和模型校准(model calibration)精度。
模型校准指让模型对事件结果的预测概率和事件的真实经验概率保持一致。例如在一个二分类的任务里,如果我们取出100张模型预测概率为0.7的图片,其中确实有70张图的真实标签为1,则说明模型的预测概率和真实经验概率是一致的。换而言之,模型的预测很准确可靠。在实际应用,尤其是高风险的应用中,机器学习模型的预测概率往往会被用于用户的判断或者决策制定,所以其预测的置信度非常重要。
集成模型的表现/性能往往取决于其组合中基础学习器的多样性:基础模型间的差异越大,集成效果往往更好,因此很多集成方法试图促进基础学习器间的多样性。如图1所示,深度集成模型通过改变基础学习器的训练初始权重来增加多样性;超参集成模型(hyper-deep ensemble)在改变初始权值的基础上又改变了训练的超参数,以此来进一步提高多样性。然而这些深度集成模型的基础学习器都共享同样的神经网络架构,所以一个非常自然的提高多样性的拓展就是使用不同架构的神经网络来组成集成模型(architecture ensemble)。
集成模型是由多个网络架构不同的深度神经网络的基础学习器组成的集成模型。
然而搜索由不同的网络架构组成的集成模型,复杂度远高于单个神经网络架构搜索(neural architecture search,NAS),因为搜索集成模型不仅要搜索性能较好的基础学习器的网络架构,还要探索不同的基础学习器之间的可能组合的性能。所以现有的方法需要完整评估大量的神经网络架构来生成一个基础学习器候选池,并通过贪心算法来评估不同组合生成的集成模型的测试性能。导致了搜索集成模型需要耗费大量的计算成本和时间。
发明内容
为了解决上述的问题,本申请实施例提供了一种搜索神经网络集成模型的方法、装置和终端设备。
第一方面,本申请实施例提供了一种搜索神经网络架构集成模型的方法,所述方法包括:获取数据集,所述数据集包括分类任务中的样本和标注;使用神经网络架构分布搜索算法进行搜索,包括:确定神经网络架构分布的超参;在所述超参定义的架构分布中采样一个神经网络架构;根据所述分类任务中的样本和标注对所述神经网络架构训练和评估,得到性能指标;根据所述性能指标确定共享所述超参的神经网络架构分布,获得基础学习器的候选池;所述基础学习器为符合所述架构分布要求的神经网络架构;所述神经网络架构由神经网络架构单元重复堆叠而成;确定代理模型;所述代理模型用于预测未评估的神经网络架构的测试性能;通过代理模型预测所述候选池中基础学习器的测试性能,确定符合所述分类任务要求的k个基础学习器组成集成模型,所述集成模型的大小为k。以此,大大减少了对单个神经网络架构和单个集成模型的评估次数,从而在不降低搜索质量的同时显著降低了结构集成模型的难度和成本;针对的集成模型相较于单个深度神经网络模型更擅长拒识OOD数据,因此对于数据分布扰动(distributional shift)更鲁棒。
作为一个可行的实施方式,所述使用神经网络架构分布搜索算法进行搜索,包括:使用基于学习算子概率分布的近似神经网络架构搜索(approximate neural architecture search via operation distribution,ANASOD)算法进行神经网络架构分布搜索。以此,可以对搜索空间中更大的部分进行遍历,极大地提升搜索效率。
作为一个可行的实施方式,所述确定神经网络架构分布的超参,包括:确定神经网络架构分布的超参为ANASOD编码;所述ANASOD编码为指示神经网络架构单元中各种算子的概率分布的向量,所述ANASOD编码和神经网络架构单元的映射是一对多。以此,可以通过算子概率分布对NAS问题进行的近似,极大地压缩搜索空间。
作为一个可行的实施方式,,所述确定神经网络架构分布的超参,包括:采用搜索策略优化神经网络架构分布的超参,所述搜索策略为贝叶斯优化,所述搜索策略用于在下一次迭代中采样到比当前的所述神经网络架构单元的性能指标更符合要求的神经网络单元。以此,每一次挑选和评估的都是超参定义覆盖下的架构分布,因此可以对搜索空间中更大的部分进行遍历,极大地提升搜索效率。
作为一个可行的实施方式,在所述超参定义的架构分布中采样一个神经网络架构,包括:根据所述ANASOD编码定义的算子概率分布,确定所述神经网络架构的组成单元中各个算子的具体数量;根据设定的搜索空间连接不同的算子来获得所述神经网络架构。以此可以获得一个符合超参定义的有效的架构,作为所有共享此ANASOD编码θ的神经网络架构分布的性能代理。
作为一个可行的实施方式,所述对所述神经网络架构单元在所述数据集上训练和评估,得到性能指标,包括:在训练数据集上训练所述神经网络架构;在验证数据集上评估所述神经网络架构,获得性能指标;所述训练集数据和验证集数据同属于所述数据集。以此,可以从每个架构分布中只采样和评估一个神经网络架构,并将其性能指标y作为所有共享此ANASOD编码θ的神经网络架构单元的性能指标,可以有效地避免重复评估相似架构单元性能带来的高额成本。
作为一个可行的实施方式,所述将所述使用神经网络架构分布搜索(distributional NAS)算法进行搜索,还包括:根据所述性能指标和所述超参确定所述神经网络架构分布的搜索策略。以此,调整搜索策略,确定下一次迭代中神经网络架构分布搜索的搜索策略,在下一次迭代中采样到比当前的所述神经网络架构单元的性能指标更符合要求的神经网络单元。
作为一个可行的实施方式,所述使用神经网络架构分布搜索(distributional NAS)算法进行搜索,还包括:根据每次搜索出来的神经网络架构分布的超参和性能指标确定其他未知分布的超参的性能预测值,包括均值和方差;根据所述均值和方差确定神经网络架构分布的性能预测策略,所述性能预测策略用于预测神经网络架构分布的性能指标。以此,可以根据性能预测值来更新性能预测策略(θ_t,y_t),从而确定下一次搜索策略。
作为一个可行的实施方式,所述根据所述性能指标确定共享所述超参的神经网络架构分布,获得基础学习器的候选池,包括:根据所述性能指标和所述超参确定所述神经网络架构分布的搜索策略;根据所述性能指标和所述神经网络架构单元确定所述神经网络架构分布的性能预测策略;根据所述搜索策略和性能预测策略,在共享所述超参的所述神经网络架构分布中搜索,确定基础学习器的候选池。以此,可以得到最优的神经网络架构分布,优质的架构分布能生成性能相近的优质神经网络架构,提供一个好的基础学习器候选池。
作为一个可行的实施方式,所述根据所述性能指标确定共享所述超参的神经网络架构分布,获得基础学习器的候选池,包括:根据历史搜索中的多个神经网络架构和对应性能指标输出多个共享所述超参的神经网络架构;根据所述多个共享所述超参的神经网络架构确定符合要求的神经网络架构分布;根据所述符合要求的神经网络架构分布,生成多个神经网络架构单元,获得基础学习器的生成分布/候选池。以此,用神 经网络架构分布搜索的方法来学习候选池/架构分布,更高效,且大大减少了评估单个网络架构的次数和成本。
作为一个可行的实施方式,所述确定代理模型,包括:根据所述神经网络架构单元和所述性能指标通过在所述数据集上训练,获得所述代理模型。以此,可以通过代理模型来直接预测其他未评估的架构的性能,以避免巨大的评估成本。
作为一个可行的实施方式,所述通过代理模型预测所述候选池中基础学习器的测试性能,确定符合任务场景要求的k个基础学习器组成集成模型,包括:通过代理模型预测所述候选池中多个基础学习器的测试性能;根据预测结果进行区域搜索(local search),确定q个预估的顶点架构,所述预估的顶点架构为所述代理模型在验证集上预测的性能指标高于相邻架构的神经网络架构;将所述q个预估的顶点架构中性能指标符合要求的k个架构进行组合,得到集成模型。以此,可以从候选池中挑选出最优组合,将复杂度极高的排列组合问题难度降低,只需评估极少次基础学习器的组合就可以搜索出优质集成模型。
作为一个可行的实施方式,将所述q个预估的顶点架构中性能指标符合要求的k个架构进行组合,包括:将q个预估的顶点架构的性能指标由优到劣排序,取性能指标位于前面的k个架构进行组合。以此,可以从候选池中挑选出最优组合,将复杂度极高的排列组合问题难度降低,只需评估极少次基础学习器的组合就可以搜索出优质集成模型。
作为一个可行的实施方式,将所述q个预估的顶点架构中性能指标符合要求的k个架构进行组合,包括:使用贪心算法(greedy selection algorithm),遍历q个预估的顶点架构,依此选取添加k个架构进入集成模型。以此,可以从候选池中挑选出最优组合,将排列组合的可能性和复杂度大大降低(从候选池的所有基础学习器中挑选k个减少到从q个基础学习器中挑选k个),只需评估极少次基础学习器的组合就可以搜索出优质集成模型。
第二方面,本申请实施例提供一种搜索神经网络架构集成模型的装置,所述装置包括:数据获取模块,用于获取数据集,所述数据集包括分类任务中的样本和标注;架构分布搜索模块,用于使用神经网络架构分布搜索算法进行搜索,包括:用于确定神经网络架构分布的超参;在所述超参定义的架构分布中采样一个神经网络架构;根据所述分类任务中的样本和标注对所述神经网络架构训练和评估,得到性能指标;根据所述性能指标确定共享所述超参的神经网络架构分布,获得基础学习器的候选池;所述基础学习器为符合所述架构分布要求的神经网络架构;所述神经网络架构由神经网络架构单元重复堆叠而成;确定代理模型;所述代理模型用于预测未评估的神经网络架构的测试性能;架构集成模型组合模块,用于通过代理模型预测所述候选池中基础学习器的测试性能,确定符合所述分类任务要求的k个基础学习器组成集成模型,所述集成模型的大小为k。
第三方面,本申请实施例提供一种电子装置,包括处理器和存储器;所述处理器用于执行所述存储器所存储的计算机执行指令,所述处理器运行所述计算机执行指令执行第一方面任意实施例所述的基于演化学习的神经网络结构搜索的方法。
第四方面,本申请实施例提供一种存储介质,包括可读存储介质和存储在所述可 读存储介质中的计算机程序,所述计算机程序用于实现第一方面任意一实施例所述的基于演化学习的神经网络结构搜索的方法。
附图说明
为了更清楚地说明本说明书披露的多个实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书披露的多个实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
下面对实施例或现有技术描述中所需使用的附图作简单地介绍。
图1为第一方案的集成模型搜索方法流程图;
图2为本申请实施例提供的系统架构图;
图3为本申请实施例提供的搜索神经网络集成模型的方法应用于图片分类场景的示意图;
图4为本申请实施例提供的搜索神经网络集成模型的方法应用于目标检测识别场景的示意图;
图5为本申请实施例提供的搜索神经网络集成模型的流程框图;
图6为包括本申请提供的DistriNAS-PM在内的各个基准在CIFAR10上取得的测试误差对比曲线图;
图7为随机选择添加到CIFAR10和CIFAR100验证集图片上的15种干扰/噪音方式示意图;
图8为随机选择添加到CIFAR10和CIFAR100验证集图片上的干扰/噪音后的效果图;
图9为在NAS-Bench-201空间上本申请提供的DistriNAS-PM与其他搜索方法的OOD验证比较示意图;
图10为一种电子设备示意图。
具体实施方式
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。
在以下的描述中,所涉及的术语“第一\第二\第三等”或模块A、模块B、模块C等,仅用于区别类似的对象,不代表针对对象的特定排序,可以理解地,在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。
在以下的描述中,所涉及的表示步骤的标号,如S110、S120……等,并不表示一定会按此步骤执行,在允许的情况下可以互换前后步骤的顺序,或同时执行。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
下面将结合附图,对与本申请实施例相关的技术方案进行描述。
第一方案如图1所示,采用随机搜索(random search,NES-RS)或者进化算法(evolutionary algorithm,NES-RE)来搜索适合作为基础学习器的架构,从而搭建一个足够大的基础学习器的候选池,然后采用贪心算法(greedy selection algorithm,GSA)来遍历候选池中的基础学习器,逐个挑选组成最终集合的成员。
其中,随机搜索算法是以目标函数和基础学习器候选池大小n pool为输入,在NAS搜索空间中随机采样n pool个架构单元,并对每个架构单元进行完整训练和性能评估获得其指标,输出为基础学习器候选池。
进化算法是以目标函数和基础学习器候选池大小n pool为输入,在NAS搜索空间中随机采样n init个架构单元,并对每个架构单元进行完整训练和性能评估获得其指标。将性能指标最好的n parent个架构单元作为父代单元。在达到终止标准前迭代执行:从父辈单元中随机采样B个架构单元,将其进行随机变异获得B个子代架构单元;对这个B个子代架构单元进行完全训练和性能评估;遍历这B个子代架构单元,从中挑选出能组成最大优化目标函数的集成模型的架构单元,将其加入父代架构单元池,并将最旧的父代架构单元去除以保证父代架构单元池大小不变仍然为n parent。输出为基础学习器候选池。
贪心算法是以基础学习器候选池和架构集成大小k为输入;初始化架构集成,将其设为候选池中测试误差最低的神经网络架构,并将此架构从候选池中去除,在架构集成大小小于k前迭代执行:遍历候选池中剩余的基础学习器,逐个将其添加到现有的架构集成中,评估新的架构集成的性能;选择能导致最大性能提升的架构加入现有的架构集成中,并将此架构从候选池中去除。输出为最终集成模型。
此方法采用了传统NAS的思维,将架构搜索空间中的每一个神经网络架构都看做是一个单一的个体,所以被选入候选池的每一个神经网络架构/基础学习器都需要经过完整训练和评估。而此类方法的架构搜索一般需要一个含有几百个神经网络架构的候选池,导致搭建候选池这一环节就需要极高的评估成本。
在从候选池中寻找最优的基础学习器组合/架构集成时,使用的贪心算法虽然能将原本高复杂度的排列组合问题大大简化,却依然需要评估大量可能的架构组合/架构集成,导致集成挑选这一环节又增加了很多评估成本。
因此此类搜索方法需要花费高昂的图形处理单元(graphics processing unit,GPU)时间和算力成本,才能找到高质量的集成模型。对于GPU时间和硬件资源的巨大要求往往限制了其在实际场景中的应用。
GPU时间(GPU time)是一个常见的衡量算法计算量的单位,为完成某任务单个GPU需运行的时间,具体表达为GPU-天数(GPU-days),GPU-秒数(GPU-seconds)等。
现有的自动搜索集成模型的方法搜索成本巨大,搜索效率低下。其根本原因是它们需要先构建一个足够大的基础学习器候选池,然后尝试候选池中基础学习器的多种组合来找到最优的集成模型。而构建一个足够大的基础学习器候选池需要完整训练及评估大量的神经网络架构,因为采用传统NAS的方法以单个神经网络架构作为搜索目标,很多极为相似的神经网络架构也需要被单独训练和评估。从候选池中基础学习器 的多种组合找到最优的集成模型则需要评估众多可能的架构组合,虽然使用贪心算法可以将NP-难度的排列组合问题大大简化,但是需要评估的组合数量依然和候选池中基础学习器的数量以及集成模型的规模成正比。
下面将结合本申请实施例中的附图,对与本申请实施例的技术方案进行描述。
在现有的神经网络架构搜索空间中,集成模型具有以下特性:集成模型的性能取决于集成中基础学习器的平均性能;在保证平均性能优异的前提下,基础学习器之间的差异即集成的多样性越大,集成模型的性能越好;神经网络架构搜索空间中的顶点架构的预测输出差异相对较大,由顶点架构组成的集成模型具有较高的多样性。其中,顶点架构是指测试精度高于其直接相邻的其它架构的测试精度的架构。
基于以上集成模型的特性,本申请实施例提供的一种搜索神经网络集成模型的方法使用神经网络架构分布搜索(distributional NAS)确定生成基础学习器的架构分布或候选池。根据生成基础学习器的架构分布或候选池,通过代理模型(surrogate model)来预测候选池中最优的神经网络顶点架构,将这些顶点架构进行组合,获得集成模型。
需要理解的是,神经网络架构分布搜索将架构相似或属于同一分布的神经网络整合起来进行评估,从而避免反复多次评估单个神经网络架构,大大提高搜索效率。同时,从最终找到的架构分布中生成的基础学习器都具有比较接近且优异的平均性能,从而满足了集成模型的性能取决于集成模型中基础学习器的平均性能的要求。
本申请实施例提供的一种搜索神经网络集成模型的方法采用更高效的基础学习器候选池生成方案和基础学习器组合方案,能够降低集成模型搜索成本,在保证性能的前提下,大大提高集成模型的搜索成本,使得在更大的搜索空间中搜索集成模型的成本可控可接受,适用于更多的实际生产场景下。
代理模型通常是一种简单模型,用于模拟过于复杂或者黑盒的实际问题,也可用于快速预测黑盒或者过于复杂的问题的输出。代理模型能快速且相对准确地预测神经网络架构的测试性能,从而避免了大量的训练和评估来获取它们的真实性能。基于神经网络架构搜索空间中的顶点架构的预测输出差异相对较大,而由顶点架构组成的集成模型具有较高的多样性,将这些顶点架构进行组合,能保证集成模型的多样性,避免多次尝试不同架构组合来生成的集成模型的评估成本。
因为集成模型的性能主要取决于其基础学习器的平均性能和多样性,本申请实施例提供的一种搜索神经网络集成模型的方法在减少评估次数的同时,针对性地保证了集成模型的上述特性,从而使得用户能高效地搜索到高性能的集成模型。
图2为本申请实施例提供的系统架构图。本申请实施例提供的一种搜索神经网络集成模型的方法可广泛应用于各种需要使用卷积神经网络的系统架构和场景。如图3所示,数据收集设备10通过各种途径获得所需要的数据或样本,并向计算设备11提供数据。本申请实施例提供的搜索神经网络集成模型的方法将在计算设备11上运行,进行神经网络架构的搜索和对最终搜索出的架构进行训练,并将训练后的集成模型部署到各种应用场景的设备上,如个人计算机12,服务器13和移动设备14等。
图3为本申请实施例提供的搜索神经网络集成模型的方法应用于图片分类场景的示意图。如图3所示,用户经常在智能手机或者其他多媒体储存设备的用户相册中存储大量图片,而根据图片内的信息对它们进行分类可以帮助用户管理与查找。本申请 实施例提供的搜索神经网络集成模型的方法可以快速地在相似的图片分类任务上预先搜索并训练出最适合的卷积神经网络的集成模型部署在智能手机上,以替代常见的人类专家手动设计的网络模型,从而达到更高的分类准确率,提升用户的使用体验。
图4为本申请实施例提供的搜索神经网络集成模型的方法应用于目标检测识别场景的示意图。如图4所示,应用本申请实施例提供的搜索神经网络集成模型的方法搜索到的集成模型,可以识别原有图片上的物体目标,输出带标记的目标检测后的图片。对图像或视频中物体目标的检测识别广泛应用于智慧城市、自动驾驶等任务中。与图3的应用场景相似,本申请实施例提供的搜索神经网络集成模型的方法可以针对于各种目标和限制,如移动设备的硬件限制,找到针对于各种场景中物体检测识别任务中最合适的神经网络骨架(neural network backbone),部署在相关设备上以提升识别精度。
本申请实施例提供的搜索神经网络集成模型的方法还适用于医学影像的图片分类情景。机器学习系统现已应用于医学影像方面帮助医护人员通过影像学资料进行诊断。经过已有数据的训练,通过本申请实施例提供的搜索神经网络集成模型的方法搜索到的卷积神经网络集成模型,不仅可以根据影像的特征高效且高精度地进行分类诊断,同时由于集成模型对预测的不确定性拥有较好的校准,也可以输出对其诊断的置信度,帮助医生筛选需要人工确认的案例。
本申请实施例提供的搜索神经网络集成模型的方法,可以部署在相关设备的计算节点上。其数据与代码可以储存在各种计算机中各种常见的储存器上。对于指令和功能模块的执行,除性能评估以外的步骤一般可以由中央处理器(central processing unit,CPU)执行。而性能评估涉及到对神经网络架构的训练,一般由图形处理单元(graphics processing unit,GPU)执行。本申请实施例提供的搜索神经网络集成模型的方法获得的集成模型在训练之后可以部署在各种计算机和移动计算设备上,应用于图片分类、
目标检测识别和医学影像的图片分类等各种需要卷积神经网络预测的任务中。
本申请实施例提供的搜索神经网络集成模型的方法,该方法包括:获取数据集,数据集包括分类任务中的样本和标注;使用神经网络架构分布搜索(distributional NAS)算法进行搜索,包括:确定神经网络架构分布的超参;在超参定义的架构分布中采样一个神经网络架构;根据分类任务中的样本和标注对神经网络架构训练和评估,得到性能指标;根据性能指标确定共享超参的神经网络架构分布,获得基础学习器的候选池;基础学习器为符合架构分布要求的神经网络架构;神经网络架构由神经网络架构单元重复堆叠而成;确定代理模型(surrogate model);代理模型用于预测未评估的神经网络架构的测试性能;通过代理模型预测候选池中基础学习器的测试性能,确定符合分类任务场景要求的k个基础学习器组成集成模型,集成模型的大小为k。
本申请实施例中会涉及两方面的评估:1)对于单个神经网络架构即基础学习器的评估包括了在训练数据集上从头开始训练该神经网络架构,并在训练结束后,在验证数据集上测评其测试精度等表现。2)对于单个集成模型的评估,在验证数据集上测评整个集成模型的表现。其中,评估次数(number of evaluations)是一个常见的衡量算法计算成本的指标。
图5为本申请实施例提供的搜索神经网络集成模型的方法的流程框图。下文将结 合图5详细解释本申请实施例提供一种搜索神经网络集成模型的方法中的每一步骤。如图5所示,本申请实施例提供一种搜索神经网络集成模型的方法,执行以下步骤1-5进行神经网络架构分布搜索获得基础学习器的候选池。
步骤1,获取数据集。
在一个可以实现的实施方式中,获取已有的数据和目标任务对应的数据。
示例性地,可以从现有的数据集或其他人工标注的数据中获得图片分类任务中一定数量的图片和其正确的标注作为样本(ground truth)。可以将这些数据作为训练数据集和/或测试集数据。
步骤2,确定架构搜索空间。
在一个可以实现的实施方式中,通过定义搜索空间和目标函数,确定备选神经网络架构的形态和神经网络架构搜索的目标。
其中,一种常见的搜索空间是基于神经网络架构单元(neural architecture cell)定义的,包括定义搜索神经网络架构单元中算子的数量、可供选择算子的类型、各个算子之间最大连接数量以及最终神经网络架构中神经网络架构单元堆叠的次数。
示例性地,可以定义一个NAS搜索空间为:一个神经网络架构单元a中算子的数量为10,可供选择算子的类型有3种:A、B、C;各个算子之间最大连接数量为i个,神经网络架构单元堆叠的次数为j次。
目标函数视目标任务而定义。常见的目标函数为最大化任务精度,示例性地,可以定义图片分类问题中验证集上的准确率为目标函数。目标函数也可以包括其他的限制和目标,示例性地,定义在最大化图片分类准确率的前提下最小化神经网络的每秒浮点运算量(floating-point operations per second,FLOPs)为目标函数,以应用于算力受限的移动设备。
本申请实施例提供的搜索神经网络集成模型的方法使用神经网络架构分布搜索
(distributional NAS)算法进行搜索,确定架构搜索空间为神经网络架构分布搜索
(distributional NAS)的搜索空间。该神经网络架构分布分布的搜索空间在定义NAS搜索空间的同时定义架构分布的超参,以确定各个算子在神经网络架构单元中出现的分布概率。
在一个可以实现的实施方式中,神经网络架构分布搜索(distributional NAS)算法可以采用基于学习算子概率分布的近似神经网络架构搜索(approximate neural architecture search via operation distribution,ANASOD)算法进行神经网络架构分布搜索,在定义NAS搜索空间的同时,定义对应的ANASOD的编码θ为架构分布的超参。
在一个可以实现的实施方式中,可以定义ANASOD的搜索空间为:每一个神经网络架构单元a对应的编码θ是一个位于k维单纯形空间(simplex space)的向量,向量中的每个值为每种算子在神经网络架构单元a中出现的概率。
示例性地,一个常规的神经网络架构单元有10个算子,可供选择算子的类型有3种:A、B和C,其中A在神经网络架构单元中出现5次,B在神经网络架构单元中出现2次,C在神经网络架构单元中出现3次,则神经网络架构单元对应的编码θ为:[0.5,0.3,0.2]。
编码θ和神经网络架构单元的映射是一对多的,多个相似的神经网络架构单元共享同一个编码θ,从而极大地压缩搜索空间。例如以下神经网络架构单元共享同一个编码θ。
神经网络架构单元1:A-A-A-A-A-B-B-B-C-C
神经网络架构单元2:A-A-A-A-B-A-B-B-C-C
神经网络架构单元3:A-A-A-B-B-A-A-B-C-C
以上3个神经网络架构单元仅仅是共享编码θ=[0.5,0.3,0.2]的多个相似的神经网络架构单元中的一部分,其他的通过排列组合得到的相似的神经网络架构单元不在一一枚举。
与此同时,在NAS搜索空间中,拥有相同ANASOD编码θ的不同神经网络架构单元的在测试集上的测试精度表现往往极为相似,说明仅通过算子分布概率对NAS问题进行的近似是较为精确的。
可以理解的是,通过对现有的NAS搜索空间进行分析可知:搜索一个算子类型、数量和拓扑结构都已完全确定的精确解是不必要的。与之相反,一组具有相同算子类型和数量、相同的算子概率分布但拓扑结构不同的神经网络架构单元具有非常相似的性能表现。据此,将ANASOD编码θ定义为神经网络架构单元中各种算子的分布概率的向量,神经网络架构单元中各种算子的概率分布之和为1。
与NAS中庞大的排列组合原有搜索空间不同,ANASOD编码θ位于更易优化且更小的向量空间。基于此,可以在编码θ对应的搜索空间中使用一系列近似NAS算法的ANASOD算法,在基本保持搜索精度不受影响的情况下大大减少了搜索难度并提升了搜索效率。因为一个低维编码θ可对应一组多个近似的神经网络架构单元,ANASOD算法可以直接搜索多个近似的神经网络架构单元应用于集成模型。
本申请实施例提供的神经网络架构集成模型搜索方法也可以采用其他的神经网络架构分布搜索方法。
步骤3,采用搜索策略推荐新的神经网络架构分布的超参,以确定神经网络架构分布的超参。
在一个可以实现的实施方式中,搜索策略可以为贝叶斯优化,采用贝叶斯优化来推荐新的架构分布的超参。示例性地,可以采用贝叶斯优化推荐新的ANASOD的编码θ;还可以采用进化算法推荐新的神经网络架构分布的超参。
示例性地,搜索策略可以表示为:搜索策略(θ,y);其中θ为贝叶斯优化模型推荐的ANASOD的编码/超参,y为预测的神经网络架构分布的性能指标。其意义可以这样解读:搜索ANASOD的编码θ覆盖下的,预测的性能指标为y的神经网络架构分布。
用于预测神经网络架构分布的性能指标的性能预测模型可以是高斯过程(gaussian process)、贝叶斯神经网络(Bayesian neural network)、随机森林(random forest)等。
本申请实施例提供的采用搜索策略推荐新的神经网络架构分布的超参与第一方案的集成模型搜索方法完全不同;第一方案使用NAS的方法搜索适合成为集成模型基础学习器的候选架构,每次搜索策略直接推荐一个新的神经网络架构单元。而在本申请实施例提供的步骤3中每一次挑选和评估的都是超参定义覆盖下的架构分布,因此可 以对搜索空间中更大的部分进行遍历,极大地提升搜索效率。
步骤4,在超参定义的架构分布中随机采样一个神经网络架构,进行性能评估,得到性能指标。
在一个可以实现的实施方式中,在采用ANASOD优化/搜索架构分布(θ,a,y)时,从θ定义的架构分布中搜索和训练一个神经网络架构a,对其进行性能评估,并将其性能指标y作为拥有相同ANASOD编码θ的神经网络架构分布共同的性能指标。
在一个可以实现的实施方式中,步骤4包括:
步骤41,在超参定义的分布中随机采样一个神经网络架构。
在一个可以实现的实施方式中,在ANASOD的编码空间中,根据ANASOD编码θ定义的神经网络架构单元中的算子分布概率,确定神经网络架构单元中各个算子的具体数量;根据搜索空间的限制,随机连接不同的算子来获得神经网络架构单元;在确定神经网络架构单元后,根据搜索空间的定义堆叠该神经网络架构单元数次,得到神经网络架构a。在确定神经网络架构a中算子的分布概率符合ANASOD编码θ后,使用神经网络架构a作为所有共享此ANASOD编码θ的神经网络架构分布的性能代理。
步骤42,对神经网络架构a在数据集上训练和评估,得到性能指标。
在一个可以实现的实施方式中,可以在步骤1获得的训练数据集上,根据分类任务中的样本和标注按照常规的神经网络优化方法训练该神经网络架构a,并根据步骤2中定义的目标函数在验证集上评估并获得其性能指标y。该过程可以表示为性能评估(y,a)。训练集数据和验证集数据属于同一分布,同属于一个数据集。
在一个可以实现的实施方式中,可以根据性能评估(y,a)进行神经网络架构分布的性能评估,将其性能指标y作为所有共享此超参θ的架构分布的性能指标。
由于同一架构分布生成的不同神经架构单元最终性能表现极为相似,因此在优化/搜索架构分布时,从每个架构分布中只采样和评估一个神经网络架构,并将其性能指标y作为所有共享此ANASOD编码θ的神经网络架构单元的性能指标,可以有效地避免重复评估相似架构单元性能带来的高额成本。
步骤5,根据性能指标y更新搜索策略;在更新搜索策略的同时根据搜索历史输出目前最优的神经网络架构分布;更新神经网络架构性能代理模型。目前最优的神经网络架构分布为符合要求的神经网络架构分布。
在一个可以实现的实施方式中,可以根据性能指标和超参确定神经网络架构分布的搜索策略,使其在下一次迭代中搜索到更高性能指标的神经网络架构单元。包括以下步骤:
步骤51,根据超参和性能指标更新搜索策略(θ_t,y_t),确定搜索策略。t为迭代次数。
示例性地,根据每次搜索出来的神经网络架构分布的超参θ_t和性能指标y_t调整搜索策略,确定下一次迭代中神经网络架构分布搜索的搜索策略。
步骤52,根据性能指标和神经网络架构分布的超参确定神经网络架构分布的性能预测策略。
在一个可以实现的实施方式中,可以根据每次搜索出来的神经网络架构分布的超参θ_t和性能指标y_t更新性能预测策略(θ_t,y_t),从而确定神经网络架构分布 的性能预测策略。
示例性地,将每次搜索出来的神经网络架构分布的超参θ_t和其性能指标(θ_t,y_t)输入性能预测模型,可以输出其他未知分布的超参的性能预测值,包括均值(mean)m值和方差(variance)v值:
第t次的性能预测值(m,v)=性能预测模型(θ|{θ_i,y_i}^t_{i=1,2,…t})。
其中,θ是未知的超参数值,{}部分指的是搜索评估过的超参历史。
步骤53,根据性能预测值来更新性能预测策略,从而确定下一次的评估目标(θ_(t+1),y_t+1)。
步骤54,根据搜索历史输出目前最优的神经网络架构分布。
在一个可以实现的实施方式中,根据历史搜索中的多个神经网络架构分布和对应性能指标确定符合要求的神经网络架构分布。
示例性地,搜索历史有3个架构分布的超参及其性能:(θ_1,y_1)、(θ_2,y_2)、(θ_3,y_3),其中(θ_2,y_2)的性能指标y_2为最优,则共享超参θ_2的架构分布为符合要求的神经网络架构分布。
步骤55,根据符合要求的神经网络架构分布,生成多个神经网络架构单元,获得基础学习器的生成分布/候选池。
在一个可以实现的实施方式中,可以根据S54输出的最优的神经网络架构分布,随机生成众多具体的神经网络架构,获得基础学习器的生成分布/候选池。其中最优的神经网络架构分布为符合要求的神经网络架构分布。
步骤56,确定代理模型,代理模型用于预测未评估的神经网络架构的测试性能,以帮助预测和辅助搜索最优的顶点架构,从而快速生成高质量的集成模型。
在一个可以实现的实施方式中,可以根据t次搜索过程中评估过的神经网络架构单元和性能指标训练和更新代理模型。
在一个可以实现的实施方式中,代理模型可以为基于weisfeiler-lehman图核的高斯过程(gaussian process with WL graph kernel,GPWL),示例性地,y=GPWL模型(a|{a_i,y_i}^t_{i=1,2,…t}),其中其中,a是未知的神经网络架构,y是代理模型预测的架构性能,{}部分指的是搜索评估过的神经网络架构历史。
根据神经网络架构分布搜索(distributional NAS)的工作以及实验验证,优质的架构分布往往能生成性能相近的优质神经网络架构。因此,对于集成模型搜索,一个好的架构分布很自然地提供了一个好的基础学习器候选池。相较于现有技术使用传统NAS来迭代评估单个神经网络架构后添加到候选池中,本申请实施例提出的用神经网络架构分布搜索的方法来学习候选池/架构分布,更高效,且大大减少了评估单个网络架构的次数和成本。
集成模型的效果不仅仅取决于基础学习器的性能,还取决于其组合中基础学习器的多样性。因此直接将性能最优的k个基础学习器组合起来,未必能生成最优的集成模型。而从候选池中挑选出大小为k的最优组合,是一个复杂度极高的排列组合问题。
第一方案中的集成模型搜索主要通过贪心算法来逐个选择组合中的基础学习器,但是这种方法需要评估的集成模型的数量和候选池大小以最终集成模型的大小成正比。
根据之前集成模型的研究以及进一步的实验分析,可以发现在基础学习器的平均 性能保证的前提下,组成集成模型的学习器之间的预测差异越大,集成模型的性能越好。而搜索空间中的顶点架构之间的预测差异往往较大。因此将k个最好的顶点架构组合能自然地保证集成模型的多样性,生成性能优异的架构集成。而要找到真正的顶点架构需要遍历评估搜索空间中所有的神经网络架构。
本申请实施例的搜索神经网络集成模型的方法执行如下步骤6进行架构采样和搜索架构集成,只需评估极少次(甚至1次)基础学习器的组合就可以搜索出优质集成模型。下面对步骤6详细阐述。
步骤6,通过代理模型预测候选池中基础学习器的测试性能,确定符合分类任务场景要求的k个基础学习器组成集成模型,集成模型的大小为k。
根据基础学习器的生成分布/候选池,需要从中搜索出最合适的k个基础学习器来组成集成模型,则集成模型的大小为k。
在一个可以实现的实施方式中,步骤6通过以下步骤S61-S63获得集成模型
步骤61,使用步骤5中确定的代理模型来直接预测其他未评估的架构的性能,以避免巨大的评估成本。
在一个可以实现的实施方式中,可以先从最优的架构分布中进行随机采样,然后从多个采样的架构开始,根据预测的性能指标进行区域搜索(local search)。
步骤62,根据代理模型输出的预测性能指标,确定q个预估的顶点架构,预估的顶点架构为代理模型在验证集上预测的性能指标高于相邻架构的神经网络架构。
其中,相邻架构是指算子排布只相差一位的网络架构。示例性地,有如下3个神经网络架构,架构1:A-A-B-C,架构2:A-B-B-C,架构3:A-A-C-C,架构1与架构2互为相邻架构,架构1与架构3互为相邻架构。
步骤63,将q个预估的顶点架构中性能指标最好的k个进行组合得到集成模型。
在一个可以实现的实施方式中,可将q个预估的顶点架构的性能指标由优到劣排序,取性能指标位于前面的k个架构进行组合,得到集成模型。在一个可以实现的实施方式中,可将q个预估的顶点架构中性能指标符合分类任务要求的k个进行组合,得到集成模型。分类任务要求可以是准确度最高,错误率最低或损失函数最小等。
在一个可以实现的实施方式中,可使用贪心算法,在q个预估顶点架构中逐个找出能带来现有集成模型性能最大提升的k个进行组合,得到最终集成模型。
最后,执行步骤S7,输出集成模型。
在一个可以实现的实施方式中,可以定义最大搜索次数或最大搜索时间为终止标准,如果终止标准还未达到,则继续步骤3至步骤6的迭代。在达到终止标准后,本申请实施例的算法迭代终止。下游可直接将输出的集成模型应用于前述所提到的各种场景。
实施例一
在图片分类任务中应用本申请实施例提供的搜索神经网络集成模型的方法,目的是在常见的搜索空间如DARTS和NAS-Bench-201上和数据集如CIFAR10,CIFAR100和ImageNet16-120上搜索高性能的神经网络集成模型,并设定训练集数据和验证集数据属于同一分布。下面详细地描述本申请提供的搜索神经网络集成模型的方法在实施例一的图片分类任务场景中的具体步骤。
S701,数据获取。
示例性地,可以从常见的图片分类数据集中获取数据。示例性地,CIFAR-10和CIFAR-100的训练集/验证集中各有50,000/10,000张图片,而ImageNet16-120中则有超过154,700/3,000张训练/验证集图片,可以从这些数据集中获取训练集数据和验证集数据。
在一个可以实现的实施方式中,可以使用训练集的图片及其人工标注的标签搜索神经网络架构,并在验证集上进行验证。
S702,定义搜索空间及目标函数。
在一个可以实现的实施方式中,可以定义搜索空间为NAS-Bench-201的搜索空间:其中有6个算子位置N=6,和5种算子类型选项k=5,5种算子类型为:3x3卷积、3x3平局池化、1x1卷积、残差连接和零连接。该搜索空间中包含的总神经网络架构单元数量有15,625个,去除同构(isomorphic)架构单元后,一共有6466个不同(unique)的神经网络架构单元。在这个搜索空间里,本申请的目标函数是输出集成模型在CIFAR-10验证集上的分类错误率。
在NAS-Bench-201搜索空间里,希望找到的集成模型是在验证集上分类错误率低的集成模型。
在一个可以实现的实施方式中,可以定义搜索空间为基于DARTS的搜索空间,其中有8个算子位置N=8,和7种类型选项k=7,其算子的具体类型为:3x3或5x5大小的可分卷积(separable convolution)、3x3或5x5大小的空洞卷积(dilated convolution)、3x3最大池化(max pooling)、3x3平均池化(average pooling)和残差连接(skip connection)。在这个搜索空间里,本申请的目标函数同样是输出集成模型在CIFAR-10验证集上的分类错误率。
下面,基于ANASOD和高斯过程进行神经网络集成模型的搜索,具体步骤详见以下:
以目标函数和集成模型大小k为输入,初始化架构分布搜索ANASOD及初始架构分布的超参,又称做ANASOD编码。在达到终止标准前迭代以下步骤:
S703,由搜索策略推荐架构分布的超参,即ANASOD编码。
S704,在ANASOD编码定义的算子分布中随机采样一个具体的神经网络架构,进行性能评估,并将评估结果作为所有共享此ANASOD编码的神经网络架构的性能代理。
S705,使用S704神经架构的评估结果更新分布搜索策略和代理模型,代理模型为基于weisfeiler-lehman图核的高斯过程结(gaussian process with WL graph kernel,GPWL)。
S706,从搜索策略推荐的当前最优架构分布/ANASOD编码中进行q个采样,使用代理模型评估每个采样的测试性能,然后从每个采样开始,运用局部搜索找到其附近的预估的顶点架构,从q个预估的顶点架构中选择预测性能最优的k个架构组成集成模型。
输出集成模型。
在本申请提供的实施例一中,在架构集成搜索的第一阶段采用了ANASOD神经网 络架构分布搜索方法,如步骤S703-S705,在第二阶段采用了基于weisfeiler-lehman图核的高斯过程结(gaussian process with WL graph kernel,GPWL)作为代理模型,如步骤S705中被更新的代理模型,来辅助顶点模型的预测和最终集成的搜索,如步骤S706。
在NAS-Bench-201的搜索空间上将本申请实施例提供的搜索神经网络集成模型的方法与现有的架构集成搜索标杆NES-RS和NES-RE进行对比。表1是在NAS-Bench-201搜索空间上,本申请实施例提供的搜索神经网络集成模型的方法(记为DistriNAS-PM)和其他方法的比较数据。结果是10次试验中的平均验证集错误率(%)(±1个标准误差)。表1中本申请实施例提供的搜索神经网络集成模型的方法记为DistriNAS-PM,其他方法包括NES-RS、NES-RE和Deep Ensemble,分别在数据集CIFAR10,CIFAR100和ImageNet16-120上比较评估数量、测试误差和置信度。
如表1可见,本申请实施例提出的方法DistriNAS-PM评估的网络架构的数量N为30,相比于NES-RS和NES-RE,本申请实施例提出的方法DistriNAS-PM只需要不到1/3的搜索成本就可以找到到测试误差(Error)相当甚至更低的集成模型。而且本申请实施例找到的架构集成不仅仅测试误差低,模型的校准程度、置信度也和NES-RE找到的最优集成模型相当。其中置信度由负对数似然(negative log likelihood,NLL)衡量,NLL越低,模型校准程度越高。
表1
Figure PCTCN2022123139-appb-000001
此外,表1的最后一行,添加了深度集成(deep ensemble)基准:将本申请实施例找到的架构集成中的基础学习器各自使用k个不同的训练初始权重进行训练来评估其对应的深度集成效果。表1的结果再次证明使用不同架构(architecture)组成的集成模型比使用不同训练初始权重(initialization)组成的集成模型更优。
图6为包括本申请提供的DistriNAS-PM在内的各个基准在CIFAR10上取得的测试误差对比曲线图。如图6所示,将各个基准在CIFAR10上取得的测试误差随着搜索的进行和架构评估的数量逐渐增加进行了对比。包括NAS-Bench-201搜索空间中不同方法搜索到的集成模型如DistriNAS-PM、NES-RS和NES-RE,以及本申请DistriNAS找到的架构集成中每个架构对应的深度集成模型(DeepEnsemble),和最优基础学习器,在CIFAR-10数据集上的验证集错误率(%)错误率越低越好。图8上x轴是架构评估的数量。
由图6可知,本申请实施例提出的架构集成搜索方法(distriNAS-pm)比NES-RS 和NES-RE能更快地找到验证误差更低的架构集成,大大减少了搜索成本;对比架构集成和深度集成,架构集成的验证取得的误差更低;所有的集成模型都明显优于最优的单个神经网络架构/基础学习器。
为了验证本申请实施例的搜索方法在更大,更真实的搜索空间上的效果,可以进一步在DARTS搜索空间上和使用不同搜索成本的NES-RS进行对比。表2为在DARTS搜索空间上,本申请实施例提出的方法(DistriNAS-PM)和使用不同搜索成本的NES-RS在CIFAR10任务上的比较数据。结果是3次试验中的平均验证集错误率(%)。
表2
Figure PCTCN2022123139-appb-000002
表2的结果显示,在DARTS搜索空间上,本申请实施例提出的架构集成搜索方法只需要1/8的搜索成本,就可以找到比于现有的标杆NES-RS所找到的更优的架构集成。
在本实施例的应用场景里,本申请实施例的流程与现有方法比,本申请实施例能用更少的时间和计算量成本搜索到相近表现的神经网络集成模型,既加快了集成模型的搜索效率也提高了架构集成搜索的精度。
相对于现有方法,本申请实施例用同样的时间和成本可以搜索到精度更优的神经网络架构集成。应用于实际生产中,使用本申请实施例可以在更短时间和更少的计算量的情况下搜索到精度更高的卷积神经网络架构集成以应用于图片分类任务。这种有益效果主要是因为本申请实施例采用了架构分布搜索来寻找基础学习器候选池(第一阶段)并使用代理模型和顶点架构的组合来寻找高质量集成模型(第二阶段),大大减少了大量训练和评估单个神经网络架构(单个基础学习器)的需求。
实施例二
实例二通过图片分类应用,检验本申请实施例对OOD数据的鲁棒性。OOD数据在很多现实应用中非常常见,例如自动驾驶,医疗图像诊断等等。本申请实施例在常见的搜索空间和数据集上搜索高性能的神经网络集成模型,其中常见的搜索空间包括DARTS和NAS-Bench-201,数据集包括CIFAR10和CIFAR100;采用被多种噪音进行不同程度扰动的验证样本来评估不同方法所搜索到的架构集成在OOD数据上的测试误差,校准程度以及拒识能力。被多种噪音扰动的验证样本包括CIFAR10-C、CIFAR100-C。
CIFAR10-C和CIFAR100-C是分别是通过向CIFAR10和CIFAR100的验证集图片添加15种扰动/噪音中的随机一种而生成。扰动/噪音的强度(shift severity)由低到高有5级,强度等级越高,表示对图像生成分布的移动越大,即被干扰的图像和原图的分布差异越大。
图7为随机选择添加到CIFAR10和CIFAR100验证集图片上的15种干扰/噪音方式。
图8为随机选择添加到CIFAR10和CIFAR100验证集图片上的干扰/噪音后的效果图;图8显示干扰/噪音的等级越高,生成的被干扰后的图片和原图的(分布)差异越大。
先在NAS-Bench-201的搜索空间上将本申请实施例提出的方法DistriNAS-PM与现有的架构集成搜索方法NES-RS和NES-RE在CIFAR100-C的OOD任务上进行对比。与实施例一场景的比较结果相同,NES-RS和NES-RE的架构集成是经过评估100个神经网络架构后找到的,而本申请实施例提出的DistriNAS-PM只需要评估30个神经网络架构。
图9为在NAS-Bench-201空间上本申请提供的DistriNAS-PM与其他搜索方法的OOD验证比较示意图;图9显示在NAS-Bench-201空间上,相比于NES-RS及NES-RE,本申请实施例DistriNAS-PM用低于1/3的成本就能找到在OOD验证集上误差更低、模型校准更好(NLL更低)的架构集成。
如图9所示,DistriNAS-PM找到的架构集成不仅在原图像(severity=0)上测试误差优于NES-RS并和NES-RE相当。在数据受不同程度(severity=2,4)的干扰的OOD情况下,DistriNAS-PM找到的架构集成甚至在测试误差和模型校准程度上(NLL)都略微优于NES-RS和NES-RE找到的架构集成,证明本申请实施例提出的搜索方法不但能更快地找到好的集成,而且能找到对OOD数据鲁棒性更高的集成。
值得一提的是在图9也展示了不同方法找的各自架构集成中最优基础学习器的性能(亮色模块),可以发现所有的集成模型在OOD数据上所达到的测试误差和NLL都明显低于最优单个神经网络架构/基础学习器,证明了本申请提及的集成模型在数据受干扰时或测试数据分布与训练数据分布不一致时,有更好的拒识能力和鲁棒性。
同样进一步在更大的DARTS搜索空间上进行搜索,在CIFAR10-C数据集上进行验证。表3是在DARTS搜索空间上,本申请实施例提出的方法(DistriNAS-PM)和使用不同搜索成本的NES-RS在CIFAR10-C任务上的比较数据。结果是3次试验中的平均验证集错误率(%),如表3所示。
表3的结果表明在DARTS空间上,本申请实施例提出的方法也能高效地搜索到对OOD数据测试误差更小,模型不确定性校准更好(NLL更低)及鲁棒性更好的架构集成。
表3
Figure PCTCN2022123139-appb-000003
在本实施例2的应用场景中,与现有方法比,本申请实施例提供的搜索神经网络集成模型的方法能用更少的时间和计算量成本搜索到相近表现的神经网络集成模型,既加快了集成模型的搜索效率,也提高了架构集成搜索对OOD数据的鲁棒性,获得更低的验证误差,更高的模型校准和更准确的不确定值。应用于实际生产中,使用本申请实施例可以在更短时间和更少的计算量的情况下搜索到适用于高风险或高不确定性的使用场景的优质架构集成。
本申请实施例提供的一种高效搜索多神经网络集成模型的方法,基于高效并更加适用于集成模型搜索的两阶段搜索框架,使用神经架构分布搜索而非传统NAS来快速找到一个基础学习器的候选池,以避免重复评估类似网络架构获得搜索效率的提升;使用代理模型在候选池中通过预测(而非实际)性能快速选择最优且多样化的顶点模型来组成目标集合模型。
本申请实施例方法与现有方法相比能更高效地进行集成模型的搜索,从而大大降低搜索成本,使得集成模型的搜索在更多的应用场景下的可行性大大增强。原有方法依靠传统NAS,需要对基础学习器逐个搜索和评估来搭建足够大的候选池,而本方法利用架构分布搜索可以快速搭建基础学习器的候选池,简化搜索空间、降低搜索难度。
在从基础学习器候选池搜索搭建最优集成模型时,本申请实施例使用代理模型在候选池中搜索顶点模型,在高效的同时确保顶点模型的多样化、确保集合模型相对于单个基础学习器可以获得较大的性能提升。
本申请实施例可以和各种类型的架构分布搜索相结合,且适用于不同地搜索空间,泛用性强、可应用于不同场景与任务。
本申请实施例除了可以应用在卷积神经网络的架构搜索上,也可潜在应用在其他类型的、有类似架构搜索单元结构的神经架构搜索的任务中,及其他使用集成模型可以带来进一步收益的任务中,如常用于自然语言处理的循环神经网络(recurrent neural network,RNN)架构和常用于自然语言处理和视觉任务的以transformer为代表的深度自注意力变化网络(deep transformer self-attention networks)中,以获得对不确定性较良好的度量。
本申请实施例提供一种搜索神经网络架构集成模型的装置,所述装置包括:数据获取模块,用于获取数据集,所述数据集包括分类任务中的样本和标注;架构分布搜索模块,用于使用神经网络架构分布搜索算法进行搜索,包括:用于确定神经网络架构分布的超参;在所述超参定义的架构分布中采样一个神经网络架构;根据所述分类任务中的样本和标注对所述神经网络架构训练和评估,得到性能指标;根据所述性能指标确定共享所述超参的神经网络架构分布,获得基础学习器的候选池;所述基础学习器为符合所述架构分布要求的神经网络架构;所述神经网络架构由神经网络架构单元重复堆叠而成;确定代理模型;所述代理模型用于预测未评估的神经网络架构的测试性能;架构集成模型组合,通过代理模型预测所述候选池中基础学习器的测试性能,确定符合所述分类任务要求的k个基础学习器组成集成模型,所述集成模型的大小为k。
本申请实施例提供一种电子装置1000,如图10所示,包括处理器1001和存储器1002;所述处理器1001用于执行所述存储器1002所存储的计算机执行指令,所述处理器1001运行所述计算机执行指令执行上述任意实施例所述的基于演化学习的神经 网络结构搜索的方法。
本申请实施例提供一种存储介质,包括可读存储介质和存储在所述可读存储介质中的计算机程序,所述计算机程序用于实现上述任意一实施例所述的基于演化学习的神经网络结构搜索的方法。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。
此外,本申请实施例的各个方面或特征可以实现成方法、装置或使用标准编程和/或工程技术的制品。本申请中使用的术语“制品”涵盖可从任何计算机可读器件、载体或介质访问的计算机程序。例如,计算机可读介质可以包括,但不限于:磁存储器件(例如,硬盘、软盘或磁带等),光盘(例如,压缩盘(compact disc,CD)、数字通用盘(digital versatile disc,DVD)等),智能卡和闪存器件(例如,可擦写可编程只读存储器(erasable programmable read-only memory,EPROM)、卡、棒或钥匙驱动器等)。另外,本文描述的各种存储介质可代表用于存储信息的一个或多个设备和/或其它机器可读介质。术语“机器可读介质”可包括但不限于,无线信道和能够存储、包含和/或承载指令和/或数据的各种其它介质。
应当理解的是,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者接入网设备等)执行本申请实施例各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读 存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (17)

  1. 一种搜索神经网络架构集成模型的方法,其特征在于,所述方法包括:
    获取数据集,所述数据集包括分类任务中的样本和标注;
    使用神经网络架构分布搜索算法进行搜索,包括:确定神经网络架构分布的超参;在所述超参定义的架构分布中采样一个神经网络架构;根据所述分类任务中的样本和标注对所述神经网络架构训练和评估,得到性能指标;根据所述性能指标确定共享所述超参的预测的神经网络架构分布,获得基础学习器的候选池;所述基础学习器为符合所述架构分布要求的神经网络架构;所述神经网络架构由神经网络架构单元重复堆叠而成;确定代理模型;所述代理模型用于预测未评估的神经网络架构的测试性能;
    通过代理模型预测所述候选池中基础学习器的测试性能,确定符合所述分类任务要求的k个基础学习器组成集成模型,所述集成模型的大小为k。
  2. 根据权利要求1所述的搜索神经网络架构集成模型的方法,其特征在于,所述使用神经网络架构分布搜索算法进行搜索,包括:
    使用基于学习算子概率分布的近似神经网络架构搜索(approximate neural architecture search via operation distribution,ANASOD)算法进行神经网络架构分布搜索。
  3. 根据权利要求1或2所述的搜索神经网络架构集成模型的方法,其特征在于,所述确定神经网络架构分布的超参,包括:
    确定神经网络架构分布的超参为ANASOD编码;所述ANASOD编码为指示神经网络架构单元中各种算子的概率分布的向量,所述ANASOD编码和神经网络架构单元的映射是一对多。
  4. 根据权利要求1或2所述的搜索神经网络架构集成模型的方法,其特征在于,所述确定神经网络架构分布的超参,包括:
    采用搜索策略优化神经网络架构分布的超参,所述搜索策略为贝叶斯优化,所述搜索策略用于在下一次迭代中采样到比当前的所述神经网络架构单元的性能指标更符合要求的神经网络单元。
  5. 根据权利要求3所述的搜索神经网络架构集成模型的方法,其特征在于,在所述超参定义的架构分布中采样一个神经网络架构,包括:
    根据所述ANASOD编码定义的算子概率分布,确定所述神经网络架构的组成单元中各个算子的具体数量;
    根据设定的搜索空间连接不同的算子来获得所述有效神经网络架构。
  6. 根据权利要求1或2所述的搜索神经网络架构集成模型的方法,其特征在于,所述对所述神经网络架构单元在所述数据集上训练和评估,得到性能指标,包括:
    在训练数据集上训练所述神经网络架构;
    在验证数据集上评估所述神经网络架构,获得性能指标;所述训练集数据和验证集数据同属于所述数据集。
  7. 根据权利要求1或2所述的搜索神经网络架构集成模型的方法,其特征在于,所述将所述使用神经网络架构分布搜索(distributional NAS)算法进行搜索,还包括:
    根据所述预测的神经网络架构分布的性能指标和超参确定所述神经网络架构分布 的搜索策略。
  8. 根据权利要求1或2所述的搜索神经网络架构集成模型的方法,其特征在于,所述使用神经网络架构分布搜索(distributional NAS)算法进行搜索,还包括:
    根据每次搜索出来的神经网络架构分布的超参和性能指标确定其他未知分布的超参的性能预测值,包括均值和方差;
    根据所述均值和方差确定神经网络架构分布的性能预测策略,所述性能预测策略用于预测神经网络架构分布的性能指标。
  9. 根据权利要求1或2所述的搜索神经网络架构集成模型的方法,其特征在于,所述根据所述性能指标确定共享所述超参的神经网络架构分布,获得基础学习器的候选池,包括:
    根据所述性能指标和所述超参确定所述神经网络架构分布的搜索策略;
    根据所述性能指标和所述神经网络架构单元确定所述神经网络架构分布的性能预测策略;
    根据所述搜索策略和性能预测策略,在共享所述超参的所述神经网络架构分布中搜索,确定基础学习器的候选池。
  10. 根据权利要求1或2所述的搜索神经网络架构集成模型的方法,其特征在于,所述根据所述性能指标确定共享所述超参的神经网络架构分布,获得基础学习器的候选池,包括:
    根据历史搜索中的多个神经网络架构和对应性能指标输出多个共享所述超参的神经网络架构;
    根据所述多个共享所述超参的神经网络架构确定符合要求的神经网络架构分布;
    根据所述符合要求的神经网络架构分布,生成多个神经网络架构单元,获得基础学习器的生成分布/候选池。
  11. 根据权利要求1或2所述的搜索神经网络架构集成模型的方法,其特征在于,所述确定代理模型,包括:
    根据所述神经网络架构单元和所述性能指标通过在所述数据集上训练,获得所述代理模型。
  12. 根据权利要求1或2所述的搜索神经网络架构集成模型的方法,其特征在于,所述通过代理模型预测所述候选池中基础学习器的测试性能,确定符合任务场景要求的k个基础学习器组成集成模型,包括:
    通过代理模型预测所述候选池中多个基础学习器的测试性能;
    根据预测结果进行区域搜索(local search),确定q个预估的顶点架构,所述预估的顶点架构为所述代理模型在验证集上预测的性能指标高于相邻架构的神经网络架构;
    将所述q个预估的顶点架构中性能指标符合要求的k个架构进行组合,得到集成模型。
  13. 根据权利要求12所述的搜索神经网络架构集成模型的方法,其特征在于,将所述q个预估的顶点架构中性能指标符合要求的k个架构进行组合,包括:
    将q个预估的顶点架构的性能指标由优到劣排序,取性能指标位于前面的k个架 构进行组合。
  14. 根据权利要求12所述的搜索神经网络架构集成模型的方法,其特征在于,将所述q个预估的顶点架构中性能指标符合要求的k个架构进行组合,包括:
    使用贪心算法(greedy selection algorithm),遍历q个预估的顶点架构,逐个添加k个架构组成集成模型。
  15. 一种搜索神经网络架构集成模型的装置,其特征在于,所述装置包括:
    数据获取模块,用于获取数据集,所述数据集包括分类任务中的样本和标注;
    架构分布搜索模块,用于使用神经网络架构分布搜索算法进行搜索,包括:用于确定神经网络架构分布的超参;在所述超参定义的架构分布中采样一个神经网络架构;根据所述分类任务中的样本和标注对所述神经网络架构训练和评估,得到性能指标;根据所述性能指标确定共享所述超参的神经网络架构分布,获得基础学习器的候选池;所述基础学习器为符合所述架构分布要求的神经网络架构;所述神经网络架构由神经网络架构单元重复堆叠而成;确定代理模型;所述代理模型用于预测未评估的神经网络架构的测试性能;
    架构集成模型组合模块,用于通过代理模型预测所述候选池中基础学习器的测试性能,确定符合所述分类任务要求的k个基础学习器组成集成模型,所述集成模型的大小为k。
  16. 一种电子装置,其特征在于,包括:处理器和存储器;所述处理器用于执行所述存储器所存储的计算机执行指令,所述处理器运行所述计算机执行指令,执行权利要求1-14所述的基于演化学习的神经网络结构搜索的方法。
  17. 一种存储介质,其特征在于,包括可读存储介质和存储在所述可读存储介质中的计算机程序,所述计算机程序用于实现权利要求1-14所述的基于演化学习的神经网络结构搜索的方法。
PCT/CN2022/123139 2021-11-22 2022-09-30 搜索神经网络集成模型的方法、装置和电子设备 WO2023087953A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111387843.8 2021-11-22
CN202111387843.8A CN116151319A (zh) 2021-11-22 2021-11-22 搜索神经网络集成模型的方法、装置和电子设备

Publications (1)

Publication Number Publication Date
WO2023087953A1 true WO2023087953A1 (zh) 2023-05-25

Family

ID=86356560

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/123139 WO2023087953A1 (zh) 2021-11-22 2022-09-30 搜索神经网络集成模型的方法、装置和电子设备

Country Status (2)

Country Link
CN (1) CN116151319A (zh)
WO (1) WO2023087953A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117152568A (zh) * 2023-11-01 2023-12-01 常熟理工学院 深度集成模型的生成方法、装置和计算机设备
CN117787444A (zh) * 2024-02-27 2024-03-29 西安羚控电子科技有限公司 一种面向集群对抗场景的智能算法快速集成方法及装置
CN117787444B (zh) * 2024-02-27 2024-05-17 西安羚控电子科技有限公司 一种面向集群对抗场景的智能算法快速集成方法及装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232434A (zh) * 2019-04-28 2019-09-13 吉林大学 一种基于属性图优化的神经网络架构评估方法
US20190286984A1 (en) * 2018-03-13 2019-09-19 Google Llc Neural architecture search by proxy
CN110909877A (zh) * 2019-11-29 2020-03-24 百度在线网络技术(北京)有限公司 神经网络模型结构搜索方法、装置、电子设备及存储介质
CN111406267A (zh) * 2017-11-30 2020-07-10 谷歌有限责任公司 使用性能预测神经网络的神经架构搜索
CN111814966A (zh) * 2020-08-24 2020-10-23 国网浙江省电力有限公司 神经网络架构搜索方法、神经网络应用方法、设备及存储介质
CN113298233A (zh) * 2021-05-21 2021-08-24 南京大学 一种基于代理模型的渐进式深度集成架构搜索方法
CN113344174A (zh) * 2021-04-20 2021-09-03 湖南大学 基于概率分布的高效神经网络结构搜索方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111406267A (zh) * 2017-11-30 2020-07-10 谷歌有限责任公司 使用性能预测神经网络的神经架构搜索
US20190286984A1 (en) * 2018-03-13 2019-09-19 Google Llc Neural architecture search by proxy
CN110232434A (zh) * 2019-04-28 2019-09-13 吉林大学 一种基于属性图优化的神经网络架构评估方法
CN110909877A (zh) * 2019-11-29 2020-03-24 百度在线网络技术(北京)有限公司 神经网络模型结构搜索方法、装置、电子设备及存储介质
CN111814966A (zh) * 2020-08-24 2020-10-23 国网浙江省电力有限公司 神经网络架构搜索方法、神经网络应用方法、设备及存储介质
CN113344174A (zh) * 2021-04-20 2021-09-03 湖南大学 基于概率分布的高效神经网络结构搜索方法
CN113298233A (zh) * 2021-05-21 2021-08-24 南京大学 一种基于代理模型的渐进式深度集成架构搜索方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117152568A (zh) * 2023-11-01 2023-12-01 常熟理工学院 深度集成模型的生成方法、装置和计算机设备
CN117152568B (zh) * 2023-11-01 2024-01-30 常熟理工学院 深度集成模型的生成方法、装置和计算机设备
CN117787444A (zh) * 2024-02-27 2024-03-29 西安羚控电子科技有限公司 一种面向集群对抗场景的智能算法快速集成方法及装置
CN117787444B (zh) * 2024-02-27 2024-05-17 西安羚控电子科技有限公司 一种面向集群对抗场景的智能算法快速集成方法及装置

Also Published As

Publication number Publication date
CN116151319A (zh) 2023-05-23

Similar Documents

Publication Publication Date Title
WO2022083624A1 (zh) 一种模型的获取方法及设备
CN109241317B (zh) 基于深度学习网络中度量损失的行人哈希检索方法
CN110991311B (zh) 一种基于密集连接深度网络的目标检测方法
CN109671102B (zh) 一种基于深度特征融合卷积神经网络的综合式目标跟踪方法
CN108132968A (zh) 网络文本与图像中关联语义基元的弱监督学习方法
CN107391512B (zh) 知识图谱预测的方法和装置
CN110503161B (zh) 一种基于弱监督yolo模型的矿石泥团目标检测方法和系统
CN108536784B (zh) 评论信息情感分析方法、装置、计算机存储介质和服务器
WO2022126448A1 (zh) 一种基于演化学习的神经网络结构搜索方法和系统
CN111008224B (zh) 一种基于深度多任务表示学习的时间序列分类和检索方法
CN112381227B (zh) 神经网络生成方法、装置、电子设备及存储介质
CN110716792A (zh) 一种目标检测器及其构建方法和应用
Lin et al. Hypergraph optimization for multi-structural geometric model fitting
CN113139651A (zh) 基于自监督学习的标签比例学习模型的训练方法和设备
WO2023087953A1 (zh) 搜索神经网络集成模型的方法、装置和电子设备
CN107451617B (zh) 一种图转导半监督分类方法
Wan et al. Confnet: predict with confidence
CN117237733A (zh) 一种结合自监督和弱监督学习的乳腺癌全切片图像分类方法
CN114897085A (zh) 一种基于封闭子图链路预测的聚类方法及计算机设备
WO2022100607A1 (zh) 一种神经网络结构确定方法及其装置
CN113592008B (zh) 小样本图像分类的系统、方法、设备及存储介质
CN108229692B (zh) 一种基于双重对比学习的机器学习识别方法
CN112509017A (zh) 一种基于可学习差分算法的遥感影像变化检测方法
CN114664391A (zh) 一种分子特征确定的方法、相关装置以及设备
CN115812210A (zh) 用于增强机器学习分类任务的性能的方法和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22894491

Country of ref document: EP

Kind code of ref document: A1