WO2021081809A1 - 网络结构搜索的方法、装置、存储介质和计算机程序产品 - Google Patents

网络结构搜索的方法、装置、存储介质和计算机程序产品 Download PDF

Info

Publication number
WO2021081809A1
WO2021081809A1 PCT/CN2019/114361 CN2019114361W WO2021081809A1 WO 2021081809 A1 WO2021081809 A1 WO 2021081809A1 CN 2019114361 W CN2019114361 W CN 2019114361W WO 2021081809 A1 WO2021081809 A1 WO 2021081809A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
network structure
jumper
general map
sub
Prior art date
Application number
PCT/CN2019/114361
Other languages
English (en)
French (fr)
Inventor
蒋阳
庞磊
胡湛
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to CN201980031708.4A priority Critical patent/CN112106077A/zh
Priority to PCT/CN2019/114361 priority patent/WO2021081809A1/zh
Publication of WO2021081809A1 publication Critical patent/WO2021081809A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • This application relates to the field of Artificial Intelligence (AI), and more specifically, to a method, device, storage medium, and computer program product for searching a network structure.
  • AI Artificial Intelligence
  • Neural network is the foundation of AI. As the performance of neural network continues to improve, its network structure is becoming more and more complex. A neural network can be used normally after training. The training process of the neural network is mainly to adjust the operation in each layer of the neural network and the connection relationship between each layer, so that the neural network can output correct results . Among them, the above connection relationship may also be referred to as skip or shortcut.
  • One method of training neural networks is to train neural networks using Efficient Neural Architecture Search (ENAS).
  • ENAS Efficient Neural Architecture Search
  • the controller continuously samples the network structure of the neural network, tries the influence of different network structures on the output result, and uses the output result of the network structure obtained from the previous sampling to determine the next sampling The network structure until the neural network converges.
  • training the neural network with ENAS can improve the training efficiency of the neural network, but the training efficiency of training the neural network with ENAS still needs to be improved.
  • the embodiments of the present application provide a method and device for searching a network structure, a computer storage medium, and a computer program product.
  • a network structure search method which includes: general map training step: training the first general map according to the first network structure and training data to generate a second general map; network structure training step: according to said The first network structure determines several test sub-graphs from the second general map; tests the several test sub-graphs through test data to generate a feedback result; determines the jumper constraint item according to the feedback result; according to the feedback The result and the jumper constraint items update the first network structure.
  • the above method can be applied to chips, mobile terminals or servers.
  • different training stages have different requirements for jumper density; for example, in the initial training stage of some neural networks, in order to explore the search space as much as possible to avoid network structure bias, it is necessary to adopt The controller with a higher jumper density is trained; in the later training stages of some neural networks, the randomness of the neural network has been greatly reduced. It is not necessary to explore the entire search space, and a controller with a lower jumper density can be used for training. , In order to reduce the consumption of resources (including computing resources and time resources).
  • the jumper constraint item in the above method is determined based on the feedback result in the current training stage, the jumper constraint item in the above method is related to the current training stage, so that the current jumper density of the controller is more suitable for the current training stage. This can reduce resource consumption and improve training efficiency while achieving good training results, and is especially suitable for mobile devices.
  • a network structure search device is provided, and the device is configured to execute the method in the above-mentioned first aspect.
  • a network structure search device in a third aspect, includes a memory and a processor.
  • the memory is used to store instructions.
  • the processor is used to execute the instructions stored in the memory and store the instructions in the memory. Execution of the instructions of causes the processor to execute the method of the first aspect.
  • a chip in a fourth aspect, includes a processing module and a communication interface, the processing module is used to control the communication interface to communicate with the outside, and the processing module is also used to implement the method of the first aspect.
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a computer, the computer realizes the method of the first aspect.
  • a computer program product containing instructions, which when executed by a computer causes the computer to implement the method of the first aspect.
  • FIG. 1 is a schematic flowchart of a network structure search method provided by the present application
  • Fig. 2 is a schematic diagram of a general drawing and sub-figure provided by the present application.
  • FIG. 3 is a schematic flowchart of another network structure search method provided by the present application.
  • FIG. 4 is a schematic flowchart of another network structure search method provided by this application.
  • Fig. 5 is a schematic diagram of a network structure search device provided by the present application.
  • first and second are only used to describe different objects, and cannot be understood as having other limitations.
  • first general plan and second general plan represent two different general plans, and there are no other restrictions.
  • plurality means two or more than two, unless otherwise specifically defined.
  • connection should be understood in a broad sense, for example, it may be a fixed connection, a detachable connection, or an integral connection; It can be a mechanical connection, an electrical connection or a mutual communication; it can be a direct connection or an indirect connection through an intermediary, and it can be a communication between two components or an interaction relationship between two components.
  • connection should be understood in a broad sense, for example, it may be a fixed connection, a detachable connection, or an integral connection; It can be a mechanical connection, an electrical connection or a mutual communication; it can be a direct connection or an indirect connection through an intermediary, and it can be a communication between two components or an interaction relationship between two components.
  • network structure search is a technology that uses algorithms to automatically design neural network models.
  • the network structure search is to search out the structure of the neural network model.
  • the neural network model to be searched for the network structure is a convolutional neural network (Convolutional Neural Networks, CNN).
  • the problem to be solved by the network structure search is to determine the operations between nodes in the neural network model. Different combinations of operations between nodes correspond to different network structures. Further, the nodes in the neural network model can be understood as the characteristic layer in the neural network model. The operation between two nodes refers to the operation required to transform the feature data on one node into the feature data on the other node. The operations mentioned in this application may be other neural network operations such as convolution operations, pooling operations, or fully connected operations. It can be considered that the operation between two nodes constitutes the operation layer between these two nodes. Generally, there are multiple searchable operations on the operation layer between two nodes, that is, multiple candidate operations. The purpose of the network structure search is to determine an operation on each operation layer.
  • each layer operation of the network structure is sampled from these six options.
  • NAS after NAS has established a search space, it usually uses a controller (a neural network) to sample a network structure A in the search space, and then train a child network with architecture A ) To determine the amount of feedback such as accuracy R, where accuracy can also be referred to as a predicted value; then, calculate the gradient of p and scale it with R to update the controller, that is, use R as a reward (reward) Update the controller.
  • a controller a neural network
  • the updated controller is used to sample the search space to obtain a new network structure, and the above steps are performed cyclically until a convergent sub-network is obtained.
  • the controller can be a Recurrent Neural Network (RNN), or a CNN or a Long-Short Term Memory (LSTM) neural network. This application does not limit the specific form of the controller.
  • RNN Recurrent Neural Network
  • LSTM Long-Short Term Memory
  • the overall graph is composed of operations represented by each node and jumpers between each operation, where the operations represented by each node can be all operations in the search space.
  • the controller can search the network structure from the search space, for example, determine the operation of each node and the connection relationship between the nodes from the search space, and obtain the final network structure from the search space. Determine a sub-picture in the overall picture.
  • the operation connected by the bold arrows in Fig. 2 is an example of the final subgraph, where node 1 is the input node of the final subgraph, and nodes 3 and 6 are the output nodes of the final subgraph.
  • the parameters of the general map can be fixed, and then the controller is trained.
  • the controller can search the network structure in the search space, and obtain a sub-graph from the general graph based on the network structure, enter the test data (valid) into the sub-graph to obtain the predicted value, and update the controller with the predicted value .
  • ENAS based on weight sharing shares the parameters that can be shared every time it searches the network structure, the efficiency of the network structure search is improved. For example, in the example shown in Figure 2, when the network structure was searched last time, node 1, node 3, and node 6 were searched and the searched network structure was trained. This time, node 1, node 2, node 3, and Node 6, then the parameters related to node 1, node 3 and node 6 obtained last time can be applied to the training of the network structure found this time. In this way, the efficiency is improved through weight sharing.
  • ENAS can increase the efficiency of NAS by more than 1000 times.
  • the predicted value of the sub-graph is always changing.
  • the controller determines the prediction of the sub-graph
  • the result is more and more accurate, that is, the predicted value of the subgraph gradually increases; and the coefficient of the jumper constraint item of the parameter update formula of the controller is fixed, therefore, the constraint force generated by the jumper constraint item increases with the subgraph
  • the training process keeps decreasing; the jumper constraint item reflects the jumper density of the controller.
  • the gradual decrease of the constraint force generated by the jumper constraint item means that the jumper density of the controller gradually increases.
  • this application provides a network structure search method, as shown in FIG. 3, the method includes:
  • Whole graph training step S12 training the first overall graph according to the first network structure and training (train) data to generate a second overall graph;
  • Network structure training step S14 Determine several test subgraphs from the second general map according to the first network structure; Test the several test subgraphs through test data to generate a feedback result; Determine according to the feedback result Jumper constraint items; update the first network structure according to the feedback result and the jumper constraint items.
  • the first network structure can be the network structure of the controller in any training phase.
  • the first network structure can be the network structure of the controller that has never been updated, or the first network structure can be the control that has been updated several times.
  • the network structure of the device can be the network structure of the controller in any training phase.
  • “several” refers to at least one, for example, several times refers to at least once, and several test subgraphs refer to at least one test subgraph.
  • the first overall image can be a neural network with a preset number of layers.
  • the preset number of layers is 4, and each layer of the neural network corresponds to a search space containing 6 operations.
  • the first overall image can be composed of 24 operations and the The network structure formed by the connection relationship between 24 operations, each layer of the network structure contains 6 operations.
  • the second overall graph may not be a convergent overall graph. However, after training with training data, the randomness of the second overall graph is generally less than the randomness of the first overall graph.
  • the first network structure can determine the first training subgraph from the first general graph; input a batch of the training data into the first training subgraph to generate the first training result;
  • the first training result trains the first general map, and generates the second general map.
  • the first overall image may be trained using the first training result and a back propagation (Back Propagation, BP) algorithm.
  • BP Back Propagation
  • the first network structure may use the method shown in FIG. 2 to determine the first training sub-picture from the first overall picture, and use the method shown in FIG. 2 to update the first network structure.
  • the final general graph and the final network structure ie, the final controller
  • the final general graph is determined by the final network structure
  • the sub-picture, the final sub-picture is a network structure conforming to the preset scene.
  • the first network structure may be an LSTM neural network, and the search space includes, for example, conv3*3, conv5*5, depthwise3*3, depthwise5*5, maxpool3*3, and averagepool3*3.
  • each layer of the first general map contains all operations of the search space.
  • Each layer of the network structure to be searched corresponds to a time step of the LSTM neural network. Without considering the jumper, the LSTM neural network needs to execute 20 time steps.
  • the LSTM neural network Each time step is executed, the LSTM neural network The cell will output a hidden state, which can be encoded (encoding) and mapped to a vector with a dimension of 6, which corresponds to the search space and corresponds to 6 dimensions 6 operations in the search space; subsequently, the vector is processed by the softmax function to become a probability distribution, and the LSTM neural network samples according to this probability distribution to obtain the operation of the current layer of the network structure to be searched. Repeat the above process to obtain a network structure (ie, sub-graph).
  • a network structure ie, sub-graph
  • Figure 4 shows an example of determining the network structure.
  • the blank rectangle represents the cell of the LSTM neural network
  • the square containing "conv3*3" and other content represents the operation of the layer in the network structure to be searched
  • the circle represents the connection relationship between the layers.
  • the LSTM neural network can encode the hidden state, and map it to a vector with a dimension of 6.
  • the vector undergoes a normalized exponential function (softmax) to become a probability distribution, and sampling is performed according to this probability distribution. Get the operation of the current layer.
  • the input quantity (for example, a random value) input to the unit of the LSTM neural network is normalized into a vector by the softmax function, and then translated into an operation (conv3 ⁇ 3); conv3 ⁇ 3 is used as the input of the unit of the LSTM neural network when the second time step is executed, and the hidden state generated by the first time step is also used as the input when the second time step is executed.
  • the above two inputs have passed
  • the circle 1 is obtained by processing, and circle 1 indicates that the output of the current operation layer (the operation layer corresponding to node 2) and the output of the first operation layer (the operation layer corresponding to node 1) are concatenated together.
  • circle 1 is used as the input of the unit of the LSTM neural network when the third time step is executed, and the hidden state generated by the second time step is also used as the input when the third time step is executed.
  • the two inputs The amount is processed to obtain sep5 ⁇ 5. And so on to finally get a network structure.
  • a sub-graph is determined from the first general map based on the network structure, and a batch of data in the training data is input into the aforementioned sub-graph to generate training results so as to train the first general map based on the training results. Therefore, the subgraph determined by the first network structure from the first general graph may be referred to as a training subgraph.
  • the controller updates the parameters of the training sub-graph according to the training results and the BP algorithm to complete one iteration. Since the training sub-graph belongs to the first general graph, updating the parameters of the training sub-graph is equivalent to updating the parameters of the first general graph; , The controller can determine a training subgraph again from the first general graph after one iteration, and input another batch of training data into the training subgraph to generate another training result, and use the training result and the BP algorithm to update the training subgraph again. Train the subgraph and complete another iteration. After all the training data have been used, the second overall picture is obtained.
  • the first network structure can determine a sub-graph from the second general map according to the method shown in Figure 4, and this sub-graph can be called a test sub-graph; input a batch of test data into the test sub-graph, and get the feedback result (For example, predicted value).
  • the feedback result can be used directly to update the first network structure, or the average value of multiple feedback results can be used to update the first network structure, where the multiple test results are obtained by inputting multiple batches of test data into the test sub-graph. .
  • the jumper restriction item may be determined according to the feedback result, and then the first network structure is updated according to the jumper restriction item and the feedback result.
  • the current jumper density determined based on the jumper constraint item is more suitable for the current training stage, so that it can search for a higher credibility network structure while improving Training efficiency.
  • the size of the above-mentioned jumper constraint item is positively correlated with the size of the feedback result.
  • the controller In the initial training phase, the controller needs to fully explore the search space, so as to avoid large deviations that will cause the overall graph to fail to converge in the subsequent iterations. Therefore, the jumper density of the controller should not be too small, that is, the value of the jumper constraint item should not be too small. Big. After several iterations, the randomness of the overall graph is reduced, that is, the probability that some operations may be sampled is reduced. In this case, continuing to use controller sampling with a higher jumper density will result in a decrease in training efficiency and A waste of computing power, therefore, a larger jumper constraint is needed to update the controller.
  • the size of the jumper constraint item is positively correlated with the size of the feedback result, so that the value of the jumper constraint item increases with the training stage.
  • the progress of increasing continuously so as to achieve a balance between performance (that is, the performance of the sub-graph), training efficiency and computing power in the search process of the network structure.
  • the jumper constraint items include cos(1-R k ) n , R k is a feedback result, and n is a hyperparameter related to an application scenario.
  • the value of n can be a real number greater than 0.
  • the value of n can be 10, 20, 30, 40, 50, 60, 70, or 100.
  • n can also be greater than or equal to 100. value.
  • the current patch cord density can be obtained based on the feedback result of the test sub-graph.
  • the feedback result includes the prediction accuracy rate of the test data by the test subgraph.
  • the processor may update the first network structure according to formula (1).
  • a t is the time step t of sampling to the operation (operation)
  • a (t-1): 1; ⁇ c) the probability of the operation is the sample
  • m is the first
  • T is the number of layers of the second total graph
  • is a hyperparameter, which is generally set to 0.0001 in the classification task, and different values can be set according to the specific task.
  • Equation (1) The meaning of the equation (1) is: R k maximized while minimizing the KL divergence; i.e., to maintain the current density at the same jumper patch density is desired, maximizing the R k.
  • the formula (1) adds a coefficient ⁇ , which is positively correlated with R k.
  • the prediction value generated by the controller from the test sub-graph determined in the general map is not accurate enough. Therefore, R k is small, ⁇ is also small, and the penalty for the jumper constraint item is small.
  • the updated controller With a large jumper density, it can fully explore the search space and avoid large deviations in the initial training phase; as the training phase progresses, the accuracy of the predicted value generated by the controller from the test sub-graph determined in the general chart is improved , R k gradually increases, ⁇ also gradually increases, and the penalty of the jumper constraint item also gradually increases.
  • the updated controller has a smaller jumper density and can no longer fully explore the search space (due to the randomness of the overall graph Reduced, the controller does not need to fully explore the search space), thereby improving training efficiency; in addition, since the updated controller no longer fully explores the search space, there is no need for excessive FLOPS, thereby reducing computational power consumption.
  • is an adaptive coefficient that can balance the performance of the network structure (that is, the performance of the subgraph) and the computing power consumption in the search process of the network structure, and is especially suitable for mobile devices with weak processing capabilities.
  • jumper constraint items including ⁇ are only examples for illustration, and any jumper constraint items that can be adaptively adjusted according to the training phase fall into the protection scope of this application.
  • the network structure search apparatus includes hardware structures and/or software modules corresponding to various functions.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
  • This application can divide the network structure search device into functional units according to the above method examples. For example, each function can be divided into each functional unit, or two or more functions can be integrated into one functional unit.
  • the above functional units can be implemented in the form of hardware or software. It should be noted that the division of units in this application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.
  • Figure 5 shows a schematic structural diagram of a network structure search device provided by the present application.
  • the dotted line in Figure 5 indicates that the unit is an optional unit.
  • the apparatus 500 may be used to implement the methods described in the foregoing method embodiments.
  • the apparatus 500 may be a software module, a chip, a terminal device, or other electronic devices.
  • the device 500 includes one or more processing units 501, and the one or more processing units 501 can support the device 500 to implement the method in the method embodiment corresponding to FIG. 3.
  • the processing unit 501 may be a software processing unit, a general-purpose processor, or a special-purpose processor.
  • the processing unit 501 may be used to control the device 500, execute a software program (for example, a software program for implementing the method described in the first aspect), and process data (for example, a predicted value).
  • the device 500 may further include a communication unit 505 to implement signal input (reception) and output (transmission).
  • the apparatus 500 may be a software module
  • the communication unit 505 may be an interface function of the software module.
  • the software module can run on the processor or control circuit.
  • the device 500 may be a chip, and the communication unit 505 may be an input and/or output circuit of the chip, or the communication unit 505 may be a communication interface of the chip, and the chip may be a component of a terminal device or other electronic equipment. .
  • the processing unit 501 may execute:
  • General map training step train the first general map according to the first network structure and training data to generate a second general map
  • Network structure training step determine several test subgraphs from the second general map according to the first network structure; test the several test subgraphs through test data to generate feedback results; determine jumps based on the feedback results Line constraint item; the first network structure is updated according to the feedback result and the jumper line constraint item.
  • the size of the jumper constraint item is positively correlated with the size of the feedback result.
  • the jumper constraint item includes cos(1-R k ) n , R k is the feedback result, and n is a hyperparameter related to an application scenario.
  • the patch cord constraint item includes a KL divergence between the current patch cord density and a preset desired patch cord density.
  • the current patch cord density is obtained based on the several test sub-graphs.
  • the feedback result includes the prediction accuracy rate of the test data by the several test subgraphs.
  • the processing unit 501 is specifically configured to: determine a first training sub-picture in the first general picture through the first network structure; and input a batch of data in the training data into the first Training the subgraph to generate a first training result; training the first overall image according to the first training result to generate the second overall image.
  • the processing unit 501 is further configured to: generate a final general image and a final network structure after the general image training step and the network structure training step are executed several times in a loop; A final sub-picture is determined in the final general picture, and the final sub-picture is a network structure conforming to a preset scene.
  • the processing unit 501 may be a central processing unit (CPU), a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (ASIC), and a field programmable gate array (field programmable gate array).
  • CPU central processing unit
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • FPGA field programmable gate array
  • the device 500 may include one or more storage units 502, in which a program 504 (for example, a software program containing the method described in the second aspect) is stored, and the program 504 may be executed by the processing unit 501 to generate instructions 503 so that the processing unit 501 The method described in the foregoing method embodiment is executed according to the instruction 503.
  • the storage unit 502 may also store data (for example, predicted value and jumper density).
  • the processing unit 501 may also read data stored in the storage unit 502, and the data may be stored at the same storage address as the program 504, or the data may be stored at a different storage address from the program 504.
  • the processing unit 501 and the storage unit 502 may be provided separately or integrated together, for example, integrated on a single board or a system-on-chip (SOC).
  • SOC system-on-chip
  • the present application also provides a computer program product, which, when executed by the processing unit 501, implements the method described in any of the embodiments of the present application.
  • the computer program product may be stored in the storage unit 502, for example, a program 504.
  • the program 504 is finally converted into an executable object file that can be executed by the processing unit 501 through processing processes such as preprocessing, compilation, assembly, and linking.
  • the computer program product can be transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • it can be transmitted from a website, computer, server, or data center through a cable (such as coaxial cable, optical fiber, digital subscriber line ( digital subscriber line, DSL) or wireless (such as infrared, wireless, microwave, etc.) to transmit to another website, computer, server, or data center.
  • a cable such as coaxial cable, optical fiber, digital subscriber line ( digital subscriber line, DSL) or wireless (such as infrared, wireless, microwave, etc.
  • This application also provides a computer-readable storage medium (for example, the storage unit 502) on which a computer program is stored, and when the computer program is executed by a computer, the method described in any embodiment of the present application is implemented.
  • the computer program can be a high-level language program or an executable target program.
  • the computer-readable storage medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a digital video disc (DVD)), or a semiconductor medium (for example, a solid state disk (SSD)) )Wait.
  • the computer-readable storage medium may be volatile memory or non-volatile memory, or the computer-readable storage medium may include both volatile memory and non-volatile memory.
  • the non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), and electrically available Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • ROM read-only memory
  • PROM programmable read-only memory
  • EPROM erasable programmable read-only memory
  • electrically available Erase programmable read-only memory electrically available Erase programmable read-only memory
  • EEPROM electrically available Erase programmable read-only memory
  • flash memory electrically available Erase programmable read-only memory
  • the volatile memory may be random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • static random access memory static random access memory
  • dynamic RAM dynamic RAM
  • DRAM dynamic random access memory
  • synchronous dynamic random access memory synchronous DRAM, SDRAM
  • double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous connection dynamic random access memory serial DRAM, SLDRAM
  • direct rambus RAM direct rambus RAM, DR RAM
  • the size of the sequence number of each process does not mean the order of execution.
  • the execution order of each process should be determined by its function and internal logic, and should not correspond to the difference in the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the system, device, and method disclosed in the embodiments provided in this application may be implemented in other ways. For example, some features of the method embodiments described above may be ignored or not implemented.
  • the device embodiments described above are merely illustrative.
  • the division of units is only a logical function division. In actual implementation, there may be other division methods, and multiple units or components may be combined or integrated into another system.
  • the coupling between the units or the coupling between the components may be direct coupling or indirect coupling, and the foregoing coupling includes electrical, mechanical, or other forms of connection.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请提供了一种网络结构搜索方法,包括:总图训练步骤:根据第一网络结构和训练数据对第一总图进行训练,生成第二总图;网络结构训练步骤:根据所述第一网络结构从所述第二总图中确定若干测试子图;通过测试数据对所述若干测试子图进行测试,生成反馈结果;根据所述反馈结果确定跳线约束项;根据所述反馈结果以及跳线约束项对所述第一网络结构进行更新。在神经网络的训练过程中,不同训练阶段对跳线密度的要求不同,上述方法中的跳线约束项是基于当前训练阶段中的反馈结果确定的,因此,上述方法中跳线密度与当前训练阶段相关,使得控制器的当前跳线密度更加适应当前训练阶段,从而能够在取得良好的训练结果的同时提高训练效率。

Description

网络结构搜索的方法、装置、存储介质和计算机程序产品
版权申明
本专利文件披露的内容包含受版权保护的材料。该版权为版权所有人所有。版权所有人不反对任何人复制专利与商标局的官方记录和档案中所存在的该专利文件或者该专利披露。
技术领域
本申请涉及人工智能(Artificial Intelligence,AI)领域,并且更为具体地,涉及一种网络结构搜索的方法、装置、存储介质和计算机程序产品。
背景技术
神经网络是AI的基础,随着神经网络的性能不断提高,其网络结构也越来越复杂。一个神经网络可以需要经过训练才能够被正常使用,神经网络的训练过程主要是调整神经网络的各层内的操作(operation)以及各层之间的连接关系,以便于神经网络能够输出正确的结果。其中,上述连接关系也可以称为跳线(skip)或者捷径(shortcut)。
一种训练神经网络的方法是利用高效神经结构搜索(Efficient Neural Architecture Search,ENAS)训练神经网络。在利用ENAS训练神经网络的过程中,控制器不断的对神经网络的网络结构进行采样,尝试不同的网络结构对输出结果的影响,并利用上一次采样得到的网络结构的输出结果确定下一次采样的网络结构,直至神经网络收敛。
相比于手工调试神经网络的方法,利用ENAS训练神经网络能够提高神经网络的训练效率,但是,利用ENAS训练神经网络的训练效率仍有待于提高。
发明内容
本申请的实施方式提供一种网络结构搜索的方法及装置、计算机存储介质和计算机程序产品。
第一方面,提供了一种网络结构搜索方法,包括:总图训练步骤:根据第一网络结构和训练数据对第一总图进行训练,生成第二总图;网络结构训练步骤:根据所述第一网络结构从所述第二总图中确定若干测试子图;通过 测试数据对所述若干测试子图进行测试,生成反馈结果;根据所述反馈结果确定跳线约束项;根据所述反馈结果以及跳线约束项对所述第一网络结构进行更新。
上述方法可以应用于芯片、移动终端或者服务器。在神经网络的训练过程中,不同训练阶段对跳线密度的要求不同;例如,在一些神经网络的初期训练阶段中,为了尽可能地探索搜索空间以避免网络结构出现偏差(bias),需要采用跳线密度较大的控制器进行训练;在一些神经网络的后期训练阶段中,神经网络的随机性已大大减小,无需探索全部搜索空间,则可以采用跳线密度较小的控制器进行训练,以减小资源(包括算力资源和时间资源)的消耗。由于上述方法中的跳线约束项是基于当前训练阶段中的反馈结果确定的,因此,上述方法中跳线约束项与当前训练阶段相关,使得控制器的当前跳线密度更加适应当前训练阶段,从而能够在取得良好的训练结果的同时减小资源消耗,提高训练效率,尤其适用于移动设备。
第二方面,提供一种网络结构搜索装置,所述装置用于执行上述第一方面中的方法。
第三方面,提供了一种网络结构搜索装置,所述装置包括存储器和处理器,所述存储器用于存储指令,所述处理器用于执行所述存储器存储的指令,并且对所述存储器中存储的指令的执行使得所述处理器执行第一方面的方法。
第四方面,提供了一种芯片,所述芯片包括处理模块与通信接口,所述处理模块用于控制所述通信接口与外部进行通信,所述处理模块还用于实现第一方面的方法。
第五方面,提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被计算机执行时使得所述计算机实现第一方面的方法。
第六方面,提供一种包含指令的计算机程序产品,所述指令被计算机执行时使得所述计算机实现第一方面的方法。
附图说明
图1是本申请提供的一种网络结构搜索方法的流程示意图;
图2是本申请提供的一种总图和子图的示意图;
图3是本申请提供的另一种网络结构搜索方法的流程示意图;
图4是本申请提供的再一种网络结构搜索方法的流程示意图;
图5是本申请提供的一种网络结构搜索装置的示意图。
具体实施方式
下面将结合附图,对本申请实施例中的技术方案进行描述。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。
在本申请的描述中,需要理解的是,术语“第一”、“第二”仅用于描述不同的对象,而不能被理解为存在其它限定。例如,“第一总图”和“第二总图”表示两个不同的总图,除此之外不存在其它限定。此外,在本申请的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。
在本申请的描述中,需要说明的是,除非另有明确的规定和限定,术语“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接或可以相互通信;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通或两个元件的相互作用关系。对于本领域的普通技术人员而言,可以根据具体情况理解上述术语在本申请中的具体含义。
下文的公开提供了许多不同的实施方式或例子用来实现本申请的方案。为了简化本申请的公开,下文例子中对特定的部件或步骤进行描述。当然,下文中的例子的目的不在于限制本申请。此外,本申请可能在不同的例子中重复使用参考数字和/或参考字母,这种重复是为了简化和清楚的目的,其本身不代表所讨论各种实施方式和/或设置之间的关系。
下面将详细描述本申请的实施方式,下文所描述的实施方式是示例性的,仅用于解释本申请,而不能理解为对本申请的限制。
近年来,机器学习算法,尤其是深度学习算法,得到了快速发展与广泛应用。随着模型性能不断地提高,模型结构也越来越复杂。在非自动化机器学习算法中,这些结构需要机器学习专家手工设计和调试,过程非常繁复。而且,随着应用场景和模型结构变得越来越复杂,在应用场景中得到最优模型的难度也越来越大。在这种情况下,自动化机器学习算法(Auto Machine Learning,AutoML)受到学术界与工业界的广泛关注,尤其是神经结构搜索 (Neural Architecture Search,NAS)。
具体地,网络结构搜索是一种利用算法自动化设计神经网络模型的技术。网络结构搜索就是要搜索出神经网络模型的结构。在本申请实施方式中,待进行网络结构搜索的神经网络模型为卷积神经网络(Convolutional Neural Networks,CNN)。
网络结构搜索要解决的问题就是确定神经网络模型中的节点之间的操作。节点之间的操作的不同组合对应不同的网络结构。进一步地,神经网络模型中的节点可以理解为神经网络模型中的特征层。两个节点之间的操作指的是,其中一个节点上的特征数据变换为另一个节点上的特征数据所需的操作。本申请提及的操作可以为卷积操作、池化操作、或全连接操作等其他神经网络操作。可以认为两个节点之间的操作构成这两个节点之间的操作层。通常,两个节点之间的操作层上具有多个可供搜索的操作,即具有多个候选操作。网络结构搜索的目的就是在每个操作层上确定一个操作。
例如,将conv3*3,conv5*5,depthwise3*3,depthwise5*5,maxpool3*3,average pool3*3等定义为搜索空间。也即是说,网络结构的每一层操作是在这六个选择中采样。
如图1所示,NAS在建立搜索空间后,通常利用控制器(一种神经网络)在搜索空间中采样到一个网络结构A,然后训练具有网络结构A的子网络(a child network with architecture A)以确定例如准确度(accuracy)R的反馈量,其中,准确度也可称为预测值;随后,计算p的梯度并用R来缩放它以更新控制器,即,将R作为反馈(reward)更新控制器。
随后,利用更新后的控制器在搜索空间中取样得到新的网络结构,循环执行上述步骤,直至得到一个收敛的子网络。
在图1的示例中,控制器可以是循环神经网络(Recurrent Neural Network,RNN),也可以是CNN或长短期记忆(Long-Short Term Memory,LSTM)神经网络。本申请不对控制器的具体形式进行限定。
然而,将子网络结构训练到收敛比较耗时。因此,相关技术出现了多种提高NAS效率的方法,例如,基于网络变换的高效网络结构搜索(efficient architecture search by network transformation),以及基于权值分享的ENAS(ENAS via parameter sharing)。其中,基于权值分享的ENAS应用较为广泛。
如图2所示,总图由各个节点表示的操作以及各个操作之间的跳线组成, 其中,各个节点所表示的操作可以是搜索空间的全部操作。在使用基于权值分享的ENAS的过程中,控制器可以从搜索空间中搜索网络结构,例如,从搜索空间中确定各个节点的操作以及节点之间的连接关系,基于搜索到的最终网络结构从总图中确定一个子图。图2中由加粗的箭头连接的操作为最终子图的一个示例,其中,节点1为最终子图的输入节点,节点3和节点6为最终子图的输出节点。
在使用基于权值分享的ENAS的过程中,每次采样到一个网络结构后,基于该网络结构从总图中确定一个子图后,不再将其直接训练至收敛,而是使用一份小批量(batch)数据对其进行训练,例如,可以使用反向传播(Back Propagation,BP)算法更新该子图的参数,完成一次迭代。由于子图属于总图,因此,更新子图的参数相当于更新总图的参数。迭代多次后,总图最终可以收敛。请注意,总图的收敛并不相当于子图的收敛。
在训练完所述总图后,可以将总图的参数固定(fix)住,然后训练控制器。例如,可以通过该控制器在搜索空间中搜索网络结构,并基于该网络结构从总图中得到一个子图,将测试数据(valid)输入该子图得到预测值,利用该预测值更新控制器。
由于基于权值分享的ENAS在每次搜索网络结构时,分享了可以分享的参数,从而提高网络结构搜索的效率。例如,在图2所示的例子中,上次搜索网络结构时搜索到节点1、节点3和节点6并对搜索到的网络结构进行训练,本次搜索到节点1、节点2、节点3和节点6,那么,上次得到的与节点1、节点3和节点6的相关参数可以应用到对本次搜索到的网络结构的训练中。这样,就实现了通过权值分享提高效率。
ENAS可以将NAS的效率提升1000倍以上,但是,在实际使用的过程中,会出现如下问题:子图的预测值是一直变化的,随着训练进程的进行,控制器确定的子图的预测结果越来越准确,即,子图的预测值逐渐增大;而控制器的参数更新公式的跳线约束项的系数是固定的,因此,跳线约束项所产生的约束力度随着子图的训练进程不断减小;跳线约束项反映了控制器的跳线密度,跳线约束项所产生的约束力度逐渐减小意味着控制器的跳线密度逐渐增大,过多的跳线会增大控制器的每秒浮点运算次数(floating-point operations per second,FLOPS),从而降低控制器的更新效率,进而影响确定最终子图的效率。此外,若将跳线约束项的初始值设置为较大的值,则由于 子图的预测值在总图训练的初始阶段较小,会导致跳线约束项的约束力度过大,使得更新控制器时无法充分探索搜索空间,从而导致控制器的网络结构出现较大的偏差(bias)。
基于此,本申请提供了一种网络结构搜索方法,如图3所示,该方法包括:
总图(whole graph)训练步骤S12:根据第一网络结构和训练(train)数据对第一总图进行训练,生成第二总图;
网络结构训练步骤S14:根据所述第一网络结构从所述第二总图中确定若干测试子图;通过测试数据对所述若干测试子图进行测试,生成反馈结果;根据所述反馈结果确定跳线约束项;根据所述反馈结果以及所述跳线约束项对所述第一网络结构进行更新。
第一网络结构可以是控制器在任意一个训练阶段的网络结构,例如,第一网络结构可以是从未更新过的控制器的网络结构,或者,第一网络结构可以是更新过若干次的控制器的网络结构。
在本申请中,“若干”指的是至少一个,例如,若干次指的是至少一次,若干测试子图指的是至少一个测试子图。
第一总图可以是包含预设层数的神经网络,例如,预设层数为4,每层神经网络对应包含6个操作的搜索空间,则第一总图可以是由24个操作以及该24个操作之间的连接关系构成的网络结构,每层网络结构包含6个操作。
第二总图可以不是收敛的总图,但是,经过训练数据的训练,第二总图的随机性通常小于第一总图的随机性。
在训练第一总图时,第一网络结构可以从在第一总图内确定第一训练子图;将该训练数据中的一批数据输入第一训练子图,生成第一训练结果;根据第一训练结果训练第一总图,生成所述第二总图。例如,可以利用第一训练结果和反向传播(Back Propagation,BP)算法训练第一总图。
第一网络结构可以利用图2所示的方法从在第一总图内确定第一训练子图,并且,利用图2所示的方法更新第一网络结构。在循环执行若干次所述总图训练步骤和所述网络结构训练步骤后,生成最终总图和最终网络结构(即,最终的控制器);通过所述最终网络结构在最终总图中确定最终子图,最终子图为符合预设场景的网络结构。
下面,将详细介绍更新第一网络结构的过程。
第一网络结构可以是LSTM神经网络,搜索空间例如包括conv3*3、conv5*5、depthwise3*3、depthwise5*5、maxpool3*3和averagepool3*3。
若待搜索的网络结构的预设层数为20,则需要构建一个20层的第一总图,第一总图的每一层包含搜索空间的所有操作。待搜索的网络结构的每一个层对应LSTM神经网络的一个时间步(timestep),在不考虑跳线的情况下,LSTM神经网络需要执行20个时间步,每执行一个时间步,LSTM神经网络的单元(cell)会输出一个隐状态(hidden state),可以对该隐状态进行编码(encoding)操作,将其映射为维度为6的向量(vector),该向量与搜索空间对应,6个维度对应搜索空间的6个操作;随后,该向量经过softmax函数处理,变为概率分布,LSTM神经网络依据此概率分布进行采样(sample),得到待搜索的网络结构的当前层的操作(operation)。重复上述过程,得到一个网络结构(即,子图)。
图4示出了一种确定网络结构的示例。
空白的矩形表示LSTM神经网络的单元(cell),包含“conv3*3”等内容的方块表示在待搜索的网络结构中该层的操作,圆圈表示层与层之间的连接关系。
LSTM神经网络可以对隐状态进行编码(encoding)操作,将其映射维度为6的向量(vector),该向量经过归一化指数函数(softmax),变为概率分布,依据此概率分布进行采样,得到当前层的操作。
例如,在执行第一个时间步的过程中,输入LSTM神经网络的单元的输入量(例如,一个随机值)被softmax函数归一化成一个向量,随后被翻译成一个操作(conv3×3);conv3×3作为LSTM神经网络的单元在执行第二个时间步时的输入量,执行第一个时间步生成的隐状态也作为执行第二个时间步时的输入量,上述两个输入量经过处理得到圆圈1,圆圈1表示当前操作层(节点2对应的操作层)的输出和第一个操作层(节点1对应的操作层)的输出拼接(concatenated)到一起。
同理,圆圈1作为LSTM神经网络的单元在执行第三个时间步时的输入量,执行第二个时间步生成的隐状态也作为执行第三个时间步时的输入量,该两个输入量经过处理得到sep5×5。以此类推最终得到一个网络结构。
随后,基于该网络结构从第一总图中确定一个子图,将训练数据中的一 批数据输入上述子图,生成训练结果,以便于基于该训练结果训练第一总图。因此,第一网络结构从第一总图中确定的子图可以称为训练子图。
例如,控制器根据训练结果和BP算法更新训练子图的参数,完成一次迭代,由于训练子图属于第一总图,因此,更新训练子图的参数相当于更新第一总图的参数;随后,控制器可以从迭代一次后的第一总图中再次确定一个训练子图,并将另一批训练数据输入该训练子图,生成另一个训练结果,利用该训练结果和BP算法再次更新该训练子图,完成又一次迭代。待全部训练数据均被使用后,得到第二总图。
获取第二总图后,可以将第二总图的参数固定住,然后训练控制器。第一网络结构可以按照图4所示的方法从第二总图中确定一个子图,该子图可以称为测试子图;将测试数据中的一批数据输入该测试子图,得到反馈结果(例如,预测值)。可以直接利用该反馈结果更新第一网络结构,也可以利用多个反馈结果的平均值更新第一网络结构,其中,该多个测试结果是将测试数据中的多批数据输入测试子图得到的。
在更新第一网络结构的过程中,可以根据反馈结果确定跳线约束项,随后,根据该跳线约束项和该反馈结果更新第一网络结构。
由于跳线约束项与当前训练阶段的反馈结果相关,因此,基于跳线约束项所确定的当前跳线密度更加适应当前训练阶段,从而能够在搜索到较高可信度的网络结构的同时提高训练效率。
可选地,上述跳线约束项的大小与反馈结果的大小正相关。
本申请中,“正相关”指的是:一个参数的值随着另外一个参数的值的增大而增大,或者,一个参数的值随着另外一个参数的值的减小而减小。
在初始训练阶段控制器需要充分探索搜索空间,以免产生较大的偏差导致后续迭代过程中总图无法收敛,因此,控制器的跳线密度不宜过小,即,跳线约束项的值不宜过大。经过若干次迭代后,总图的随机性减小,即,一些操作可能被采样的概率减小,在这种情况下,继续使用跳线密度较大的控制器采样将导致训练效率的降低以及算力的浪费,因此,需要使用较大的跳线约束项来更新控制器。
由于反馈结果(测试子图的预测准确度)通常随着训练阶段的进行不断增大,因此,跳线约束项的大小与反馈结果的大小正相关可以使得跳线约束项的值随着训练阶段的进行不断增大,从而在网络结构的搜索过程中达到性 能(即,子图的性能)与训练效率和算力的平衡。
可选地,跳线约束项包括cos(1-R k) n,R k为反馈结果,n为与应用场景相关的超参数。其中,n的取值可以是大于0的实数,例如,n的取值可以是10、20、30、40、50、60、70或100,可选地,n也可以取大于或等于100的值。
可选地,跳线约束项包括当前跳线密度和预设的期望跳线密度之间的KL散度(Kullback-Leibler divergence),例如,跳线约束项为
Figure PCTCN2019114361-appb-000001
其中,α=cos(1-R k) n,λ为超参数,θ c为第一网络结构的参数,q为预设的期望跳线密度,p为当前跳线密度。当前跳线密度可以是基于测试子图的反馈结果得到的。所述反馈结果包括测试子图对测试数据的预测正确率。
可选地,处理器可以根据公式(1)更新第一网络结构。
Figure PCTCN2019114361-appb-000002
其中,a t为第t个时间步中采样到的操作(operation),P(a t|a (t-1):1;θ c)为采样到该操作的概率,m为在对第一网络结构进行一次更新时使用的反馈结果的数量,T为第二总图的层数,λ为超参数,一般在分类任务中设置为0.0001,可以根据具体的任务设置不同的值。
公式(1)的含义是:最大化R k的同时,最小化KL散度;即,保持当前跳线密度与期望跳线密度一致的情况下,最大化R k
现有技术中,由于R k随着总图的收敛逐渐增大,并且,由于λ所产生的惩罚力度一直不变,因此,在实际迭代过程中约束跳线时,q通常只能设置到0.4-0.6之间。其中,q按照(所有当前的跳线数量)/(所有可连的跳线数量)计算,取值为0-1之间的数值,初始状态下跳线为随机连接的线,跳线密度为0.5。
公式(1)相比于现有技术增加了一个系数α,α与R k正相关。在初始训练阶段,控制器从总图中确定的测试子图生成的预测值不够准确,因此,R k较小,α也较小,跳线约束项的惩罚力度较小,更新后的控制器具有较大的跳线密度,能够充分探索搜索空间,避免在初始训练阶段产生较大偏差;随着训练阶段的进行,控制器从总图中确定的测试子图生成的预测值的准确度提高,R k逐渐增大,α也逐渐增大,跳线约束项的惩罚力度也逐渐增大,更新后的控制器的跳线密度较小,不能再充分探索搜索空间(由于总图的随机性减小,控制器也无需充分探索搜索空间),从而提高了训练效率;此外,由于更新后的控制器不再充分探索搜索空间,因此,无需过多的FLOPS,从 而减小了算力消耗。由上可知,α是一个自适应的系数,能够平衡在网络结构的搜索过程中网络结构的性能(即,子图的性能)与算力消耗,尤其适用于处理能力较弱的移动设备。
需要说明的是,上文中包含α的跳线约束项仅是举例说明,任何能够根据训练阶段自适应调整的跳线约束项均落入本申请的保护范围。
上文详细介绍了本申请提供的网络结构搜索的方法的示例。可以理解的是,网络结构搜索的装置为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本申请可以根据上述方法示例对网络结构搜索的装置进行功能单元的划分,例如,可以将各个功能划分为各个功能单元,也可以将两个或两个以上的功能集成在一个功能单元中。上述功能单元既可以采用硬件的形式实现,也可以采用软件的形式实现。需要说明的是,本申请中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
图5示出了本申请提供的一种网络结构搜索的装置的结构示意图。图5中的虚线表示该单元为可选的单元。装置500可用于实现上述方法实施例中描述的方法。装置500可以是软件模块、芯片、终端设备或者其它电子设备。
装置500包括一个或多个处理单元501,该一个或多个处理单元501可支持装置500实现图3所对应方法实施例中的方法。处理单元501可以是软件处理单元、通用处理器或者专用处理器。处理单元501可以用于对装置500进行控制,执行软件程序(例如,用于实现第一方面所述的方法的软件程序),处理数据(例如,预测值)。装置500还可以包括通信单元505,用以实现信号的输入(接收)和输出(发送)。
例如,装置500可以是软件模块,通信单元505可以是该软件模块的接口函数。该软件模块可以在处理器或者控制电路上运行。
又例如,装置500可以是芯片,通信单元505可以是该芯片的输入和/或输出电路,或者,通信单元505可以是该芯片的通信接口,该芯片可以作 为终端设备或其它电子设备的组成部分。
装置500中,处理单元501可以执行:
总图训练步骤:根据第一网络结构和训练数据对第一总图进行训练,生成第二总图;
网络结构训练步骤:根据所述第一网络结构从所述第二总图中确定若干测试子图;通过测试数据对所述若干测试子图进行测试,生成反馈结果;根据所述反馈结果确定跳线约束项;根据所述反馈结果以及所述跳线约束项对所述第一网络结构进行更新。
可选地,所述跳线约束项的大小与所述反馈结果的大小正相关。
可选地,所述跳线约束项包括cos(1-R k) n,R k为所述反馈结果,n为与应用场景相关的超参数。
可选地,0<n≤100。
可选地,所述跳线约束项包括当前跳线密度和预设的期望跳线密度之间的KL散度。
可选地,所述跳线约束项为
Figure PCTCN2019114361-appb-000003
其中,α=cos(1-R k) n,λ为超参数,θ c为所述第一网络结构的参数,q为预设的期望跳线密度,p为当前跳线密度。
可选地,所述当前跳线密度是基于所述若干测试子图得到的。
可选地,所述反馈结果包括所述若干测试子图对所述测试数据的预测正确率。
可选地,所述处理单元501具体用于:通过所述第一网络结构在所述第一总图内确定第一训练子图;将所述训练数据中的一批数据输入所述第一训练子图,生成第一训练结果;根据所述第一训练结果训练所述第一总图,生成所述第二总图。
可选地,所述处理单元501还用于:在循环执行若干次所述总图训练步骤和所述网络结构训练步骤后,生成最终总图和最终网络结构;通过所述最终网络结构在所述最终总图中确定最终子图,所述最终子图为符合预设场景的网络结构。
本领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述装置和单元的具体工作过程以及产生的效果可以参见图1至图4对应的方法实施例中的相关描述。为了简洁,在此不再赘述。
作为一种可选的实施方式,上述各步骤可以通过硬件形式的逻辑电路或者软件形式的指令完成。例如,处理单元501可以是中央处理器(central processing unit,CPU)、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其它可编程逻辑器件,例如,分立门、晶体管逻辑器件或分立硬件组件。
装置500可以包括一个或多个存储单元502,其中存有程序504(例如,包含第二方面所述的方法的软件程序),程序504可被处理单元501运行,生成指令503,使得处理单元501根据指令503执行上述方法实施例中描述的方法。可选地,存储单元502中还可以存储有数据(例如,预测值和跳线密度)。可选地,处理单元501还可以读取存储单元502中存储的数据,该数据可以与程序504存储在相同的存储地址,该数据也可以与程序504存储在不同的存储地址。
处理单元501和存储单元502可以单独设置,也可以集成在一起,例如,集成在单板或者系统级芯片(system on chip,SOC)上。
本申请还提供了一种计算机程序产品,该计算机程序产品被处理单元501执行时实现本申请中任一实施例所述的方法。
该计算机程序产品可以存储在存储单元502中,例如是程序504,程序504经过预处理、编译、汇编和链接等处理过程最终被转换为能够被处理单元501执行的可执行目标文件。
该计算机程序产品可以从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。
本申请还提供了一种计算机可读存储介质(如,存储单元502),其上存储有计算机程序,该计算机程序被计算机执行时实现本申请中任一实施例所述的方法。该计算机程序可以是高级语言程序,也可以是可执行目标程序。
该计算机可读存储介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如数字视频光盘(digital video disc,DVD))、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。例如,该计算机可读存储介质可以 是易失性存储器或非易失性存储器,或者,该计算机可读存储介质可以同时包括易失性存储器和非易失性存储器。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
应理解,在本申请的各个实施例中,各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请的实施例的实施过程构成任何限定。
本文中的术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
本申请所提供的实施例所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的方法实施例的一些特征可以忽略,或不执行。以上所描述的装置实施例仅仅是示意性的,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,多个单元或组件可以结合或者可以集成到另一个系统。另外,各单元之间的耦合或各个组件之间的耦合可以是直接耦合,也可以是间接耦合,上述耦合包括电的、机械的或其它形式的连接。
总之,以上所述仅为本申请的部分实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (23)

  1. 一种网络结构搜索方法,其特征在于,包括:
    总图训练步骤:根据第一网络结构和训练数据对第一总图进行训练,生成第二总图;
    网络结构训练步骤:根据所述第一网络结构从所述第二总图中确定若干测试子图;通过测试数据对所述若干测试子图进行测试,生成反馈结果;根据所述反馈结果确定跳线约束项;根据所述反馈结果以及所述跳线约束项对所述第一网络结构进行更新。
  2. 根据权利要求1所述的方法,其特征在于,所述跳线约束项的大小与所述反馈结果的大小正相关。
  3. 根据权利要求2所述的方法,其特征在于,所述跳线约束项包括cos(1-R k) n,R k为所述反馈结果,n为与应用场景相关的超参数。
  4. 根据权利要求3所述的方法,其特征在于,0<n≤100。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述跳线约束项包括当前跳线密度和预设的期望跳线密度之间的KL散度。
  6. 根据权利要求5所述的方法,其特征在于,所述跳线约束项为
    Figure PCTCN2019114361-appb-100001
    其中,α=cos(1-R k) n,λ为超参数,θ c为所述第一网络结构的参数,q为预设的期望跳线密度,p为当前跳线密度。
  7. 根据权利要求5或6所述的方法,其特征在于,所述当前跳线密度是基于所述若干测试子图得到的。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述反馈结果包括所述若干测试子图对所述测试数据的预测正确率。
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,所述根据第一网络结构和训练数据对第一总图进行训练,生成第二总图,包括:
    通过所述第一网络结构在所述第一总图内确定第一训练子图;
    将所述训练数据中的一批数据输入所述第一训练子图,生成第一训练结果;
    根据所述第一训练结果训练所述第一总图,生成所述第二总图。
  10. 根据权利要求1至9中任一项所述的方法,其特征在于,还包括:
    在循环执行若干次所述总图训练步骤和所述网络结构训练步骤后,生成最终总图和最终网络结构;
    通过所述最终网络结构在所述最终总图中确定最终子图,所述最终子图为符合预设场景的网络结构。
  11. 一种网络结构搜索装置,其特征在于,包括处理单元,用于执行:
    总图训练步骤:根据第一网络结构和训练数据对第一总图进行训练,生成第二总图;
    网络结构训练步骤:根据所述第一网络结构从所述第二总图中确定若干测试子图;通过测试数据对所述若干测试子图进行测试,生成反馈结果;根据所述反馈结果确定跳线约束项;根据所述反馈结果以及所述跳线约束项对所述第一网络结构进行更新。
  12. 根据权利要求11所述的装置,其特征在于,所述跳线约束项的大小与所述反馈结果的大小正相关。
  13. 根据权利要求12所述的装置,其特征在于,所述跳线约束项包括cos(1-R k) n,R k为所述反馈结果,n为与应用场景相关的超参数。
  14. 根据权利要求13所述的装置,其特征在于,0<n≤100。
  15. 根据权利要求11至14中任一项所述的装置,其特征在于,所述跳线约束项包括当前跳线密度和预设的期望跳线密度之间的KL散度。
  16. 根据权利要求15所述的装置,其特征在于,所述跳线约束项为
    Figure PCTCN2019114361-appb-100002
    其中,α=cos(1-R k) n,λ为超参数,θ c为所述第一网络结构的参数,q为预设的期望跳线密度,p为当前跳线密度。
  17. 根据权利要求15或16所述的装置,其特征在于,所述当前跳线密度是基于所述若干测试子图得到的。
  18. 根据权利要求11至17中任一项所述的装置,其特征在于,所述反馈结果包括所述若干测试子图对所述测试数据的预测正确率。
  19. 根据权利要求11至18中任一项所述的装置,其特征在于,所述处理单元具体用于:
    通过所述第一网络结构在所述第一总图内确定第一训练子图;
    将所述训练数据中的一批数据输入所述第一训练子图,生成第一训练结果;
    根据所述第一训练结果训练所述第一总图,生成所述第二总图。
  20. 根据权利要求11至19中任一项所述的装置,其特征在于,所述处理单元还用于:
    在循环执行若干次所述总图训练步骤和所述网络结构训练步骤后,生成最终总图和最终网络结构;
    通过所述最终网络结构在所述最终总图中确定最终子图,所述最终子图为符合预设场景的网络结构。
  21. 一种网络结构搜索设备,其特征在于,包括:存储器与处理器,所述存储器用于存储指令,所述处理器用于执行所述存储器存储的指令,并且对所述存储器中存储的指令的执行使得,所述处理器用于执行如权利要求1至10中任一项所述的方法。
  22. 一种计算机存储介质,其特征在于,其上存储有计算机程序,所述计算机程序被计算机执行时使得,所述计算机执行如权利要求1至10中任一项所述的方法。
  23. 一种包含指令的计算机程序产品,其特征在于,所述指令被计算机执行时使得计算机执行如权利要求1至10中任一项所述的方法。
PCT/CN2019/114361 2019-10-30 2019-10-30 网络结构搜索的方法、装置、存储介质和计算机程序产品 WO2021081809A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980031708.4A CN112106077A (zh) 2019-10-30 2019-10-30 网络结构搜索的方法、装置、存储介质和计算机程序产品
PCT/CN2019/114361 WO2021081809A1 (zh) 2019-10-30 2019-10-30 网络结构搜索的方法、装置、存储介质和计算机程序产品

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/114361 WO2021081809A1 (zh) 2019-10-30 2019-10-30 网络结构搜索的方法、装置、存储介质和计算机程序产品

Publications (1)

Publication Number Publication Date
WO2021081809A1 true WO2021081809A1 (zh) 2021-05-06

Family

ID=73750057

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/114361 WO2021081809A1 (zh) 2019-10-30 2019-10-30 网络结构搜索的方法、装置、存储介质和计算机程序产品

Country Status (2)

Country Link
CN (1) CN112106077A (zh)
WO (1) WO2021081809A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190026639A1 (en) * 2017-07-21 2019-01-24 Google Llc Neural architecture search for convolutional neural networks
WO2019084560A1 (en) * 2017-10-27 2019-05-02 Google Llc SEARCH FOR NEURONAL ARCHITECTURES
CN109934336A (zh) * 2019-03-08 2019-06-25 江南大学 基于最优结构搜索的神经网络动态加速平台设计方法及神经网络动态加速平台
CN110009048A (zh) * 2019-04-10 2019-07-12 苏州浪潮智能科技有限公司 一种神经网络模型的构建方法以及设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190026639A1 (en) * 2017-07-21 2019-01-24 Google Llc Neural architecture search for convolutional neural networks
WO2019084560A1 (en) * 2017-10-27 2019-05-02 Google Llc SEARCH FOR NEURONAL ARCHITECTURES
CN109934336A (zh) * 2019-03-08 2019-06-25 江南大学 基于最优结构搜索的神经网络动态加速平台设计方法及神经网络动态加速平台
CN110009048A (zh) * 2019-04-10 2019-07-12 苏州浪潮智能科技有限公司 一种神经网络模型的构建方法以及设备

Also Published As

Publication number Publication date
CN112106077A (zh) 2020-12-18

Similar Documents

Publication Publication Date Title
US12099927B2 (en) Asynchronous neural network training
WO2017181866A1 (en) Making graph pattern queries bounded in big graphs
WO2021254114A1 (zh) 构建多任务学习模型的方法、装置、电子设备及存储介质
US20180060301A1 (en) End-to-end learning of dialogue agents for information access
WO2021244354A1 (zh) 神经网络模型的训练方法和相关产品
CN111406264A (zh) 神经架构搜索
WO2016064576A1 (en) Tagging personal photos with deep networks
US20220012307A1 (en) Information processing device, information processing system, information processing method, and storage medium
US20190138929A1 (en) System and method for automatic building of learning machines using learning machines
JP2020060922A (ja) ハイパーパラメータチューニング方法、装置及びプログラム
WO2020237689A1 (zh) 网络结构搜索的方法及装置、计算机存储介质和计算机程序产品
KR102511225B1 (ko) 인공지능 추론모델을 경량화하는 방법 및 시스템
KR102499517B1 (ko) 최적 파라미터 결정 방법 및 시스템
WO2022252694A1 (zh) 神经网络优化方法及其装置
US20200226458A1 (en) Optimizing artificial neural network computations based on automatic determination of a batch size
CN117009539A (zh) 知识图谱的实体对齐方法、装置、设备及存储介质
TWI758223B (zh) 具有動態最小批次尺寸之運算方法,以及用於執行該方法之運算系統及電腦可讀儲存媒體
CN117669700A (zh) 深度学习模型训练方法和深度学习模型训练系统
WO2021081809A1 (zh) 网络结构搜索的方法、装置、存储介质和计算机程序产品
EP4414901A1 (en) Model weight acquisition method and related system
WO2021146977A1 (zh) 网络结构搜索方法和装置
CN116978450A (zh) 蛋白质数据的处理方法、装置、电子设备及存储介质
CN114722490A (zh) 一种基于混合增点与区间缩减的代理模型全局优化方法
WO2020237687A1 (zh) 网络结构搜索的方法及装置、计算机存储介质和计算机程序产品
Seward et al. First order generative adversarial networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19950715

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19950715

Country of ref document: EP

Kind code of ref document: A1