JP2023552048A

JP2023552048A - Neural architecture scaling for hardware acceleration

Info

Publication number: JP2023552048A
Application number: JP2023524743A
Authority: JP
Inventors: リー，アンドリュー; リー，ション; タン，ミンシン; パン，ルオミン; チェン，リチュン; リー，コック・ブイ; ジョピー，ノーマン・ポール
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2021-01-15
Filing date: 2021-07-29
Publication date: 2023-12-14
Also published as: TW202230221A; CN116261734A; WO2022154829A1; EP4217928A1

Abstract

ハードウェアアクセラレータ上でニューラルネットワークアーキテクチャをスケーリングするための、コンピュータ読み取り可能な媒体を含む方法、システム、および装置。方法は、訓練データと、ターゲットコンピューティングリソースを指定する情報とを受信することと、訓練データを用いて探索空間のニューラルアーキテクチャ探索を実行して基本ニューラルネットワークのアーキテクチャを識別することとを含む。基本ニューラルネットワークをスケーリングするための複数のスケーリングパラメータ値が識別され得る。この識別は、複数のスケーリングパラメータ値候補を選択することと、複数のスケーリングパラメータ値候補に応じてスケーリングされた基本ニューラルネットワークの性能評価指標を決定することとを繰り返し実行することを含み得、性能評価指標は、レイテンシ目的を含む複数の第２目的に従った性能評価指標である。複数のスケーリングパラメータ値に応じてスケーリングされた基本ニューラルネットワークのアーキテクチャを用いて、スケーリングされたニューラルネットワークのアーキテクチャが決定され得る。A method, system, and apparatus, including a computer-readable medium, for scaling a neural network architecture on a hardware accelerator. The method includes receiving training data and information specifying target computing resources, and using the training data to perform a neural architecture search of a search space to identify a base neural network architecture. Multiple scaling parameter values may be identified for scaling the basic neural network. This identification may include iteratively selecting a plurality of candidate scaling parameter values and determining a performance evaluation metric for the scaled basic neural network in response to the plurality of candidate scaling parameter values, The evaluation index is a performance evaluation index according to a plurality of secondary objectives including a latency objective. A scaled neural network architecture may be determined using the base neural network architecture scaled according to the plurality of scaling parameter values.

Description

関連出願の相互参照
本願は、２０２１年１月１５日に出願された米国特許出願第６３／１３７，９２６号の利益を米国特許法第１１９条（ｅ）の下で主張する２０２１年２月１２日に出願された米国特許出願第１７，１７５，０２９号の継続出願であり、その開示内容を引用により本明細書に援用する。 CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of U.S. Patent Application No. 63/137,926, filed January 15, 2021, under 35 U.S.C. 119(e) February 12, 2021 It is a continuation of U.S. Patent Application No. 17,175,029, filed on July 1, 1999, the disclosure of which is incorporated herein by reference.

背景
ニューラルネットワークは、受け付けた入力に対する出力を予測するための非線形演算の１つ以上の層を含む機械学習モデルである。入力層および出力層に加えて、１つ以上の隠れ層を含むニューラルネットワークもある。各隠れ層の出力は、ニューラルネットワークの別の隠れ層または出力層に入力され得る。ニューラルネットワークの各層は、その層の１つ以上のモデルパラメータの値に応じて、受け付けた入力から各出力を生成できる。モデルパラメータは、訓練アルゴリズムによって決定される、ニューラルネットワークに正確な出力を生成させるための重みまたはバイアスであり得る。 Background Neural networks are machine learning models that include one or more layers of nonlinear operations to predict outputs given inputs received. Some neural networks include one or more hidden layers in addition to input and output layers. The output of each hidden layer may be input to another hidden layer or output layer of the neural network. Each layer of the neural network can generate a respective output from received inputs depending on the value of one or more model parameters for that layer. Model parameters can be weights or biases determined by a training algorithm to force the neural network to produce accurate outputs.

概要
本発明の態様に従って実装されたシステムは、各候補の計算要件（たとえば、ＦＬＯＰＳ）、演算強度、および実行効率に応じてニューラルネットワークアーキテクチャ候補を探索することによって、ニューラルネットワークアーキテクチャのレイテンシを減らすことができる。本明細書において説明するように、計算要件単体では推論時のレイテンシを含むレイテンシに影響を与えるのとは対照的に、計算要件、演算強度、および実行効率は、ターゲットコンピューティングリソース上のニューラルネットワークのレイテンシの根本的な原因であることがわかった。本開示の態様は、この観測されたレイテンシと計算との関係、演算強度、および実行効率に基づいて、レイテンシを意識した複合スケーリングによって、そしてニューラルネットワーク候補を探索する空間を拡張することによってニューラルアーキテクチャ探索およびスケーリングを実行するための技術を可能にする。 SUMMARY A system implemented in accordance with aspects of the present invention reduces the latency of neural network architectures by exploring neural network architecture candidates according to each candidate's computational requirements (e.g., FLOPS), computational intensity, and execution efficiency. Can be done. As discussed herein, whereas computational requirements alone affect latency, including inference-time latency, computational requirements, computational intensity, and execution efficiency affect neural networks on target computing resources. was found to be the root cause of latency. Aspects of the present disclosure develop neural architectures based on this observed latency-computation relationship, computational intensity, and execution efficiency through latency-aware composite scaling and by expanding the space in which to search for neural network candidates. Enabling techniques to perform exploration and scaling.

さらには、システムは、複数の目的に応じてニューラルネットワークの複数のパラメータを均一にスケーリングするための複合スケーリングを実行できる。これにより、１つの目的が考慮される手法やニューラルネットワークのスケーリングパラメータが別個に探索される手法よりも、スケーリングされたニューラルネットワークのパフォーマンスが改善できる。レイテンシを意識した複合スケーリングを用いて、最初のスケーリングされたニューラルネットワークアーキテクチャからの異なる値に応じてスケーリングされ、かつ、異なるユースケースに適し得るニューラルネットワークアーキテクチャのファミリーを素早く構築することができる。 Furthermore, the system can perform compound scaling to uniformly scale multiple parameters of the neural network according to multiple objectives. This can improve the performance of scaled neural networks over approaches where one objective is considered or where the scaling parameters of the neural network are searched separately. Latency-aware compound scaling can be used to quickly build a family of neural network architectures that can be scaled according to different values from an initial scaled neural network architecture and suitable for different use cases.

本開示の態様によると、１つのコンピュータにより実施される方法は、ニューラルネットワークのアーキテクチャを決定するための方法を含む。この方法は、１つ以上のプロセッサが、ニューラルネットワークタスクに対応する訓練データと、ターゲットコンピューティングリソースを指定する情報とを受信することと、１つ以上のプロセッサが、複数の第１目的に従って、訓練データを用いて探索空間のニューラルアーキテクチャ探索を実行して基本ニューラルネットワークのアーキテクチャを識別することと、１つ以上のプロセッサが、ターゲットコンピューティングリソースを指定する情報と、基本ニューラルネットワークの複数のスケーリングパラメータとに応じて基本ニューラルネットワークをスケーリングするための複数のスケーリングパラメータ値を識別することとを含む。識別することは、複数のスケーリングパラメータ値候補を選択することと、複数のスケーリングパラメータ値候補に応じてスケーリングされた基本ニューラルネットワークの性能評価指標を決定することとを繰り返し実行することを含み、性能評価指標は、レイテンシ目的を含む複数の第２目的に従って決定される。方法は、１つ以上のプロセッサが、複数のスケーリングパラメータ値に応じてスケーリングされた基本ニューラルネットワークのアーキテクチャを用いて、スケーリングされたニューラルネットワークのアーキテクチャを生成することをさらに含み得る。 According to aspects of the present disclosure, a computer-implemented method includes a method for determining the architecture of a neural network. The method includes: one or more processors receiving training data corresponding to a neural network task and information specifying target computing resources; and one or more processors receiving training data corresponding to a neural network task and information specifying target computing resources; performing a neural architecture search of the search space using the training data to identify the architecture of the base neural network; and identifying a plurality of scaling parameter values for scaling the base neural network in response to a parameter. The identifying includes iteratively selecting a plurality of scaling parameter value candidates and determining a performance evaluation metric for the scaled basic neural network in response to the plurality of scaling parameter value candidates, and determining the performance of the scaled basic neural network. The evaluation index is determined according to a plurality of secondary objectives, including a latency objective. The method may further include the one or more processors generating a scaled neural network architecture using the scaled base neural network architecture in response to the plurality of scaling parameter values.

上記およびその他の実施態様は、各々、必要に応じて下記の特徴のうち１つ以上を単体または組み合わせて含み得る。 Each of the above and other embodiments may optionally include one or more of the following features alone or in combination.

ニューラルアーキテクチャ探索を実行する複数の第１目的は、複数のスケーリングパラメータ値を識別するための複数の第２目的と同じであり得る。 The first objectives for performing the neural architecture search may be the same as the second objectives for identifying scaling parameter values.

複数の第１目的および複数の第２目的は、基本ニューラルネットワークの出力の正解率に対応する正解率目的を含み得る。 The plurality of first objectives and the plurality of second objectives may include an accuracy rate objective corresponding to an accuracy rate of the output of the basic neural network.

性能評価指標は、基本ニューラルネットワークが複数のスケーリングパラメータ値候補に応じてスケーリングされ、ターゲットコンピューティングリソース上にデプロイされたときの基本ニューラルネットワークが入力を受け付けることと、出力を生成することとの間のレイテンシの評価指標に少なくとも一部対応し得る。 The performance metric is the difference between the base neural network accepting input and producing output when the base neural network is scaled according to multiple scaling parameter value candidates and deployed on a target computing resource. may correspond at least in part to a latency metric.

レイテンシ目的は、基本ニューラルネットワークがターゲットコンピューティングリソース上にデプロイされたときの基本ニューラルネットワークが入力を受け付けることと、出力を生成することとの間の最小レイテンシに対応し得る。 The latency objective may correspond to a minimum latency between the base neural network accepting input and producing output when the base neural network is deployed on the target computing resource.

探索空間は、ニューラルネットワーク層候補を含み得、各ニューラルネットワーク候補層は、１つ以上の演算を実行するように構成される。探索空間は、異なる活性化関数を含むニューラルネットワーク層候補を含み得る。 The search space may include candidate neural network layers, with each candidate neural network layer configured to perform one or more operations. The search space may include neural network layer candidates that include different activation functions.

基本ニューラルネットワークのアーキテクチャは、複数のコンポーネント候補を含み得、各コンポーネントは、複数のニューラルネットワーク層を有する。探索空間は、第１の活性化関数を含むネットワーク層候補の第１コンポーネントと、第１の活性化関数とは異なる第２の活性化関数を含むネットワーク層候補の第２コンポーネントとを含む、ニューラルネットワーク層候補の複数のコンポーネント候補を含み得る。 The basic neural network architecture may include multiple component candidates, each component having multiple neural network layers. The search space includes a first component of the network layer candidate that includes a first activation function, and a second component of the network layer candidate that includes a second activation function that is different from the first activation function. A network layer candidate may include multiple component candidates.

ターゲットコンピューティングリソースを指定する情報は、１つ以上のハードウェアアクセラレータを指定し得、方法は、スケーリングされたニューラルネットワークを１つ以上のハードウェアアクセラレータ上で実行してニューラルネットワークタスクを実行することをさらに含む。 The information specifying target computing resources may specify one or more hardware accelerators, and the method includes executing the scaled neural network on the one or more hardware accelerators to perform the neural network task. further including.

ターゲットコンピューティングリソースは、第１のターゲットコンピューティングリソースを含み得、複数のスケーリングパラメータ値は、複数の第１スケーリングパラメータ値であり、方法は、１つ以上のプロセッサが、第１のターゲットコンピューティングリソースとは異なる第２のターゲットコンピューティングリソースを指定する情報を受信することと、第２のターゲットコンピューティングリソースを指定する情報に応じて基本ニューラルネットワークをスケーリングするための複数の第２スケーリングパラメータ値を識別することとをさらに含み得、複数の第２スケーリングパラメータ値は、複数の第１スケーリングパラメータ値とは異なる。 The target computing resource may include a first target computing resource, the plurality of scaling parameter values are a plurality of first scaling parameter values, and the method includes: receiving information specifying a second target computing resource different from the resource; and a plurality of second scaling parameter values for scaling the base neural network in response to the information specifying the second target computing resource. the plurality of second scaling parameter values being different from the plurality of first scaling parameter values.

複数のスケーリングパラメータ値は、複数の第１スケーリングパラメータ値であり、方法は、複数の第２スケーリングパラメータ値を用いてスケーリングされた基本ニューラルネットワークアーキテクチャから、スケーリングされたニューラルネットワークアーキテクチャを生成することをさらに含み、第２スケーリングパラメータ値は、複数の第１スケーリングパラメータ値と、第１スケーリングパラメータ値の各々の値を均一に変更する１つ以上の複合係数に応じて生成される。 The plurality of scaling parameter values are a plurality of first scaling parameter values, and the method includes generating a scaled neural network architecture from a scaled base neural network architecture using a plurality of second scaling parameter values. Further including, a second scaling parameter value is generated in response to the plurality of first scaling parameter values and one or more composite coefficients that uniformly modify the value of each of the first scaling parameter values.

基本ニューラルネットワークは、畳み込みニューラルネットワークであり得、複数のスケーリングパラメータは、基本ニューラルネットワークの深さ、基本ニューラルネットワークの幅、基本ニューラルネットワークの入力の分解能のうち、１つ以上を含み得る。 The base neural network may be a convolutional neural network, and the plurality of scaling parameters may include one or more of base neural network depth, base neural network width, and base neural network input resolution.

別の態様によると、ニューラルネットワークのアーキテクチャを決定するための方法は、１つ以上のプロセッサが、ターゲットコンピューティングリソースを指定する情報を受け付けることと、１つ以上のプロセッサが、基本ニューラルネットワークのアーキテクチャを指定するデータを受信することと、１つ以上のプロセッサが、ターゲットコンピューティングリソースを指定する情報と、基本ニューラルネットワークの複数のスケーリングパラメータとに応じて基本ニューラルネットワークをスケーリングするための複数のスケーリングパラメータ値を識別することとを含む。識別することは、複数のスケーリングパラメータ値候補を選択することと、複数のスケーリングパラメータ値候補に応じてスケーリングされた基本ニューラルネットワークの性能評価指標を決定することとを繰り返し実行することを含み、性能評価指標は、レイテンシ目的を含む複数の目的に応じて決定され、方法は、１つ以上のプロセッサが、複数のスケーリングパラメータ値に応じてスケーリングされた基本ニューラルネットワークのアーキテクチャを用いて、スケーリングされたニューラルネットワークのアーキテクチャを生成することをさらに含む。 According to another aspect, a method for determining an architecture of a neural network includes: one or more processors receiving information specifying target computing resources; and one or more processors determining an architecture of a base neural network. and a plurality of scalings for the one or more processors to scale the base neural network in response to information specifying target computing resources and a plurality of scaling parameters for the base neural network. and identifying parameter values. The identifying includes iteratively selecting a plurality of scaling parameter value candidates and determining a performance evaluation metric for the scaled basic neural network in response to the plurality of scaling parameter value candidates, and determining the performance of the scaled basic neural network. The evaluation metric is determined according to multiple objectives, including a latency objective, and the method uses a basic neural network architecture in which one or more processors are scaled according to multiple scaling parameter values. Further comprising generating an architecture for the neural network.

複数の目的は、複数の第２目的であり得、基本ニューラルネットワークのアーキテクチャを指定するデータを受信することは、１つ以上のプロセッサが、ニューラルネットワークタスクに対応する訓練データを受信することと、１つ以上のプロセッサが、複数の第１目的に従って、訓練データを用いて探索空間のニューラルアーキテクチャ探索を実行して基本ニューラルネットワークのアーキテクチャを識別することとを含み得る。 The plurality of objectives may be a plurality of secondary objectives, and receiving data specifying the architecture of the underlying neural network comprises: the one or more processors receiving training data corresponding to the neural network task; The one or more processors may include performing a neural architecture search of the search space using the training data to identify an architecture of the base neural network according to the plurality of first objectives.

その他の実施態様は、前記方法の動作を実行するように各々が構成されたコンピュータシステムと、装置と、１つ以上のコンピュータ記憶装置上に記録されたコンピュータプログラムとを含む。 Other embodiments include a computer system, an apparatus, and a computer program recorded on one or more computer storage devices, each configured to perform the operations of the method.

デプロイされたニューラルネットワークが動作するハードウェアアクセラレータが収容されているデータセンターにおいてデプロイするための、スケーリングされたニューラルネットワークアーキテクチャのファミリーを示すブロック図である。FIG. 2 is a block diagram illustrating a family of scaled neural network architectures for deployment in a data center containing hardware accelerators on which the deployed neural networks operate. ターゲットコンピューティングリソース上で実行するためのスケーリングされたニューラルネットワークアーキテクチャを生成するための例示的なプロセスのフロー図である。FIG. 2 is a flow diagram of an example process for generating a scaled neural network architecture for execution on target computing resources. 基本ニューラルネットワークアーキテクチャのレイテンシを意識した複合スケーリングの例示的なプロセスである。An exemplary process for latency-aware compound scaling of basic neural network architectures. 本開示の態様に係る、ＮＡＳ－ＬＡＣＳ（ニューラルアーキテクチャ探索-レイテンシを意識した複合スケーリング）システムのブロック図である。1 is a block diagram of a NAS-LACS (Neural Architecture Search-Latency Aware Complex Scaling) system according to aspects of the present disclosure. FIG. ＮＡＳ－ＬＡＣＳシステムを実装するための例示的な環境のブロック図である。1 is a block diagram of an example environment for implementing a NAS-LACS system. FIG.

詳細な説明
概要
本明細書に記載のテクノロジーは、概して、異なるハードウェアアクセラレータなど、異なるターゲットコンピューティングリソース上で実行するためのニューラルネットワークをスケーリングすることに関する。ニューラルネットワークは、複数の異なるパフォーマンス目的に応じてスケーリングされ得る。これらのパフォーマンス目的は、処理時間（本明細書において、レイテンシと称する）を最小限に抑える目的と、ターゲットコンピューティングリソース上で実行するためにスケーリングされたときのニューラルネットワークの正解率を最大化する目的という別個の目的を含み得る。 DETAILED DESCRIPTION Overview The technology described herein generally relates to scaling neural networks for execution on different target computing resources, such as different hardware accelerators. Neural networks can be scaled for a number of different performance objectives. These performance objectives are to minimize processing time (referred to herein as latency) and to maximize the accuracy rate of the neural network when scaled to run on the target computing resources. It can include a separate purpose: purpose.

一般に、１つ以上の目的に応じてアーキテクチャ候補から構成される所与の探索空間からニューラルネットワークアーキテクチャを選択するためのＮＡＳ（ニューラルアーキテクチャ探索）システムがデプロイされ得る。１つの共通する目的は、ニューラルネットワークの正解率である。一般に、ＮＡＳ技術を実装するシステムは、正解率が低いネットワークよりも、訓練後に正解率が高くなるネットワークを好む。ＮＡＳに続いて基本ニューラルネットワークが選択された後、１つ以上のスケーリングパラメータに応じて基本ニューラルネットワークがスケーリングされ得る。スケーリングは、たとえば、数字から構成される係数探索空間にあるスケーリングパラメータを探索することによって、基本ニューラルネットワークをスケーリングするための１つ以上スケーリングパラメータ値を探索することを含み得る。スケーリングは、ニューラルネットワークをデプロイするために利用可能な計算リソースおよび／またはメモリリソースを有効活用するために、ニューラルネットワークが有する層の数または各層のサイズを増やしたり減らしたりすることを含み得る。 In general, a NAS (Neural Architecture Search) system may be deployed to select a neural network architecture from a given search space comprised of architecture candidates according to one or more objectives. One common objective is the accuracy rate of neural networks. In general, systems implementing NAS technology prefer networks that have a high accuracy rate after training over networks that have a low accuracy rate. After the base neural network is selected following the NAS, the base neural network may be scaled according to one or more scaling parameters. Scaling may include searching for one or more scaling parameter values for scaling the basic neural network, for example, by searching for scaling parameters in a coefficient search space made up of numbers. Scaling may involve increasing or decreasing the number of layers a neural network has or the size of each layer to make better use of computational and/or memory resources available for deploying the neural network.

ニューラルアーキテクチャ探索およびスケーリングに関して共通して抱かれている考えは、ニューラルネットワークを通った入力を処理するために必要とされる、たとえばＦＬＯＰＳ（１秒当たりの浮動小数点演算）で測定されるネットワークの計算要件は、ネットワークに入力を送信することと、出力を受け付けることとの間のレイテンシに比例するという考えである。すなわち、計算要件が低い（低ＦＬＯＰＳ）ニューラルネットワークは、ネットワークの計算要件が高い（高ＦＬＯＰＳ）場合よりも高速に出力を生成すると信じられている。なぜならば、全体として実行される演算が少ないためである。よって、多くのＮＡＳシステムは、計算要件が低いニューラルネットワークを選択する。しかしながら、演算強度、並列性、および実行効率などニューラルネットワークのその他の特徴がニューラルネットワークの全体的なレイテンシに影響を与える可能性があるので、計算要件とレイテンシとの関係は比例しないことが明らかになっている。 A commonly held idea regarding neural architecture exploration and scaling is that the network computations, e.g. measured in FLOPS (floating point operations per second), are required to process the inputs passed through the neural network. The idea is that the requirement is proportional to the latency between sending input to the network and accepting output. That is, it is believed that a neural network with low computational requirements (low FLOPS) will produce output faster than if the network has high computational requirements (high FLOPS). This is because fewer operations are executed overall. Therefore, many NAS systems choose neural networks because of their low computational requirements. However, it is clear that the relationship between computational requirements and latency is not proportional, as other characteristics of neural networks such as computational intensity, parallelism, and execution efficiency can affect the overall latency of neural networks. It has become.

本明細書において説明するテクノロジーは、ＬＡＣＳ（レイテンシを意識した複合スケーリング）と、ニューラルネットワークが選択されるニューラルネットワーク候補探索空間の拡張とを可能にする。探索空間を拡張するという状況では、演算およびアーキテクチャが探索空間に含まれ得る。当該演算およびアーキテクチャは、これらを加えることで、様々な種類のハードウェアアクセラレータ上でデプロイするのに適したより高度な演算強度、実行効率、および並列性がもたらされるという点で、「ハードウェアアクセラレータフレンドリーな」演算およびアーキテクチャである。このような演算は、ｓｐａｃｅ－ｔｏ－ｄｅｐｔｈ（空間から深さへの）演算、ｓｐａｃｅ－ｔｏ－ｂａｔｃｈ（空間からバッチへの）演算、融合された畳み込み構造、およびコンポーネントごとの探索活性化関数を含み得る。 The technology described herein enables LACS (Latency Aware Composite Scaling) and expansion of the neural network candidate search space from which neural networks are selected. In the context of extending the search space, operations and architectures may be included in the search space. The computation and architecture are ``hardware accelerator friendly'' in that they together provide higher computational intensity, execution efficiency, and parallelism suitable for deployment on various types of hardware accelerators. ” operations and architecture. Such operations include space-to-depth operations, space-to-batch operations, fused convolution structures, and per-component search activation functions. may be included.

ニューラルネットワークのレイテンシを意識した複合スケーリングは、レイテンシに応じて最適化を実行しない従来の手法よりも、ニューラルネットワークのスケーリングを改善できる。その代わりに、ＬＡＣＳを用いて、正確かつターゲットコンピューティングリソース上で低レイテンシで動作するスケーリングされたニューラルネットワークのスケーリングされたパラメータ値を識別してもよい。 Latency-aware composite scaling of neural networks can improve the scaling of neural networks over traditional methods that do not perform latency-aware optimizations. Alternatively, LACS may be used to identify scaled parameter values for scaled neural networks that operate accurately and with low latency on target computing resources.

このテクノロジーは、さらに、ＮＡＳまたは同様の技術を用いてニューラルアーキテクチャを探索する目的を共有する多目的スケーリングを可能にする。スケーリングされたニューラルネットワークは、基本ニューラルネットワークを探索する際に用いられた目的と同じ目的に応じて識別可能である。その結果、基本アーキテクチャ探索およびスケーリングという各ステージを別個の目的を有するタスクとして扱うのではなく、スケーリングされたニューラルネットワークをこれら２つのステージにおけるパフォーマンスに最適化させることができる。 This technology further enables multi-objective scaling that shares the purpose of exploring neural architectures using NAS or similar technology. The scaled neural network is distinguishable according to the same objectives used in searching the basic neural network. As a result, the scaled neural network can be optimized for performance at the basic architecture exploration and scaling stages, rather than treating each stage as tasks with separate objectives.

ＬＡＣＳを既存のＮＡＳシステムと一体化できる。なぜならば、少なくとも、スケーリングされたニューラルネットワークアーキテクチャを決定するためのエンドツーエンドシステムを構築するために、探索およびスケーリングの両方に対して同じ目的を使用できるためである。さらには、スケーリングされたニューラルネットワークアーキテクチャのファミリーを、ｓｅａｒｃｈｉｎｇ－ｗｉｔｈｏｕｔ－ｓｃａｌｉｎｇ手法よりも高速で識別できる。ｓｅａｒｃｈｉｎｇ－ｗｉｔｈｏｕｔ－ｓｃａｌｉｎｇ手法では、ニューラルネットワークアーキテクチャを探索するが、ターゲットコンピューティングリソース上にデプロイするためのスケーリングは行わない。 LACS can be integrated with existing NAS systems. This is because the same objectives can be used for both exploration and scaling, at least to build an end-to-end system for determining scaled neural network architectures. Furthermore, families of scaled neural network architectures can be identified faster than searching-without-scaling techniques. Searching-without-scaling techniques explore neural network architectures but do not scale them for deployment on target computing resources.

本明細書において説明するテクノロジーは、ＬＡＣＳを使わない従来のｓｅａｒｃｈ－ａｎｄ－ｓｃａｌｉｎｇ（サーチおよびスケーリング）手法で識別されるニューラルネットワークよりも改善したニューラルネットワークを可能にし得る。これに加えて、モデルの正解率および推論レイテンシのような目的間のトレードオフが異なるニューラルネットワークのファミリーを、様々なユースケースに適用するために素早く生成できる。また、このテクノロジーは、特定のタスクを実行するためのニューラルネットワークをより高速で識別し得るが、識別されたニューラルネットワークは、その他の手法を用いて識別されたニューラルネットワークよりも向上した正解率で機能し得る。これは、少なくとも、本明細書において説明する探索およびスケーリングを実行した結果識別されるニューラルネットワークが、単にネットワークの計算要件だけを考慮するのではなく、レイテンシに影響を与える可能性のある演算強度および実行効率のような特性を考慮するためである。このようにすれば、識別されたニューラルネットワークは、推論時、ネットワークの正解率を犠牲にすることなく、より高速で動作できる。 The technology described herein may enable neural networks that are improved over those identified with traditional search-and-scaling techniques that do not use LACS. In addition to this, families of neural networks with different trade-offs between objectives such as model accuracy and inference latency can be quickly generated to be applied to various use cases. This technology may also identify neural networks to perform a particular task faster, and the identified neural networks may have an improved accuracy rate than neural networks identified using other techniques. It can work. This means that, at a minimum, the neural networks identified as a result of performing the exploration and scaling described herein do not simply consider the computational requirements of the network, but also the computational intensity and This is to consider characteristics such as execution efficiency. In this way, the identified neural network can operate faster during inference without sacrificing the accuracy of the network.

このテクノロジーは、さらに、向上したコンピューティングリソース環境に既存のニューラルネットワークを素早く移行させるための一般に適用可能なフレームワークを可能にし得る。たとえば、特定のハードウェアを有するデータセンター用に選択された既存のニューラルネットワークの実行を、異なるハードウェアを用いるデータセンターに移行するときに、本明細書において説明するＬＡＣＳおよびＮＡＳを適用できる。この点に関して、既存のニューラルネットワークのタスクを実行するためのニューラルネットワークのファミリーを素早く識別し、新しいデータセンターハードウェア上にデプロイすることができる。この応用方法は、コンピュータビジョンにおけるタスクまたはその他の画像処理タスクをネットワークが実行するなど、効率よく実行するために最先端のハードウェアを必要とする高速デプロイメントの分野で特に有用になり得る。 This technology may also enable a generally applicable framework for quickly migrating existing neural networks to improved computing resource environments. For example, the LACS and NAS described herein can be applied when migrating an existing neural network implementation selected for a data center with particular hardware to a data center with different hardware. In this regard, families of neural networks to perform the tasks of existing neural networks can be quickly identified and deployed on new data center hardware. This application method can be particularly useful in areas of rapid deployment, such as when networks perform tasks in computer vision or other image processing tasks, which require state-of-the-art hardware to perform efficiently.

図１は、ハードウェアアクセラレータ１１６が収容されているデータセンター１１５にデプロイするためのスケーリングされたニューラルネットワークアーキテクチャ１０４Ａ～１０４Ｎのファミリー１０３のブロック図である。ハードウェアアクセラレータ１１６上では、デプロイされたニューラルネットワークが動作する。ハードウェアアクセラレータ１１６は、ＣＰＵ、ＧＰＵ、ＦＧＰＡなどの任意の種類のプロセッサ、またはＴＰＵなどのＡＳＩＣであり得る。本開示の態様によると、スケーリングされたニューラルネットワークアーキテクチャのファミリー１０３は、基本ニューラルネットワークアーキテクチャ１０１から生成され得る。 FIG. 1 is a block diagram of a family 103 of scaled neural network architectures 104A-104N for deployment in a data center 115 in which a hardware accelerator 116 is housed. The deployed neural network operates on the hardware accelerator 116. Hardware accelerator 116 may be any type of processor such as a CPU, GPU, FGPA, or an ASIC such as a TPU. According to aspects of the present disclosure, family of scaled neural network architectures 103 may be generated from base neural network architecture 101.

ニューラルネットワークのアーキテクチャは、ニューラルネットワークを識別する特性を指す。たとえば、アーキテクチャは、１つのネットワークを構成する複数の異なるニューラルネットワーク層の特性、これらの層が入力をどのように処理するか、これらの層が互いにどのようにやり取りするかなどを含み得る。たとえば、ＣｏｎｖＮｅｔ（畳み込みニューラルネットワーク）のアーキテクチャは、入力された画像データを受信する離散畳み込み層を規定し、次にプーリング層を規定し、次に、入力された画像データの内容を分類するなどニューラルネットワークタスクに従って出力を生成する全結合層を規定できる。また、ニューラルネットワークのアーキテクチャは、各層内で実行される演算の種類も規定できる。たとえば、ＣｏｎｖＮｅｔのアーキテクチャは、ネットワークの全結合層においてＲｅＬＵ活性化関数を使用すると規定し得る。 Neural network architecture refers to the characteristics that identify a neural network. For example, the architecture may include the characteristics of the different neural network layers that make up a network, how these layers process input, how the layers interact with each other, and so on. For example, the ConvNet (convolutional neural network) architecture defines a discrete convolutional layer that receives input image data, then a pooling layer, and then a neural network that classifies the content of the input image data. Fully connected layers can be defined that produce outputs according to network tasks. The architecture of a neural network can also define the types of operations performed within each layer. For example, the ConvNet architecture may specify the use of ReLU activation functions in the fully connected layer of the network.

ＮＡＳを用いて、基本ニューラルネットワークアーキテクチャ１０１を目的のセットに応じて識別できる。ニューラルネットワークアーキテクチャ候補から構成される探索空間から、ＮＡＳを用いて、基本ニューラルネットワークアーキテクチャ１０１を目的のセットに応じて識別できる。本明細書においてさらに詳細に説明するが、ニューラルネットワークアーキテクチャ候補から構成される探索空間は、それぞれ異なるネットワークコンポーネント、それぞれ異なる演算、および目的を満たす基本ネットワークが識別され得るそれぞれ異なる層を含むように拡張できる。 Using the NAS, basic neural network architectures 101 can be identified according to a set of objectives. From a search space composed of neural network architecture candidates, basic neural network architectures 101 can be identified using the NAS according to a desired set. As described in further detail herein, the search space comprised of candidate neural network architectures is expanded to include different network components, different operations, and different layers from which a basic network satisfying the objective can be identified. can.

また、基本ニューラルネットワークアーキテクチャ１０１を識別するために用いられる目的のセットを適用して、ファミリー１０３にあるニューラルネットワーク１０４Ａ～１０４Ｎごとにスケーリングパラメータ値を識別することもできる。基本ニューラルネットワークアーキテクチャ１０１、およびスケーリングされたニューラルネットワークアーキテクチャ１０４Ａ～１０４Ｎは、パラメータの数によって特徴付けられ得る。これらのパラメータは、スケーリングされたニューラルネットワークアーキテクチャ１０４Ａ～１０４Ｎにおいて様々な程度にスケーリングされる。図１では、ニューラルネットワーク１０１、１０４Ａは、ニューラルネットワークにある層の数を示すＤと、ニューラルネットワーク層の幅またはニューラルネットワーク層内のニューロンの数を示すＷと、ニューラルネットワークによって所与の層において処理される入力のサイズを示すＲ、という３つのスケーリングパラメータを有すると示されている。 The set of objectives used to identify the base neural network architecture 101 may also be applied to identify scaling parameter values for each neural network 104A-104N in the family 103. Basic neural network architecture 101 and scaled neural network architectures 104A-104N may be characterized by a number of parameters. These parameters are scaled to varying degrees in scaled neural network architectures 104A-104N. In FIG. 1, the neural networks 101, 104A are shown in FIG. It is shown to have three scaling parameters: R, which indicates the size of the input being processed.

本明細書においてさらに詳細に説明するが、ＬＡＣＳを実行するように構成されたシステムは、係数探索空間１０８を探索してスケーリングパラメータ値の複数のセットを識別できる。各スケーリングパラメータ値は、係数探索空間にある係数であり、たとえば、正の実数のセットであり得る。各ネットワーク候補１０７Ａ～１０７Ｎは、係数探索空間１０８における探索の一部として識別された係数値候補に応じて基本ニューラルネットワーク１０１からスケーリングされる。システムは、パレートフロンティア探索またはグリッドサーチなど、係数候補を識別するための任意の様々な探索技術を適用可能である。ネットワーク候補１０７Ａ～１０７Ｎごとに、システムは、ニューラルネットワークタスクを実行する際の当該ネットワーク候補の性能評価指標を評価できる。性能評価指標は、複数の目的に基づき得、入力を受け付けることと、ニューラルネットワークタスクの実行の一部として対応する出力を生成することとの間のネットワーク候補のレイテンシを測定するというレイテンシ目的を含む。 As described in further detail herein, a system configured to perform LACS can search coefficient search space 108 to identify multiple sets of scaling parameter values. Each scaling parameter value is a coefficient in a coefficient search space, and may be, for example, a set of positive real numbers. Each network candidate 107A-107N is scaled from base neural network 101 according to the coefficient value candidates identified as part of the search in coefficient search space 108. The system can apply any of a variety of search techniques to identify coefficient candidates, such as Pareto frontier search or grid search. For each network candidate 107A-107N, the system can evaluate the performance metrics of that network candidate in performing the neural network task. The performance metrics may be based on multiple objectives, including a latency objective of measuring the latency of the network candidate between accepting input and producing a corresponding output as part of executing the neural network task. .

係数探索空間１０８に対してスケーリングパラメータ値探索が実行された後、システムは、スケーリングされたニューラルネットワークアーキテクチャ１０９を受け付け得る。スケーリングされたニューラルネットワークアーキテクチャ１０９は、係数探索空間における探索中に識別されたネットワーク候補１０７Ａ～１０７Ｎの最大性能評価指標をもたらすスケーリングパラメータ値を用いて、基本ニューラルネットワーク１０１からスケーリングされる。 After the scaling parameter value search is performed on the coefficient search space 108, the system may accept a scaled neural network architecture 109. Scaled neural network architecture 109 is scaled from base neural network 101 using scaling parameter values that result in the maximum performance metrics of network candidates 107A-107N identified during the search in coefficient search space.

スケーリングされたニューラルネットワークアーキテクチャ１０９から、システムは、スケーリングされたニューラルネットワークアーキテクチャ１０４Ａ～１０４Ｎのファミリー１０３を生成できる。ファミリー１０３は、スケーリングされたニューラルネットワークアーキテクチャ１０９をそれぞれ異なる値に応じてスケーリングすることによって生成できる。スケーリングされたニューラルネットワークアーキテクチャ１０９の各スケーリングパラメータ値を均一にスケーリングして、ファミリー１０３に含まれる他のスケーリングされたニューラルネットワークアーキテクチャを生成できる。たとえば、スケーリングされたニューラルネットワークアーキテクチャ１０９の各スケーリングパラメータ値は、各スケーリングパラメータ値を２倍に増やすことによってスケーリングできる。スケーリングされたニューラルネットワークアーキテクチャ１０９の各スケーリングパラメータ値に均一に適用されるそれぞれ異なる値-または「複合係数」について、スケーリングされたニューラルネットワークアーキテクチャ１０９をスケーリングできる。いくつかの実施態様では、スケーリングされたニューラルネットワークアーキテクチャ１０９を、たとえば各スケーリングパラメータ値を別個にスケーリングすることによってなど別の方法でスケーリングして、ファミリー１０３に含まれるスケーリングされたニューラルネットワークアーキテクチャを生成する。 From the scaled neural network architecture 109, the system can generate a family 103 of scaled neural network architectures 104A-104N. Families 103 can be generated by scaling each scaled neural network architecture 109 according to a different value. Each scaling parameter value of scaled neural network architecture 109 can be uniformly scaled to generate other scaled neural network architectures in family 103. For example, each scaling parameter value of scaled neural network architecture 109 can be scaled by increasing each scaling parameter value by a factor of two. The scaled neural network architecture 109 can be scaled by different values--or "compound coefficients"--that are uniformly applied to each scaling parameter value of the scaled neural network architecture 109. In some implementations, scaled neural network architecture 109 is scaled in another manner, such as by scaling each scaling parameter value separately, to produce the scaled neural network architectures in family 103. do.

スケーリングされたニューラルネットワークアーキテクチャ１０９をそれぞれ異なる値に応じてスケーリングすることによって、様々なユースケースに応じてタスクを実行するためのそれぞれ異なるニューラルネットワークアーキテクチャを素早く生成することができる。当該異なるユースケースは、スケーリングされたニューラルネットワークアーキテクチャ１０９を識別するために用いられる複数の目的間の異なるトレードオフとして指定され得る。たとえば、あるスケーリングされたニューラルネットワークアーキテクチャは、実行中のレイテンシが大きいという犠牲を伴って、より高い正解率のしきい値を満たすと識別され得る。別のスケーリングされたニューラルネットワークアーキテクチャは、より低い正解率のしきい値を満たすと識別され得るが、ハードウェアアクセラレータ１１６上で低レイテンシで実行できる。別のスケーリングされたニューラルネットワークアーキテクチャは、ハードウェアアクセラレータ１１６上の正解率とレイテンシとのトレードオフのバランスを取ると識別され得る。 By scaling the scaled neural network architectures 109 according to different values, different neural network architectures can be quickly generated to perform tasks according to different use cases. The different use cases may be specified as different trade-offs between objectives used to identify the scaled neural network architecture 109. For example, certain scaled neural network architectures may be identified as meeting higher accuracy thresholds at the cost of greater latency during execution. Alternative scaled neural network architectures may be identified that meet lower accuracy rate thresholds, but can be run with lower latency on the hardware accelerator 116. Another scaled neural network architecture may be identified that balances the trade-off between accuracy and latency on the hardware accelerator 116.

例として、物体認識などのコンピュータビジョンタスクを実行するために、アプリケーションが連続して映像データまたは画像データならびに受信データにある特定のクラスの物体を識別するタスクを受け付けることの一部として、ニューラルネットワークアーキテクチャは、出力をリアルタイムまたはほぼリアルタイムで生成する必要があるであろう。この例示的なタスクでは、正解率の許容値は低い可能性があるので、低レイテンシおよび低正解率についての適切なトレードオフでスケーリングされたニューラルネットワークアーキテクチャがデプロイされてタスクを実行し得る。 As an example, to perform a computer vision task such as object recognition, an application may sequentially accept video or image data as well as a neural network as part of the task of identifying a particular class of objects in the received data. The architecture will need to generate output in real time or near real time. For this example task, the accuracy rate tolerance may be low, so a scaled neural network architecture with an appropriate tradeoff for low latency and low accuracy rate may be deployed to perform the task.

別の例として、ニューラルネットワークアーキテクチャは、画像データまたは映像データから受信したシーンにあるすべての物体を分類するタスクが課せられ得る。この例では、この例示的なタスクを実行する際のレイテンシが当該タスクを精度高く実行することと同じくらい重要であると考慮されていない場合、レイテンシを犠牲にした正解率の高いスケーリングされたニューラルネットワークがデプロイされ得る。その他の例では、正解率と、レイテンシと、その他の目的との間のトレードオフのバランスを取るスケーリングされたニューラルネットワークアーキテクチャがデプロイされ得る。ここでは、ニューラルネットワークタスクを実行する際、特定のトレードオフは識別されたり所望されたりしない。 As another example, a neural network architecture may be tasked with classifying all objects in a scene received from image or video data. In this example, if the latency in performing this illustrative task is not considered as important as performing said task with high accuracy, then a scaled neural network with a high accuracy rate at the expense of latency. A network may be deployed. In other examples, scaled neural network architectures may be deployed that balance tradeoffs between accuracy rate, latency, and other objectives. Here, no particular trade-offs are identified or desired when performing neural network tasks.

スケーリングされたニューラルネットワークアーキテクチャ１０４Ｎは、スケーリングされたニューラルネットワーク１０４Ａを取得するためにスケーリングされたニューラルネットワークアーキテクチャ１０９をスケーリングするのに用いられたスケーリングパラメータ値とは異なるスケーリングパラメータ値を用いてスケーリングされ、異なるユースケース、たとえば、推論時にレイテンシよりも正解率が所望されるユースケースを表し得る。 Scaled neural network architecture 104N is scaled using scaling parameter values different from the scaling parameter values used to scale scaled neural network architecture 109 to obtain scaled neural network 104A; It may represent a different use case, for example a use case where accuracy rate is desired over latency during inference.

本明細書において説明するＬＡＣＳおよびＮＡＳ技術は、ハードウェアアクセラレータ１１６のファミリー１０３を生成し、さらなる訓練データと、異なる種類のハードウェアアクセラレータなど複数の異なるコンピューティングリソースを指定する情報とを受信し得る。ハードウェアアクセラレータ１１６のファミリー１０３を生成することに加えて、システムは、基本ニューラルネットワークアーキテクチャを探索し、他のハードウェアアクセラレータについて、スケーリングされたニューラルネットワークアーキテクチャのファミリーを生成し得る。たとえば、ＧＰＵおよびＴＰＵの場合、システムは、ＧＰＵとＴＰＵそれぞれについての正解率とレイテンシとのトレードオフに最適化させた別個のモデルファミリーを生成し得る。いくつかの実施態様では、システムは、同じ基本ニューラルネットワークアーキテクチャから、複数のスケーリングされたファミリーを生成し得る。 The LACS and NAS technologies described herein generate a family 103 of hardware accelerators 116 that may receive additional training data and information specifying multiple different computing resources, such as different types of hardware accelerators. . In addition to generating family 103 of hardware accelerators 116, the system may explore basic neural network architectures and generate families of scaled neural network architectures for other hardware accelerators. For example, for GPUs and TPUs, the system may generate separate model families optimized for accuracy rate and latency trade-offs for GPUs and TPUs, respectively. In some implementations, the system may generate multiple scaled families from the same basic neural network architecture.

例示的な方法
図２は、ターゲットコンピューティングリソース上で実行するためのスケーリングされたニューラルネットワークアーキテクチャを生成するための例示的なプロセス２００のフロー図である。例示的なプロセス２００は、１つ以上の場所にある１つ以上のプロセッサから構成されるシステム上で実行され得る。たとえば、本明細書において説明するＮＡＳ－ＬＡＣＳ（ニューラルアーキテクチャ探索-レイテンシを意識した複合スケーリング）システムは、プロセス２００を実行し得る。 Exemplary Method FIG. 2 is a flow diagram of an exemplary process 200 for generating a scaled neural network architecture for execution on target computing resources. Exemplary process 200 may be executed on a system comprised of one or more processors at one or more locations. For example, the NAS-LACS (Neural Architecture Exploration-Latency Aware Composite Scaling) system described herein may perform process 200.

ブロック２１０に示すように、システムは、ニューラルネットワークタスクに対応する訓練データを受信する。ニューラルネットワークタスクは、ニューラルネットワークによって実行され得る機械学習タスクである。スケーリングされたニューラルネットワークは、任意の種類のデータ入力を受け付けて、ニューラルネットワークタスクを実行するための出力を生成するように構成され得る。例として、出力は、入力に基づいて出力される任意の種類のスコア、クラス分類、または回帰であり得る。これに対応して、ニューラルネットワークタスクは、与えられた入力に対する出力を予測するためのスコアリングタスク、クラス分類タスク、および／または回帰タスクであり得る。これらのタスクは、画像、映像、テキスト、音声、またはその他の種類のデータを処理する際の様々なアプリケーションに対応し得る。 As shown at block 210, the system receives training data corresponding to a neural network task. Neural network tasks are machine learning tasks that can be performed by neural networks. A scaled neural network may be configured to accept any type of data input and produce output to perform a neural network task. By way of example, the output may be any type of score, classification, or regression output based on the input. Correspondingly, the neural network task may be a scoring task, a classification task, and/or a regression task to predict an output given input. These tasks may correspond to various applications in processing images, video, text, audio, or other types of data.

受け付けた訓練データは、様々な学習技術のうち１つの学習技術に応じて、ニューラルネットワークを訓練するのに適した任意の形式であり得る。ニューラルネットワークを訓練するための学習技術は、教師あり学習技術、教師なし学習技術、および半教師あり学習技術を含み得る。たとえば、訓練データは、ニューラルネットワークが入力として受け付け得る複数の訓練例を含み得る。訓練例は、特定のニューラルネットワークタスクを実行するように適切に訓練されたニューラルネットワークによって生成されることになっている出力に対応する既知の出力でラベル付けされ得る。たとえば、ニューラルネットワークタスクがクラス分類タスクである場合、訓練例は、画像に描かれている被写体を分類分けする１つ以上のクラスでラベル付けされた画像であり得る。 The received training data may be in any format suitable for training a neural network, depending on one of a variety of learning techniques. Learning techniques for training neural networks may include supervised learning techniques, unsupervised learning techniques, and semi-supervised learning techniques. For example, the training data may include multiple training examples that the neural network may accept as input. Training examples may be labeled with known outputs that correspond to the outputs that are supposed to be produced by a properly trained neural network to perform a particular neural network task. For example, if the neural network task is a classification task, the training example may be an image labeled with one or more classes that classify the objects depicted in the image.

ブロック２２０に示すように、システムは、ターゲットコンピューティングリソースを指定する情報を受け付ける。ターゲットコンピューティングリソースのデータは、ニューラルネットワークの少なくとも一部がデプロイされ得るコンピューティングリソースの特性を指定し得る。コンピューティングリソースは、様々な種類のハードウェアデバイスをホストしている１つ以上のデータセンターまたはその他の物理的位置に収容され得る。ハードウェアの種類として、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、エッジコンピューティングデバイスまたはモバイルコンピューティングデバイス、ＦＧＰＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）、および様々な種類のＡＳＩＣ（Ａｐｐｌｉｃａｔｉｏｎ－ＳｐｅｃｉｆｉｃＣｉｒｃｕｉｔ）などが挙げられる。 As shown at block 220, the system receives information specifying target computing resources. The target computing resource data may specify characteristics of the computing resource on which at least a portion of the neural network may be deployed. Computing resources may be housed in one or more data centers or other physical locations hosting various types of hardware devices. Types of hardware include CPUs (Central Processing Units), GPUs (Graphics Processing Units), edge or mobile computing devices, FGPAs (Field Programmable Gate Arrays), and various types of ASICs ( Application-Specific Circuit) Examples include.

ハードウェアアクセラレーションのために構成され得るデバイスもあり、特定の種類の演算を効率よく実行するために構成されたデバイスを含み得る。たとえばＧＰＵとＴＰＵ（ＴｅｎｓｏｒＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）とを含むこれらのハードウェアアクセラレータは、ハードウェアアクセラレーションの特殊機能を実施し得る。ハードウェアアクセラレーションの機能として、行列乗算など、機械学習モデルの実行に共通して関連する演算を実行するための構成などを挙げることができる。また、例として、これらの特殊機能は、異なる種類のＧＰＵにおいて利用可能な行列積和ユニット、およびＴＰＵにおいて利用可能な行列積ユニットを含み得る。 Some devices may be configured for hardware acceleration and may include devices configured to efficiently perform certain types of operations. These hardware accelerators, including for example GPUs and TPUs (Tensor Processing Units), may perform specialized functions of hardware acceleration. Examples of hardware acceleration functions include configurations for executing operations commonly associated with the execution of machine learning models, such as matrix multiplication. Also, by way of example, these special functions may include matrix multiply-accumulate units available in different types of GPUs, and matrix multiply units available in TPUs.

ターゲットコンピューティングリソースのデータは、１つ以上のターゲットコンピューティングリソースセットについてのデータを含み得る。ターゲットコンピューティングリソースセットは、ニューラルネットワークをデプロイしたいコンピューティングデバイスの集まりを指す。ターゲットコンピューティングリソースセットを指定する情報は、ターゲットセットに含まれるハードウェアアクセラレータまたはその他のコンピューティングデバイスの種類および量を指し得る。ターゲットセットは、同じ種類または異なる種類のデバイスを含み得る。たとえば、ターゲットコンピューティングリソースセットは、処理能力、スループット、およびメモリ容量を含む、特定の種類のハードウェアアクセラレータのハードウェア特性および量を規定し得る。本明細書において説明したように、システムは、ターゲットコンピューティングリソースセットにおいて指定されたデバイスごとに、スケーリングされたニューラルネットワークアーキテクチャのファミリーを生成し得る。 Target computing resource data may include data about one or more target computing resource sets. The target computing resource set refers to the collection of computing devices on which you want to deploy your neural network. Information specifying a target set of computing resources may refer to the type and amount of hardware accelerators or other computing devices included in the target set. A target set may include devices of the same type or different types. For example, the target computing resource set may define hardware characteristics and quantities of particular types of hardware accelerators, including processing power, throughput, and memory capacity. As described herein, the system may generate a family of scaled neural network architectures for each device specified in the target computing resource set.

これに加えて、ターゲットコンピューティングリソースのデータは、異なるターゲットコンピューティングリソースセットを指定し得、たとえば、データセンターに収容されているコンピューティングリソースのそれぞれ異なる可能な構成を反映している。この訓練およびターゲットコンピューティングリソースのデータから、システムは、ニューラルネットワークアーキテクチャのファミリーを生成し得る。各アーキテクチャは、システムが識別した基本ニューラルネットワークから生成され得る。 In addition, the target computing resource data may specify different sets of target computing resources, eg, reflecting different possible configurations of computing resources housed in the data center. From this training and target computing resource data, the system may generate a family of neural network architectures. Each architecture may be generated from the basic neural networks identified by the system.

ブロック２３０に示すように、システムは、訓練データを用いて、探索空間に対してニューラルアーキテクチャ探索を実行し、基本ニューラルネットワークのアーキテクチャを識別し得る。システムは、強化学習、進化的探索、または微分可能探索に基づく技術など、様々なＮＡＳ技術のいずれも使用し得る。いくつかの実施態様では、システムは、たとえば、本明細書において説明するように訓練データを受け付けてＮＡＳを実行することなく、基本ニューラルネットワークのアーキテクチャを指定するデータを直接受け付けてもよい。たとえば、特定のニューラルネットワークタスクを実行するのに適したニューラルネットワークアーキテクチャの種類の解析に基づいて、最初のニューラルネットワークアーキテクチャ候補を予め定めることができる。別の例として、ニューラルネットワーク候補から構成される探索空間から、最初のニューラルネットワークアーキテクチャ候補をランダムに選択できる。別の例として、最初のニューラルネットワーク候補の少なくとも一部の特性を、たとえば、同様または同じ数の層、重み値など、以前訓練されたニューラルネットワークの同様の特性に基づいて選択できる。 As shown at block 230, the system may use the training data to perform a neural architecture search on the search space to identify the architecture of the base neural network. The system may use any of a variety of NAS techniques, such as techniques based on reinforcement learning, evolutionary search, or differentiable search. In some implementations, the system may directly accept data specifying the architecture of the basic neural network, without, for example, accepting training data and running the NAS as described herein. For example, an initial candidate neural network architecture can be predetermined based on an analysis of the types of neural network architectures that are suitable for performing a particular neural network task. As another example, an initial neural network architecture candidate can be randomly selected from a search space comprised of neural network candidates. As another example, characteristics of at least some of the initial neural network candidates can be selected based on similar characteristics of previously trained neural networks, such as, for example, similar or the same number of layers, weight values, etc.

探索空間は、基本ニューラルネットワークアーキテクチャの一部として選択される可能性のあるニューラルネットワーク候補またはニューラルネットワーク候補の一部を指す。ニューラルネットワークアーキテクチャ候補の一部は、ニューラルネットワークのコンポーネントを指し得る。ニューラルネットワークのアーキテクチャは、ニューラルネットワークの複数のコンポーネントに応じて規定され得る。ニューラルネットワークでは、各コンポーネントは、１つ以上のニューラルネットワーク層を含む。コンポーネントレベルのアーキテクチャにおいてニューラルネットワーク層の特性を規定でき、これは、コンポーネントにおける特定の演算を当該アーキテクチャが規定できることを意味し、その結果、コンポーネントにある各ニューラルネットワークが、コンポーネントに対して規定された同じ演算を実施する。また、コンポーネントは、アーキテクチャにおいて、コンポーネントにある層の数で規定され得る。 Search space refers to neural network candidates or portions of neural network candidates that may be selected as part of the basic neural network architecture. Some of the candidate neural network architectures may refer to components of a neural network. The architecture of a neural network may be defined according to multiple components of the neural network. In a neural network, each component includes one or more neural network layers. Characteristics of the neural network layers can be specified in a component-level architecture, which means that the architecture can specify specific operations on the component, such that each neural network in the component Perform the same operation. Also, a component may be defined in the architecture by the number of layers it has.

ＮＡＳを実行することの一部として、システムは、ニューラルネットワーク候補を識別すること、複数の目的に対応するパフォーマンスメトリックを取得すること、これらの各パフォーマンスメトリックに応じてニューラルネットワーク候補を評価することを、繰り返し実行し得る。ニューラルネットワーク候補の正解率およびレイテンシのメトリックなどのパフォーマンスメトリックを取得することの一部として、システムは、受け付けた訓練データを用いてニューラルネットワーク候補を訓練し得る。訓練し終わると、システムは、ニューラルネットワークアーキテクチャ候補を評価し、そのパフォーマンスメトリックを判定し、現在最適な候補に応じてこれらのパフォーマンスメトリックを比較し得る。 As part of performing NAS, the system identifies neural network candidates, obtains performance metrics that address multiple objectives, and evaluates neural network candidates according to each of these performance metrics. , may be executed repeatedly. As part of obtaining performance metrics, such as accuracy rate and latency metrics, for the neural network candidate, the system may train the neural network candidate using the received training data. Once trained, the system may evaluate candidate neural network architectures, determine their performance metrics, and compare these performance metrics according to the currently best candidate.

システムは、ニューラルネットワーク候補を選択し、ネットワークを訓練し、そのパフォーマンスメトリックを比較することによって、停止メトリックに達するまでこの探索プロセスを繰り返し実行し得る。停止メトリックは、現在のネットワーク候補が満たすパフォーマンスの所定の最小しきい値であり得る。これに加えてまたはこれに代えて、停止メトリックは、最大数の探索イテレーション、または探索を実行するために割り当てられる最大期間であり得る。停止メトリックは、ニューラルネットワークのパフォーマンスが収束する条件、たとえば、後続のイテレーションのパフォーマンスが前回のイテレーションのパフォーマンスとは異なるしきい値未満である条件であり得る。 The system may iteratively perform this search process by selecting neural network candidates, training the networks, and comparing their performance metrics until a stopping metric is reached. The outage metric may be a predetermined minimum threshold of performance that the current network candidate meets. Additionally or alternatively, the stopping metric may be a maximum number of search iterations or a maximum period of time allocated for performing the search. The stopping metric may be a condition under which the performance of the neural network converges, for example, the performance of a subsequent iteration is below a different threshold than the performance of a previous iteration.

正解率およびレイテンシなど、ニューラルネットワークの様々なパフォーマンスメトリックを最適化させるという状況では、停止メトリックは、「最適である」と予め定められたしきい値範囲を指定し得る。たとえば、最適なレイテンシのしきい値範囲は、ターゲットコンピューティングリソースが実現する理論上の最小レイテンシまたは測定された最小レイテンシからのしきい値範囲であり得る。理論上の最小レイテンシまたは測定された最小レイテンシは、コンピューティングリソースのコンポーネントが物理的に受信データを読み込みして処理できるために最低限必要な時間など、コンピューティングリソースの物理的特性に基づき得る。いくつかの実施態様では、レイテンシは、たとえば、物理的に可能な限りゼロ遅延に近い最小値として保持され、ターゲットコンピューティングリソースから測定または算出されたターゲットレイテンシに基づいてはいない。 In the context of optimizing various performance metrics of a neural network, such as accuracy rate and latency, the stopping metric may specify a predetermined threshold range that is "optimal." For example, the optimal latency threshold range may be a threshold range from a theoretical or measured minimum latency that the target computing resource achieves. The theoretical or measured minimum latency may be based on physical characteristics of the computing resource, such as the minimum amount of time required for a component of the computing resource to be physically capable of reading and processing received data. In some implementations, the latency is maintained as a minimum value, eg, as close to zero delay as physically possible, and is not based on a target latency measured or calculated from the target computing resource.

システムは、次のニューラルネットワークアーキテクチャ候補を選択するために機械学習モデルまたはその他の技術を使用するように構成され得る。ここで、選択は、特定のニューラルネットワークタスクの目的を受けてうまく機能する可能性の高いそれぞれ異なるニューラルネットワーク候補の学習済み特性に少なくとも一部基づき得る。 The system may be configured to use machine learning models or other techniques to select the next candidate neural network architecture. Here, the selection may be based at least in part on learned characteristics of different neural network candidates that are likely to perform well given the objectives of the particular neural network task.

いくつかの例では、システムは、基本ニューラルネットワークアーキテクチャを識別するために多目的報酬メカニズムを次のように使用し得る。 In some examples, the system may use a multi-objective reward mechanism to identify the basic neural network architecture as follows.

ニューラルネットワーク候補の正解率を測定するために、システムは、訓練セットを用いて、ニューラルネットワークタスクを実行するようにニューラルネットワーク候補を訓練し得る。システムは、たとえば、８０／２０分割によって訓練データを訓練セットと検証セットとに分割し得る。たとえば、システムは、教師あり学習技術を適用して、ニューラルネットワーク候補が生成する出力と、ネットワークが処理する訓練例の正解ラベルとの誤差を算出し得る。システムは、ニューラルネットワークが訓練されているタスクの種類に適した任意の様々な損失関数または誤差関数を利用でき、クラス分類タスクには交差エントロピー誤差、回帰タスクには平均二乗誤差などがある。ニューラルネットワーク候補の重みを変化させた場合の誤差の勾配を、たとえば逆伝播アルゴリズムを用いて算出し得、ニューラルネットワークの重みを更新し得る。システムは、訓練のためのイテレーション回数、最大期間、収束、または正解率の最小しきい値を満たした場合など、停止メトリックが満たされるまでニューラルネットワーク候補を訓練するように訓練され得る。 To measure the accuracy rate of a neural network candidate, the system may use a training set to train the neural network candidate to perform a neural network task. The system may split the training data into a training set and a validation set, for example, by an 80/20 split. For example, the system may apply supervised learning techniques to calculate the error between the output produced by the candidate neural network and the ground truth labels of the training examples processed by the network. The system can utilize any variety of loss or error functions appropriate to the type of task the neural network is being trained on, such as cross-entropy error for classification tasks and mean squared error for regression tasks. The gradient of the error when changing the weights of the neural network candidates may be calculated using, for example, a backpropagation algorithm, and the weights of the neural network may be updated. The system may be trained to train the neural network candidate until a stopping metric is met, such as when a minimum threshold of number of training iterations, maximum duration, convergence, or accuracy rate is met.

（１）ターゲットコンピューティングリソース上にデプロイされたときの基本ニューラルネットワーク候補の演算強度、および／またはターゲットコンピューティングリソース上の基本ニューラルネットワーク候補の実行効率を含むその他のパフォーマンスメトリックに加えて、システムは、ターゲットコンピューティングリソース上のニューラルネットワークアーキテクチャ候補の正解率およびレイテンシのパフォーマンスメトリックを生成し得る。いくつかの実施態様では、正解率およびレイテンシに加えて、基本ニューラルネットワーク候補の性能評価指標は、演算強度および／または実行効率の少なくとも一部に基づく。 In addition to other performance metrics, including (1) the computational intensity of the base neural network candidate when deployed on the target computing resource, and/or the execution efficiency of the base neural network candidate on the target computing resource, the system , may generate accuracy rate and latency performance metrics for the candidate neural network architecture on the target computing resource. In some implementations, in addition to accuracy rate and latency, performance metrics for base neural network candidates are based at least in part on computational intensity and/or execution efficiency.

レイテンシ、演算強度、および実行効率は、次のように定義され得る。 Latency, computational intensity, and execution efficiency may be defined as follows.

システムは、計算が少ないネットワークのみを探して探索するのではなく、演算強度、実行効率、および計算要件が改善された複数のニューラルネットワークアーキテクチャ候補を同時に探索することによって、基本ニューラルネットワークアーキテクチャを探索し、最終ニューラルネットワークのレイテンシを改善させることができる。システムをこのように動作するように構成して、最終基本ニューラルネットワークアーキテクチャの全体的なレイテンシを軽減させることができる。 The system explores basic neural network architectures by simultaneously exploring multiple neural network architecture candidates with improved computational intensity, execution efficiency, and computational requirements, rather than searching only for networks with fewer computations. , the latency of the final neural network can be improved. The system can be configured to operate in this manner to reduce the overall latency of the final underlying neural network architecture.

これに加えて、特に、ターゲットコンピューティングリソースがデータセンターハードウェアアクセラレータである場合、システムが基本ニューラルネットワークアーキテクチャを選択するアーキテクチャ候補探索空間を拡張して、ターゲットコンピューティングリソース上で少ない推論レイテンシで精度高く動作する可能性の高い利用可能なニューラルネットワーク候補の種類を広げることができる。 In addition to this, especially when the target computing resource is a data center hardware accelerator, the system expands the architecture candidate search space from which to select a base neural network architecture to improve accuracy with less inference latency on the target computing resource. It is possible to expand the variety of available neural network candidates that are likely to perform well.

記載のように探索空間を拡張することで、データセンターアクセラレータのデプロイにより適したニューラルネットワークアーキテクチャ候補の数を増やすことができ、その結果、本開示の態様に応じて拡張されなかった探索空間では候補にならなかったであろう基本ニューラルネットワークアーキテクチャを識別できるようになる。ターゲットコンピューティングリソースがＧＰＵおよびＴＰＵのようなハードウェアアクセラレータを指定する実施例では、アーキテクチャ候補、または、演算強度、並列性、および／もしくは実行効率を向上させるコンポーネントまたは演算などのアーキテクチャの一部を用いて探索空間を拡張できる。 Expanding the search space as described can increase the number of neural network architecture candidates that are more suitable for data center accelerator deployments, resulting in a search space that is not expanded in accordance with aspects of this disclosure. Be able to identify basic neural network architectures that otherwise would not have been possible. In embodiments where the target computing resources specify hardware accelerators such as GPUs and TPUs, candidate architectures or portions of the architecture, such as components or operations that improve computational strength, parallelism, and/or execution efficiency, can be used to expand the search space.

１つの例示的な拡張方法では、様々な種類の活性化関数のうち１つを実装する層を有するニューラルネットワークアーキテクチャコンポーネントを含むように探索空間を拡張できる。ＴＰＵおよびＧＰＵの場合、ＲｅＬＵまたはｓｗｉｓｈなどの活性化関数は、通常、演算強度が低く、これらの種類のハードウェアアクセラレータ上のメモリによる制約を受けることが通常であることが分かった。ニューラルネットワークにおいて活性化関数を実行することは、概して、ターゲットコンピューティングリソース上で利用可能なメモリの総量による制約を受けるので、これらの関数の実行は、エンドツーエンドネットワークの推論速度のパフォーマンスに非常にマイナスの影響を与え得る。 In one exemplary expansion method, the search space can be expanded to include neural network architecture components having layers that implement one of various types of activation functions. It has been found that for TPUs and GPUs, activation functions such as ReLU or swish typically have low computational intensity and are typically memory constrained on these types of hardware accelerators. Executing activation functions in neural networks is generally constrained by the amount of memory available on the target computing resources, so executing these functions has a huge impact on the inference speed performance of the end-to-end network. may have a negative impact.

活性化関数に相対する探索空間の１つの例示的な拡張は、関連する離散畳み込みと融合させた活性化関数を探索空間に導入することである。活性化関数は、概して、要素単位の演算であり、ベクトル演算のために構成されたハードウェアアクセラレータ単位で動作するので、離散畳み込みと並列してこれらの活性化関数を実行できる。離散畳み込みは、通常、ハードウェアアクセラレータの行列単位上で動作する行列ベースの演算である。これらの融合活性化関数-畳み込み演算は、本明細書に記載の基本ニューラルネットワークアーキテクチャ探索の一部として、システムによってニューラルネットワークコンポーネント候補として選択可能である。Ｓｗｉｓｈ、ＲｅＬＵ（ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ）、Ｓｉｇｍｏｉｄ、Ｔａｎｈ、およびＳｏｆｔｍａｘを含む、任意の様々な活性化関数を利用できる。 One exemplary extension of the search space relative to the activation function is to introduce an activation function fused with an associated discrete convolution into the search space. Activation functions are generally element-by-element operations and operate on a hardware accelerator configured for vector operations, so these activation functions can be performed in parallel with discrete convolution. Discrete convolution is a matrix-based operation that typically operates on matrix units of hardware accelerators. These fused activation function-convolution operations can be selected as neural network component candidates by the system as part of the basic neural network architecture search described herein. Any of a variety of activation functions can be utilized, including Swish, Rectified Linear Unit (ReLU), Sigmoid, Tanh, and Softmax.

融合活性化関数-畳み込み演算の層を含む様々なコンポーネントを探索空間に付加でき、当該コンポーネントは、使われる活性化関数の種類のよって異なり得る。たとえば、活性化関数－畳み込みの層の１つのコンポーネントがＲｅＬＵ活性化関数を含み得る一方で、別のコンポーネントは、Ｓｗｉｓｈ活性化関数を含み得る。異なる活性化関数を用いて異なるハードウェアアクセラレータがより効率よく動作し得るため、複数種類の活性化関数の融合活性化関数－畳み込みを含むように探索空間を拡張させることで、対象のニューラルネットワークタスクを実行するのに最も適した基本ニューラルネットワークアーキテクチャを識別することをさらに向上できることが分かった。 Fusion Activation Functions - Various components can be added to the search space, including layers of convolution operations, and the components can vary depending on the type of activation function used. For example, one component of the activation function-convolution layer may include a ReLU activation function, while another component may include a Swish activation function. Since different hardware accelerators can operate more efficiently using different activation functions, expanding the search space to include a fused activation function-convolution of multiple types of activation functions can improve the target neural network task. We found that we can further improve our ability to identify the most suitable basic neural network architecture to perform.

本明細書において説明した様々な活性化関数を有するコンポーネントに加えて、その他の融合された畳み込み構造を用いて探索空間を拡張し、異なる形状、種類、および大きさの畳み込みを用いて探索空間をさらに豊かにすることもできる。異なる畳み込み構造は、ニューラルネットワークアーキテクチャ候補の一部として追加されるコンポーネントであり得、１×１畳み込みからなる拡張層、ｄｅｐｔｈ－ｗｉｓｅ（深さ単位）の畳み込み、１×１畳み込みからなる投射層、ならびに活性化関数、バッチ正規化関数、および／またはスキップ接続などその他の演算を含み得る。 In addition to the components with various activation functions described herein, other fused convolutional structures can be used to expand the search space, and convolutions of different shapes, types, and sizes can be used to expand the search space. You can also make it even richer. Different convolutional structures may be components added as part of a candidate neural network architecture, such as an enhancement layer consisting of 1×1 convolutions, a depth-wise convolution, a projection layer consisting of 1×1 convolutions, and other operations such as activation functions, batch normalization functions, and/or skip connections.

そのため、ＮＡＳの探索空間をその他の方法で拡張して、ハードウェアアクセラレータ上で利用可能な並列性を利用できる演算を含むようにすることができる。探索空間は、深さ単位の畳み込みを隣接する１×１畳み込みと融合させるための１つ以上の演算と、ニューラルネットワークの入力を整形するための演算とを含み得る。たとえば、ニューラルネットワーク候補への入力は、テンソルであり得る。テンソルは、異なる階数に応じた複数の値を表し得るデータ構造である。たとえば、一階のテンソルは、ベクトルであり得、二階のテンソルは、行列であり得、三階の行列は、３次元行列であり得る…などである。深さ単位の畳み込みを融合させることにはメリットがあるであろう。なぜならば、深さ単位の演算は、概して、演算強度の低い演算であり、この演算を隣接する畳み込みと融合させることで、演算強度をハードウェアアクセラレータの最大能力に近い演算強度に高めることができるためである。 Therefore, the search space of the NAS can be expanded in other ways to include operations that can take advantage of the parallelism available on hardware accelerators. The search space may include one or more operations to fuse the depth-wise convolution with adjacent 1×1 convolutions and operations to shape the input of the neural network. For example, the input to a neural network candidate may be a tensor. A tensor is a data structure that can represent multiple values according to different ranks. For example, a first order tensor may be a vector, a second order tensor may be a matrix, a third order matrix may be a three-dimensional matrix, etc. There may be merit in fusing depth-wise convolutions. This is because depth unit operations are generally low-integration operations, and by merging this operation with adjacent convolutions, the operation intensity can be increased to near the maximum capacity of the hardware accelerator. It's for a reason.

また、探索空間は、ターゲットコンピューティングリソース上のメモリ内の様々な場所にテンソルの要素を移動させることによって入力テンソルを整形する演算を含み得る。これに加えてまたはこれに代えて、演算は、メモリ内の様々な場所に要素を複製し得る。 The search space may also include operations that shape the input tensor by moving elements of the tensor to various locations in memory on the target computing resource. Additionally or alternatively, operations may duplicate elements at various locations in memory.

いくつかの実施態様では、システムは、直接スケーリングするための基本ニューラルネットワークを受信し、基本ニューラルネットワークを識別するためのＮＡＳまたはその他の探索を実行しないように構成される。いくつかの実施態様では、１つのデバイス上で基本ニューラルネットワークを識別し、本明細書において説明したように基本ニューラルネットワークを別のデバイス上でスケーリングすることによって、複数のデバイスが個々にプロセス２００の少なくとも一部を実行する。 In some implementations, the system is configured to receive the base neural network for scaling directly and not perform a NAS or other search to identify the base neural network. In some implementations, multiple devices individually perform the process 200 by identifying a base neural network on one device and scaling the base neural network on another device as described herein. Do at least some of it.

図２のブロック２４０に示すように、システムは、ターゲットコンピューティングリソースを指定する情報と、複数のスケーリングパラメータとに応じて基本ニューラルネットワークをスケーリングするための複数のスケーリングパラメータ値を識別し得る。システムは、基本ニューラルネットワークのスケーリングパラメータ値を探索する目的としてスケーリングされたニューラルネットワーク候補の正解率とレイテンシとを使うために、今回説明したようなレイテンシを意識した複合スケーリングを利用できる。たとえば、システムは、図２のステップ２４０において、レイテンシを意識した複合スケーリングのプロセス３００を適用し、複数のスケーリングパラメータ値を識別し得る。 As shown in block 240 of FIG. 2, the system may identify multiple scaling parameter values for scaling the base neural network in response to information specifying target computing resources and the multiple scaling parameters. The system can use the latency-aware composite scaling described here in order to use the accuracy rate and latency of the scaled neural network candidate for the purpose of searching the scaling parameter value of the basic neural network. For example, the system may apply a latency-aware composite scaling process 300 and identify multiple scaling parameter values in step 240 of FIG.

一般に、スケーリング技術をＮＡＳと併せて適用し、ターゲットコンピューティングリソース上でデプロイするためにスケーリングされるニューラルネットワークを識別する。モデルスケーリングをＮＡＳと併せて使用し、様々なユースケースをサポートするニューラルネットワークのファミリーをさらに効率よく探索できる。スケーリング手法の下では、様々な技術を用いて、ニューラルネットワークの深さ、幅、および分解能などのスケーリングパラメータについて、様々な値を探索できる。スケーリングパラメータごとに値を別個に探索することによって、または、複数のスケーリングパラメータをまとめて調整するための均一の値セットを探索することによって、スケーリングを行うことができる。前者は、単純スケーリングと称される場合があり、後者は、複合スケーリングと称される場合がある。 Generally, scaling techniques are applied in conjunction with NAS to identify neural networks to be scaled for deployment on target computing resources. Model scaling can be used in conjunction with NAS to more efficiently explore families of neural networks that support a variety of use cases. Under scaling techniques, different techniques can be used to explore different values for scaling parameters such as depth, width, and resolution of the neural network. Scaling can be performed by searching for values for each scaling parameter separately or by searching for a uniform set of values to adjust multiple scaling parameters together. The former is sometimes referred to as simple scaling, and the latter is sometimes referred to as complex scaling.

唯一の目的として正解率を用いるスケーリング技術では、データセンターアクセラレータなど専用ハードウェア上にデプロイされたときのパフォーマンス／速度の影響を適切に考慮してスケーリングされたニューラルネットワークを得ることができない。図３においてさらに詳細を説明するが、ＬＡＣＳは、基本ニューラルネットワークアーキテクチャを識別するために使われる目的と同じ目的として共有され得る正解率目的およびレイテンシ目的の両方を使用し得る。 Scaling techniques that use accuracy as the sole objective do not allow neural networks to be scaled with adequate consideration of performance/speed implications when deployed on specialized hardware such as data center accelerators. As explained in further detail in FIG. 3, LACS may use both accuracy and latency objectives, which may be shared as the same objective used to identify the basic neural network architecture.

図３は、基本ニューラルネットワークアーキテクチャのレイテンシを意識した複合スケーリングの例示的なプロセス３００である。例示的なプロセス３００は、１つ以上の場所にある１つ以上のプロセッサから構成されるシステムまたはデバイス上で実行され得る。たとえば、プロセス３００は、本明細書に記載のＮＡＳ－ＬＡＣＳシステム上で実行され得る。たとえば、システムは、ステップ２４０の一部としてプロセス３００を実行し、基本ニューラルネットワークをスケーリングするための複数のスケーリングパラメータ値を識別し得る。 FIG. 3 is an example process 300 for latency-aware compound scaling of basic neural network architectures. Exemplary process 300 may be executed on a system or device comprised of one or more processors at one or more locations. For example, process 300 may be performed on a NAS-LACS system described herein. For example, the system may perform process 300 as part of step 240 to identify multiple scaling parameter values for scaling the base neural network.

複合スケーリング手法では、スケーリングパラメータごとのスケーリング係数をまとめて探索する。システムは、パレートフロンティア探索またはグリッドサーチなど、係数のタプルを識別するための任意の様々な探索技術を適用できる。システムは、係数のタプルを探索するが、図２を参照して本明細書において説明した基本ニューラルネットワークアーキテクチャを識別するために用いられる目的と同じ目的に応じて当該タプルを探索し得る。これに加えて、複数の目的は、正解率とレイテンシとの両方を含み得、次のように表すことができる。 In the composite scaling method, scaling coefficients for each scaling parameter are searched for at the same time. The system can apply any of a variety of search techniques to identify tuples of coefficients, such as Pareto frontier search or grid search. The system may search for tuples of coefficients according to the same objectives used to identify the basic neural network architecture described herein with reference to FIG. In addition to this, multiple objectives can include both accuracy rate and latency, and can be expressed as:

性能評価指標を決定することの一部として、システムは、受信した訓練データを用いて、スケーリングされたニューラルネットワークアーキテクチャ候補をさらに訓練および調整し得る。システムは、訓練済みのスケーリングされたニューラルネットワークの性能評価指標を決定し、ブロック３３０に従って、性能評価指標がパフォーマンスしきい値を満たすかどうかを判断し得る。性能評価指標およびパフォーマンスしきい値は、それぞれ、複数のパフォーマンスメトリックの複合物および複数のパフォーマンスしきい値であり得る。たとえば、システムは、スケーリングされたニューラルネットワークの正解率および推論レイテンシの両方のメトリックから１つの性能評価指標を決定し得、または、異なる目的について別個のパフォーマンスメトリックを判定し、各メトリックを対応するパフォーマンスしきい値と比較し得る。 As part of determining the performance metrics, the system may use the received training data to further train and tune the scaled neural network architecture candidate. The system may determine a performance metric for the trained scaled neural network and, in accordance with block 330, determine whether the performance metric meets a performance threshold. A performance metric and a performance threshold may be a composite of multiple performance metrics and multiple performance thresholds, respectively. For example, the system may determine one performance metric from both the scaled neural network's accuracy rate and inference latency metrics, or it may determine separate performance metrics for different purposes and associate each metric with a corresponding performance Can be compared to a threshold.

性能評価指標がパフォーマンスしきい値を満たした場合、プロセス３００は終了する。そうでない場合、プロセスは継続し、システムは、ブロック３１０に従って新しい複数のスケーリングパラメータ値候補を選択する。たとえば、システムは、以前選択したタプル候補およびその対応する性能評価指標に少なくとも一部基づいて、係数探索空間からスケーリング係数から構成される新しいタプルを選択し得る。いくつかの実施態様では、システムは、係数から構成される複数のタプルを探索し、タプル候補の各々の近傍にある複数の目的に応じてより細かい探索を行う。 If the performance metrics meet the performance threshold, process 300 ends. Otherwise, the process continues and the system selects a new plurality of potential scaling parameter values according to block 310. For example, the system may select new tuples comprised of scaling coefficients from the coefficient search space based at least in part on previously selected tuple candidates and their corresponding performance metrics. In some implementations, the system searches multiple tuples of coefficients and performs a finer search depending on multiple objectives in the vicinity of each candidate tuple.

システムは、たとえば、グリッドサーチ、強化学習、進化的探索を用いるなど、係数候補空間を繰り返し探索するために任意の様々な技術を実施し得る。前述したように、システムは、収束またはイテレーションの数などの停止メトリックに達するまで、スケーリングパラメータ値を探索し続け得る。 The system may implement any of a variety of techniques to iteratively explore the coefficient candidate space, such as using grid search, reinforcement learning, evolutionary search, for example. As mentioned above, the system may continue to explore scaling parameter values until a stopping metric is reached, such as convergence or number of iterations.

いくつかの実施態様では、システムは、１つ以上のコントローラパラメータ値に応じて調整され得る。コントローラパラメータ値は、手作業で調整され得、機械学習技術によって学習され得、またはこれらの組合せでもあり得る。コントローラパラメータは、タプル候補の全体的な性能評価指標に対する各目的の相対的な効果に影響を与え得る。いくつかの例では、タプル候補に含まれる特定の値または値同士の特定の関係は、コントローラパラメータ値に少なくとも一部が反映された理想的なスケーリング係数の学習済み特性に基づいて、好まれたり好まれなかったりし得る。 In some implementations, the system may be adjusted in response to one or more controller parameter values. Controller parameter values may be adjusted manually, learned by machine learning techniques, or a combination thereof. Controller parameters may affect the relative effect of each objective on the overall performance metrics of the tuple candidates. In some examples, particular values or particular relationships between values in the candidate tuples are favored or It may not be liked.

ブロック３４０に従って、システムは、１つ以上の目的トレードオフに応じて、選択したスケーリングパラメータ値候補から、１つ以上のスケーリングパラメータ値グループを生成する。目的トレードオフは、正解率およびレイテンシなど、目的ごとのそれぞれ異なるしきい値を表し得、様々なスケーリングされたニューラルネットワークによって満たされ得る。たとえば、１つの目的トレードオフでは、ネットワーク正解率のしきい値は高いが、推論レイテンシのしきい値は低い（すなわち、精度が高いネットワークでレイテンシが高い）。別の例として、目的トレードオフでは、ネットワーク正解率のしきい値は低いが、推論レイテンシのしきい値は高い（すなわち、精度が低いネットワークでレイテンシが低い）。別の例として、目的トレードオフでは、正解率とレイテンシパフォーマンスとの間でバランスが取られている。 According to block 340, the system generates one or more scaling parameter value groups from the selected scaling parameter value candidates according to one or more objective tradeoffs. Objective trade-offs may represent different thresholds for each objective, such as accuracy rate and latency, and may be satisfied by various scaled neural networks. For example, one objective tradeoff is to have a high threshold for network accuracy, but a low threshold for inference latency (ie, high accuracy networks with high latency). As another example, the objective tradeoff is to have a low threshold for network accuracy, but a high threshold for inference latency (i.e., a network with low accuracy and low latency). As another example, an objective tradeoff is a balance between accuracy rate and latency performance.

目的トレードオフごとに、システムは、目的トレードオフを満たすように基本ニューラルネットワークアーキテクチャをスケーリングするためにシステムが使用可能なスケーリングパラメータ値グループを識別し得る。すなわち、システムは、ブロック３１０に示す選択と、ブロック３２０に示す性能評価指標の決定と、ブロック３３０に示す性能評価指標がパフォーマンスしきい値を満たすかどうかの判断とを繰り返し得る（パフォーマンスしきい値が目的トレードオフによって定められるという点は異なる）。いくつかの実施態様では、基本ニューラルネットワークアーキテクチャのタプルを探索するのではなく、システムは、ブロック３３０に従って複数の目的の性能評価指標を最初に満たした選択したスケーリングパラメータ値の候補に応じてスケーリングされた基本ニューラルネットワークアーキテクチャのスケーリング係数タプル候補を探索し得る。 For each objective tradeoff, the system may identify a group of scaling parameter values that the system can use to scale the base neural network architecture to meet the objective tradeoff. That is, the system may iterate the selection shown in block 310, the determination of a performance metric shown in block 320, and the determination of whether the performance metric meets a performance threshold shown in block 330. is determined by objective trade-offs). In some implementations, rather than exploring tuples of basic neural network architectures, the system scales according to the selected scaling parameter value candidate that first satisfies multiple objective performance metrics according to block 330. Candidate scaling factor tuples of the basic neural network architecture can be searched.

図２に戻ると、ブロック２５０に示すように、ＮＡＳ－ＬＡＣＳシステムは、複数のスケーリングパラメータ値に応じてスケーリングされた基本ニューラルネットワークのアーキテクチャを用いて、スケーリングされたニューラルネットワークの１つ以上のアーキテクチャを生成し得る。スケーリングされたニューラルネットワークアーキテクチャは、基本ニューラルネットワークアーキテクチャおよび様々なスケーリングパラメータ値から生成されたニューラルネットワークのファミリーであり得る。 Returning to FIG. 2, as shown at block 250, the NAS-LACS system uses one or more architectures of the scaled neural network using the architecture of the scaled base neural network according to the plurality of scaling parameter values. can be generated. A scaled neural network architecture may be a family of neural networks generated from a base neural network architecture and various scaling parameter values.

ターゲットコンピューティングリソースを指定する情報が複数のターゲットコンピューティングリソース、たとえば、複数の様々な種類のハードウェアアクセラレータから構成される１つ以上のセットを含む場合、システムは、ハードウェアアクセラレータごとに、プロセス２００およびプロセス３００を繰り返し、ターゲットセットに各々が対応するスケーリングされたニューラルネットワークのアーキテクチャを生成し得る。ハードウェアアクセラレータごとに、システムは、レイテンシと正解率との間、またはレイテンシと、正解率と、その他の目的（特に、（３）を参照して説明した演算強度および実行効率を含む）との間の様々な目的トレードオフに応じて、スケーリングされたニューラルネットワークアーキテクチャのファミリーを生成し得る。 If the information specifying a target computing resource includes one or more sets of multiple target computing resources, e.g., multiple hardware accelerators of various types, the system specifies a process for each hardware accelerator. 200 and process 300 may be repeated to generate scaled neural network architectures, each corresponding to a target set. For each hardware accelerator, the system determines the relationship between latency and accuracy rate, or between latency and accuracy rate and other objectives (including, inter alia, computational intensity and execution efficiency as discussed with reference to (3)). A family of scaled neural network architectures may be generated depending on various objective trade-offs between

いくつかの実施態様では、システムは、同じ基本ニューラルネットワークアーキテクチャから複数のスケーリングされたニューラルネットワークアーキテクチャのファミリーを生成し得る。この手法は、様々な対象デバイスが同様のハードウェア特性を共有している状況において有用であり得、デバイスごとの対応するスケーリングされたファミリーをより高速に識別できるようになる。なぜならば、少なくとも、たとえば図２のプロセスに示すような基本ニューラルネットワークアーキテクチャの探索が、１回しか実行されないためである。 In some implementations, the system may generate multiple families of scaled neural network architectures from the same basic neural network architecture. This approach may be useful in situations where various target devices share similar hardware characteristics, allowing faster identification of the corresponding scaled family for each device. This is because at least the search for the basic neural network architecture, such as the one shown in the process of FIG. 2, is performed only once.

例示的なシステム
図４は、本開示の態様に係る、ＮＡＳ－ＬＡＣＳ（ニューラルアーキテクチャ探索-レイテンシを意識した複合スケーリング）システム４００のブロック図である。システム４００は、ニューラルネットワークタスクを実行するための訓練データ４０１と、ターゲットコンピューティングリソースを指定するターゲットコンピューティングリソースのデータ４０２とを受信するように構成される。図１～図３を参照して本明細書において説明したように、システム４００は、スケーリングされたニューラルネットワークアーキテクチャのファミリーを生成するための技術を実施するように構成され得る。 Exemplary System FIG. 4 is a block diagram of a NAS-LACS (Neural Architecture Discovery-Latency Aware Composite Scaling) system 400, in accordance with aspects of the present disclosure. System 400 is configured to receive training data 401 for performing a neural network task and target computing resource data 402 specifying a target computing resource. As described herein with reference to FIGS. 1-3, system 400 may be configured to implement techniques for generating a family of scaled neural network architectures.

システム４００は、ユーザーインタフェースに応じて入力データを受信するように構成され得る。たとえば、システム４００は、システム４００を公開しているＡＰＩ（アプリケーションプログラムインタフェース）に対する呼び出しの一部としてデータを受信し得る。図５を参照して本明細書に説明するが、システム４００は、１つ以上のコンピューティングデバイス上に実装できる。たとえば、ネットワークで１つ以上のコンピューティングデバイスに接続されたリモートストレージを含む記憶媒体を通してシステム４００への入力が行われ得、または、システム４００に連結されたクライアントコンピューティングデバイス上のユーザーインタフェースを通して入力が行われ得る。 System 400 may be configured to receive input data in response to a user interface. For example, system 400 may receive data as part of a call to an API (application program interface) that exposes system 400. As described herein with reference to FIG. 5, system 400 can be implemented on one or more computing devices. For example, input to system 400 may be made through a storage medium, including remote storage, connected to one or more computing devices in a network, or through a user interface on a client computing device coupled to system 400. can be done.

システム４００は、スケーリングされたニューラルネットワークアーキテクチャのファミリーなど、スケーリングされたニューラルネットワークアーキテクチャ４０９を出力するように構成され得る。スケーリングされたニューラルネットワークアーキテクチャ４０９は、たとえばユーザディスプレイ上に表示するための出力として送信され、必要に応じて、アーキテクチャにおいて規定されている各ニューラルネットワーク層の形状およびサイズに従って可視化され得る。いくつかの実施態様では、システム４００は、スケーリングされたニューラルネットワークアーキテクチャ４０９を１つ以上のコンピュータプログラムなど、コンピュータ読み取り可能な命令セットとして提供するように構成され得る。コンピュータ読み取り可能な命令セットは、スケーリングされたニューラルネットワークアーキテクチャ４０９を実装するために、ターゲットコンピューティングリソースによって実行され得る。 System 400 may be configured to output a scaled neural network architecture 409, such as a family of scaled neural network architectures. The scaled neural network architecture 409 may be transmitted as an output for display on a user display, for example, and optionally visualized according to the shape and size of each neural network layer defined in the architecture. In some implementations, system 400 may be configured to provide scaled neural network architecture 409 as a set of computer readable instructions, such as one or more computer programs. The computer readable instruction set may be executed by target computing resources to implement scaled neural network architecture 409.

コンピュータプログラムは、たとえば、宣言型、手続き型、アセンブリ、オブジェクト指向、データ指向、関数型、または命令型など、任意のプログラミングパラダイムに従って任意の種類のプログラミング言語で書かれ得る。コンピュータプログラムは、１つ以上の異なる関数を実行し、コンピューティング環境内、たとえば物理デバイス上、仮想マシン上、または複数のデバイス間で動作するように書かれ得る。また、コンピュータプログラムは、本明細書に記載の機能、たとえば、システム、エンジン、モジュール、またはモデルによって実行される機能を実施する。 Computer programs may be written in any type of programming language according to any programming paradigm, such as declarative, procedural, assembly, object-oriented, data-oriented, functional, or imperative. A computer program may be written to perform one or more different functions and to operate within a computing environment, such as on a physical device, on a virtual machine, or across multiple devices. A computer program also performs the functions described herein, such as those performed by a system, engine, module, or model.

いくつかの実施態様では、システム４００は、アーキテクチャを（必要に応じて、機械学習モデルを生成するためのフレームワークの一部として）コンピュータプログラミング言語で書かれた実行可能なプログラムに変換するために構成された１つ以上のその他のデバイスに、スケーリングされたニューラルネットワークアーキテクチャ４０９用のデータを転送するように構成される。また、システム４００は、スケーリングされたニューラルネットワークアーキテクチャ４０９に対応するデータを、格納して後の検索に用いるため記憶装置に送るように構成され得る。 In some implementations, system 400 is configured to convert the architecture into an executable program written in a computer programming language (optionally as part of a framework for generating machine learning models). The scaled neural network architecture 409 is configured to transfer data for the scaled neural network architecture 409 to one or more other configured devices. System 400 may also be configured to send data corresponding to scaled neural network architecture 409 to a storage device for storage and later retrieval.

システム４００は、ＮＡＳエンジン４０５を備え得る。ＮＡＳエンジン４０５およびシステム４００のその他の構成要素は、１つ以上のコンピュータプログラム、特別に構成された電子回路、またはこれらの任意の組合せとして実装され得る。ＮＡＳエンジン４０５は、訓練データ４０１とターゲットコンピューティングリソースのデータ４０２とを受信し、基本ニューラルネットワークアーキテクチャ４０７を生成するように構成され得る。基本ニューラルネットワークアーキテクチャ４０７は、ＬＡＣＳエンジン４１５に送信され得る。ＮＡＳエンジン４０５は、図１～図３を参照して本明細書において説明したニューラルアーキテクチャ探索の任意の様々な技術を実装できる。システムは、ターゲットコンピューティングリソース上で実行されたときのニューラルネットワーク候補の推論レイテンシと正解率とを含む複数の目的を用いてＮＡＳを実行するよう、本開示の態様に従って構成され得る。ＮＡＳエンジン４０５が基本ニューラルネットワークアーキテクチャを探索するために利用できるパフォーマンスメトリックを判定することとの一部として、システム４００は、パフォーマンス測定エンジン４１０を備え得る。 System 400 may include a NAS engine 405. NAS engine 405 and other components of system 400 may be implemented as one or more computer programs, specially configured electronic circuits, or any combination thereof. NAS engine 405 may be configured to receive training data 401 and target computing resource data 402 and generate basic neural network architecture 407. Basic neural network architecture 407 may be sent to LACS engine 415. NAS engine 405 may implement any of the various techniques of neural architecture exploration described herein with reference to FIGS. 1-3. A system may be configured in accordance with aspects of the present disclosure to perform NAS with multiple objectives, including inference latency and accuracy rate of neural network candidates when executed on target computing resources. As part of determining performance metrics that the NAS engine 405 can utilize to explore the underlying neural network architecture, the system 400 may include a performance measurement engine 410.

パフォーマンス測定エンジン４１０は、基本ニューラルネットワーク候補のアーキテクチャを受信する、およびＮＡＳエンジン４０５によってＮＡＳを実行するために使われる目的に応じてパフォーマンスメトリックを生成するように構成され得る。パフォーマンスメトリックは、複数の目的に応じてニューラルネットワーク候補の全体的な性能評価指標を提供し得る。基本ニューラルネットワーク候補の正解率を判断するために、パフォーマンス測定エンジン４１０は、たとえば、訓練データ４０１の一部を残しておくことによって検証用の訓練例セットを取得することによって、当該検証セット上で基本ニューラルネットワーク候補を実行し得る。 Performance measurement engine 410 may be configured to receive the base neural network candidate architecture and generate performance metrics in response to the purpose used by NAS engine 405 to perform the NAS. Performance metrics may provide an overall performance evaluation of a neural network candidate for multiple purposes. In order to determine the accuracy rate of the basic neural network candidate, the performance measurement engine 410 obtains a training example set for verification by leaving a part of the training data 401, and performs a test on the verification set. Basic neural network candidates can be implemented.

レイテンシを測定するために、パフォーマンス測定エンジン４１０は、データ４０２によって指定されているターゲットコンピューティングリソースに対応するコンピューティングリソースと通信し得る。たとえば、ターゲットコンピューティングリソースのデータ４０２がＴＰＵを対象リソースと指定している場合、パフォーマンス測定エンジン４１０は、基本ニューラルネットワーク候補を対応するＴＰＵ上で実行するために送信し得る。ＴＰＵは、システム４００を実装する１つ以上のプロセッサと（たとえば、図５を参照してさらに詳細を説明するネットワークで）通信しているデータセンターに収容され得る。 To measure latency, performance measurement engine 410 may communicate with a computing resource corresponding to the target computing resource specified by data 402. For example, if target computing resource data 402 specifies a TPU as the target resource, performance measurement engine 410 may send the base neural network candidate to run on the corresponding TPU. The TPU may be housed in a data center in communication (e.g., in a network described in further detail with reference to FIG. 5) with one or more processors implementing system 400.

パフォーマンス測定エンジン４１０は、ターゲットコンピューティングリソースが入力を受け付けることと、出力を生成することとの間のレイテンシを示すレイテンシ情報を受信し得る。レイテンシ情報は、現地でターゲットコンピューティングリソースに対して直接測定され、パフォーマンス測定エンジン４１０に送られ得る、または、パフォーマンス測定エンジン４１０自体によって測定され得る。パフォーマンス測定エンジン４１０がレイテンシを測定した場合、エンジン４１０は、基本ニューラルネットワーク候補の処理が原因ではないレイテンシ、たとえば、ターゲットコンピューティングリソースと通信するためのネットワークレイテンシを補償するように構成され得る。別の例として、パフォーマンス測定エンジン４１０は、ターゲットコンピューティングリソースの以前の測定値、およびターゲットコンピューティングリソースのハードウェア特性に基づいて、基本ニューラルネットワーク候補を通った入力の処理のレイテンシを推定し得る。 Performance measurement engine 410 may receive latency information indicating the latency between a target computing resource accepting input and producing output. Latency information may be measured locally on the target computing resource directly and sent to performance measurement engine 410, or may be measured by performance measurement engine 410 itself. If performance measurement engine 410 measures latency, engine 410 may be configured to compensate for latency that is not due to processing of the underlying neural network candidate, e.g., network latency for communicating with the target computing resource. As another example, performance measurement engine 410 may estimate the latency of processing the input through the base neural network candidate based on previous measurements of the target computing resource and hardware characteristics of the target computing resource. .

パフォーマンス測定エンジン４１０は、演算強度および実行効率など、ニューラルネットワークアーキテクチャ候補のその他の特性のパフォーマンスメトリックを生成し得る。図１～図３を参照して本明細書において説明したように、推論レイテンシは、ＦＬＯＰＳ（計算要件）、実行効率、および演算強度に応じて判定され得、いくつかの実施態様では、システム４００は、これらの追加特性に基づいてニューラルネットワークを探索し、直接または間接的にスケーリングする。 Performance measurement engine 410 may generate performance metrics for other characteristics of the candidate neural network architecture, such as computational intensity and execution efficiency. As described herein with reference to FIGS. 1-3, inference latency may be determined as a function of FLOPS (computational requirements), execution efficiency, and computational intensity, and in some implementations, the system 400 explores and scales neural networks directly or indirectly based on these additional properties.

パフォーマンスメトリックが生成されると、パフォーマンス測定エンジン４１０は、メトリックをＮＡＳエンジン４０５に送り得る。そして、ＮＡＳエンジン４０５は、図２を参照して本明細書において説明したように、停止メトリックに達するまで新しい基本ニューラルネットワークアーキテクチャ候補の新しい探索を繰り返し得る。 Once the performance metrics are generated, performance measurement engine 410 may send the metrics to NAS engine 405. NAS engine 405 may then iterate new searches for new base neural network architecture candidates until a stopping metric is reached, as described herein with reference to FIG. 2.

いくつかの例では、ＮＡＳエンジン４０５が次の基本ニューラルネットワークアーキテクチャ候補をどのように選択するかについて調整するための１つ以上のコントローラパラメータに応じて、ＮＡＳエンジン４０５を調整する。コントローラパラメータは、特定のニューラルネットワークタスクのためのニューラルネットワークの所望の特性に応じて、手作業で調整できる。いくつかの例では、コントローラパラメータを、様々な機械学習技術によって学習でき、ＮＡＳエンジン４０５は、レイテンシおよび正解率など、複数の目的に応じて基本ニューラルネットワークアーキテクチャを選択するために訓練された１つ以上の機械学習モデルを実装し得る。たとえば、ＮＡＳエンジン４０５は、以前の基本ニューラルネットワーク候補の特徴量および複数の目的を使用するように訓練された再帰型ニューラルネットワークを実装し、これらの目的を満たす可能性がより高い基本ネットワーク候補を予測し得る。ニューラルネットワークタスクに関連する訓練データセット、およびターゲットコンピューティングリソースのデータが与えられたときに選択された最終基本ニューラルアーキテクチャを示すようラベル付けされた訓練データとパフォーマンスメトリックとを用いて、ニューラルネットワークを訓練できる。 In some examples, the NAS engine 405 is adjusted in response to one or more controller parameters to adjust how the NAS engine 405 selects the next base neural network architecture candidate. Controller parameters can be adjusted manually depending on the desired characteristics of the neural network for a particular neural network task. In some examples, the controller parameters can be learned by various machine learning techniques, and the NAS engine 405 can be one trained to select the base neural network architecture according to multiple objectives, such as latency and accuracy rate. The above machine learning models can be implemented. For example, the NAS engine 405 implements a recurrent neural network that is trained to use features and multiple objectives of previous base neural network candidates to select base network candidates that are more likely to meet those objectives. Predictable. Develop a neural network using a training data set associated with the neural network task, and the training data and performance metrics labeled to indicate the final base neural architecture selected given the target computing resource data. Can be trained.

ＬＡＣＳエンジン４１５は、本開示の態様に従って説明したように、レイテンシを意識した複合スケーリングを実行するように構成され得る。ＬＡＣＳエンジン４１５は、基本ニューラルネットワークアーキテクチャを指定するデータ４０７をＮＡＳエンジン４０５から受信するように構成される。ＮＡＳエンジン４０５と同様に、ＬＡＣＳエンジン４１５は、パフォーマンス測定エンジン４１０と通信して、スケーリングされたニューラルネットワークアーキテクチャ候補のパフォーマンスメトリックを取得し得る。図１～図３を参照して本明細書において説明したように、ＬＡＣＳエンジン４１５は、スケーリング係数から構成される異なるタプルのメモリに探索空間を保持し得、また、最終のスケーリングされたアーキテクチャをスケーリングして、スケーリングされたニューラルネットワークアーキテクチャのファミリーを素早く取得するように構成され得る。いくつかの実施態様では、ＬＡＣＳエンジン４１５は、その他の形式のスケーリング、たとえば、単純スケーリングを実行するように構成されるが、ＮＡＳエンジン４０５が用いるレイテンシを含む複数の目的を使用する。 LACS engine 415 may be configured to perform latency-aware composite scaling as described in accordance with aspects of this disclosure. LACS engine 415 is configured to receive data 407 from NAS engine 405 specifying a basic neural network architecture. Similar to NAS engine 405, LACS engine 415 may communicate with performance measurement engine 410 to obtain performance metrics for the scaled neural network architecture candidate. As described herein with reference to FIGS. 1-3, LACS engine 415 may maintain a search space in memory of different tuples composed of scaling factors and It may be configured to scale to quickly obtain a family of scaled neural network architectures. In some implementations, LACS engine 415 is configured to perform other forms of scaling, such as simple scaling, but uses multiple purposes, including the latency used by NAS engine 405.

図５は、ＮＡＳ－ＬＡＣＳシステム４００を実装するための例示的な環境５００のブロック図である。システム４００は、サーバコンピューティングデバイス５１５内など１つ以上の場所に１つ以上のプロセッサを有する１つ以上のデバイス上に実装され得る。クライアントコンピューティングデバイス５１２およびサーバコンピューティングデバイス５１５は、ネットワーク５６０で１つ以上の記憶装置５３０に通信可能に連結され得る。記憶装置（複数可）５３０は、揮発性メモリと不揮発性メモリとの組合せであり得、コンピューティングデバイス５１２、５１５と物理的に同じ位置にあってもよく、物理的に異なる位置にあってもよい。たとえば、記憶装置（複数可）５３０は、ハードドライブ、ソリッドステートドライブ、テープドライブ、光記憶装置、メモリカード、ＲＯＭ、ＲＡＭ、ＤＶＤ、ＣＤ－ＲＯＭ、書き込み可能メモリ、および読取り専用メモリなど、情報を格納可能な任意の種類の非一時的なコンピュータ読み取り可能な媒体を含み得る。 FIG. 5 is a block diagram of an example environment 500 for implementing NAS-LACS system 400. System 400 may be implemented on one or more devices having one or more processors in one or more locations, such as within a server computing device 515. Client computing device 512 and server computing device 515 may be communicatively coupled to one or more storage devices 530 in network 560. Storage device(s) 530 may be a combination of volatile and non-volatile memory and may be in the same physical location as the computing devices 512, 515 or in a different physical location. good. For example, storage device(s) 530 may store information such as hard drives, solid state drives, tape drives, optical storage devices, memory cards, ROMs, RAM, DVDs, CD-ROMs, writable memory, and read-only memory. It may include any type of non-transitory computer-readable medium capable of being stored.

サーバコンピューティングデバイス５１５は、１つ以上のプロセッサ５１３と、メモリ５１４とを備え得る。メモリ５１４は、プロセッサ（複数可）５１３がアクセスできる情報を格納し得、この情報は、プロセッサ（複数可）５１３によって実行され得る命令５２１を含む。また、メモリ５１４は、プロセッサ（複数可）５１３が検索したり、操作したり、格納したりできるデータ５２３を含み得る。メモリ５１４は、揮発性メモリおよび不揮発性メモリなど、プロセッサ（複数可）５１３がアクセスできる情報を格納可能な種類の非一時的なコンピュータ読み取り可能な媒体であり得る。プロセッサ（複数可）５１３は、１つ以上のＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、１つ以上のＧＰＵ（ＧｒａｐｈｉｃＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、１つ以上のＦＧＰＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）、および／または、ＴＰＵ（ＴｅｎｓｏｒＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）などの１つ以上のＡＳＩＣ（特定用途向け集積回路）を含み得る。 Server computing device 515 may include one or more processors 513 and memory 514. Memory 514 may store information that may be accessed by processor(s) 513, including instructions 521 that may be executed by processor(s) 513. Memory 514 may also include data 523 that may be retrieved, manipulated, and stored by processor(s) 513 . Memory 514 can be any type of non-transitory computer-readable medium capable of storing information that can be accessed by processor(s) 513, such as volatile memory and non-volatile memory. The processor(s) 513 includes one or more CPUs (Central Processing Units), one or more GPUs (Graphic Processing Units), one or more FGPAs (Field Programmable Gate Arrays), and/or T. PU (Tensor Processing) may include one or more ASICs (Application Specific Integrated Circuits), such as

命令５２１は、１つ以上の命令を含み得る。当該１つ以上の命令は、プロセッサ（複数可）５１３によって実行されると、命令が定める動作を１つ以上のプロセッサに実行させる。命令５２１は、プロセッサ（複数可）５１３によって直接処理されるオブジェクトコード形式、または、要求に基づいて解釈されたり予めコンパイルされたりする独立したソースコードモジュールから構成される解釈可能なスクリプトまたはコレクションを含む、その他の形式で格納できる。命令５２１は、本開示の態様と一致したシステム４００を実装するための命令を含み得る。システム４００は、プロセッサ（複数可）５１３を用いて実行でき、および／またはサーバコンピューティングデバイス５１５から遠隔の場所に置かれているその他のプロセッサを用いて実行できる。 Instructions 521 may include one or more instructions. The one or more instructions, when executed by processor(s) 513, cause the one or more processors to perform the operations specified by the instructions. Instructions 521 may be in the form of object code that is processed directly by processor(s) 513 or may include an interpretable script or collection of independent source code modules that are interpreted or precompiled on demand. , and can be stored in other formats. Instructions 521 may include instructions for implementing system 400 consistent with aspects of this disclosure. System 400 may be executed using processor(s) 513 and/or other processors located remotely from server computing device 515.

データ５２３は、命令５２１に従って、プロセッサ（複数可）５１３によって検索されたり、格納されたり、または修正されたりし得る。データ５２３は、コンピュータレジスタに格納でき、複数の異なるフィールドおよびレコードを有するテーブルとしてリレーショナルデータベースもしくは非リレーショナルデータベースに格納でき、またはＪＳＯＮ、ＹＡＭＬ、ｐｒｏｔｏ、もしくはＸＭＬ文書として格納できる。また、データ５２３は、バイナリ値、ＡＳＣＩＩ、またはＵｎｉｃｏｄｅなど、コンピュータスが読み取り可能な形式にフォーマットされ得るが、これらに限定されない。また、データ５２３は、数字、説明文、プロプライエタリコード、ポインタ、その他のネットワークの場所などその他のメモリに格納されているデータへのリファレンスなど、関連性のある情報を特定するのに十分な情報、または、関連性のあるデータを計算するために関数が用いる情報を含み得る。 Data 523 may be retrieved, stored, or modified by processor(s) 513 according to instructions 521. Data 523 can be stored in computer registers, in a relational or non-relational database as a table with multiple different fields and records, or as a JSON, YAML, proto, or XML document. Additionally, data 523 may be formatted in a computer readable format, such as, but not limited to, binary values, ASCII, or Unicode. Data 523 also includes information sufficient to identify the relevant information, such as numbers, descriptive text, proprietary codes, pointers, references to data stored in other memory, such as other network locations; Or, it may contain information that the function uses to calculate relevant data.

また、クライアントコンピューティングデバイス５１２は、サーバコンピューティングデバイス５１５と同様に、１つ以上のプロセッサ５１６、メモリ５１７、命令５１８、およびデータ５１９で構成できる。クライアントコンピューティングデバイス５１２も、ユーザ出力部５２６と、ユーザ入力部５２４とを含み得る。ユーザ入力部５２４は、キーボード、マウス、機械式アクチュエータ、ソフトアクチュエータ、タッチスクリーン、マイクロフォン、およびセンサーなど、ユーザから入力を受け付けるための任意の適切なメカニズムまたは技術を含み得る。 Client computing device 512 may also be configured with one or more processors 516, memory 517, instructions 518, and data 519, similar to server computing device 515. Client computing device 512 may also include user outputs 526 and user inputs 524. User input 524 may include any suitable mechanism or technology for receiving input from a user, such as a keyboard, mouse, mechanical actuator, soft actuator, touch screen, microphone, and sensor.

サーバコンピューティングデバイス５１５は、クライアントコンピューティングデバイス５１２にデータを送信するように構成され得、クライアントコンピューティングデバイス５１２は、受信データの少なくとも一部を、ユーザ出力部５２６の一部として実装されたディスプレイに表示するように構成され得る。また、クライアントコンピューティングデバイス５１２とサーバコンピューティングデバイス５１５との間のインタフェースを表示するためにユーザ出力部５２６を用いることができる。これに代えてまたはこれに加えて、ユーザ出力部５２６は、１つ以上のスピーカー、変換器またはその他の音声出力部、クライアントコンピューティングデバイス５１２のプラットフォームユーザに非視覚的かつ非可聴式情報を提供する触覚インタフェースまたはその他の触覚フィードバックを含み得る。 Server computing device 515 may be configured to transmit data to client computing device 512, and client computing device 512 may transmit at least a portion of the received data to a display implemented as part of user output 526. It can be configured to display on. User output 526 may also be used to display the interface between client computing device 512 and server computing device 515. Alternatively or additionally, user output 526 may include one or more speakers, transducers, or other audio outputs that provide non-visual and non-audible information to a platform user of client computing device 512. may include a haptic interface or other tactile feedback.

図５では、プロセッサ５１３、５１６およびメモリ５１４、５１７がコンピューティングデバイス５１５、５１２内にあると図示されているが、プロセッサ５１３、５１６およびメモリ５１４、５１７を含む本明細書に記載の構成要素は、物理的に異なる位置でそれぞれ動作でき、かつ、同じコンピューティングデバイス内に存在しない複数のプロセッサおよび複数のメモリを含み得る。たとえば、命令５２１、５１８およびデータ５２３、５１９の一部をリムーバブルＳＤカード上に格納し、残りを読取り専用コンピュータチップ内に格納できる。命令およびデータの一部またはすべては、プロセッサ５１３、５１６から物理的に離れた場所ではあるがプロセッサ５１３、５１６がアクセスできる場所に格納できる。同様に、プロセッサ５１３、５１６は、同時に動作できるおよび／または逐次動作できるプロセッサの集合を含み得る。コンピューティングデバイス５１５、５１２は、各々、タイミング情報を提供する１つ以上の内部クロックを備え得る。タイミング情報は、コンピューティングデバイス５１５、５１２によって実行される演算およびプログラムの時間を測定するために用いられ得る。 Although processors 513, 516 and memories 514, 517 are illustrated in FIG. 5 as being within computing devices 515, 512, the components described herein including processors 513, 516 and memories 514, 517 are , may include multiple processors and multiple memories, each of which can operate in different physical locations, and which are not within the same computing device. For example, some of the instructions 521, 518 and data 523, 519 may be stored on a removable SD card, and the remainder may be stored within a read-only computer chip. Some or all of the instructions and data may be stored in a location that is physically remote from the processors 513, 516 but accessible to the processors 513, 516. Similarly, processors 513, 516 may include a collection of processors that can operate simultaneously and/or sequentially. Computing devices 515, 512 may each include one or more internal clocks that provide timing information. Timing information may be used to time operations and programs executed by computing devices 515, 512.

サーバコンピューティングデバイス５１５は、ハードウェアアクセラレータ５５１Ａ～５５１Ｎが収容されているデータセンター５５０にネットワーク５６０で接続され得る。データセンター５５０は、複数のデータセンターのうち１つであり得、またはハードウェアアクセラレータなど様々な種類のコンピューティングデバイスが置かれているその他の設備のうち１つであり得る。本明細書において説明したように、データセンター５５０に収容されているコンピューティングリソースは、スケーリングされたニューラルネットワークアーキテクチャをデプロイするためのターゲットコンピューティングリソースの一部として指定され得る。 Server computing device 515 may be connected by network 560 to data center 550 where hardware accelerators 551A-551N are housed. Data center 550 may be one of a plurality of data centers or other facilities in which various types of computing devices are located, such as hardware accelerators. As described herein, computing resources housed in data center 550 may be designated as part of the target computing resources for deploying a scaled neural network architecture.

サーバコンピューティングデバイス５１５は、データセンター５５０にあるコンピューティングリソース上のクライアントコンピューティングデバイス５１２から、データを処理する要求を受け付けるように構成され得る。たとえば、環境５００は、様々なユーザーインタフェースおよび／またはプラットフォームサービスを公開しているＡＰＩを通して様々なサービスをユーザに提供するように構成されたコンピューティングプラットフォームの一部であり得る。１つ以上のサービスは、機械学習フレームワークであり得、または、指定のタスクおよび訓練データに応じてニューラルネットワークもしくはその他の機械学習モデルを生成するためのツールセットであり得る。クライアントコンピューティングデバイス５１２は、特定のニューラルネットワークタスクを実行するように訓練されたニューラルネットワークを実行するために割り当てられるターゲットコンピューティングリソースを指定するデータを受送信し得る。図１～図４を参照して本明細書において説明した本開示の態様によると、ＮＡＳ－ＬＡＣＳシステム４００は、ターゲットコンピューティングリソースを指定するデータと訓練データとを受信し、それに応答して、ターゲットコンピューティングリソース上にデプロイするためのスケーリングされたニューラルネットワークアーキテクチャのファミリーを生成し得る。 Server computing device 515 may be configured to accept requests to process data from client computing devices 512 on computing resources located at data center 550. For example, environment 500 may be part of a computing platform configured to provide various services to users through various user interfaces and/or APIs exposing platform services. The one or more services may be a machine learning framework or a toolset for generating neural networks or other machine learning models in response to specified tasks and training data. Client computing device 512 may receive and transmit data specifying target computing resources to be allocated to execute a neural network trained to perform a particular neural network task. According to aspects of the disclosure described herein with reference to FIGS. 1-4, the NAS-LACS system 400 receives data specifying target computing resources and training data, and in response: A family of scaled neural network architectures may be generated for deployment on target computing resources.

環境５００を実装するプラットフォームが提供する可能性のあるサービスのその他の例として、サーバコンピューティングデバイス５１５は、データセンター５５０において利用可能である可能性のある様々なターゲットコンピューティングリソースに従って様々なスケーリングされたニューラルネットワークアーキテクチャのファミリーを保持し得る。たとえば、サーバコンピューティングデバイス５１５は、データセンター５５０に収容されている様々な種類のＴＰＵおよび／またはＧＰＵ上にニューラルネットワークをデプロイするための様々なファミリーを保持し得、そうでない場合、処理に使用できる様々なファミリーを保持し得る。 As other examples of services that a platform implementing environment 500 may provide, server computing device 515 may be configured to perform various scaling according to various target computing resources that may be available in data center 550. can hold a family of neural network architectures. For example, server computing device 515 may maintain different families for deploying neural networks on different types of TPUs and/or GPUs housed in data center 550 and otherwise used for processing. can hold various families.

デバイス５１２、５１５、およびデータセンター５５０は、ネットワーク５６０で直接または間接的に通信可能である。たとえば、ネットワークソケットを使用して、クライアントコンピューティングデバイス５１２は、インターネットプロトコルを通してデータセンター５５０において動作しているサービスに接続できる。デバイス５１５、５１２は、情報を送受信するための開始接続を受け付け得るリスニングソケットをセットアップできる。ネットワーク５６０自体が、インターネット、ＷｏｒｌｄＷｉｄｅＷｅｂ、イントラネット、仮想プライベートネットワーク、ワイドエリアネットワーク、ローカルネットワーク、および１つ以上の会社が所有する通信プロトコルを用いたプライベートネットワークを含む、様々な構成およびプロトコルを含み得る。ネットワーク５６０は、様々な短距離接続および長距離接続をサポートできる。短距離接続および長距離接続は、２．４０２ＧＨｚ～２．４８０ＧＨｚ（共通してＢｌｕｅｔｏｏｔｈ（登録商標）規格に対応付けられている）、２．４ＧＨｚおよび５ＧＨｚ（共通してＷｉ－Ｆｉ（登録商標）通信プロトコルに対応付けられている）などの様々な帯域幅で行われ得、または、ワイヤレスブロードバンド通信のためのＬＴＥ（登録商標）規格など様々な通信規格を用いて行われ得る。また、これに加えてまたはこれに代えて、ネットワーク５６０は、デバイス５１２、５１５とデータセンター５５０との間で、様々な種類のイーサネット（登録商標）接続での有線接続を含む、有線接続をサポートできる。 Devices 512, 515 and data center 550 can communicate directly or indirectly over network 560. For example, using network sockets, client computing device 512 can connect to services operating at data center 550 through Internet protocols. Devices 515, 512 can set up listening sockets that can accept initiated connections to send and receive information. Network 560 itself may include a variety of configurations and protocols, including the Internet, the World Wide Web, intranets, virtual private networks, wide area networks, local networks, and private networks using communication protocols proprietary to one or more companies. obtain. Network 560 can support a variety of short-range and long-range connections. Short-range and long-range connections are 2.402 GHz to 2.480 GHz (commonly mapped to the Bluetooth® standard), 2.4 GHz and 5 GHz (commonly associated with Wi-Fi® communication protocols) or using different communication standards, such as the LTE standard for wireless broadband communications. Additionally or alternatively, network 560 supports wired connections between devices 512, 515 and data center 550, including wired connections over various types of Ethernet connections. can.

１つのサーバコンピューティングデバイス５１５、１つのクライアントコンピューティングデバイス５１２、および１つのデータセンター５５０が図５に示されているが、本開示の態様は、逐次処理または並列処理のためのパラダイムで実装する、または複数のデバイスから構成される分散ネットワーク上で実装するなど、様々な構成および量のコンピューティングデバイスに応じて実装できることを理解されたい。いくつかの実施態様では、本開示の態様は、ニューラルネットワークを処理するために構成された複数のハードウェアアクセラレータに接続された１つのデバイス上で実行できる、または、それらの任意の組合せであり得る。 Although one server computing device 515, one client computing device 512, and one data center 550 are shown in FIG. 5, aspects of the present disclosure may be implemented in paradigms for sequential or parallel processing. It should be appreciated that implementations may be implemented on a variety of configurations and quantities of computing devices, such as on a distributed network of multiple devices, or on a distributed network of multiple devices. In some implementations, aspects of the present disclosure may be performed on one device connected to multiple hardware accelerators configured to process neural networks, or any combination thereof. .

例示的なユースケース
本明細書において説明したように、本開示の態様は、多目的手法に応じて基本ニューラルネットワークからスケーリングされたニューラルネットワークのアーキテクチャの生成を可能にする。ニューラルネットワークタスクの例は、以下の通りである。 Exemplary Use Cases As described herein, aspects of the present disclosure enable the generation of scaled neural network architectures from basic neural networks in a multi-objective manner. An example of a neural network task is as follows.

例として、ニューラルネットワークへの入力は、画像形式、映像形式であり得る。与えられた入力を処理することの一部として、たとえばコンピュータビジョンタスクの一部として、特徴量を抽出、識別、および生成するようにニューラルネットワークを構成できる。この種類のニューラルネットワークタスクを実行するように訓練されたニューラルネットワークを、様々なあり得るクラス分類セットから１つの出力クラス分類を生成するように訓練できる。これに加えてまたはこれに代えて、画像または映像において識別された被写体が特定のクラスに属している可能性があるとの推定に対応するスコアを出力するようにニューラルネットワークを訓練できる。 By way of example, the input to the neural network can be in the form of images, videos. Neural networks can be configured to extract, identify, and generate features as part of processing a given input, such as as part of a computer vision task. A neural network trained to perform this type of neural network task can be trained to generate one output class classification from a set of different possible class classifications. Additionally or alternatively, a neural network can be trained to output a score corresponding to an estimate that an object identified in an image or video is likely to belong to a particular class.

別の例として、ニューラルネットワークへの入力は、特定のフォーマットに対応するデータファイルであり得、たとえば、ＨＴＭＬファイル、ワープロ文書、または、画像ファイルのメタデータなど、その他の種類のデータから取得したフォーマット済みのメタデータであり得る。この状況におけるニューラルネットワークタスクは、受け付けた入力についての特性を分類すること、スコアリングすること、そうでない場合、予測することであり得る。たとえば、受け付けた入力が特定のテーマに関連するテキストを含んでいる可能性を予測するようにニューラルネットワークを訓練できる。また、特定のタスクを実行することの一部として、たとえば文書を作成中に文書におけるテキストのオートコンプリートのためのツールの一部としてテキスト予測を生成するようにニューラルネットワークを訓練できる。たとえば、メッセージの作成中に入力文書にあるテキストの対象言語への翻訳を予測するためのニューラルネットワークを訓練できる。 As another example, the input to a neural network may be a data file corresponding to a particular format, for example, a format obtained from an HTML file, a word processing document, or other types of data, such as metadata of an image file. This can be metadata that has already been used. The neural network task in this situation may be to classify, score, or otherwise predict properties about the received input. For example, a neural network can be trained to predict the likelihood that input it receives contains text related to a particular theme. Neural networks can also be trained to generate text predictions as part of performing a specific task, for example as part of a tool for auto-completion of text in a document while composing the document. For example, a neural network can be trained to predict the translation of text in an input document into a target language while composing a message.

その他の種類の入力文書は、相互に接続されたデバイスから構成されるネットワークの特性に関連するデータであり得る。これらの入力文書は、アクティビティログ、および、様々なコンピューティングデバイスが機密である可能性のあるデータの様々なソースにアクセスできるアクセス特権に関するレコードを含み得る。現在または将来のネットワークへのセキュリティ侵害を予測するためにこれらの文書およびその他の種類の文書を処理するようにニューラルネットワークを訓練できる。たとえば、悪意のある行為者によるネットワークへの侵入を予測するようにニューラルネットワークを訓練できる。 Other types of input documents may be data related to characteristics of a network comprised of interconnected devices. These input documents may include records regarding activity logs and access privileges that allow various computing devices to access various sources of potentially sensitive data. Neural networks can be trained to process these and other types of documents to predict current or future security breaches to the network. For example, neural networks can be trained to predict network intrusions by malicious actors.

別の例として、ニューラルネットワークへの入力は、ストリーミングオーディオ、予め録音された音声、および映像またはその他のソースもしくはメディアの一部としての音声を含む、音声入力であり得る。音声という状況では、ニューラルネットワークタスクは、その他の識別された音声ソースから音声を分離すること、および／または識別された音声の特性を強調して聞き取りやすくすることを含む音声認識を含み得る。たとえば翻訳ツールの一部として入力音声の対象言語へのリアルタイムな正確な翻訳を予測するようにニューラルネットワークを訓練できる。 As another example, the input to the neural network can be audio input, including streaming audio, pre-recorded audio, and audio as part of video or other sources or media. In the context of speech, neural network tasks may include speech recognition, including separating speech from other identified speech sources and/or emphasizing characteristics of identified speech to make it easier to hear. For example, as part of a translation tool, a neural network can be trained to predict accurate translations of input speech into a target language in real time.

また、本明細書に記載の様々な種類のデータを含むデータ入力に加えて、与えられた入力に対応する特徴量を処理するようにニューラルネットワークを訓練できる。特徴量とは、値であり、たとえば、入力の特性に関連する数値または明確な値である。たとえば、画像という状況では、画像の特徴量は、画像にある画素ごとのＲＧＢ値に関連し得る。画像／映像の状況におけるニューラルネットワークタスクは、たとえば様々な人、場所、または物の存在を対象として画像または映像の内容を分類することであり得る。与えられた入力に対する出力を生成するために処理される関連性のある特徴量を抽出および選択するようにニューラルネットワークを訓練でき、学習した入力データの様々な特性間の関係性に基づいて新しい特徴量を生成するようにも訓練できる。 Additionally, in addition to data inputs that include the various types of data described herein, neural networks can be trained to process features that correspond to a given input. A feature is a value, for example a numerical value or a definite value related to a characteristic of the input. For example, in the context of an image, image features may relate to the RGB values of each pixel in the image. A neural network task in an image/video situation may be, for example, to classify the content of an image or video with respect to the presence of different people, places, or things. Neural networks can be trained to extract and select relevant features that are processed to produce an output for a given input, creating new features based on learned relationships between various characteristics of the input data. It can also be trained to produce quantities.

本開示の態様は、デジタル回路、コンピュータ読み取り可能な記憶媒体、１つ以上のコンピュータプログラムとして、またはこれらのうちの１つ以上の組合せとして実装できる。コンピュータ読み取り可能な記憶媒体は、たとえば、プロセッサ（複数可）によって実行可能であり、有形の記憶装置上に格納される１つ以上の命令として、非一時的なコンピュータ読み取り可能な記憶媒体であり得る。 Aspects of the present disclosure may be implemented as a digital circuit, a computer-readable storage medium, one or more computer programs, or a combination of one or more of these. A computer-readable storage medium can be a non-transitory computer-readable storage medium, e.g., as one or more instructions executable by processor(s) and stored on a tangible storage device. .

本明細書において、「構成される（ｃｏｎｆｉｇｕｒｅｄｔｏ）」というフレーズが、コンピュータシステム、ハードウェア、またはコンピュータプログラムの一部に関連する様々な状況で使われている。システムは１つ以上の演算を実行するように構成される、と述べられている場合、これは、動作時、システムに１つ以上の演算を実行させる適切なソフトウェア、ファームウェア、および／またはハードウェアがシステムにインストールされていることを意味する。ハードウェアは１つ以上の演算を実行するように構成される、と述べられている場合、これは、動作時、入力を受け付け、入力に応じて１つ以上の演算に対応する出力を生成する１つ以上の回路をハードウェアが備えることを意味する。コンピュータプログラムは１つ以上の演算を実行するように構成される、と述べられている場合、これは、１つ以上のコンピュータによって実行されると１つ以上のコンピュータに１つ以上の演算を実行させる１つ以上のプログラム命令をコンピュータプログラムが含むことを意味する。 The phrase "configured to" is used herein in various contexts relating to a computer system, hardware, or piece of computer program. When it is stated that a system is configured to perform one or more operations, this means that the system is configured with appropriate software, firmware, and/or hardware that, when operated, causes the system to perform one or more operations. is installed on the system. When hardware is said to be configured to perform one or more operations, this means that, in operation, it accepts input and, depending on the input, produces output corresponding to the one or more operations. It means that the hardware includes one or more circuits. When a computer program is said to be configured to perform one or more operations, this means that when executed by one or more computers, it causes one or more computers to perform one or more operations. means that a computer program includes one or more program instructions to cause

図面に示されている動作およびクレームに記載されている動作は、特定の順序で示されているが、これらの動作は、示されている順序とは異なる順序で実行できること、一部の動作は省略できること、１回以上実行できること、および／またはその他の動作と並行して実行できることを理解されたい。さらには、様々な動作を実行するために構成された様々なシステム構成要素を分離することは、これらの構成要素を分離する必要があると理解されるべきではない。記載されている構成要素、モジュール、プログラム、およびエンジンは、１つのシステムに統合でき、または複数のシステムの一部とすることができる。 Although the acts illustrated in the drawings and recited in the claims may be shown in a particular order, these acts may be performed in a different order than that shown, and some acts may be It is to be understood that this may be omitted, may be performed more than once, and/or may be performed in parallel with other operations. Furthermore, separation of various system components configured to perform various operations is not to be understood as requiring separation of these components. The components, modules, programs, and engines described can be integrated into one system or can be part of multiple systems.

特に明示しない限り、上記のその他の実施例のほとんどは、相互に排他的ではない。しかし、様々な組合せで実装してユニークな利点を実現してもよい。上述した機能のこれらのおよびその他の変形例および組合せは、添付の特許請求の範囲によって示される発明の主題を逸脱しない範囲で利用することができるため、上記実施の形態の説明は、添付の特許請求の範囲によって示される発明の主題を限定するものではなく、一例としてとらえるべきである。これに加えて、本明細書に記載した実施例の提供、および「ｓｕｃｈａｓ」、「ｉｎｃｌｕｄｉｎｇ」などの言葉で表現された節の提供は、添付の特許請求の範囲の発明の主題を具体例に限定すると解釈されるべきではない。むしろ、これらの実施例は、多くの可能な実施の形態のうちの１つを例示しているにすぎない。さらには、異なる図面における同一の参照番号は、同一または同様の要素を識別し得る。 Unless explicitly stated otherwise, most of the other embodiments described above are not mutually exclusive. However, they may be implemented in various combinations to achieve unique benefits. These and other variations and combinations of the features described above may be utilized without departing from the subject matter of the invention as indicated by the appended claims, and therefore the above description of the embodiments is incorporated herein by reference in the accompanying patents. The subject matter indicated by the claims is to be regarded as illustrative rather than limiting. In addition, the provision of the embodiments described herein, and the provision of phrases such as "such as," "including," and the like, are intended to exemplify the subject matter of the appended claims. should not be construed as limiting. Rather, these examples merely illustrate one of many possible implementations. Furthermore, the same reference numbers in different drawings may identify the same or similar elements.

Claims

A computer-implemented method for determining the architecture of a neural network, comprising:
one or more processors receiving information specifying a target computing resource;
the one or more processors receiving data specifying a basic neural network architecture;
the one or more processors identifying a plurality of scaling parameter values for scaling the base neural network in response to information specifying the target computing resource and a plurality of scaling parameters for the base neural network; and the identifying includes:
selecting multiple scaling parameter value candidates;
and determining a performance evaluation index of the basic neural network scaled according to the plurality of scaling parameter value candidates, and the performance evaluation index is determined according to a plurality of purposes including a latency purpose. determined, said method comprising:
The method further comprising: the one or more processors generating a scaled neural network architecture using the base neural network architecture scaled according to the plurality of scaling parameter values.

The plurality of purposes are a plurality of second purposes,
receiving said data specifying an architecture of said base neural network;
one or more processors receiving training data corresponding to a neural network task;
2. The one or more processors comprising: performing a neural architecture search of a search space using the training data to identify an architecture of the base neural network according to a plurality of first objectives. the method of.

The search space includes candidate neural network layers, each candidate neural network layer configured to perform one or more operations;
3. The method of claim 2, wherein the search space includes neural network layer candidates that include different activation functions.

The basic neural network architecture includes a plurality of component candidates, each component having a plurality of neural network layers;
The search space includes a first component of a network layer candidate including a first activation function, and a second component of a network layer candidate including a second activation function different from the first activation function. 4. The method of claim 3, comprising a plurality of candidate components of the neural network layer candidates.

3. The method of claim 2, wherein the plurality of first objectives for performing the neural architecture search are the same as the plurality of second objectives for identifying the plurality of scaling parameter values.

The plurality of first objectives and the plurality of second objectives include an accuracy rate objective corresponding to an accuracy rate of the output of the basic neural network when trained using the training data. Method.

The performance evaluation index is configured such that the basic neural network receives input and generates an output when the basic neural network is scaled according to the plurality of scaling parameter value candidates and is deployed on the target computing resource. 2. The method of claim 1, wherein the method corresponds at least in part to a measure of latency between.

5. The latency objective corresponds to a minimum latency between the base neural network accepting input and producing an output when the base neural network is deployed on the target computing resource. The method described in 1.

The information specifying the target computing resource specifies one or more hardware accelerators;
2. The method of claim 1, further comprising running the scaled neural network on the one or more hardware accelerators to perform the neural network task.

the target computing resource is a first target computing resource, the plurality of scaling parameter values are a plurality of first scaling parameter values,
The method includes:
the one or more processors receiving information specifying a second target computing resource different from the first target computing resource;
identifying a plurality of second scaling parameter values for scaling the base neural network in response to information specifying the second target computing resource, the plurality of second scaling parameter values comprising: 10. The method of claim 9, wherein the plurality of first scaling parameter values are different.

The plurality of scaling parameter values are a plurality of first scaling parameter values,
The method further includes generating a scaled neural network architecture from the base neural network architecture scaled using a plurality of second scaling parameter values, the second scaling parameter values being 2. The method of claim 1, wherein the method is generated in response to one scaling parameter value and one or more composite coefficients that uniformly change the value of each of the first scaling parameter values.

The basic neural network is a convolutional neural network, and the plurality of scaling parameters include one or more of a depth of the basic neural network, a width of the basic neural network, and a resolution of input to the basic neural network. , the method of claim 1.

A system,
one or more processors;
one or more storage devices storing instructions coupled to the one or more processors, the instructions, when executed by the one or more processors, causing the one or more processors to performing an operation to determine the architecture of the network, the operation comprising:
an operation of accepting information specifying a target computing resource;
an act of receiving data specifying the architecture of the basic neural network;
an act of identifying a plurality of scaling parameter values for scaling the basic neural network in response to information specifying the target computing resource and a plurality of scaling parameters of the basic neural network; teeth,
selecting multiple scaling parameter value candidates;
and determining a performance evaluation index of the basic neural network scaled according to the plurality of scaling parameter value candidates, and the performance evaluation index is determined according to a plurality of purposes including a latency purpose. determined, said operation further comprising:
A system comprising an act of generating a scaled neural network architecture using the base neural network architecture scaled according to the plurality of scaling parameter values.

The plurality of purposes are a plurality of second purposes,
The act of receiving the data specifying the architecture of the basic neural network comprises:
receiving training data corresponding to a neural network task;
14. The system of claim 13, comprising performing a neural architecture search of a search space using the training data to identify an architecture of the base neural network according to a plurality of first objectives.

The search space includes candidate neural network layers, each candidate neural network layer configured to perform one or more operations;
15. The system of claim 14, wherein the search space includes neural network layer candidates that include different activation functions.

15. The system of claim 14, wherein the plurality of first objectives for performing the neural architecture search are the same as the plurality of second objectives for identifying the plurality of scaling parameter values.

15. The plurality of first objectives and the plurality of second objectives include an accuracy rate objective corresponding to the accuracy rate of the output of the basic neural network when trained using the training data. system.

The performance evaluation index is configured such that the basic neural network receives input and generates an output when the basic neural network is scaled according to the plurality of scaling parameter value candidates and is deployed on the target computing resource. 14. The system of claim 13, wherein the system corresponds at least in part to a metric of latency between.

5. The latency objective corresponds to a minimum latency between the base neural network accepting input and producing an output when the base neural network is deployed on the target computing resource. The system described in 13.

one or more non-transitory computer-readable storage media storing instructions, the instructions, when executed by the one or more processors, causing the one or more processors to implement a neural network architecture; An operation for determining is performed, and the operation is:
an operation of accepting information specifying a target computing resource;
an act of the one or more processors receiving data specifying a basic neural network architecture;
an act of identifying a plurality of scaling parameter values for scaling the basic neural network in response to information specifying the target computing resource and a plurality of scaling parameters of the basic neural network; teeth,
selecting multiple scaling parameter value candidates;
and determining a performance evaluation index of the basic neural network scaled according to the plurality of scaling parameter value candidates, and the performance evaluation index is determined according to a plurality of purposes including a latency purpose. determined, said operation further comprising:
A computer-readable storage medium comprising an act of generating a scaled neural network architecture using the base neural network architecture scaled according to the plurality of scaling parameter values.