JP7428075B2

JP7428075B2 - Object detection device, object detection method and terminal equipment

Info

Publication number: JP7428075B2
Application number: JP2020092988A
Authority: JP
Inventors: 昊康; タヌ・ジミン
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2019-06-21
Filing date: 2020-05-28
Publication date: 2024-02-06
Anticipated expiration: 2040-05-28
Also published as: JP2021002333A; CN112116032A

Description

本発明は、情報技術分野に関する。 The present invention relates to the field of information technology.

近年、深層学習（ディープラーニング）により、コンピュータービジョンの分野の研究は大きな進歩を遂げている。深層学習とは、階層型のニューラルネットワークに様々な機械学習アルゴリズムを適用して画像やテキストなどの様々な問題を解決するためのアルゴリズムの集合を意味する。特徴学習は、深層学習のコアとして、階層型のニューラルネットワークを通じて階層型の特徴情報を取得することで、特徴を手動で設計する必要があるという従来の重要な問題を解決することを目的とする。 In recent years, research in the field of computer vision has made great progress thanks to deep learning. Deep learning refers to a collection of algorithms that apply various machine learning algorithms to hierarchical neural networks to solve various problems such as images and text. Feature learning, as the core of deep learning, aims to solve the traditional problem of having to manually design features by acquiring hierarchical feature information through a hierarchical neural network. .

現在、普及している深層学習方法は幾つかがあり、例えばＹＯＬＯネットワークは、物体の認識と検出のための有望な深層学習方法である。例えば、ｄａｒｋｎｅｔ５３をバックボーンネットワーク構造として有するＹＯＬＯ－Ｄａｒｋｎｅｔ５３ネットワークは、マルチスケールの物体検出と優れた分類器を有するため、シングルステージ（ｓｉｎｇｌｅｓｔａｇｅ）に比べて、処理速度が速く、認識精度が高い。ここで、ｄａｒｋｎｅｔ５３構造は、特徴の抽出のために用いられる。 Currently, there are several popular deep learning methods, such as YOLO network, which is a promising deep learning method for object recognition and detection. For example, the YOLO-Darknet53 network, which has darknet53 as its backbone network structure, has multi-scale object detection and an excellent classifier, so it has faster processing speed and higher recognition accuracy than a single stage. Here, darknet53 structure is used for feature extraction.

なお、上述した技術背景の説明は、本発明の技術案を明確、完全に理解させるための説明であり、当業者を理解させるために記述されているものである。これらの技術案は、単なる本発明の背景技術部分として説明されたものであり、当業者により周知されたものではない。 It should be noted that the above description of the technical background is provided to provide a clear and complete understanding of the technical solution of the present invention, and is provided to provide a clear and complete understanding to those skilled in the art. These technical solutions are merely explained as a background technical part of the present invention, and are not well known by those skilled in the art.

しかし、例えばＹＯＬＯ－Ｄａｒｋｎｅｔ５３ネットワークなどの認識精度の高いニューラルネットワークは、広くて深い層を有するため、プロセッサへのメモリと処理速度の要件が高い。例えば、ＹＯＬＯ－Ｄａｒｋｎｅｔ５３ネットワークでは、１秒あたりに実行される演算の回数（ＦＬＯＰｓ：ＦＬｏａｔｉｎｇｐｏｉｎｔＯｐｅｒａｔｉｏｎｓＰｅｒＳｅｃｏｎｄ）が７３９．８Ｍであり、中央処理装置（ＣＰＵ：ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）での処理速度が１フレームあたり１３７５．８ｍｓであり、グラフィック処理装置（ＧＰＵ：ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）での処理速度が１フレームあたり３７．０ｍｓであることが要求されている。例えば車載デバイスなどの端末機器では、約１００ＭのＦＬＯＰｓにしか対応できない。このため、現在の認識精度の高いニューラルネットワークは、モバイルデバイスに適用できない場合がある。 However, neural networks with high recognition accuracy, such as the YOLO-Darknet53 network, have wide and deep layers, and therefore have high memory and processing speed requirements for the processor. For example, in the YOLO-Darknet53 network, the number of operations per second (FLOPs) is 739.8M, and the processing speed of the central processing unit (CPU) is 1. This is 1375.8 ms per frame, and the processing speed of a graphics processing unit (GPU) is required to be 37.0 ms per frame. For example, terminal equipment such as in-vehicle devices can only support approximately 100M FLOPs. For this reason, current neural networks with high recognition accuracy may not be applicable to mobile devices.

本発明の実施例は、物体検出装置、物体検出方法及び端末機器を提供する。特徴抽出のためのシャッフル部における全ての畳み込み層の入力チャネルと出力チャネルの数が同一であるため、特徴の拡張及び圧縮を行う必要がなく、プロセッサへのメモリと性能の要件を軽減し、処理速度を向上させることができる。また、該シャッフル部が少なくとも１つの深さ方向分離可能な畳み込み層を有するため、ＦＬＯＰｓなどのプロセッサへのメモリと性能の要件を大幅に軽減すると共に、ＹＯＬＯ－Ｄａｒｋｎｅｔ５３などのネットワークに比べて、認識精度を略維持しながらプロセッサへの要件を大幅に軽減することができる。従って、軽量であり、処理速度が速く、且つ認識精度が高い検出方法を提供できるため、メモリ及び性能が限られた端末機器に適用することができ、優れた認識効果を得ることができる。 Embodiments of the present invention provide an object detection device, an object detection method, and a terminal device. Since the number of input and output channels of all convolutional layers in the shuffle part for feature extraction is the same, there is no need to perform feature expansion and compression, reducing memory and performance requirements on the processor and processing speed. Speed can be improved. In addition, since the shuffle section has at least one depthwise separable convolutional layer, it significantly reduces the memory and performance requirements for processors such as FLOPs, and compared to networks such as YOLO-Darknet53, recognition Processor requirements can be significantly reduced while accuracy is substantially maintained. Therefore, it is possible to provide a detection method that is lightweight, has a high processing speed, and has high recognition accuracy, so it can be applied to terminal devices with limited memory and performance, and excellent recognition effects can be obtained.

本発明の実施例の第１態様では、物体検出装置であって、入力画像における特徴を抽出する特徴抽出部と、前記特徴抽出部により抽出された特徴に基づいて、前記入力画像における物体を検出する検出部と、を含み、前記特徴抽出部は、少なくとも１つのシャッフル部を含み、前記シャッフル部は、複数の畳み込み層を含み、前記複数の畳み込み層の各畳み込み層の入力チャネルの数と出力チャネルの数とは同一であり、前記複数の畳み込み層は、少なくとも１つの深さ方向分離可能な畳み込み層を含む、装置を提供する。 In a first aspect of the embodiment of the present invention, the object detection device includes a feature extraction unit that extracts features in an input image, and detects an object in the input image based on the features extracted by the feature extraction unit. a detection unit that performs a detection unit, the feature extraction unit includes at least one shuffling unit, the shuffling unit includes a plurality of convolutional layers, and the number of input channels and the output of each convolutional layer of the plurality of convolutional layers. The number of channels is the same, and the plurality of convolutional layers includes at least one depthwise separable convolutional layer.

本発明の実施例の第２態様では、本発明の実施例の第１態様に記載の装置を含む、端末機器を提供する。 In a second aspect of an embodiment of the invention there is provided a terminal device comprising the apparatus according to the first aspect of an embodiment of the invention.

本発明の実施例の第３態様では、物体検出方法であって、特徴抽出部が入力画像における特徴を抽出するステップ、検出部が前記特徴抽出部により抽出された特徴に基づいて、前記入力画像における物体を検出するステップと、を含み、前記特徴抽出部は、少なくとも１つのシャッフル部を含み、前記シャッフル部は、複数の畳み込み層を含み、前記複数の畳み込み層の各畳み込み層の入力チャネルの数と出力チャネルの数とは同一であり、前記複数の畳み込み層は、少なくとも１つの深さ方向分離可能な畳み込み層を含む、方法を提供する。 In a third aspect of the embodiment of the present invention, there is provided an object detection method, wherein the feature extraction section extracts features in the input image, and the detection section extracts features from the input image based on the features extracted by the feature extraction section. detecting an object in the plurality of convolutional layers, wherein the feature extraction unit includes at least one shuffling unit, the shuffling unit includes a plurality of convolutional layers, and the feature extraction unit includes at least one shuffling unit, and the shuffling unit includes a plurality of convolutional layers, and the feature extraction unit includes at least one shuffling unit, and the shuffler unit includes a plurality of convolutional layers, and the feature extraction unit includes at least one shuffling unit, and the shuffler unit includes a plurality of convolutional layers, and the feature extraction unit includes at least one shuffling unit, and the shuffler unit includes a plurality of convolutional layers. and the number of output channels are the same, and the plurality of convolutional layers includes at least one depth-separable convolutional layer.

本発明の有利な効果は以下の通りである。特徴抽出のためのシャッフル部における全ての畳み込み層の入力チャネルと出力チャネルの数が同一であるため、特徴の拡張及び圧縮を行う必要がなく、プロセッサへのメモリと性能の要件を軽減し、処理速度を向上させることができる。また、該シャッフル部が少なくとも１つの深さ方向分離可能な畳み込み層を有するため、ＦＬＯＰｓなどのプロセッサへのメモリと性能の要件を大幅に軽減すると共に、ＹＯＬＯ－Ｄａｒｋｎｅｔ５３などのネットワークに比べて、認識精度を略維持しながらプロセッサへの要件を大幅に軽減することができる。従って、軽量であり、処理速度が速く、且つ認識精度が高い検出方法を提供できるため、メモリ及び性能が限られた端末機器に適用することができ、優れた認識効果を得ることができる。 The advantageous effects of the present invention are as follows. Since the number of input and output channels of all convolutional layers in the shuffle part for feature extraction is the same, there is no need to perform feature expansion and compression, reducing memory and performance requirements on the processor and processing speed. Speed can be improved. In addition, since the shuffle section has at least one depthwise separable convolutional layer, it significantly reduces the memory and performance requirements for processors such as FLOPs, and compared to networks such as YOLO-Darknet53, recognition Processor requirements can be significantly reduced while accuracy is substantially maintained. Therefore, it is possible to provide a detection method that is lightweight, has a high processing speed, and has high recognition accuracy, so it can be applied to terminal devices with limited memory and performance, and excellent recognition effects can be obtained.

本発明の特定の実施形態は、後述の説明及び図面に示すように、詳細に開示され、本発明の原理を採用されることが可能な方式を示している。なお、本発明の実施形態は、範囲上には限定されるものではない。本発明の実施形態は、添付されている特許請求の範囲の主旨及び内容の範囲内、各種の改変、修正、及び均等的なものが含まれる。 Certain embodiments of the invention are disclosed in detail and illustrate the manner in which the principles of the invention may be employed, as set forth in the following description and drawings. Note that the embodiments of the present invention are not limited in scope. Embodiments of the present invention include various alterations, modifications, and equivalents within the spirit and content of the appended claims.

ある一つの実施形態に説明及び又は示されている特徴は、同一又は類似の方式で一つ又は多くの他の実施形態に使用されてもよく、他の実施形態における特徴と組み合わせてもよく、他の実施形態における特徴を代替してもよい。 Features described and/or illustrated in one embodiment may be used in one or more other embodiments in the same or similar manner, and may be combined with features in other embodiments; Features in other embodiments may be substituted.

なお、用語「含む／有する」は、本文に使用される際に、特徴、要素、ステップ又は構成要件の存在を意味し、一つ又は複数の他の特徴、要素、ステップ又は構成要件の存在又は追加を排除するものではない。 Note that the term "comprising/comprising", when used in the main text, means the presence of a feature, element, step, or component, and does not include the presence or absence of one or more other features, elements, steps, or components. This does not exclude additions.

ここで含まれる図面は、本発明の実施例を理解させるためのものであり、本明細書の一部を構成し、本発明の実施例を例示するためのものであり、文言の記載と合わせて本発明の原理を説明する。なお、ここに説明される図面は、単なる本発明の実施例を説明するためのものであり、当業者にとって、これらの図面に基づいて他の図面を容易に得ることができる。
本発明の実施例１に係る物体検出装置を示す図である。本発明の実施例１に係る物体検出装置１０による入力画像の検出結果を示す図である。本発明の実施例１に係る特徴抽出部１００を示す図である。本発明の実施例１に係る第１シャッフル部１０３を示す図である。本発明の実施例１に係る第２シャッフル部１０４を示す図である。本発明の実施例２に係る端末機器を示す図である。本発明の実施例２に係る端末機器のシステム構成を示すブロック図である。本発明の実施例３に係る物体検出方法を示す図である。 The drawings included herein are for the purpose of providing an understanding of embodiments of the invention, constitute a part of this specification, and are intended to illustrate embodiments of the invention, and together with the written description. The principle of the present invention will now be explained. Note that the drawings described here are merely for explaining embodiments of the present invention, and those skilled in the art can easily obtain other drawings based on these drawings.
1 is a diagram showing an object detection device according to a first embodiment of the present invention. FIG. 3 is a diagram showing detection results of an input image by the object detection device 10 according to Example 1 of the present invention. FIG. 2 is a diagram showing a feature extraction unit 100 according to Example 1 of the present invention. FIG. 3 is a diagram showing a first shuffle section 103 according to Example 1 of the present invention. It is a figure showing the second shuffle part 104 concerning Example 1 of the present invention. It is a figure showing the terminal equipment concerning Example 2 of the present invention. FIG. 2 is a block diagram showing a system configuration of a terminal device according to a second embodiment of the present invention. FIG. 7 is a diagram showing an object detection method according to Example 3 of the present invention.

本発明の上記及びその他の特徴は、図面及び下記の説明により明確になる。明細書及び図面では、本発明の特定の実施形態、即ち本発明の原則に従う一部の実施形態を表すものを公開している。なお、本発明は説明される実施形態に限定されず、本発明は、特許請求の範囲内の全ての変更、変形及び均等なものを含む。 These and other features of the invention will become clear from the drawings and the following description. The specification and drawings disclose certain embodiments of the invention, ie, some embodiments in accordance with the principles of the invention. Note that the present invention is not limited to the described embodiments, and the present invention includes all modifications, variations, and equivalents within the scope of the claims.

＜実施例１＞
本発明の実施例は物体検出装置を提供する。図１は本発明の実施例１に係る物体検出装置を示す図である。 <Example 1>
Embodiments of the present invention provide an object detection device. FIG. 1 is a diagram showing an object detection device according to a first embodiment of the present invention.

図１に示すように、物体検出装置１０は、特徴抽出部１００及び検出部２００を含む。 As shown in FIG. 1, the object detection device 10 includes a feature extraction section 100 and a detection section 200.

特徴抽出部１００は、入力画像における特徴を抽出する。 The feature extraction unit 100 extracts features in the input image.

検出部２００は、特徴抽出部１００により抽出された特徴に基づいて、該入力画像における物体を検出する。 The detection unit 200 detects an object in the input image based on the features extracted by the feature extraction unit 100.

ここで、特徴抽出部１００は、少なくとも１つのシャッフル部（ｓｈｕｆｆｌｅｕｎｉｔ）を含み、シャッフル部は、複数の畳み込み層を含み、該複数の畳み込み層の各畳み込み層の入力チャネルの数と出力チャネルの数とは同一であり、該複数の畳み込み層は、少なくとも１つの深さ方向分離可能な畳み込み層を含む。 Here, the feature extraction unit 100 includes at least one shuffle unit, the shuffle unit includes a plurality of convolutional layers, and the number of input channels and the number of output channels of each convolutional layer of the plurality of convolutional layers. The number is the same, and the plurality of convolutional layers includes at least one depthwise separable convolutional layer.

図２は本発明の実施例１に係る物体検出装置１０による入力画像の検出結果を示す図である。図２に示すように、物体検出装置１０は、画像における各物体を正確に検出することができる。 FIG. 2 is a diagram showing a detection result of an input image by the object detection device 10 according to the first embodiment of the present invention. As shown in FIG. 2, the object detection device 10 can accurately detect each object in an image.

本実施例によれば、特徴抽出のためのシャッフル部における全ての畳み込み層の入力チャネルと出力チャネルの数が同一であるため、特徴の拡張及び圧縮を行う必要がなく、プロセッサへのメモリと性能の要件を軽減し、処理速度を向上させることができる。また、該シャッフル部が少なくとも１つの深さ方向分離可能な畳み込み層を有するため、ＦＬＯＰｓなどのプロセッサへのメモリと性能の要件を大幅に軽減すると共に、ＹＯＬＯ－Ｄａｒｋｎｅｔ５３などのネットワークに比べて、認識精度を略維持しながらプロセッサへの要件を大幅に軽減することができる。従って、軽量であり、処理速度が速く、且つ認識精度が高い検出方法を提供できるため、メモリ及び性能が限られた端末機器に適用することができ、優れた認識効果を得ることができる。 According to this embodiment, since the number of input channels and output channels of all convolutional layers in the shuffle unit for feature extraction is the same, there is no need to expand and compress features, and the memory and performance of the processor are reduced. requirements and increase processing speed. In addition, since the shuffle section has at least one depthwise separable convolutional layer, it significantly reduces the memory and performance requirements for processors such as FLOPs, and compared to networks such as YOLO-Darknet53, recognition Processor requirements can be significantly reduced while accuracy is substantially maintained. Therefore, it is possible to provide a detection method that is lightweight, has a high processing speed, and has high recognition accuracy, so it can be applied to terminal devices with limited memory and performance, and excellent recognition effects can be obtained.

本実施例では、該入力画像は、リアルタイムで取得された画像であってもよいし、予め取得された画像であってもよい。例えば、該入力画像は、車載デバイスにより撮影されたビデオ画像であり、各入力画像は該ビデオ画像の１つのフレームに対応する。 In this embodiment, the input image may be an image acquired in real time or an image acquired in advance. For example, the input images are video images taken by an in-vehicle device, and each input image corresponds to one frame of the video image.

本実施例では、特徴抽出部１００は入力画像における特徴を抽出する。特徴抽出部１００は、少なくとも１つのシャッフル部を含み、シャッフル部は、複数の畳み込み層を含み、該複数の畳み込み層の各畳み込み層の入力チャネルの数と出力チャネルの数とは同一であり、該複数の畳み込み層は、少なくとも１つの深さ方向分離可能な畳み込み（ｄｅｐｔｈ－ｗｉｓｅｓｅｐａｒａｂｌｅｃｏｎｖｏｌｕｔｉｏｎ）層を含む。 In this embodiment, the feature extraction unit 100 extracts features in the input image. The feature extraction unit 100 includes at least one shuffle unit, the shuffle unit includes a plurality of convolutional layers, and the number of input channels and the number of output channels of each convolutional layer of the plurality of convolutional layers are the same, The plurality of convolutional layers includes at least one depth-wise separable convolutional layer.

本実施例では、該少なくとも１つのシャッフル部は、少なくとも１つの第１シャッフル部及び／又は少なくとも１つの第２シャッフル部を含んでもよい。ここで、第１シャッフル部は、ストライド（ｓｔｒｉｄｅ）が１のシャッフル部であり、第２シャッフル部は、ストライドが２のシャッフル部である。 In this embodiment, the at least one shuffle section may include at least one first shuffle section and/or at least one second shuffle section. Here, the first shuffle part is a shuffle part with a stride of 1, and the second shuffle part is a shuffle part with a stride of 2.

以下は、本実施例の特徴抽出部１００の構成を例示的に説明する。 The configuration of the feature extraction unit 100 of this embodiment will be exemplified below.

図３は本発明の実施例１に係る特徴抽出部１００を示す図である。図３に示すように、特徴抽出部１００は、第１畳み込み層１０１、プーリング層１０２、複数の第１シャッフル部１０３、及び複数の第２シャッフル部１０４を含む。 FIG. 3 is a diagram showing the feature extraction unit 100 according to the first embodiment of the present invention. As shown in FIG. 3, the feature extraction section 100 includes a first convolution layer 101, a pooling layer 102, a plurality of first shuffle sections 103, and a plurality of second shuffle sections 104.

第１畳み込み層１０１は、入力画像を処理する。 The first convolutional layer 101 processes the input image.

プーリング層１０２は、第１畳み込み層１０１により出力された特徴に対してプーリング処理を行う。 The pooling layer 102 performs pooling processing on the features output by the first convolutional layer 101.

本実施例では、第１畳み込み装置及びプーリング層１０２は、従来の構造を用いてもよい。 In this embodiment, the first convolution device and pooling layer 102 may use a conventional structure.

本実施例では、第１シャッフル部１０３及び第２シャッフル部１０４の数及び並び替えの順序は、実際の需要に応じて設定されてもよい。言い換えれば、実際の需要に応じて所定の規則を決定し、該所定の規則に従って第１シャッフル部１０３及び第２シャッフル部１０４の数及び並び替えの順序を決定してもよい。 In this embodiment, the number of the first shuffle section 103 and the second shuffle section 104 and the order of rearrangement may be set according to actual demand. In other words, a predetermined rule may be determined according to actual demand, and the number and rearrangement order of the first shuffle sections 103 and second shuffle sections 104 may be determined according to the predetermined rule.

図３に示すように、「タイプ」（ｔｙｐｅ）は特徴抽出部の各層のタイプを表し、「チャネルパラメータ」（ｆｉｌｔｅｒｓ）はチャネルの大きさを表し、「サイズ」（ｓｉｚｅ）は各層により処理される特徴マップのサイズを表し、「ストライド」（ｓｔｒｉｄｅ）は各層のストライドを表し、「出力」（ｏｕｔｐｕｔ）は出力された特徴のサイズを表す。 As shown in Figure 3, "type" represents the type of each layer of the feature extraction section, "channel parameters" (filters) represents the size of the channel, and "size" represents the parameters processed by each layer. ``stride'' represents the stride of each layer, and ``output'' represents the size of the output feature.

本実施例では、各層のチャネルパラメータ、サイズ、ストライド及び出力特徴のサイズは、実際の需要に応じて決定されてもよい。 In this embodiment, the channel parameters, size, stride and output feature size of each layer may be determined according to actual demand.

また、図３に示すように、数字と「×」の組み合わせは該層の繰り返し配置の数を表し、例えば、「７×」は対応する層が７個繰り返し配置されていることを表し、「３×」は対応する層が３個繰り返し配置されていることを表す。 Further, as shown in FIG. 3, a combination of numbers and "x" represents the number of repeated arrangements of the layer; for example, "7x" represents that seven corresponding layers are repeatedly arranged; 3×” indicates that three corresponding layers are repeatedly arranged.

図３に示すように、第１畳み込み層１０１は入力画像に対して特徴抽出を行い、抽出された特徴はプーリング層１０２に入力されてプーリング処理が行われ、プーリング処理された特徴は、順序に従って並び替えられた複数の第１シャッフル部１０３及び複数の第２シャッフル部１０４に入力されてシャッフル処理が行われ、抽出された特徴は、検出を行うように検出部２００に出力される。 As shown in FIG. 3, the first convolutional layer 101 extracts features from the input image, the extracted features are input to the pooling layer 102, where pooling processing is performed, and the pooled features are processed in order. The rearranged features are input to a plurality of first shuffle units 103 and a plurality of second shuffle units 104 for shuffling processing, and the extracted features are output to a detection unit 200 for detection.

以下は、第１シャッフル部１０３及び第２シャッフル部１０４の構成をそれぞれ例示的に説明する。 The configurations of the first shuffle section 103 and the second shuffle section 104 will be exemplified below.

図４は本発明の実施例１に係る第１シャッフル部１０３を示す図である。図４に示すように、第１シャッフル部１０３は、第１チャネル分割モジュール４０１、第２畳み込み層４０２、第１深さ方向分離可能な畳み込み層４０３、第３畳み込み層４０４、第１併合モジュール４０５、及び第１シャッフルモジュール４０６を含む。 FIG. 4 is a diagram showing the first shuffle section 103 according to the first embodiment of the present invention. As shown in FIG. 4, the first shuffle unit 103 includes a first channel division module 401, a second convolutional layer 402, a first depthwise separable convolutional layer 403, a third convolutional layer 404, and a first merging module 405. , and a first shuffle module 406 .

第１チャネル分割モジュール４０１は、第１シャッフル部１０３に入力された特徴を第１部分特徴と第２部分特徴とに分割する。 The first channel division module 401 divides the feature input to the first shuffle unit 103 into a first partial feature and a second partial feature.

第２畳み込み層４０２は、第２部分特徴を処理する。 The second convolutional layer 402 processes the second partial features.

第１深さ方向分離可能な畳み込み層４０３は、第２畳み込み層４０２により処理された第２部分特徴を処理する。 The first depthwise separable convolutional layer 403 processes the second partial features processed by the second convolutional layer 402 .

第３畳み込み層４０４は、第１深さ方向分離可能な畳み込み層４０３により処理された第２部分特徴を処理する。 The third convolutional layer 404 processes the second partial features processed by the first depthwise separable convolutional layer 403 .

第１併合モジュール４０５は、第１部分特徴と第３畳み込み層４０４により処理された第２部分特徴とを併合する。 The first merging module 405 merges the first partial feature and the second partial feature processed by the third convolutional layer 404 .

第１シャッフルモジュール４０６は、併合された第１部分特徴及び第２部分特徴に対してシャッフル処理を行う。 The first shuffle module 406 performs shuffling processing on the merged first partial features and second partial features.

図４に示すように、入力された特徴は、第１チャネル分割モジュール４０１により２つの部分、即ち第１部分特徴と第２部分特徴に分割される。第１部分特徴は、処理が何れも行われず、左の分岐路を介して第１併合モジュール４０５に入力される。第２部分特徴は、右の分岐路に入り、まず１×１の第２畳み込み層４０２に入力される。第２畳み込み層４０２により出力された特徴は、正規化と活性化の処理が行われた後に、３×３の第１深さ方向分離可能な畳み込み層４０３に入力される。第１深さ方向分離可能な畳み込み層４０３により取得された特徴は、正規化が行われた後に、１×１の第３畳み込み層４０４に入力される。第３畳み込み層４０４により出力された特徴は、正規化と活性化の処理が行われた後に、第１併合モジュール４０５に出力される。第１併合モジュール４０５は、第１部分特徴と第２部分特徴とを併合し、併合された特徴を第１シャッフルモジュール４０６に入力する。第１シャッフルモジュール４０６は、併合された第１部分特徴及び第２部分特徴に対してシャッフル処理を行って出力する。 As shown in FIG. 4, the input features are divided into two parts, namely a first partial feature and a second partial feature, by a first channel splitting module 401. The first partial feature is input to the first merging module 405 via the left branch without any processing. The second partial feature takes the right branch and is first input into the 1×1 second convolutional layer 402 . The features output by the second convolutional layer 402 are input to a 3×3 first depthwise separable convolutional layer 403 after being subjected to normalization and activation processing. The features acquired by the first depthwise separable convolutional layer 403 are input to a 1×1 third convolutional layer 404 after normalization. The features output by the third convolutional layer 404 are output to the first merging module 405 after normalization and activation processing. The first merging module 405 merges the first partial feature and the second partial feature and inputs the merged feature to the first shuffle module 406 . The first shuffle module 406 performs shuffle processing on the merged first partial feature and second partial feature and outputs the result.

図５は本発明の実施例１に係る第２シャッフル部１０４を示す図である。図５に示すように、入力特徴は、第３部分特徴及び第４部分特徴を含み、第２シャッフル部１０４は、第２深さ方向分離可能な畳み込み層５０１、第４畳み込み層５０２、第５畳み込み層５０３、第３深さ方向分離可能な畳み込み層５０４、第６畳み込み層５０５、第２併合モジュール５０６、及び第２シャッフルモジュール５０７を含む。 FIG. 5 is a diagram showing the second shuffle section 104 according to the first embodiment of the present invention. As shown in FIG. 5, the input features include a third partial feature and a fourth partial feature, and the second shuffle unit 104 includes a second depthwise separable convolutional layer 501, a fourth convolutional layer 502, a fifth It includes a convolutional layer 503 , a third depthwise separable convolutional layer 504 , a sixth convolutional layer 505 , a second merging module 506 , and a second shuffling module 507 .

第２深さ方向分離可能な畳み込み層５０１は、第３部分特徴を処理する。 A second depthwise separable convolutional layer 501 processes the third partial feature.

第４畳み込み層５０２は、第２深さ方向分離可能な畳み込み層５０１により処理された第３部分特徴を処理する。 The fourth convolutional layer 502 processes the third partial feature processed by the second depthwise separable convolutional layer 501.

第５畳み込み層５０３は、第４部分特徴を処理する。 The fifth convolutional layer 503 processes the fourth partial feature.

第３深さ方向分離可能な畳み込み層５０４は、第５畳み込み層５０３により処理された第４部分特徴を処理する。 The third depthwise separable convolutional layer 504 processes the fourth partial feature processed by the fifth convolutional layer 503.

第６畳み込み層５０５は、第３深さ方向分離可能な畳み込み層５０４により処理された第４部分特徴を処理する。 The sixth convolutional layer 505 processes the fourth partial feature processed by the third depthwise separable convolutional layer 504 .

第２併合モジュール５０６は、第４畳み込み層５０２により処理された第３部分特徴と第６畳み込み層５０５により処理された第４部分特徴とを併合する。 A second merging module 506 merges the third partial feature processed by the fourth convolutional layer 502 and the fourth partial feature processed by the sixth convolutional layer 505.

第２シャッフルモジュール５０７は、併合された第３部分特徴及び第４部分特徴に対してシャッフル処理を行う。 The second shuffle module 507 performs shuffling processing on the merged third and fourth partial features.

図５に示すように、入力された第３部分特徴は、左の分岐路に入り、まず３×３の第２深さ方向分離可能な畳み込み層５０１に入力される。第２深さ方向分離可能な畳み込み層５０１により出力された特徴は、正規化が行われた後に、１×１の第４畳み込み層５０２に入力される。第４畳み込み層５０２により出力された特徴は、正規化と活性化が行われた後に、第２併合モジュール５０６に出力される。入力された第４部分特徴は、右の分岐路に入り、まず１×１の第５畳み込み層５０３に入力される。第５畳み込み層５０３により出力された特徴は、正規化と活性化が行われた後に、３×３の第３深さ方向分離可能な畳み込み層５０４に入力される。第３深さ方向分離可能な畳み込み層５０４により出力された特徴は、正規化が行われた後に、１×１の第６畳み込み層５０５に入力される。第６畳み込み層５０５により出力された特徴は、正規化と活性化が行われた後に、第２併合モジュール５０６に出力される。第２併合モジュール５０６は、第３部分特徴と第４部分特徴とを併合し、併合された特徴を第２シャッフルモジュール５０７に入力する。第２シャッフルモジュール５０７は、併合された第３部分特徴及び第４部分特徴に対してシャッフル処理を行って出力する。 As shown in FIG. 5, the input third partial feature enters the left branch path and is first input to the 3×3 second depthwise separable convolution layer 501. The features output by the second depthwise separable convolutional layer 501 are input to the 1×1 fourth convolutional layer 502 after being normalized. The features output by the fourth convolutional layer 502 are output to the second merging module 506 after being normalized and activated. The inputted fourth partial feature enters the right branch path and is first inputted to the 1×1 fifth convolutional layer 503. The features output by the fifth convolutional layer 503 are normalized and activated, and then input to a 3×3 third depthwise separable convolutional layer 504. The features output by the third depthwise separable convolutional layer 504 are input to a 1×1 sixth convolutional layer 505 after normalization. The features output by the sixth convolutional layer 505 are output to the second merging module 506 after being normalized and activated. The second merging module 506 merges the third partial feature and the fourth partial feature and inputs the merged feature to the second shuffle module 507. The second shuffle module 507 performs shuffle processing on the merged third partial feature and fourth partial feature and outputs the result.

本実施例では、第１畳み込み層１０１、第２畳み込み層４０２、第３畳み込み層４０４、第４畳み込み層５０２、第５畳み込み層５０３及び第６畳み込み層５０５は、通常の畳み込み層であってもよい。第１深さ方向分離可能な畳み込み層４０３、第２深さ方向分離可能な畳み込み層５０１及び第３深さ方向分離可能な畳み込み層５０４は、従来の深さ方向に分離可能な畳み込み層であってもよい。 In this embodiment, the first convolutional layer 101, the second convolutional layer 402, the third convolutional layer 404, the fourth convolutional layer 502, the fifth convolutional layer 503, and the sixth convolutional layer 505 may be ordinary convolutional layers. good. The first depthwise separable convolutional layer 403, the second depthwise separable convolutional layer 501, and the third depthwise separable convolutional layer 504 are conventional depthwise separable convolutional layers. You can.

上記の各畳み込み層の入力チャネルと出力チャネルの数が同一であり、即ち各畳み込み層が特徴の拡張及び圧縮を行う必要がないため、プロセッサへのメモリと性能の要件を軽減し、処理速度を向上させることができる。 The number of input and output channels of each convolutional layer above is the same, i.e. each convolutional layer does not need to perform feature expansion and compression, reducing memory and performance requirements on the processor and increasing processing speed. can be improved.

以上は本実施例の特徴抽出部１００の構成を例示的に説明した。 The configuration of the feature extraction unit 100 of this embodiment has been described above as an example.

特徴抽出部１００により入力画像から特徴が抽出された後に、検出部２００は、特徴抽出部１００により抽出された特徴に基づいて、該入力画像における物体を検出する。 After the feature extraction unit 100 extracts features from the input image, the detection unit 200 detects objects in the input image based on the features extracted by the feature extraction unit 100.

本実施例では、検出部２００は従来のネットワーク構造を用いてもよく、例えば、検出部２００はＹＯＬＯ（ＹｏｕＯｎｌｙＬｏｏｋＯｎｃｅ）ネットワークを含む。ＹＯＬＯネットワークは、抽出された特徴に基づいて、入力画像における物体を検出する。ＹＯＬＯネットワークによる物体検出の原理及びプロセスは、従来技術を参照してもよく、ここでその説明を省略する。 In this embodiment, the detection unit 200 may use a conventional network structure, for example, the detection unit 200 includes a YOLO (You Only Look Once) network. The YOLO network detects objects in the input image based on the extracted features. The principle and process of object detection by the YOLO network may refer to the prior art, and the description thereof will be omitted here.

表１は、本発明の実施例の物体検出装置と従来のネットワークとのパラメータの対比である。

Table 1 shows a comparison of parameters between the object detection device according to the embodiment of the present invention and a conventional network.

表１に示すように、１列目は従来のＹＯＬＯ－Ｄａｒｋｎｅｔ５３のパラメータであり、２列目は本実施例の物体検出装置１０のパラメータであり、３列目は本実施例の物体検出装置１０’のパラメータである。ここで、物体検出装置１０と物体検出装置１０’とは、構造が同一であるが、パラメータが異なり、例えば、物体検出装置１０’のチャネルパラメータは物体検出装置１０より小さい。ＦＬＯＰｓは１秒あたりに実行される演算の回数を表し、ＣＰＵはＣＰＵでの処理速度を表し、ＧＰＵはＧＰＵでの処理速度を表し、ｍＡＰは平均認識正確度を表し、ＡＰｐｅｒｓｏｎは人物の認識正確度を表し、ＡＰｂｉｃｙｃｌｅは自転車の認識正確度を表し、ＡＰｃａｒは車の認識正確度を表し、ＡＰｂｕｓはバスの認識正確度を表し、ＡＰｖａｎは箱型のトラックの認識正確度を表し、ＡＰｔｒｕｃｋはフラット型のトラックの認識正確度を表す。表１から分かるように、本実施例の物体検出装置１０及び物体検出装置１０’は、ＹＯＬＯ－Ｄａｒｋｎｅｔ５３ネットワークと略同一の認識正確度を維持しながら、ＹＯＬＯ－Ｄａｒｋｎｅｔ５３ネットワークに比べてプロセッサへのメモリ及び性能の要件を大幅に軽減することができる。 As shown in Table 1, the first column is the parameters of the conventional YOLO-Darknet 53, the second column is the parameters of the object detection device 10 of this embodiment, and the third column is the parameters of the object detection device 10 of this embodiment. ' is a parameter of '. Here, the object detection device 10 and the object detection device 10' have the same structure, but have different parameters. For example, the channel parameter of the object detection device 10' is smaller than that of the object detection device 10. FLOPs represents the number of operations performed per second, CPU represents the processing speed on the CPU, GPU represents the processing speed on the GPU, mAP represents the average recognition accuracy, and APperson represents the accuracy of person recognition. APbicycle represents the recognition accuracy of a bicycle, APcar represents the recognition accuracy of a car, APbus represents the recognition accuracy of a bus, APvan represents the recognition accuracy of a box-shaped truck, and APtruck represents the recognition accuracy of a flat truck. Represents the recognition accuracy of type trucks. As can be seen from Table 1, the object detection device 10 and the object detection device 10' of this embodiment maintain substantially the same recognition accuracy as the YOLO-Darknet53 network, while requiring less memory for the processor than the YOLO-Darknet53 network. and performance requirements can be significantly reduced.

＜実施例２＞
本発明の実施例は端末機器をさらに提供し、図６は本発明の実施例２に係る端末機器を示す図である。図６に示すように、端末機器６００は物体検出装置６０１を含み、該物体検出装置６０１は実施例１に記載されたものと同じであり、ここでその説明を省略する。 <Example 2>
The embodiment of the present invention further provides a terminal device, and FIG. 6 is a diagram showing the terminal device according to the second embodiment of the present invention. As shown in FIG. 6, the terminal device 600 includes an object detection device 601, and the object detection device 601 is the same as that described in Example 1, and its description will be omitted here.

図７は本発明の実施例２に係る端末機器のシステム構成を示すブロック図である。図７に示すように、端末機器７００は、中央処理装置（中央制御装置）７０１及び記憶装置７０２を含んでもよく、記憶装置７０２は中央処理装置７０１に接続される。該図は単なる例示的なものであり、電気通信機能又は他の機能を実現するように、他の種類の構成を用いて、該構成を補充又は代替してもよい。 FIG. 7 is a block diagram showing a system configuration of a terminal device according to a second embodiment of the present invention. As shown in FIG. 7, the terminal device 700 may include a central processing unit (central control unit) 701 and a storage device 702, and the storage device 702 is connected to the central processing unit 701. The diagram is merely exemplary and other types of configurations may be used to supplement or replace the configuration to implement telecommunications or other functions.

図７に示すように、端末機器７００は、入力部７０３、ディスプレイ７０４及び電源７０５をさらに含んでもよい。 As shown in FIG. 7, the terminal device 700 may further include an input unit 703, a display 704, and a power source 705.

１つの態様では、実施例１の物体検出装置の機能は中央処理装置７０１に統合されてもよい。ここで、中央処理装置７０１は、特徴抽出部により入力画像における特徴を抽出し、検出部により該特徴抽出部により抽出された特徴に基づいて、該入力画像における物体を検出するように構成されてもよい。ここで、該特徴抽出部は、少なくとも１つのシャッフル部を含み、該シャッフル部は、複数の畳み込み層を含み、該複数の畳み込み層の各畳み込み層の入力チャネルの数と出力チャネルの数とは同一であり、該複数の畳み込み層は、少なくとも１つの深さ方向分離可能な畳み込み層を含む。 In one aspect, the functionality of the object detection device of Example 1 may be integrated into central processing unit 701. Here, the central processing unit 701 is configured to have a feature extraction unit extract features in the input image, and a detection unit to detect an object in the input image based on the features extracted by the feature extraction unit. Good too. Here, the feature extraction unit includes at least one shuffling unit, the shuffling unit includes a plurality of convolutional layers, and the number of input channels and the number of output channels of each convolutional layer of the plurality of convolutional layers are and the plurality of convolutional layers includes at least one depthwise separable convolutional layer.

例えば、該少なくとも１つのシャッフル部は、少なくとも１つの第１シャッフル部及び／又は少なくとも１つの第２シャッフル部を含み、該第１シャッフル部は、ストライドが１のシャッフル部であり、該第２シャッフル部は、ストライドが２のシャッフル部である、請求項９に記載の方法。 For example, the at least one shuffle section includes at least one first shuffle section and/or at least one second shuffle section, the first shuffle section is a shuffle section with a stride of 1, and the second shuffle section 10. The method of claim 9, wherein the section is a shuffle section with a stride of 2.

もう１つの態様では、実施例１に記載された物体検出装置は中央処理装置７０１とそれぞれ構成されてもよく、例えば該物体検出装置は中央処理装置７０１に接続されたチップであり、中央処理装置７０１の制御により該物体検出装置の機能を実現してもよい。 In another aspect, the object detection devices described in Example 1 may be configured with the central processing unit 701, for example, the object detection devices are chips connected to the central processing unit 701, and the central processing unit The functions of the object detection device may be realized by controlling the object detection device 701.

本実施例における端末機器７００は、図７に示されている全ての構成部を含まなくてもよい。 The terminal device 700 in this embodiment does not need to include all the components shown in FIG. 7.

図７に示すように、中央処理装置７０１は、コントローラ又は操作制御部とも称され、マイクロプロセッサ又は他の処理装置及び／又は論理装置を含んでもよく、中央処理装置７０１は入力を受信し、端末機器７００の各部の操作を制御する。 As shown in FIG. 7, a central processing unit 701, also referred to as a controller or operating control unit, may include a microprocessor or other processing and/or logic device, and the central processing unit 701 receives input and Controls the operation of each part of the device 700.

記憶装置７０２は、例えばバッファ、フラッシュメモリ、ハードディスク、移動可能な媒体、発揮性メモリ、不発揮性メモリ、又は他の適切な装置の１つ又は複数であってもよい。また、中央処理装置７０１は、記憶装置７０２に記憶されたプログラムを実行し、情報の記憶又は処理などを実現してもよい。他の部材は従来技術に類似するため、ここでその説明が省略される。端末機器７００の各部は、本発明の範囲から逸脱することなく、特定のハードウェア、ファームウェア、ソフトウェア又はその組み合わせによって実現されてもよい。 Storage device 702 may be, for example, one or more of a buffer, flash memory, hard disk, removable media, volatile memory, nonvolatile memory, or other suitable device. Further, the central processing unit 701 may execute a program stored in the storage device 702 to realize storage or processing of information. Since other members are similar to the prior art, their description will be omitted here. Each part of the terminal device 700 may be implemented by specific hardware, firmware, software, or a combination thereof without departing from the scope of the present invention.

＜実施例３＞
本発明の実施例は物体検出方法をさらに提供し、該物体検出方法は実施例１に記載された物体検出装置に対応する。図８は本発明の実施例３に係る物体検出方法を示す図である。図８に示すように、該方法は以下のステップを含む。 <Example 3>
The embodiment of the present invention further provides an object detection method, which corresponds to the object detection device described in the first embodiment. FIG. 8 is a diagram showing an object detection method according to Example 3 of the present invention. As shown in FIG. 8, the method includes the following steps.

ステップ８０１：特徴抽出部は入力画像における特徴を抽出する。 Step 801: The feature extractor extracts features in the input image.

ステップ８０２：検出部は該特徴抽出部により抽出された特徴に基づいて、該入力画像における物体を検出する。 Step 802: The detection unit detects an object in the input image based on the features extracted by the feature extraction unit.

ここで、特徴抽出部は、少なくとも１つのシャッフル部を含み、シャッフル部は、複数の畳み込み層を含み、複数の畳み込み層の各畳み込み層の入力チャネルの数と出力チャネルの数とは同一であり、複数の畳み込み層は、少なくとも１つの深さ方向分離可能な畳み込み層を含む。 Here, the feature extraction unit includes at least one shuffle unit, the shuffle unit includes a plurality of convolutional layers, and the number of input channels and the number of output channels of each convolutional layer of the plurality of convolutional layers are the same. , the plurality of convolutional layers includes at least one depthwise separable convolutional layer.

本実施例では、上記の各ステップの具体的な実現方法は実施例１に記載されたものと同じであり、ここでその説明を省略する。 In this embodiment, the specific implementation method of each of the above steps is the same as that described in the first embodiment, and the explanation thereof will be omitted here.

本実施例によれば、特徴抽出のためのシャッフル部における全ての畳み込み層の入力チャネルと出力チャネルの数が同一であるため、特徴の拡張及び圧縮を行う必要がなく、プロセッサへのメモリと性能の要件を軽減し、処理速度を向上させることができる。また、該シャッフル部が少なくとも１つの深さ方向分離可能な畳み込み層を有するため、ＦＬＯＰｓなどのプロセッサへのメモリと性能の要件を大幅に軽減すると共に、ＹＯＬＯ－Ｄａｒｋｎｅｔ５３などのネットワークに比べて、認識精度を略維持しながらプロセッサへの要件を大幅に軽減することができる。従って、軽量であり、処理速度が速く、且つ認識精度が高い検出方法を提供できるため、メモリ及び性能が限られた端末機器に適用することができ、優れた認識効果を得ることができる。 According to this embodiment, since the number of input channels and output channels of all convolutional layers in the shuffle unit for feature extraction is the same, there is no need to expand or compress features, and the memory and performance of the processor is reduced. requirements and increase processing speed. In addition, since the shuffle section has at least one depthwise separable convolutional layer, it significantly reduces the memory and performance requirements for processors such as FLOPs, and compared to networks such as YOLO-Darknet53, the recognition Processor requirements can be significantly reduced while accuracy is substantially maintained. Therefore, since it is possible to provide a detection method that is lightweight, has a high processing speed, and has high recognition accuracy, it can be applied to terminal devices with limited memory and performance, and excellent recognition effects can be obtained.

本発明の実施例は、物体検出装置又は端末機器においてプログラムを実行する際に、コンピュータに、該物体検出装置又は端末機器において上記実施例３に記載の物体検出方法を実行させる、コンピュータ読み取り可能なプログラムをさらに提供する。 An embodiment of the present invention is a computer-readable computer-readable device that causes a computer to execute the object detection method described in the third embodiment in the object detection device or terminal device when the program is executed in the object detection device or terminal device. Offer more programs.

本発明の実施例は、コンピュータに、物体検出装置又は端末機器において上記実施例３に記載の物体検出方法を実行させるためのコンピュータ読み取り可能なプログラムを記憶する、記憶媒体をさらに提供する。 The embodiment of the present invention further provides a storage medium that stores a computer-readable program for causing a computer to execute the object detection method described in the third embodiment in an object detection device or a terminal device.

本発明の実施例を参照しながら説明した物体検出装置又は端末機器において実行される物体検出方法は、ハードウェア、プロセッサにより実行されるソフトウェアモジュール、又は両者の組み合わせで実施されてもよい。例えば、図１に示す機能的ブロック図における１つ若しくは複数、又は機能的ブロック図の１つ若しくは複数の組み合わせは、コンピュータプログラムフローの各ソフトウェアモジュールに対応してもよいし、各ハードウェアモジュールに対応してもよい。これらのソフトウェアモジュールは、図８に示す各ステップにそれぞれ対応してもよい。これらのハードウェアモジュールは、例えばフィールド・プログラマブル・ゲートアレイ（ＦＰＧＡ）を用いてこれらのソフトウェアモジュールをハードウェア化して実現されてもよい。 The object detection method executed in the object detection device or terminal device described with reference to the embodiments of the present invention may be implemented in hardware, a software module executed by a processor, or a combination of both. For example, one or more of the functional block diagrams shown in FIG. 1, or one or more combinations of functional block diagrams, may correspond to each software module of a computer program flow, or each hardware module may You may respond. These software modules may correspond to each step shown in FIG. 8, respectively. These hardware modules may be realized by converting these software modules into hardware using, for example, a field programmable gate array (FPGA).

ソフトウェアモジュールは、ＲＡＭメモリ、フラッシュメモリ、ＲＯＭメモリ、ＥＰＲＯＭメモリ、ＥＥＰＲＯＭメモリ、レジスタ、ハードディスク、モバイルハードディスク、ＣＤ－ＲＯＭ又は当業者にとって既知の任意の他の形の記憶媒体に位置してもよい。プロセッサが記憶媒体から情報を読み取ったり、記憶媒体に情報を書き込むように該記憶媒体をプロセッサに接続してもよいし、記憶媒体がプロセッサの構成部であってもよい。プロセッサ及び記憶媒体はＡＳＩＣに位置する。該ソフトウェアモジュールは移動端末のメモリに記憶されてもよいし、移動端末に挿入されたメモリカードに記憶されてもよい。例えば、端末機器が比較的に大きい容量のＭＥＧＡ－ＳＩＭカード又は大容量のフラッシュメモリ装置を用いる場合、該ソフトウェアモジュールは該ＭＥＧＡ－ＳＩＭカード又は大容量のフラッシュメモリ装置に記憶されてもよい。 The software modules may be located in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disks, mobile hard disks, CD-ROMs or any other form of storage medium known to those skilled in the art. The storage medium may be coupled to the processor such that the processor reads information from, and writes information to, the storage medium or may be a component of the processor. The processor and storage medium are located in an ASIC. The software module may be stored in the memory of the mobile terminal or on a memory card inserted into the mobile terminal. For example, if the terminal equipment uses a relatively large capacity MEGA-SIM card or a large capacity flash memory device, the software module may be stored on the MEGA-SIM card or large capacity flash memory device.

図１に記載されている機能的ブロック図における一つ以上の機能ブロックおよび/または機能ブロックの一つ以上の組合せは、本願に記載されている機能を実行するための汎用プロセッサ、デジタル信号プロセッサ（ＤＳＰ）、特定用途向け集積回路（ＡＳＩＣ）、フィールド・プログラマブル・ゲートアレイ（ＦＰＧＡ）又は他のプログラマブル論理デバイス、ディスクリートゲートまたはトランジスタ論理装置、ディスクリートハードウェアコンポーネント、またはそれらの任意の適切な組み合わせで実現されてもよい。図１に記載されている機能的ブロック図における一つ以上の機能ブロックおよび/または機能ブロックの一つ以上の組合せは、例えば、コンピューティング機器の組み合わせ、例えばＤＳＰとマイクロプロセッサの組み合わせ、複数のマイクロプロセッサの組み合わせ、ＤＳＰ通信と組み合わせた１つ又は複数のマイクロプロセッサ又は他の任意の構成で実現されてもよい。 One or more functional blocks and/or one or more combinations of functional blocks in the functional block diagram described in FIG. 1 may include a general purpose processor, a digital signal processor ( DSP), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or any suitable combination thereof. may be done. One or more functional blocks and/or one or more combinations of functional blocks in the functional block diagram depicted in FIG. It may be implemented in a combination of processors, one or more microprocessors in combination with DSP communications, or any other configuration.

以上、具体的な実施形態を参照しながら本発明を説明しているが、上記の説明は、例示的なものに過ぎず、本発明の保護の範囲を限定するものではない。本発明の趣旨及び原理を離脱しない限り、本発明に対して各種の変形及び変更を行ってもよく、これらの変形及び変更も本発明の範囲に属する。 Although the present invention has been described above with reference to specific embodiments, the above description is merely illustrative and does not limit the scope of protection of the present invention. Various modifications and changes may be made to the present invention without departing from the spirit and principles of the present invention, and these modifications and changes also fall within the scope of the present invention.

Claims

An object detection device,
a feature extraction unit that extracts features in the input image;
a detection unit that detects an object in the input image based on the features extracted by the feature extraction unit,
The feature extraction unit includes at least one shuffle unit,
The shuffle section includes a plurality of convolutional layers,
The number of input channels and the number of output channels of each convolutional layer of the plurality of convolutional layers are the same,
The plurality of convolutional layers includes at least one depthwise separable convolutional layer,
The at least one shuffle section includes at least one first shuffle section and/or at least one second shuffle section,
The first shuffle part is a shuffle part with a stride of 1,
The second shuffle section is a shuffle section with a stride of 2 .

The feature extraction unit is
a first convolutional layer that processes the input image;
2. A pooling layer that performs a pooling process on the features output by the first convolutional layer and inputs the pooled features to the first shuffle unit or the second shuffle unit. equipment.

The first shuffle section is
a first channel division module that divides the feature input into the first shuffle unit into a first partial feature and a second partial feature;
a second convolutional layer processing the second partial feature;
a first depthwise separable convolutional layer that processes a second partial feature processed by the second convolutional layer;
a third convolutional layer that processes the second partial feature processed by the first depthwise separable convolutional layer;
a first merging module that merges the first partial feature and the second partial feature processed by the third convolutional layer;
3. The apparatus according to claim 1 , further comprising a first shuffle module that performs a shuffling process on the merged first partial feature and the second partial feature.

The features input to the second shuffle unit include a third partial feature and a fourth partial feature,
The second shuffle section is
a second depthwise separable convolutional layer that processes the third partial feature;
a fourth convolutional layer that processes the third partial feature processed by the second depthwise separable convolutional layer;
a fifth convolutional layer that processes the fourth partial feature;
a third depthwise separable convolutional layer that processes the fourth partial feature processed by the fifth convolutional layer;
a sixth convolutional layer that processes the fourth partial feature processed by the third depthwise separable convolutional layer;
a second merging module that merges a third partial feature processed by the fourth convolutional layer and a fourth partial feature processed by the sixth convolutional layer;
4. The apparatus according to claim 1, further comprising a second shuffle module that performs shuffling processing on the merged third partial feature and the fourth partial feature.

The at least one shuffle section includes a plurality of the first shuffle sections and a plurality of the second shuffle sections,
5. The apparatus according to claim 1, wherein the plurality of first shuffle sections and the plurality of second shuffle sections are rearranged according to a predetermined rule.

The apparatus according to any one of claims 1 to 5 , wherein the detection unit includes a YOLO network.

A terminal device comprising the device according to any one of claims 1 to 6 .

An object detection method, comprising:
a step in which the feature extraction unit extracts features in the input image;
a detection unit detecting an object in the input image based on the features extracted by the feature extraction unit,
The feature extraction unit includes at least one shuffle unit,
The shuffle section includes a plurality of convolutional layers,
The number of input channels and the number of output channels of each convolutional layer of the plurality of convolutional layers are the same,
The plurality of convolutional layers includes at least one depthwise separable convolutional layer,
The at least one shuffle section includes at least one first shuffle section and/or at least one second shuffle section,
The first shuffle part is a shuffle part with a stride of 1,
The method , wherein the second shuffle section is a shuffle section with a stride of 2 .