JPWO2020181685A5

JPWO2020181685A5 -

Info

Publication number: JPWO2020181685A5
Application number: JP2021502766A
Authority: JP
Publication date: 2022-06-03
Anticipated expiration: 2039-06-25

Description

本発明は深層学習に基づく車載ビデオターゲット検出方法に関わり、ビデオ画像処理技術分野に属する。 The present invention relates to an in-vehicle video target detection method based on deep learning and belongs to the field of video image processing technology.

運転中に、車前の車、歩行者及び他の障害物に対するターゲット検出及び追跡を行い、それにより前車の行動に対する分析を行うのは安全運転支援システムの基礎である。従来のターゲット検出方法は主なステップとして、ターゲット特徴の抽出、該当する分類子に対する訓練、ウィンドウのスライドによる検索、重複及び誤検知フィルタリングが普通である。このようなターゲット検出はスライディングウィンドウ選択策略に焦点がなく、タイミングの複雑さが高く、ウィンドウが冗長性につながり、手作りデザインの特徴の堅牢性が下手であり、分類子が信頼できないと同時に、需要に応じて有効な特徴を学習して個別検出を完成するように柔軟にデータを訓練できない。 It is the basis of a safe driving support system to detect and track targets for vehicles, pedestrians and other obstacles in front of the vehicle while driving, thereby analyzing the behavior of the vehicle in front. Conventional target detection methods usually have the main steps of extracting target features, training the corresponding classifier, searching by sliding windows, and filtering for duplicates and false positives. Such target detection does not focus on sliding window selection strategies, has high timing complexity, windows lead to redundancy, poor robustness of handmade design features, classifiers are unreliable and at the same time demand. It is not possible to flexibly train the data to learn valid features and complete individual detections accordingly.

本発明は従来の技術の前記の不足を解決して深層学習に基づく車載ビデオターゲット検出方法を提供することを目的にする。 An object of the present invention is to solve the above-mentioned deficiencies of the prior art and to provide an in-vehicle video target detection method based on deep learning.

上記の目的を達成するために、本発明は以下の技術的解決策を採用する。 In order to achieve the above object, the present invention employs the following technical solutions.

深層学習に基づく車載ビデオターゲット検出方法は下記のステップを含む。 An in-vehicle video target detection method based on deep learning includes the following steps.

ステップ1）深度座標の下のピクセルをカラー座標の下に合わせ、深度画像及びカラー画像を各々CNNにより特徴抽出を行い、各々の畳み込み層が出力した特徴マップをチャンネル次元で直列接続融合を行って最終的RGB-D特徴を取得して融合済畳み込み特徴マップとする。ここで、前記直列接続融合によって取得したRGB-D特徴はRPNとFast R-CNNが共有する融合済畳み込み特徴マップとしてマトリックス形式が下記であり、

その中、
i、j、k：中間変数
i～［０，ｈ‐１］、j～［０，ｗ‐１］、ｋ～［０，２ｃ‐１］、
h：特徴マップの高さ
w：特徴マップの幅
c：RGBのチャンネルの3つ
Ｙ _RGB （ｉ，ｊ，ｋ）：カラー画像特徴
Ｙ _depth （ｉ，ｊ，ｋ‐ｃ）：深度画像特徴
Ｙ _merge （ｉ，ｊ，ｋ）：直列接続融合済画像特徴である。 Step 1) Align the pixels under the depth coordinates under the color coordinates, extract the features of the depth image and the color image by CNN, and perform serial connection fusion of the feature maps output by each convolution layer in the channel dimension. The final RGB-D feature is acquired and used as a fused convolutional feature map . Here, the RGB-D features acquired by the series connection fusion have the following matrix format as a fused convolution feature map shared by RPN and Fast R-CNN.

Among them
i, j, k: Intermediate variables
i ~ [0, h-1], j ~ [0, w-1], k ~ [0,2c-1],
h: Height of feature map
w: Feature map width
c: 3 RGB channels
Y _RGB (i, j, k): Color image features
Y _depth (i, j, kc): Depth image features
Y _merge (i, j, k): This is a series-connected image feature.

ステップ2）領域提案ネットワークRPNを作成する。前記の領域提案ネットワークRPNは3×3の畳み込み層の1つ及び1×1の並行畳み込み層の2つを含む。融合済畳み込み特徴マップを3×3の畳み込み層に入力し、入力した特徴マップ上にピクセル単位でプリセットしたサイズごとにアンカーポイントを設定し、各アンカーポイントに所定寸法のアンカーポイントバウンディングボックスを生じさせる。 Step 2) Create a region proposal network RPN. The region proposed network RPN includes one of the 3x3 convolutional layers and two of the 1x1 parallel convolutional layers. Enter the fused convolution feature map into a 3x3 convolution layer, set anchor points for each pixel preset size on the entered feature map, and generate an anchor point bounding box of the specified dimensions at each anchor point . To.

生じたアンカーポイントバウンディングボックスを1×1の並行畳み込み層の2つに入力してバウンディングボックス回帰及び前景背景の判断を行い、各々アンカーポイントバウンディングボックスの前景背景信頼度及びアンカーポイントバウンディングボックス位置を出力しプリセットした条件に従って取得したアンカーポイントバウンディングボックスから見込み信頼度が一番高いトップ所定数の領域を選出し、最終的な領域提案コレクションCを取得する。 Input the generated anchor point bounding box into two 1x1 parallel convolution layers to perform bounding box regression and foreground background judgment, and output the foreground background reliability and anchor point bounding box position of the anchor point bounding box , respectively. Then, select the top predetermined number of regions with the highest expected reliability from the anchor point bounding boxes acquired according to the preset conditions, and acquire the final region proposal collection C.

Fast R-CNNモデルを作成する。前記のFast R-CNNモデルはROIプーリングレイヤーの2つ、完全に接続されたレイヤーの1つ及び並行な完全に接続されたレイヤーの2つからなり、各々当該領域の信頼度及びバウンディングボックス回帰済位置を出力する。融合済畳み込み特徴マップをFast R-CNNモデルに入力し、画像におけるターゲットの位置及び種類及び信頼度を出力する。 Create a Fast R-CNN model. The Fast R-CNN model described above consists of two ROI pooling layers, one fully connected layer and two parallel fully connected layers, the reliability and bounding box regression of the region , respectively. Output the finished position . The fused convolutional feature map is input to the Fast R-CNN model and the target position , type and reliability in the image are output.

ステップ3）RPNネットワークを訓練するコスト関数及びFast R-CNNネットワークを訓練するコスト関数を作成する。 Step 3) Create a cost function to train the RPN network and a cost function to train the Fast R-CNN network.

ステップ4）所定の値に設定した標準偏差のゼロ平均ガウス分布から重みを抽出してランダムにすべての新規レイヤーの初期化を行う。 Step 4) Extract weights from the zero mean Gaussian distribution with standard deviation set to a given value and randomly initialize all new layers.

ステップ5）逆伝播アルゴリズム、確率的勾配降下アルゴリズムを利用して、RPNとFast R-CNNの2つのネットワークに対する交替訓練によりモデルに対する訓練を行い、プリセットしたパラメータにより順にレイヤーごとのニューラルネットワークの重みを調整する。 Step 5) Using the back propagation algorithm and stochastic gradient descent algorithm, train the model by alternating training for two networks , R PN and Fast R-CNN, and then train the neural network for each layer in order according to the preset parameters. To adjust.

ステップ6）事前に取得した訓練コレクションにより訓練しておいたFaster R-CNNモデルをテストし、難しいサンプルの判断式により難しいサンプルを選出する。ここで、前記難しいサンプルの判断式が下記である。

その中、
L _IoU ：バウンディングボックス回帰誤差
Ｌ _score ：分類誤差
o：サンプルとターゲットとの交差率
ｋ：しきい値に対する感度係数
oとpの値の範囲：0～1である。 Step 6) Test the Faster R-CNN model trained by the training collection acquired in advance, and select the difficult sample by the judgment formula of the difficult sample. Here, the judgment formula of the difficult sample is as follows.

Among them
L _IoU : Bounding box regression error
L _score : Classification error
o: Intersection rate between sample and target
k: Sensitivity coefficient for threshold value
Range of values for o and p: 0 to 1.

ステップ7）ステップ6）で生じた難しいサンプルを訓練コレクションに入れ、ネットワークに対する再訓練を行い、ステップ5～7を繰り返して最終に最適なFaster R-CNNモデルを取得する。 Step 7) Put the difficult sample generated in step 6) into the training collection, retrain the network, and repeat steps 5-7 to finally get the optimal Faster R-CNN model.

ステップ8）実際に採集した車載ビデオ画像を処理し、訓練しておいたFaster R-CNNモデルに入力し、当該画像におけるターゲット種類、信頼度及びターゲット位置を出力する。 Step 8) Process the actually collected in-vehicle video image, input it to the trained Faster R-CNN model, and output the target type, reliability, and target position in the image.

本発明の効果は以下のとおりである。 The effects of the present invention are as follows.

第一、本発明は提案に基づく畳み込みニューラルネットワークモデルに基づいて、深度情報補完に基づくターゲット検出モデルを提出し、改善されたFaster R-CNNは深度情報チャンネルを追加し、カラー画像及び深度画像を各々同じ構成のCNNにより特徴抽出を行い、CNNの2つについて並行して接続した構成を採用し、元のカラー画像特徴マップを深度画像特徴マップと直列接続融合を行って最終的画像特徴を取得する。従来のアルゴリズムと比べてみると、本発明による画像特徴が更に豊かであり、車の細部関係の情報を補充し、時間コスト上昇の恐れがなく、複雑なシーンにおけるターゲット検出を向上させることの需要を満たすことができる。 First, the present invention submits a target detection model based on depth information complementation based on the convolutional neural network model based on the proposal, and the improved Faster R-CNN adds a depth information channel to add color and depth images. Feature extraction is performed by CNNs with the same configuration, and the configuration in which the two CNNs are connected in parallel is adopted, and the original color image feature map is serially connected and fused with the depth image feature map to obtain the final image feature. do. Compared to conventional algorithms, the demand for richer image features according to the invention, supplementing vehicle detail information, no risk of increased time costs, and improved target detection in complex scenes. Can be met.

第二、本発明は訓練段階に難しいサンプルの掘り出し策略を追加したので、モデルが元より更に難しいサンプルに注意をはらい、更によく車及び疑似車の背景を区分し、正確性向上の目的を達成できる。 Secondly, since the present invention has added a difficult sample digging strategy to the training stage, the model pays attention to the more difficult sample from the original, better divides the background of the car and the simulated car, and achieves the purpose of improving accuracy. can.

第三、本発明は共有畳み込みネットワークにより提案アンカーポイントバウンディングボックスを抽出するFaster R-CNNアルゴリズムがリアルタイム性で顕著に向上したものである。このアルゴリズムは従来の領域提案アルゴリズムを放棄し、深層ネットワークにおける畳み込み層によりアンカーポイントバウンディングボックスを抽出するので、大量に時間コストを削減できるものである。 Third, the present invention is a remarkable improvement in real - time performance of the Faster R-CNN algorithm that extracts the proposed anchor point bounding box by a shared convolutional network. This algorithm abandons the conventional region proposal algorithm and extracts the anchor point bounding box by the convolution layer in the deep network, so that a large amount of time cost can be reduced.

本発明の実施例のプロセスチャート。A process chart of an embodiment of the present invention. 本発明の実施例の改善されたFaster R-CNNアルゴリズムの訓練プロセスチャート。Training process chart of the improved Faster R-CN N algorithm of the embodiment of the present invention.

次に図と合わせて本発明について更に説明する。下記の実施例は更にはっきりして本発明の技術的な解決策について説明するためのものだけであり、本発明の保護範囲を制限するものではない。 Next, the present invention will be further described with reference to the drawings. The following examples are for the purpose of more clearly explaining the technical solution of the present invention and do not limit the scope of protection of the present invention.

本発明は深層学習に基づく車載ビデオターゲット検出方法を提供することを目的にし、Faster R-CNNの基礎上に、深度画像の特徴マップを追加して車の細部関係の情報を補充し、カラー画像の特徴を抽出するのと同じ畳み込みニューラルネットワークを採用し、カラー画像チャンネル及び深度画像チャンネルを並行して接続した構成にし、抽出した特徴が直列接続融合を行われて最終的RGB-D特徴を取得して、訓練に難しいサンプル掘り出しの策略を追加し、複雑な交通シーンにおける小さなターゲット及び難しいターゲットに対するアルゴリズムによる検出の正確性を向上させる。 An object of the present invention is to provide an in-vehicle video target detection method based on deep learning, and on the basis of Faster R-CNN, a feature map of a depth image is added to supplement information related to vehicle details and color. The same convolutional neural network that extracts the features of the image is adopted, and the color image channel and the depth image channel are connected in parallel, and the extracted features are connected in series and fused to obtain the final RGB-D feature. Acquire and add difficult sample digging tricks to training and improve the accuracy of algorithmic detection for small and difficult targets in complex traffic scenes.

図1は本発明の方法の実施例のプロセスチャートである。 FIG. 1 is a process chart of an embodiment of the method of the present invention.

本発明を実施する際に当たり事前に取得した訓練コレクションサンプルコレクション及びテストコレクションサンプルに基づいてもいいし、需要に応じて訓練コレクション及びテストコレクションを作成してもいい。本実施例で、KITTIデータコレクションにより訓練サンプルコレクション及びテストサンプルコレクションを作成する場合、下記のステップ1を含む。即ち、PASCAL VOCデータコレクションのフォーマット及び評価アルゴリズム工具を利用する。先ず、KITTIの種類を転換する。PASCAL VOCは20種あり、都市の交通シーンで重要な検出対象が車、歩行者及び交通標識であるので、データコレクションを前記の3種に分ける。次に、ラベル情報を転換する。ラベルファイルをtxtファイルからxmlファイルに転換し、ラベルにおける他の情報を削除し、3種のみを保留する。最後に、訓練コレクション及びテストコレクションを生成する。 The training collection sample collection and the test collection sample obtained in advance in carrying out the present invention may be used, or the training collection and the test collection may be created according to the demand. In this example, when the training sample collection and the test sample collection are created by the KITTI data collection, the following step 1 is included. That is, the PASCAL VOC data collection format and evaluation algorithm tools are used. First, change the type of KITTI. There are 20 types of PASCAL VOCs, and since the important detection targets in the urban traffic scene are cars, pedestrians, and traffic signs, the data collection is divided into the above three types. Next, the label information is converted. Convert the label file from a txt file to an xml file , remove the other information in the label and reserve only 3 types. Finally, a training collection and a test collection are generated.

図1の通りに、本発明による方法は下記のステップを含む。 As shown in FIG. 1, the method according to the present invention includes the following steps.

ステップ2）領域提案ネットワーク（Regional Proposal Network,RPN）及びFast R-CNNネットワークを整合した改善されたFaster R-CNNモデルを作成する。 Step 2) Create an improved Faster R-CNN model that aligns the Regional Proposal Network (RPN) and the Fast R-CNN network.

2.1）畳み込みニューラルネットワーク（Convolutional Neural Networks,CNN） 2.1) Convolutional Neural Networks (CNN)

先ず、深度座標の下のピクセルをカラー座標の下に合わせる。CNNについてはZFモデル中の特徴抽出ネットワークを選択し、構成が同じCNNの2つを並行して接続する（元のカラー画像チャンネルはチャンネル1、並行して接続した深度画像チャンネルはチャンネル2である）。画像の2種がCNNにより特徴抽出されてから、特徴マップはサイズがともにhwc（ここで、h、wは各々特徴マップの高さ及び幅を表し、cはRGBのチャンネルの3つである）である。カラー画像特徴及び深度画像特徴はチャンネルの2つとして特徴融合を行い、融合済特徴マップはサイズが2hwc である。 First, align the pixels below the depth coordinates below the color coordinates. For CNN, select the feature extraction network in the ZF model and connect two CNNs with the same configuration in parallel (the original color image channel is channel 1 and the depth image channel connected in parallel is channel 2). be). Since the two types of images have been feature-extracted by CNN, the feature maps are both hwc in size (where h and w represent the height and width of the feature map, respectively, and c is the three RGB channels. There is). Color image features and depth image features are fused as two channels, and the fused feature map is 2hwc in size.

2.2）領域提案ネットワークRPNは3×3の畳み込み層の1つ及び1×1の並行畳み込み層の2つを含む。 2.2) The region proposal network RPN includes one of the 3x3 convolutional layers and two of the 1x1 parallel convolutional layers.

融合済畳み込み特徴マップを3×3の畳み込み層に入力し、入力した特徴マップ上でピクセル単位でプリセットしたサイズごとにアンカーポイントを設定し、各アンカーポイントに所定寸法のアンカーポイントバウンディングボックスを生じさせる。本実施例では各々寸法の3つ及びアスペクト比の3つを採用する場合、各アンカーポイントに異なる寸法のアンカーポイントバウンディングボックスのk=3×3=9個が生じ、アンカーポイントバウンディングボックスの計hwk個が生じる。 Enter the fused convolution feature map into a 3x3 convolution layer, set anchor points for each pixel preset size on the entered feature map, and generate an anchor point bounding box of predetermined dimensions at each anchor point. .. In this embodiment, when 3 dimensions and 3 aspect ratios are adopted for each anchor point , k = 3 × 3 = 9 anchor point bounding boxes with different dimensions are generated at each anchor point, and the total hwk of the anchor point bounding boxes is generated. Individuals are produced .

1×1の並行畳み込み層の2つは上層で生じたアンカーポイントバウンディングボックスに対してバウンディングボックス回帰及び前景背景の判断を行い、各々アンカーポイントバウンディングボックスの前景背景信頼度及びアンカーポイントバウンディングボックス位置を出力する。アンカーポイントバウンディングボックス位置はアンカーポイントバウンディングボックス中心点座標のx、y、幅w’及び高さh’というパラメータの4つを含む。 Two of the 1x1 parallel convolution layers perform bounding box regression and foreground background judgment for the anchor point bounding box generated in the upper layer, and determine the foreground background reliability and anchor point bounding box position of the anchor point bounding box , respectively. Output. The anchor point bounding box position contains four parameters: x, y, width w'and height h'of the anchor point bounding box center point coordinates.

2.3）2.1）で取得したアンカーポイントバウンディングボックスに対してプリセットした条件に従ってプリセットした条件を満たす予定数の領域を選出する。本実施例では、取得したアンカーポイントバウンディングボックスに対してsoftmaxの得点により降順で並べ替え、トップ2000領域を保留し、更に非最大値抑制アルゴリズム（Non-Maximum Suppression、NMS）により見込み信頼度が一番高い領域のトップ300を選出し、最終的な領域の提案コレクションCを取得する。 2.3 ) Select the planned number of areas that meet the preset conditions according to the preset conditions for the anchor point bounding box acquired in 2.1). In this embodiment, the acquired anchor point bounding box is sorted in descending order by the score of softmax, the top 2000 area is reserved, and the expected reliability is one by the non-maximum suppression algorithm (NMS). Select the top 300 of the highest territories and get the final territory proposal collection C.

2.4）Fast R-CNNはROIプーリングレイヤーの2つ、完全に接続されたレイヤーの1つ及び並行な完全に接続されたレイヤーの2つからなり、各々当該領域の信頼度及びバウンディングボックス回帰済位置を出力する。 2.4 ) Fast R-CNN consists of two ROI pooling layers, one fully connected layer and two parallel fully connected layers, each of which has been regressed on the reliability and bounding box of the area . Output the position .

ROIプーリングレイヤーは領域の提案コレクションCと融合済畳み込み特徴マップに対するプーリング操作を行い、入力したイメージによりROIを特徴マップの対応位置にマッピングし、マッピング済領域を同一のサイズのセクションに分け、各セクションに対して最大プーリング操作を行う。 The ROI pooling layer performs a pooling operation on the region 's proposed collection C and the fused convolution feature map , maps the ROI to the corresponding position on the feature map according to the input image, divides the mapped area into sections of the same size, and each section. Perform the maximum pooling operation for.

完全に接続されたレイヤーはROIプーリングレイヤーの出力結果を併合し、最後に並行な完全に接続されたレイヤーの2つを入力し、アンカーポイントバウンディングボックスに対して領域分類及びバウンディングボックス回帰を行い、画像におけるターゲットの位置及その種類、信頼度を出力する。 Fully connected layers merge the output of the ROI pooling layer, and finally input two of the parallel fully connected layers, and perform region classification and bounding box regression on the anchor point bounding box. , Outputs the position of the target in the image, its type, and reliability.

本実施例では、RPNネットワークを訓練するコスト関数は下記である。 In this example, the cost function for training the RPN network is:

その中、
ground truth(即ち較正された真実なデータ)との引渡し率（Intersection over Union、IoU）が最大または少なくとも0.7であるアンカーポイントバウンディングボックスを正サンプルに表示する。
Ｐ_i：想定信頼度
Ｐ_i ^*：ラベル値、1である場合に正サンプル、0である場合に負サンプルを表し、ｉはアンカーポイントバウンディングボックスの索引である。
Ｎ_cls：アンカーポイントバウンディングボックス総数
Ｎ_reg：正サンプルの数
ｔ_i：想定アンカーポイントバウンディングボックスの補正値
ｔ_i ^*：実際のアンカーポイントバウンディングボックスの補正値
Ｌ_cls：分類コスト
Ｌ_reg：バウンディングボックス回帰コスト
λ：バランスウェイト

Among them
g Display an anchor point bounding box with a maximum or at least 0.7 Intersection over Union (IoU) with round truth (ie, calibrated truth data) in the positive sample.
P _i : Assumed reliability P _i ^* : Label value, if it is 1, it represents a positive sample, if it is 0, it represents a negative sample, and i is an index of the anchor point bounding box .
N _cls : Total number of anchor point bounding boxes N _reg : Number of positive samples t _i : Assumed anchor point bounding box correction value t _i ^* : Actual anchor point bounding box correction value L _cls : Classification cost L _reg : Bounding box regression Cost λ: Balance weight

本実施例では、Fast R-CNNネットワークを訓練するコスト関数は下記である。 In this example, the cost function for training the Fast R-CNN network is:

その中、
u：u類目
ｔ^u：u類目のバウンディングボックス回帰想定補正値
v：実際の補正値
Ｌ_cls：分類コスト
Ｌ_reg：バウンディングボックス回帰コスト
ｐ：分類想定結果
λ：バランスウェイト

Among them
u: u-class t ^u : u-class bounding box regression assumption correction value
v: Actual correction value L _cls : Classification cost L _reg : Bounding box regression cost
p: Assumed classification result
λ: Balance weight

ステップ4）スタンダードZFモデル訓練及び微調整ネットワークの各パラメータにより設定した標準偏差のゼロ平均ガウス分布から重みを抽出してランダムにすべての新規レイヤーの初期化を行う。 Step 4) Standard ZF model training and fine-tuning Extract weights from the zero-mean Gaussian distribution of standard deviations set by the network parameters and randomly initialize all new layers.

ステップ5）逆伝播アルゴリズム、確率的勾配降下アルゴリズムを利用して、RPNとFast R-CNNの2つのネットワークに対する交替訓練によりモデルに対する訓練を行い、順にレイヤーごとのニューラルネットワークの重みを調整し、ネットワーク初期学習率を0.01、最低学習率を0.0001、勢いを0.9、重み減衰係数を0.0005、Dropout値を0.5に設定する。具体的なステップは下記である。 Step 5) Using the back propagation algorithm and stochastic gradient descent algorithm, train the model by alternating training for two networks , R PN and Fast R-CNN, and adjust the weights of the neural network for each layer in order . Set the network initial learning rate to 0.01, the minimum learning rate to 0.0001, the momentum to 0.9, the weight attenuation coefficient to 0.0005, and the Dropout value to 0.5. The specific steps are as follows.

（1）逆伝播アルゴリズム及び確率的勾配アルゴリズムによりRPNモデルを訓練し、この段階を80000回繰り返す。 (1) Train the RPN model by the reverse polish notation algorithm and the stochastic gradient descent algorithm, and repeat this step 80,000 times.

（2）RPNに生成したアンカーポイントバウンディングボックスをFast R-CNNの入力にし、独立した訓練を行い、この段階を40000回繰り返す。 (2) Anchor point bounding box generated in RPN is used as input of Fast R-CNN, independent training is performed, and this stage is repeated 40,000 times.

（3）（2）における結果によりRPNネットワークの構成の初期化を行い、共有畳み込み層を固定し（共有畳み込み層の学習率を0にする）、RPNネットワークのパラメータを更新し、この段階を80000回繰り返す。 (3) Initialize the RPN network configuration based on the results in (2) , fix the shared convolution layer (set the learning rate of the shared convolution layer to 0 ), update the parameters of the RPN network, and set this stage to 80,000. Repeat once.

（4）共有畳み込み層を固定し（共有畳み込み層の学習率を0にする）、Fast R-CNNネットワークの構成を微調整し、その完全に接続されたレイヤーのパラメータを更新し、この段階を40000回繰り返す。 (4) Fix the shared convolution layer (set the learning rate of the shared convolution layer to 0 ), fine-tune the configuration of the Fast R-CNN network, update the parameters of its fully connected layer, and perform this step. Repeat 40,000 times.

ステップ6）訓練コレクションにより大体に訓練しておいたFaster R-CNNモデルをテストし、本発明の難しいサンプル判別式により難しいサンプルを選出する。 Step 6) Test the Faster R-CNN model that has been roughly trained by the training collection, and select difficult samples by the difficult sample discriminant of the present invention.

ステップ7）ステップ6）で生じた難しいサンプルを訓練コレクションに入れ、ネットワークに対する再訓練を行い、ステップ5～7を繰り返してネットワークの難しいサンプルに対する判別力を強化し、最終に最適なFaster R-CNNモデルを取得する。訓練の過程について図2を参照できる。 Step 7) Put the difficult sample generated in step 6) into the training collection, retrain the network, repeat steps 5-7 to strengthen the discriminating power for the difficult sample of the network, and finally the optimum Faster R-CNN. Get the model. See Figure 2 for the training process.

本発明は充分にFaster R-CNNアルゴリズムに存在する小さなターゲットの検出漏れを考慮し、深度の画像特徴融合及び難しいサンプルの掘り出し方法により複雑な交通シーンにおける車認識の正確率を向上させる。 The present invention fully takes into account the small target detection omissions present in the Faster R-CNN algorithm and improves the accuracy of vehicle recognition in complex traffic scenes through depth image feature fusion and difficult sample digging methods.

本発明で使用する畳み込みニューラルネットワークに基づくターゲット検出アルゴリズムは柔軟にデータを訓練する場合に需要に応じて有効な特徴を学習して個別検出を完成できる。R-CNNアルゴリズムはアンカーポイントバウンディングボックス提案と畳み込みニューラルネットワークを結び合わせたターゲット検出アルゴリズムであり、領域提案アルゴリズムが生じた多数の提案アンカーポイントバウンディングボックス及び高い時間コストにより、リアルタイム性及び正確性で改善の余地がまだ大きい。共有畳み込みネットワークにより提案アンカーポイントバウンディングボックスを抽出するFaster R-CNNアルゴリズムはリアルタイムで顕著に向上したものであり、従来の領域提案アルゴリズムを放棄し、深層ネットワークにおける畳み込み層によりアンカーポイントバウンディングボックスを抽出するので、大量に時間コストを削減できるが、小さなターゲットが多く、複雑なシーンでは、検出漏れが顕著であるので、改善の余地がまだ大きいである。 The target detection algorithm based on the convolutional neural network used in the present invention can learn effective features according to demand and complete individual detection when training data flexibly. The R-CNN algorithm is a target detection algorithm that combines an anchor point bounding box proposal and a convolutional neural network. There is still a lot of room for. Extracting Proposed Anchor Point Bounding Boxes with Shared Convolutional Networks The Faster R-CNN algorithm is a significant improvement in real time, abandoning the traditional region proposal algorithm and extracting anchor point bounding boxes with convolutional layers in deep networks. Therefore, the time cost can be reduced in a large amount, but there is still a lot of room for improvement because there are many small targets and the detection omission is remarkable in a complicated scene.

上記のものが本発明の好ましい実施形態だけであるので、本技術分野の普通の技術者にとって、本発明の技術原理を離れない前提で若干の改善または変形を行うことができ、該当する改善でも変形でも本発明の保護範囲にある。 Since the above are only preferred embodiments of the present invention, ordinary engineers in the art can make slight improvements or modifications on the premise that they do not deviate from the technical principles of the present invention, and even the corresponding improvements can be made. Even the modification is within the protection range of the present invention.

Claims

Deep learning-based in-vehicle video target detection method comprising the following steps:
Step 1) Align the pixels under the depth coordinates under the color coordinates, extract the features of the depth image and the color image by CNN, and perform serial connection fusion of the feature maps output by each convolution layer in the channel dimension. The final RGB-D features are acquired to form a fused convolutional feature map , where the RGB-D features acquired by the series connection fusion are in matrix format as a fused convolutional feature map shared by RPN and Fast R-CNN. Is below,

Among them
i, j, k: Intermediate variables
i ~ [0, h-1], j ~ [0, w-1], k ~ [0,2c-1],
h: Height of feature map
w: Feature map width
c: 3 RGB channels
Y _RGB (i, j, k): Color image features
Y _depth (i, j, kc): Depth image features
Y _merge (i, j, k): A series-connected image feature that has been merged.
Create a region proposal network RPN, the region proposal network RPN contains one 3x3 convolution layer and two 1x1 parallel convolution layers, and the fused convolution feature map is a 3x3 convolution layer. Set anchor points for each pixel preset size on the entered feature map, and generate an anchor point bounding box with a predetermined dimension for each anchor point .
Input the generated anchor point bounding box into two 1x1 parallel convolution layers to perform bounding box regression and foreground background judgment, and output the foreground background reliability and anchor point bounding box position of the anchor point bounding box , respectively. Then, from the anchor point bounding box acquired according to the preset conditions, the preset number of regions satisfying the predetermined conditions are extracted, and the final region proposal collection C is acquired.
Step 2) Create a Fast R-CNN model and
The Fast R-CNN model described above consists of two ROI pooling layers, one fully connected layer and two parallel fully connected layers, the reliability and bounding box regression of the region , respectively. Output the completed position , input the fused convolutional feature map to the Fast R-CNN model, output the position and type and reliability of the target in the image,
Step 3) Create a cost function to train the RPN network and a cost function to train the Fast R-CNN network.
Step 4) Extract weights from the zero mean Gaussian distribution with standard deviation set to a given value and randomly initialize all new layers.
Step 5) Using the back propagation algorithm and stochastic gradient descent algorithm, train the model by alternating training for two networks , R PN and Fast R-CNN, and then train the neural network for each layer in order according to the preset parameters. Adjust and
Step 6) Test the Faster R-CNN model trained by the pre-acquired training collection, select the difficult sample by the difficult sample judgment formula, and select the difficult sample.
Here, the judgment formula of the difficult sample is as follows .

Among them
L _IoU : Bounding box regression error
L _score : Classification error
o: Intersection rate between sample and target
k: Sensitivity coefficient for threshold value
Range of values for o and p: 0 to 1
Step 7) Put the difficult sample generated in step 6 into the training collection, retrain the network, and repeat steps 5-7 to get the optimal Faster R-CNN model.
Step 8) Process the actually collected in-vehicle video image, input it to the trained Faster R-CNN model, and output the target type, reliability, and target position in the image.

The in-vehicle video target detection method based on deep learning according to claim 1, wherein the cost function for training the RPN network is as follows.

Among them
Display an anchor point bounding box with a positive sample that has a maximum or at least 0.7 delivery rate with calibrated true data.
P _i : Assumed reliability P _i ^* : Label value, 1 indicates a positive sample, 0 indicates a negative sample.
i: Anchor point bounding box index N _cls : Total number of anchor point bounding boxes N _reg : Number of positive samples t _i : Assumed anchor point bounding box correction value t _i ^* : Actual anchor point bounding box correction value L _cls : Classification cost L _reg : Bounding box regression cost λ: Balance weight

The in-vehicle video target detection method based on deep learning according to claim 1, wherein the cost function for training the Fast R-CNN network is as follows.

The in-vehicle video target detection method based on deep learning according to claim 1, wherein the specific step of step 5) is as follows:
(1) The RPN model is trained by the reverse polish notation algorithm and the stochastic gradient descent algorithm, and this step is repeated 80,000 times.
(2) Anchor point bounding box generated in RPN is used as input of Fast R-CNN, independent training is performed, and this stage is repeated 40,000 times.
(3) Initialize the RPN network configuration based on the results in (2) , fix the shared convolution layer (set the learning rate of the shared convolution layer to 0 ), update the parameters of the RPN network, and set this stage to 80,000. Repeated times,
(4) Fix the shared convolution layer (set the learning rate of the shared convolution layer to 0 ), fine-tune the configuration of the Fast R-CNN network, update the parameters of its fully connected layer, and perform this step. Repeat 40,000 times.

The parameter setting in step 5) is described in claim 1, which includes setting the network initial learning rate to 0.01, the minimum learning rate to 0.0001, the momentum to 0.9, the weight attenuation coefficient to 0.0005, and the Dropout value to 0.5. In-vehicle video target detection method based on deep learning.

The in-vehicle video target detection method based on deep learning according to claim 1, wherein the method of acquiring the training collection and the test collection in advance includes the following steps.
Create training sample collection and test sample collection with KITTI data collection,
The PASCAL VOC format was used to convert the types of KITTI, and the KITTI data collection was divided into three types: cars, pedestrians, and traffic.
Convert label information, convert label file from txt file to xml file , remove other information in label, hold only 3 kinds, and finally generate training collection and test collection.

The in-vehicle video target detection method based on deep learning according to claim 1, wherein a method for selecting a preset number of regions satisfying a predetermined condition is as follows:
The acquired anchor point bounding boxes are sorted in descending order according to the score of softmax, the top 2000 areas are reserved, and the top predetermined number of areas with the highest expected reliability are selected by the non-maximum value suppression algorithm.

The in-vehicle video target detection method based on deep learning according to claim 1, wherein the set standard deviation set forth in step 4 is 0.01.