CN107527031A

CN107527031A - A kind of indoor objects detection method based on SSD

Info

Publication number: CN107527031A
Application number: CN201710724937.7A
Authority: CN
Inventors: 李宏亮; 姚晓宇; 杨燕平; 陈雅丽; 方清
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-08-22
Filing date: 2017-08-22
Publication date: 2017-12-29
Anticipated expiration: 2037-08-22
Also published as: CN107527031B

Abstract

The invention discloses an SSD-based indoor target detection method, belongs to the field of target recognition, and is a novel application method of a convolutional neural network for indoor target detection. The present invention solves indoor target detection. However, there is currently no marked database that meets the requirements of this experiment. For this reason, a database related to indoor targets is constructed. All images of the database are sampled at an angle with a wider field of view. It conforms to the normal viewing angle of intelligent robots, and has differences in background, lighting, and image size. Manually annotate common indoor targets such as refrigerators, TVs, beds, dining tables, chairs, sofas, coffee tables, toilets, washstands, bathtubs, cups, etc.; use the obtained images to train feature extraction networks and detectors, and finally use the trained features The extraction network and the detector detect the target to be recognized.

Description

An SSD-based indoor target detection method

技术领域technical field

本发明目标识别领域,是一种用于室内目标检测的卷积神经网络的新型应用方法。The object recognition field of the present invention is a novel application method of a convolutional neural network for indoor object detection.

背景技术Background technique

室内目标识别是在室内环境下将一种目标从其他目标中分离并识别出来的过程,对于家用服务机器人的研发、家庭监控系统的设计，室内设计方案的检索等应用领域具有重要的意义。但是传统方法检测精度和速度都不能达到要求，所以迫切的需要一种更加高效准确的检测方法。Indoor object recognition is the process of separating and identifying one object from other objects in an indoor environment. It is of great significance for the research and development of household service robots, the design of home monitoring systems, and the retrieval of interior design schemes. However, the detection accuracy and speed of traditional methods cannot meet the requirements, so a more efficient and accurate detection method is urgently needed.

在计算机视觉领域，有诸多方法可以用于室内目标检测，传统的目标检测方法基于SIFT、 HOG等特征，采用SVM、Adaboot等分类方法。但是，这些手工设计的特征往往对图像的表征能力有限，导致目标识别精度以及定位准确性往往较低，很难满足实际应用要求。而深度学习作为机器学习的一个新分支，在实时性和精确性上有了良好的改善。近年来，深度学习迅速发展，在自动语音识别、自然语言处理以及图像处理等领域都展现了良好的性能。在图像处理领域，深度学习已广泛应用于图像分类、目标检测和语义分割。卷积神经网络是深度学习的核心，卷积神经网络对于深度学习取得的良好性能功不可没。在目标检测领域，Faster R-CNN、YOLO 以及SSD等模型都有不错的检测效果，卷积神经网络在其中发挥着不可替代的作用。与此同时，近几年来循环神经网络也有不错的发展。循环卷积网络对于时间序列的处理相比于卷积神经网络有着其独特的优势。且LSTM等模型已大幅改善了循环卷积网络的梯度消失问题。In the field of computer vision, there are many methods that can be used for indoor target detection. Traditional target detection methods are based on features such as SIFT and HOG, and classification methods such as SVM and Adaboot are used. However, these manually designed features often have limited ability to represent images, resulting in low target recognition accuracy and positioning accuracy, which is difficult to meet the requirements of practical applications. As a new branch of machine learning, deep learning has improved in real-time and accuracy. In recent years, deep learning has developed rapidly and has shown good performance in the fields of automatic speech recognition, natural language processing, and image processing. In the field of image processing, deep learning has been widely used in image classification, object detection and semantic segmentation. Convolutional neural network is the core of deep learning, and convolutional neural network is indispensable for the good performance of deep learning. In the field of target detection, models such as Faster R-CNN, YOLO, and SSD have good detection results, and convolutional neural networks play an irreplaceable role in them. At the same time, recurrent neural networks have also developed well in recent years. Recurrent convolutional networks have unique advantages over convolutional neural networks for processing time series. And models such as LSTM have greatly improved the gradient disappearance problem of circular convolutional networks.

针对室内目标的检测，有一些相对较好的方案，如：基于Faster R-CNN的目标检测，采用共享的卷积网组成RPN(Region Proposal Network)，用RPN直接预测出建议框，且RPN的绝大部分预测在GPU中完成，卷积网和Fast R-CNN部分共享，因此大幅提升了目标检测的速度，但是仍不能满足实时的要求。基于YOLO(You Only Look Once)的目标检测，可以直接回归出目标的位置，由于没有使用预置框机制，YOLO的检测精度并不是很高。此外，还有基于 SSD(Single Shot Multibox Detector)的检测方法，该方法也是直接回归目标的位置和类别，检测精度相对YOLO有提高，但仍然达不到的任务要求。For indoor target detection, there are some relatively good solutions, such as: target detection based on Faster R-CNN, using a shared convolutional network to form an RPN (Region Proposal Network), using RPN to directly predict the suggestion box, and the RPN Most of the predictions are completed in the GPU, and the convolutional network and Fast R-CNN are partially shared, so the speed of target detection is greatly improved, but it still cannot meet the real-time requirements. Based on the target detection of YOLO (You Only Look Once), the position of the target can be directly returned. Since the preset frame mechanism is not used, the detection accuracy of YOLO is not very high. In addition, there is a detection method based on SSD (Single Shot Multibox Detector), which also directly regresses the position and category of the target. The detection accuracy is improved compared to YOLO, but it still cannot meet the task requirements.

神经网络有数以百万计的参数，很容易出现过拟合的问题。为了解决过拟合的问题，一般采用先在大规模数据集上预训练出一个模型，然后再用特定的小规模数据集对该模型进行微调。同时可以通过dropout的方法让一些隐藏层的节点在训练中不工作，来达到防止过拟合的目的。Neural networks have millions of parameters and are prone to overfitting problems. In order to solve the problem of overfitting, it is generally used to pre-train a model on a large-scale data set, and then fine-tune the model with a specific small-scale data set. At the same time, the dropout method can be used to prevent some hidden layer nodes from working during training to prevent overfitting.

发明内容Contents of the invention

本发明提出了一种改进的基于SSD的室内目标检测，针对家庭监控、服务机器人的应用场景。在建立的数据库上，进行了丰富的实验，对室内常见家具进行检测。The present invention proposes an improved SSD-based indoor target detection, aiming at the application scenarios of home monitoring and service robots. On the established database, extensive experiments are carried out to detect common indoor furniture.

本发明解决的是室内目标检测，然而目前没有标注好的符合本实验要求的数据库，为此构建了一个与室内目标相关的数据库，该数据库的所有图像都是以视野较开阔的角度采样的，符合智能机器人的正常视角，在背景、光照以及图像尺寸等方面又具有差异性。对室内常见目标冰箱、电视、床、餐桌、椅子、沙发、茶几、马桶、洗漱台、浴池、杯子等进行了人工标注；采用获得的图像训练特征提取网络和检测器，最后采用训练好的特征提取网络和检测器对待识别目标进行检测。因而本发明技术方案为一种基于SSD的室内目标检测方法,该方法包括:The present invention solves the problem of indoor target detection. However, there is currently no marked database that meets the requirements of this experiment. For this reason, a database related to indoor targets is constructed. All images of the database are sampled at an angle with a wider field of view. It conforms to the normal viewing angle of intelligent robots, and has differences in background, lighting, and image size. Manually mark the common indoor targets refrigerator, TV, bed, dining table, chair, sofa, coffee table, toilet, washstand, bathtub, cup, etc.; use the obtained image to train the feature extraction network and detector, and finally use the trained feature The extraction network and the detector detect the target to be recognized. Therefore, the technical solution of the present invention is an SSD-based indoor target detection method, the method comprising:

步骤1:获取室内待检测的目标图像；Step 1: obtain the indoor target image to be detected;

步骤2：建立特征提取网络，采用该特征提取网络提取目标图像的全局特征；Step 2: Establish a feature extraction network, and use the feature extraction network to extract the global features of the target image;

步骤3：将步骤2得到的全局特征输入SSD检测器，获得对应的检测结果；Step 3: Input the global features obtained in step 2 into the SSD detector to obtain the corresponding detection results;

其特征在于所述步骤2的特征提取网络包括：三个输入模块，第一至第十一卷积模块，第一至第五池化模块，两个上下文信息提取模块，一个归一化模块；所述的三个输入模块分别为一个待检测图像和第一、二标志位信息输入模块，所述待检测图像作为第一卷积模块的输入；第一卷积模块、第一池化模块、第二卷积模块、第二池化模块、第三卷积模块、第三池化模块、第四卷积模块、第四池化模块、第五卷积模块、第五池化模块、第六卷积模块、第七卷积模块、第八卷积模块、第九卷积模块、第十卷积模块、第十一卷积模块依次级联；额外的，第四卷积模块的输出还要与第一标志位信息输入模块的输出一起输入到归一化模块，然后归一化模块的输出输入到第一上下文信息提取模块；额外的，第七卷积模块的输出还要和第二标志位信息输入模块的输出一起输入到第二上下文信息提取模块；最后将第一、二上下文信息提取模块、第八至第十一卷积模块的输出作为提取出的全局特征。It is characterized in that the feature extraction network in step 2 includes: three input modules, the first to eleventh convolution modules, the first to fifth pooling modules, two context information extraction modules, and one normalization module; The three input modules are respectively an image to be detected and the first and second flag information input modules, and the image to be detected is used as the input of the first convolution module; the first convolution module, the first pooling module, The second convolution module, the second pooling module, the third convolution module, the third pooling module, the fourth convolution module, the fourth pooling module, the fifth convolution module, the fifth pooling module, the sixth The convolution module, the seventh convolution module, the eighth convolution module, the ninth convolution module, the tenth convolution module, and the eleventh convolution module are cascaded in sequence; additionally, the output of the fourth convolution module also needs Together with the output of the first flag bit information input module, it is input to the normalization module, and then the output of the normalization module is input to the first context information extraction module; additionally, the output of the seventh convolution module is also combined with the second flag The output of the bit information input module is input to the second context information extraction module together; finally, the outputs of the first and second context information extraction modules, and the eighth to eleventh convolution modules are used as the extracted global features.

进一步的，所述的上下文信息提取模块包括两个卷积模块、一个级联模块、一个横向特征提取支路和一个纵向特征提取支路；第一个卷积模块的输出分别输入到横向特征提取支路和纵向特征提取支路，所述横向特征提取支路和纵向特征提取支路的输出共同输入到级联模块，然后级联模块的输出输入到第二个卷积模块；其中级联模块为一个级联层。该方案从横向和纵向俩个方向提取上下文信息，有效提取了图像的整体信息。Further, the context information extraction module includes two convolution modules, a cascade module, a horizontal feature extraction branch and a vertical feature extraction branch; the output of the first convolution module is respectively input to the horizontal feature extraction A branch and a vertical feature extraction branch, the output of the horizontal feature extraction branch and the vertical feature extraction branch are jointly input to the cascade module, and then the output of the cascade module is input to the second convolution module; wherein the cascade module as a cascade layer. The scheme extracts context information from both horizontal and vertical directions, effectively extracting the overall information of the image.

进一步的，所述上下文信息提取模块中的两个卷积模块都为卷积核大小为1*1，步长为1，填充为0的卷积层。该方案可以有效的提取图像特征，且有助于防止网络发散。Further, the two convolution modules in the context information extraction module are both convolution layers with a convolution kernel size of 1*1, a step size of 1, and a padding of 0. This scheme can effectively extract image features and help prevent network divergence.

进一步的，所述的横向特征支路和纵向特征支路都是由一个维度转化模块、一个维度合并模块、一个循环卷积模块、一个维度还原模块依次级联组成的；所述维度转换模块的作用为将输入图像的各维度信息交换位置，维度合并模块的作用为减少图像的维度数量，所述维度还原模块的作用为将减少后的图像维度数还原到其原有的维度数和顺序；循环卷积模块是LSTM层 (长短期记忆网络)，其处理的方向由对应支路的维度转化模块控制。该方案充分利用了循环卷积网络对时间序列的处理能力，灵活应用与对单张图像的特征提取，且LSTM对梯度弥散有很好的抑制效果。Further, both the horizontal feature branch and the vertical feature branch are composed of a dimension conversion module, a dimension merging module, a circular convolution module, and a dimension restoration module in sequence; the dimension conversion module The function is to exchange the position of each dimension information of the input image, the function of the dimension merging module is to reduce the dimension quantity of the image, and the function of the dimension restoration module is to restore the reduced image dimension number to its original dimension number and order; The circular convolution module is an LSTM layer (long short-term memory network), and its processing direction is controlled by the dimension conversion module of the corresponding branch. This scheme makes full use of the ability of the circular convolutional network to process time series, flexible application and feature extraction of a single image, and LSTM has a good suppression effect on gradient dispersion.

进一步的，所述输入图像的维度为n*c*w*h，其中n表示图像张数，c表示通道数，w表示图片宽，h表示图像高；所述横向特征支路中的维度转化模块将维度顺序变为h*n*w*c，所述纵向特征支路中的维度转化模块将维度顺序变为w*n*h*c；所述两条支路的维度合并模块都是将第二和第三维度合并到一个维度中。该方案主要作用是控制循环卷积模块内LSTM提取图像特征的方向，横向特征提取支路便是将图像每一行作为一个时间序列输入循环卷积模块，纵向特征提取支路便是将图像每一行作为一个时间序列输入循环卷积模块。Further, the dimension of the input image is n*c*w*h, where n represents the number of images, c represents the number of channels, w represents the width of the image, and h represents the height of the image; the dimension conversion in the horizontal feature branch The module changes the dimension order into h*n*w*c, and the dimension conversion module in the longitudinal feature branch changes the dimension order into w*n*h*c; the dimension merging modules of the two branches are both Merges the second and third dimensions into one dimension. The main function of this scheme is to control the direction of image feature extraction by LSTM in the circular convolution module. The horizontal feature extraction branch is to input each line of the image as a time series into the circular convolution module, and the vertical feature extraction branch is to input each line of the image Input as a time series to the recurrent convolution module.

进一步的，第一和第二卷积模块都是由两个卷积核大小为3*3，步长为1，填充为1的卷积层组成；第三至第五卷积模块是由三个卷积核大小为3*3，步长为1，填充为1的卷积层组成；第六卷积模块是由一个卷积核大小为3*3，步长为1，填充为6的卷积层组成；第七卷积模块是由一个卷积核大小为1*1，步长为1，填充为1的卷积层组成；第八和第九卷积模块是由一个卷积核大小为1*1，步长为1，填充为0的卷积层和一个卷积核大小为3*3，步长为2，填充为1的卷积层组成的；第八和第九卷积模块是由一个卷积核大小为1*1，步长为1，填充为0的卷积层和一个卷积核大小为3*3，步长为1，填充为0的卷积层组成的。该方案充分提取了图像的特征，便于后面的检测器对目标进行有效的检测。Further, the first and second convolutional modules are composed of two convolutional layers with a kernel size of 3*3, a step size of 1, and a padding of 1; the third to fifth convolutional modules are composed of three The first convolution kernel size is 3*3, the step size is 1, and the convolution layer is filled with 1; the sixth convolution module is composed of a convolution kernel size of 3*3, the step size is 1, and the padding is 6. Convolution layer; the seventh convolution module is composed of a convolution kernel with a size of 1*1, a step size of 1, and a padding of 1; the eighth and ninth convolution modules are composed of a convolution kernel A convolutional layer with a size of 1*1, a stride of 1, and a padding of 0 and a convolutional kernel with a size of 3*3, a stride of 2, and a padding of 1; the eighth and ninth volumes The product module consists of a convolutional layer with a kernel size of 1*1, a stride of 1, and a padding of 0, and a convolutional layer with a kernel size of 3*3, a stride of 1, and a padding of 0. of. This scheme fully extracts the features of the image, which is convenient for the subsequent detector to detect the target effectively.

本发明在现有检测的方法的基础上，对提取特征部分进行改进，灵活应用循环卷积网络将上下文信息添加到特征谱中。现有方法中，综合考虑精度和速度，选择SSD作为检测方案。SSD 考虑了多特征谱，选择合适的特征谱添加的模块，在保证速度的情况下，将检测精度提高，更好地满足了的实际需求。Based on the existing detection method, the present invention improves the extracted feature part, and flexibly applies the circular convolution network to add context information to the feature spectrum. In the existing methods, considering both accuracy and speed, SSD is selected as the detection scheme. SSD considers multiple signatures, selects the modules added by appropriate signatures, improves the detection accuracy while ensuring the speed, and better meets the actual needs.

附图说明Description of drawings

图1为本发明整体的网络结构；Fig. 1 is the overall network structure of the present invention;

图2为本发明的上下文信息提取模块；Fig. 2 is the context information extraction module of the present invention;

图3为部分测试结果图。Figure 3 is a part of the test results.

具体实施方式detailed description

本发明主要工作分为训练和测试俩个部分，所有工作分为六个步骤：The main work of the present invention is divided into two parts of training and testing, and all work is divided into six steps:

步骤1、构建数据库：针对要研究的问题，构建室内数据库。图像从室内设计网站选取，视角较广，室内常见目标相对关系可以较为明显的体现。在排除一些不合适的图像后，共收集 6000多张图像，其中三分之二作为训练样本，三分之一作为测试样本。因样本较少，为避免过拟合，在训练时对图像进行了随机切块已经镜像等操作，以增加样本数量。Step 1. Build the database: build an indoor database for the problem to be studied. The image is selected from the interior design website, with a wide viewing angle, and the relative relationship between common indoor objects can be clearly reflected. After excluding some inappropriate images, a total of more than 6,000 images are collected, two thirds of which are used as training samples and one third are used as testing samples. Due to the small number of samples, in order to avoid overfitting, operations such as random cutting and mirroring were performed on the image during training to increase the number of samples.

步骤2、对样本目标的检测：对数据库中所有的图像进行人工标注，将图像中的目标标注出它的ground truth，即标注出目标的位置及类别，类别标定为0,1,2....9，共十个类别。Step 2. Detection of sample targets: manually label all the images in the database, mark the targets in the images with its ground truth, that is, mark the position and category of the target, and mark the categories as 0, 1, 2.. ..9, a total of ten categories.

步骤3、改进SSD模型：对于室内对象，物与物、物与环境之间都有紧密的联系，它们是按照人类共有的习惯放置的，所以针对室内对象的之一特点，改进SSD模型，将全局信息加入到对特定目标的检测当中。利用特征提取网络提取图像特征，利用LSTM对时间序列的处理能力，构建模型，将图像信息的特征谱分别按行和列分割，然后输入到循环卷积网络当中进行处理，然后对信息进行处理并联和卷积操作，将行处理和列处理的特征谱处理成一个特征谱，然后输入到SSD的检测网络中进行检测。因为高层特征全局信息丢失严重，而太低层的特征数据量过大，且冗杂信息很多，所以本发明将上下文信息提取模块添加在第四和第七俩个卷积模块输出的特征谱上。Step 3. Improve the SSD model: For indoor objects, there is a close relationship between objects and objects, and between objects and the environment. They are placed according to the common habits of human beings. Therefore, for one of the characteristics of indoor objects, the SSD model is improved. Global information is added to the detection of specific objects. Use the feature extraction network to extract image features, use LSTM's ability to process time series, build a model, divide the feature spectrum of image information into rows and columns, and then input it into the circular convolution network for processing, and then process the information in parallel And convolution operation, the feature spectrum of row processing and column processing is processed into a feature spectrum, and then input into the detection network of SSD for detection. Because the global information of high-level features is seriously lost, and the amount of feature data of too low-level is too large, and there is a lot of redundant information, so the present invention adds the context information extraction module to the feature spectrum output by the fourth and seventh convolution modules.

步骤4、预训练模型：因网络模型较大，参数较多，而样本较少，为防止过拟合，先将模型在ImageNet这一较大的数据库进行训练，得到预训练模型。Step 4. Pre-training model: Because the network model is large, with many parameters and few samples, in order to prevent over-fitting, the model is first trained on ImageNet, a large database, to obtain a pre-training model.

步骤5、训练改进的模型：在步骤四的基础上，利用自己的数据库继续对模型进行训练。在训练前对数据进行预处理，扩充数据样本。训练得到最终的网络模型。Step 5. Train the improved model: On the basis of step 4, use your own database to continue training the model. Preprocess the data before training to expand the data samples. Train to get the final network model.

步骤6、测试模型：对的训练模型分别在PASCAL VOC207以及自己的数据库上进行测试，得到检测图像的位置和类别。测试结果本发明对于室内目标检测相较于SSD的精度有了明显提高，且仍能以较快速度完成检测，很好的满足了的使用需求。Step 6. Test model: Test the correct training model on PASCAL VOC207 and its own database to obtain the position and category of the detected image. Test results Compared with the SSD, the accuracy of the present invention for indoor target detection has been significantly improved, and the detection can still be completed at a faster speed, which satisfies the needs of users.

通过以上6个步骤，对基于SSD的室内目标检测方法进行了改进，更充分的利用了图像的上下文信息，有效提高了网路对于室内场景的目标检测能力，且保证了网络的检测速度。部分测试结果如图3所示。Through the above six steps, the SSD-based indoor target detection method is improved, which makes full use of the image context information, effectively improves the network's target detection ability for indoor scenes, and ensures the detection speed of the network. Some test results are shown in Figure 3.

Claims

1. An SSD-based indoor target detection method, the method comprising:

Step 1: obtain the indoor target image to be detected;

Step 2: Establish a feature extraction network, and use the feature extraction network to extract the global features of the target image;

Step 3: Input the global features obtained in step 2 into the SSD detector to obtain the corresponding detection results;

It is characterized in that the feature extraction network in step 2 includes: three input modules, the first to eleventh convolution modules, the first to fifth pooling modules, two context information extraction modules, and one normalization module; The three input modules are respectively an image to be detected and the first and second flag information input modules, and the image to be detected is used as the input of the first convolution module; the first convolution module, the first pooling module, The second convolution module, the second pooling module, the third convolution module, the third pooling module, the fourth convolution module, the fourth pooling module, the fifth convolution module, the fifth pooling module, the sixth The convolution module, the seventh convolution module, the eighth convolution module, the ninth convolution module, the tenth convolution module, and the eleventh convolution module are cascaded in sequence; additionally, the output of the fourth convolution module also needs Together with the output of the first flag bit information input module, it is input to the normalization module, and then the output of the normalization module is input to the first context information extraction module; additionally, the output of the seventh convolution module is also combined with the second flag The output of the bit information input module is input to the second context information extraction module together; finally, the outputs of the first and second context information extraction modules, and the eighth to eleventh convolution modules are used as the extracted global features.

2. The SSD-based indoor object detection method according to claim 1, wherein the context information extraction module includes two convolution modules, a cascade module, a horizontal feature extraction branch and a vertical feature extraction branch. Extraction branch; the output of the first convolution module is input to the horizontal feature extraction branch and the vertical feature extraction branch respectively, and the output of the horizontal feature extraction branch and the vertical feature extraction branch are jointly input to the cascade module, and then The output of the cascade module is input to the second convolution module; where the cascade module is a cascade layer.

3. A kind of SSD-based indoor object detection method as claimed in claim 1 or 2, it is characterized in that the first and second convolution modules are composed of two convolution kernels with a size of 3*3 and a step size of is 1, the convolution layer is composed of 1; the third to fifth convolution modules are composed of three convolution kernels with a size of 3*3, a step size of 1, and a convolution layer of 1; the sixth volume The product module is composed of a convolution kernel with a size of 3*3, a step size of 1, and a convolution layer with a padding of 6; the seventh convolution module is composed of a convolution kernel with a size of 1*1 and a step size of 1. The convolution layer is filled with 1; the eighth and ninth convolution modules are composed of a convolution kernel with a size of 1*1, a step size of 1, a convolution layer with a padding of 0, and a convolution kernel with a size of 3* 3. The step size is 2, and the convolution layer is filled with 1; the eighth and ninth convolution modules are composed of a convolution kernel with a size of 1*1, a step size of 1, and a convolution layer with a padding of 0. A convolution kernel size is 3*3, the step size is 1, and the convolution layer is filled with 0.

4. A kind of SSD-based indoor object detection method as claimed in claim 2, it is characterized in that the two convolution modules in the context information extraction module all have a convolution kernel size of 1*1 and a step size of 1 , a convolutional layer filled with 0.

5. The SSD-based indoor object detection method according to claim 2, characterized in that the horizontal feature branch and the vertical feature branch are composed of a dimension transformation module, a dimension merging module, and a loop volume A product module and a dimension reduction module are sequentially cascaded; the function of the dimension conversion module is to exchange the position of each dimension information of the input image, the function of the dimension merging module is to reduce the dimension quantity of the image, and the function of the dimension reduction module In order to restore the reduced image dimensionality to its original dimensionality and order; the circular convolution module is an LSTM layer, and its processing direction is controlled by the dimension conversion module of the corresponding branch.

6. The SSD-based indoor target detection method according to claim 5, wherein the dimension of the input image is n*c*w*h, wherein n represents the number of images, c represents the number of channels, and w Indicates the width of the picture, and h indicates the height of the image; the dimension conversion module in the horizontal feature branch changes the dimension order to h*n*w*c, and the dimension conversion module in the vertical feature branch changes the dimension order to w *n*h*c; the dimension merging modules of the two branches both merge the second and third dimensions into one dimension.