WO2022141721A1 - Multimodal unsupervised pedestrian pixel-level semantic labeling method and system - Google Patents

Multimodal unsupervised pedestrian pixel-level semantic labeling method and system Download PDF

Info

Publication number
WO2022141721A1
WO2022141721A1 PCT/CN2021/074232 CN2021074232W WO2022141721A1 WO 2022141721 A1 WO2022141721 A1 WO 2022141721A1 CN 2021074232 W CN2021074232 W CN 2021074232W WO 2022141721 A1 WO2022141721 A1 WO 2022141721A1
Authority
WO
WIPO (PCT)
Prior art keywords
point cloud
image acquisition
cloud information
acquisition device
information
Prior art date
Application number
PCT/CN2021/074232
Other languages
French (fr)
Chinese (zh)
Inventor
彭鹭斌
苏松志
苏松剑
蔡国榕
陈延艺
陈延行
Original Assignee
罗普特科技集团股份有限公司
罗普特(厦门)系统集成有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 罗普特科技集团股份有限公司, 罗普特(厦门)系统集成有限公司 filed Critical 罗普特科技集团股份有限公司
Publication of WO2022141721A1 publication Critical patent/WO2022141721A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • G06T5/30Erosion or dilatation, e.g. thinning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/30Determination of transform parameters for the alignment of images, i.e. image registration
    • G06T7/33Determination of transform parameters for the alignment of images, i.e. image registration using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Definitions

  • the present disclosure relates to the technical field of object detection, in particular to a multimodal unsupervised pedestrian pixel-level semantic labeling method and system.
  • Pedestrian detection is a classic problem in computer vision, and its related technologies can be applied in fields such as video surveillance and autonomous driving.
  • the current common method is to first shoot a large number of samples containing pedestrians, and then manually mark the pedestrian's position in the picture as training data; finally, use supervised learning methods (such as support vector machines, deep learning) to train a classifier to distinguish pedestrians from non-pedestrian areas.
  • supervised learning methods such as support vector machines, deep learning
  • pedestrian detection technology can be divided into methods based on two-dimensional images (including color and grayscale); methods based on three-dimensional point clouds; and methods based on infrared imaging. From a technical point of view, it can be divided into: overall method, part method and local block method. Most of the above methods utilize supervised classification techniques in machine learning. Supervised classification technology needs to mark the location of pedestrians in the picture, so it requires a lot of human, material and financial resources.
  • the present disclosure proposes a multimodal unsupervised pixel-level semantic annotation method and system for pedestrians, which eliminates the need for manual The trouble of labeling pedestrian samples.
  • a multimodal unsupervised pedestrian pixel-level semantic annotation method including:
  • S1 Perform 3D reconstruction on the unmanned monitoring scene, and obtain the initial point cloud information of the monitoring scene;
  • S2 Use the Tof image acquisition device to obtain the first point cloud information in the monitoring scene, register it with the initial point cloud information, and perform a set difference operation to obtain the second point cloud information, and place the second point cloud information on the horizontal plane. Projection on it to obtain a collection of personnel point cloud information;
  • step S1 specifically includes:
  • the three-dimensional reconstruction algorithm of Structure from motion is used to reconstruct the three-dimensional monitoring scene of M images, and the initial point cloud information is obtained.
  • the three-dimensional structure can be recovered from the projected two-dimensional motion field of a moving object or scene.
  • the first point cloud information and the initial point cloud information are registered using an iterative closest point algorithm. With this step, images acquired by different acquisition devices can be registered.
  • the second point cloud information is projected on the XY plane of the three-dimensional coordinate system, several circular areas are obtained based on Hough transform, and the point cloud information corresponding to the same circular area is included in the personnel points cloud information collection.
  • the method further includes removing the area where the pixel area is smaller than the second threshold. With this step, the image can be processed to obtain connected regions.
  • the first threshold is taken in the range of 20*40-80*160
  • the second threshold is taken in the range of 1000-8196.
  • Tof image acquisition equipment, infrared image acquisition equipment and RGB image acquisition equipment are respectively installed in the monitoring scene, and the Tof image acquisition equipment, infrared image acquisition equipment and RGB image acquisition equipment are respectively calculated using the initial point cloud information position and attitude information.
  • Using the positional relationship and attitude information of the image acquisition devices of the three modalities can facilitate the conversion of later feature point clouds.
  • the specific acquisition methods of the positional relationship and attitude information include:
  • the first transformation matrix between the Tof image capturing device and the infrared image capturing device and the second transformation matrix between the Tof image capturing device and the RGB image capturing device are obtained according to the positional relationship, the attitude information and the internal parameters of the image capturing device.
  • Matrix, the third transformation matrix of the infrared image acquisition device and the RGB image acquisition device are obtained according to the positional relationship, the attitude information and the internal parameters of the image capturing device.
  • step S4 specifically includes:
  • An intersection operation is performed on the pixels of the first projection area and the set and the second projection area and the set.
  • a computer-readable storage medium having stored thereon one or more computer programs that, when executed by a computer processor, implement any of the methods described above.
  • a multimodal unsupervised pedestrian pixel-level semantic annotation system comprising:
  • Initial point cloud information acquisition unit configured to perform three-dimensional reconstruction of an unmanned monitoring scene, and obtain initial point cloud information of the monitoring scene;
  • Personnel point cloud information collection acquisition unit configured to use Tof image acquisition equipment to acquire the first point cloud information in the monitoring scene, register it with the initial point cloud information, and perform a set difference operation to obtain the second point cloud information, Project the second point cloud information on the horizontal plane to obtain a set of personnel point cloud information;
  • Connected area information set acquisition unit configured to dilate and corrode the binarized image obtained by thresholding the scene information obtained by the infrared image acquisition device to obtain a connected area information set;
  • Human body area collection acquisition unit configured to separately project the personnel point cloud information collection and the connected area information collection into the image plane space of the RGB image collection device by using the positional relationship between the cameras that have been calibrated to perform the intersection operation of the collections, In response to the common pixels exceeding the first threshold, a corresponding set of human body regions is obtained.
  • the present disclosure proposes a multi-modal unsupervised pedestrian pixel-level semantic annotation method and system, which integrates the advantages of different modal cameras of Tof image acquisition equipment, infrared image acquisition equipment and RGB image acquisition equipment, and can effectively extract scenes Human Pixels in .
  • pixel-level annotation information can be automatically provided for use by machine learning algorithms.
  • FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;
  • FIG. 2 is a flowchart of a multimodal unsupervised pedestrian pixel-level semantic labeling method according to an embodiment of the present application
  • FIG. 3 is a frame diagram of a multimodal unsupervised pedestrian pixel-level semantic annotation system according to an embodiment of the present application
  • FIG. 4 is a schematic structural diagram of a computer system suitable for implementing the electronic device according to the embodiment of the present application.
  • FIG. 1 shows an exemplary system architecture 100 to which the multimodal unsupervised pixel-level semantic annotation method for pedestrians according to embodiments of the present application can be applied.
  • the system architecture 100 may include a data server 101 , a network 102 and a main server 103 .
  • the network 102 is the medium used to provide the communication link between the data server 101 and the main server 103 .
  • the network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the main server 103 may be a server that provides various services, such as a data processing server that processes the information uploaded by the data server 101 .
  • the data processing server can detect pedestrians and store the detection results in the database.
  • the multimodal unsupervised pixel-level semantic annotation analysis method for pedestrians provided by the embodiments of the present application is generally executed by the main server 103, and accordingly, the apparatus for semantic analysis of small data sets is generally installed in the main server 103.
  • the data server and the main server may be hardware or software.
  • it can be implemented as a distributed server cluster composed of multiple servers, or as a single server.
  • it can be implemented as a plurality of software or software modules (such as software or software modules for providing distributed services), or can be implemented as a single software or software module.
  • FIG. 2 shows a flowchart of the multi-modal unsupervised pedestrian pixel-level semantic annotation method according to an embodiment of the present application. As shown in Figure 2, the method includes:
  • S201 Perform three-dimensional reconstruction on an unmanned monitoring scene, and obtain initial point cloud information of the monitoring scene.
  • the three-dimensional reconstruction algorithm of Structure from motion is used to reconstruct the three-dimensional monitoring scene of M images, and the initial point cloud information is obtained.
  • the goal of Structure from Motion (SfM) is to automatically recover camera motion and scene structure using two or more scenes. It is a self-calibration technology that can automatically complete camera tracking and motion matching.
  • S202 Use the Tof image acquisition device to acquire the first point cloud information in the monitoring scene, perform a set difference operation after registering it with the initial point cloud information, obtain the second point cloud information, and place the second point cloud information on the horizontal plane Projection on it to obtain a collection of personnel point cloud information.
  • the first point cloud information and the initial point cloud information are registered by the iterative closest point algorithm.
  • pedestrians are allowed to enter the monitoring scene, the second point cloud information is projected on the XY plane of the three-dimensional coordinate system in step S201, and a number of circular areas are obtained based on Hough transformation, and those corresponding to the same circular area are The point cloud information is included in the personnel point cloud information collection.
  • S203 Dilate and corrode the binarized image obtained by the infrared image acquisition device after thresholding the scene information to obtain a connected area information set. After the binarized image is dilated and eroded, the region with the pixel region smaller than the second threshold is further removed to obtain the connected region information set, where the second threshold is taken in the range of 1000-8196.
  • S204 Project the personnel point cloud information collection and the connected area information collection into the image plane space of the RGB image acquisition device by using the positional relationship between the cameras that have been calibrated, respectively, to perform the intersection operation of the collections, and in response to the common pixel exceeding the first When the threshold is set, the corresponding set of human body regions is obtained.
  • a Tof image acquisition device, an infrared image acquisition device, and an RGB image acquisition device are respectively installed in the monitoring scene, and the initial point cloud information is used to calculate the Tof image acquisition device, the infrared image acquisition device, and the RGB image acquisition device, respectively.
  • Position relationship and attitude information :
  • attitude information and internal parameters of the image acquisition device According to the position relationship, attitude information and internal parameters of the image acquisition device, the first transformation matrix of the Tof image acquisition device and the infrared image acquisition device, the second transformation matrix of the Tof image acquisition device and the RGB image acquisition device, and the infrared image acquisition device and the RGB image acquisition device are obtained.
  • the third transformation matrix of the image acquisition device is obtained.
  • the multi-modal unsupervised pixel-level semantic annotation method for pedestrians can specifically implement pedestrian detection and automatic annotation through the following steps.
  • camera A, camera B and camera C are used instead.
  • Step 1 Select a monitoring scene, take any point P on the ground, and establish a three-dimensional coordinate system XYZ, where the X axis is in the horizontal plane and points in a certain direction, the Z axis is perpendicular to the ground and points to the center of the earth, and the Y axis is in the horizontal plane It is perpendicular to the X-axis, and its direction is determined according to the right-hand rule;
  • Step 2 In the X-axis direction, select a point every 100cm, a total of m points are selected as the shooting position of the camera C in the horizontal direction, denoted as Q1, Q2, ..., Qm; in the Z-axis direction, every other Select a point of 50cm, and select P1, P2, ..., Pn as the shooting height in the vertical direction of the camera; at m*n positions, select one shot every k degrees for the pitch angle, yaw angle and roll angle respectively. angle.
  • Step 3 Use the Structure-from-Motion technology to reconstruct the scene three-dimensionally for the M images obtained in Step 2, so as to obtain the point cloud information of the scene, which is recorded as Scene_Point_Cloud_BG.
  • Step 4 Install camera A, camera B and camera C in the scene respectively, and use the scene point cloud information according to step 3 to calculate the mutual positional relationship between cameras ABC. In this step, it must be ensured that there are no moving objects in the scene.
  • step 4c According to the pose information of the camera obtained in step 4a and step 4b, and the respective internal parameters of camera ABC, obtain the transformation matrix Tab of camera A and camera B, the transformation matrix Tac between camera A and camera C, camera B and camera The transformation matrix Tbc between C.
  • Step 5 After completing the above steps 1-4, open the scene and allow pedestrians to enter the scene.
  • Use camera A to obtain the 3D point cloud information Scene_Point_Cloud_New in the scene, and use the ICP algorithm again to register Scene_Point_Cloud_New and Scene_Point_Cloud_BG; after registration, perform the set difference operation on the two point cloud sets to obtain a new point cloud Scene_Point_Cloud_FG.
  • Project Scene_Point_Cloud_FG on the XY plane and obtain several circular areas C1, C2, ..., Cp based on Hough transform.
  • the point cloud information corresponding to the same circular area Ci is denoted as Person_i.
  • Step 6 Use the scene information captured by camera B to obtain a binarized image Camera_B_Image_Binary after thresholding; perform dilation and erosion operations on Camera_B_Image_Binary, and remove the pixel area smaller than the threshold thr (thr is set according to the actual scene, the range is 1000 -8196) area; record the obtained connectivity area information as R1, R2, ..., Rq.
  • Step 8 Perform a set intersection operation on the two region sets obtained in Step 7: ⁇ Region_From_A_1, ..., Region_From_A_p ⁇ and ⁇ Region_From_B_1, ..., Region_From_B_q ⁇ .
  • the threshold thr_region set from 20x40 to 80x160
  • the corresponding human body region Region_From_C_k is obtained, where k ⁇ 1 and k ⁇ min(p, q).
  • FIG. 3 shows a frame diagram of a multi-modal unsupervised pixel-level semantic annotation system for pedestrians according to an embodiment of the present application.
  • the system specifically includes an initial point cloud information acquisition unit 301 , a person point cloud information collection acquisition unit 302 , a connected area information collection acquisition unit 303 , and a human body area collection acquisition unit 304 .
  • the initial point cloud information acquisition unit 301 is configured to perform three-dimensional reconstruction of an unmanned monitoring scene, and obtain initial point cloud information of the monitoring scene;
  • the personnel point cloud information collection acquisition unit 302 is configured to use Tof images
  • the acquisition device obtains the first point cloud information in the monitoring scene, registers it with the initial point cloud information, and performs a set difference operation to obtain the second point cloud information, and projects the second point cloud information on the horizontal plane to obtain Person point cloud information set;
  • the connected area information set acquisition unit 303 is configured to expand and corrode the binarized image of the scene information obtained by the infrared image acquisition device after thresholding, to obtain the connected area information set;
  • the human body area set acquisition unit 304 It is configured to respectively project the personnel point cloud information set and the connected area information set into the image plane space of the RGB image acquisition device to perform the intersection operation of the sets, and obtain the corresponding human body area set in response to the common pixels exceeding the first threshold.
  • FIG. 4 shows a schematic structural diagram of a computer system 400 suitable for implementing the electronic device of the embodiment of the present application.
  • the electronic device shown in FIG. 4 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present application.
  • a computer system 400 includes a central processing unit (CPU) 401, which can be loaded into a random access memory (RAM) 403 according to a program stored in a read only memory (ROM) 402 or a program from a storage section 408 Instead, various appropriate actions and processes are performed.
  • RAM random access memory
  • ROM read only memory
  • various programs and data required for the operation of the system 400 are also stored.
  • the CPU 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404.
  • An input/output (I/O) interface 405 is also connected to bus 404 .
  • the following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, etc.; an output section 407 including a liquid crystal display (LCD), etc. and a speaker, etc.; a storage section 408 including a hard disk, etc.; Communication section 409 of a network interface card such as a modem.
  • the communication section 409 performs communication processing via a network such as the Internet.
  • a drive 410 is also connected to the I/O interface 405 as needed.
  • a removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 410 as needed so that a computer program read therefrom is installed into the storage section 408 as needed.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable storage medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication portion 409 and/or installed from the removable medium 411 .
  • CPU central processing unit
  • the computer-readable storage medium of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable storage medium other than a computer-readable storage medium that can be sent, propagated, or transmitted for use by or in connection with the instruction execution system, apparatus, or device program of.
  • Program code embodied on a computer-readable storage medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for performing the operations of the present application may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, and also conventional procedures, or a combination thereof programming language - such as "C" or a similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments of the present application may be implemented in a software manner, and may also be implemented in a hardware manner.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be included in the electronic device described in the above-mentioned embodiments; in electronic equipment.
  • the above-mentioned computer-readable storage medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic equipment, the electronic equipment: including three-dimensional reconstruction of the unmanned monitoring scene, and obtaining the initial point of the monitoring scene Cloud information; use Tof image acquisition equipment to obtain the first point cloud information in the monitoring scene, register it with the initial point cloud information, and perform a set difference operation to obtain the second point cloud information, and put the second point cloud information in the Projection on the horizontal plane to obtain a set of personnel point cloud information; dilate and corrode the binarized image of the scene information obtained by the infrared image acquisition device after thresholding to obtain a set of connected area information; separate the personnel point cloud information set and the connected area
  • the information set is projected into the image plane space of the RGB image acquisition device to perform the intersection operation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

Provided in the present disclosure is a multimodal unsupervised pedestrian pixel-level semantic labeling method and system, the method comprising: performing three-dimensional reconstruction on an unmanned monitoring scene, and acquiring initial point cloud information of the monitoring scene; acquiring first point cloud information in the monitoring scene by using a Tof image acquisition device, registering the first point cloud information with the initial point cloud information and then executing a set difference operation to acquire second point cloud information, and projecting the second point cloud information on a horizontal plane to obtain a personnel point cloud information set; expanding and corroding a binary image after thresholding scene information acquired by an infrared image acquisition device to obtain a connected area information set; using positional relationships between calibrated cameras to project the personnel point cloud information set and the connected area information set into an image plane space of an RGB image acquisition device so as to perform a set intersection operation, and acquiring a corresponding human body area set in response to when a common pixel exceeds a first threshold. The method and system fully integrate advantages of cameras having different modalities, and can effectively extract human pixel points in the scene.

Description

一种多模态无监督的行人像素级语义标注方法和系统A Multimodal Unsupervised Pedestrian Pixel-Level Semantic Annotation Method and System
相关申请Related applications
本申请要求保护在2020年12月30日提交的申请号为202011615688.6的中国专利申请的优先权,该申请的全部内容以引用的方式结合到本文中。This application claims the priority of Chinese Patent Application No. 202011615688.6 filed on December 30, 2020, the entire contents of which are incorporated herein by reference.
技术领域technical field
本公开涉及目标检测的技术领域,尤其是一种多模态无监督的行人像素级语义标注方法和系统。The present disclosure relates to the technical field of object detection, in particular to a multimodal unsupervised pedestrian pixel-level semantic labeling method and system.
背景技术Background technique
行人检测是计算机视觉中的经典问题,其相关技术可以应用在视频监控和自动驾驶等领域中。目前的常用方法是首先拍摄大量包含行人的样本,然后手动标注出行人在图片中的位置作为训练数据;最后,采用监督学习方法(比如支持向量机、深度学习)训练一个分类器,区分行人和非行人区域。随机深度学习技术的发展,所需的训练样本数量越来越大。标注大量的样本是一件耗时耗力的事情。Pedestrian detection is a classic problem in computer vision, and its related technologies can be applied in fields such as video surveillance and autonomous driving. The current common method is to first shoot a large number of samples containing pedestrians, and then manually mark the pedestrian's position in the picture as training data; finally, use supervised learning methods (such as support vector machines, deep learning) to train a classifier to distinguish pedestrians from non-pedestrian areas. The development of stochastic deep learning techniques requires an increasing number of training samples. Labeling a large number of samples is time-consuming and labor-intensive.
行人检测技术按照输入数据的格式,可以分为基于二维图像(包含彩色和灰度的)的方法;基于三维点云的方法;基于红外成像的方法。从技术的角度上看,可分为:整体法、部位法和局部块法。上述的这些方法,大部分利用机器学习中的有监督分类技术。有监督分类技术需要标注出行人在图片中的位置,因此需要耗费大量的人力物力和财力。According to the format of input data, pedestrian detection technology can be divided into methods based on two-dimensional images (including color and grayscale); methods based on three-dimensional point clouds; and methods based on infrared imaging. From a technical point of view, it can be divided into: overall method, part method and local block method. Most of the above methods utilize supervised classification techniques in machine learning. Supervised classification technology needs to mark the location of pedestrians in the picture, so it requires a lot of human, material and financial resources.
公开内容public content
为了解决现有技术中需要耗费大量的人力物力和财力标注出行人在图片中的位置的技术问题,本公开提出了一种多模态无监督的行人像素级语义标注方法和系统,免去人工标注行人样本的麻烦。In order to solve the technical problem in the prior art that a lot of manpower, material resources and financial resources are required to mark the location of pedestrians in pictures, the present disclosure proposes a multimodal unsupervised pixel-level semantic annotation method and system for pedestrians, which eliminates the need for manual The trouble of labeling pedestrian samples.
根据本公开的一个方面,提出了一种多模态无监督的行人像素级语义标注方法,包括:According to an aspect of the present disclosure, a multimodal unsupervised pedestrian pixel-level semantic annotation method is proposed, including:
S1:对无人的监控场景进行三维重建,获取监控场景的初始点云信息;S1: Perform 3D reconstruction on the unmanned monitoring scene, and obtain the initial point cloud information of the monitoring scene;
S2:利用Tof图像采集设备获取监控场景中的第一点云信息,将其与初始点云信息配准后进行集合的差运算,获得第二点云信息,并将第二点云信息在水平面上进行投影,获得人员点云信息集合;S2: Use the Tof image acquisition device to obtain the first point cloud information in the monitoring scene, register it with the initial point cloud information, and perform a set difference operation to obtain the second point cloud information, and place the second point cloud information on the horizontal plane. Projection on it to obtain a collection of personnel point cloud information;
S3:对红外图像采集设备获取的场景信息阈值化后的二值化图像进行膨胀和腐蚀,获得连通区域信息集合;以及S3: Dilate and corrode the binarized image obtained by the thresholding of the scene information obtained by the infrared image acquisition device to obtain a set of connected area information; and
S4:分别将人员点云信息集合和连通区域信息集合,利用已经标定的相机之间的位置关系,投影到RGB图像采集设备的图像平面空间中进行集合的交集运算,响应于共同像素超过第一阈值时,获取对应的人体区域集合。S4: respectively project the personnel point cloud information set and the connected area information set into the image plane space of the RGB image acquisition device by using the positional relationship between the cameras that have been calibrated to perform the intersection operation of the sets, and in response to the common pixel exceeding the first When the threshold is set, the corresponding set of human body regions is obtained.
在一些具体的实施例中,步骤S1具体包括:In some specific embodiments, step S1 specifically includes:
在无人的监控场景中任取一原点,建立三维坐标系;Take any origin in an unmanned monitoring scene to establish a three-dimensional coordinate system;
在x轴和z轴方向上间隔设置m*n个点位作为RGB图像采集设备的图像采集位置,对俯仰角、偏航角和滚转角分别间隔k度选择拍摄角度,采集M=m*n*(180/k)*(180/k)*(180/k)张图像;Set m*n points at intervals in the x-axis and z-axis directions as the image acquisition positions of the RGB image acquisition device, select the shooting angles for the pitch angle, yaw angle and roll angle at k degrees, respectively, and collect M=m*n *(180/k)*(180/k)*(180/k) images;
利用Structure from motion的三维重建算法对M张图像进行监控场景的三维重建,并获取初始点云信息。利用STM算法可以从一个运动物体或场景的投影的二维运动领域恢复出三维结构。The three-dimensional reconstruction algorithm of Structure from motion is used to reconstruct the three-dimensional monitoring scene of M images, and the initial point cloud information is obtained. Using the STM algorithm, the three-dimensional structure can be recovered from the projected two-dimensional motion field of a moving object or scene.
在一些具体的实施例中,利用迭代最近点算法将第一点云信息和初始点云信息进行配准。凭借该步骤可以对不同采集设备获取的图像进行配准。In some specific embodiments, the first point cloud information and the initial point cloud information are registered using an iterative closest point algorithm. With this step, images acquired by different acquisition devices can be registered.
在一些具体的实施例中,将第二点云信息在三维坐标系的XY平面进行投影,基于霍夫变换获得若干圆形区域,将属于同一个圆形区域所对应的点云信息纳入人员点云信息集合中。In some specific embodiments, the second point cloud information is projected on the XY plane of the three-dimensional coordinate system, several circular areas are obtained based on Hough transform, and the point cloud information corresponding to the same circular area is included in the personnel points cloud information collection.
在一些具体的实施例中,步骤S3中对二值化图像进行膨胀和腐蚀之后,还包括去除像素区域小于第二阈值的区域。凭借该步骤可以将图像处理获取连通区域。In some specific embodiments, after the binarized image is dilated and eroded in step S3, the method further includes removing the area where the pixel area is smaller than the second threshold. With this step, the image can be processed to obtain connected regions.
在一些具体的实施例中,第一阈值取自20*40-80*160的范围内,第二阈值取自1000-8196的范围内。In some specific embodiments, the first threshold is taken in the range of 20*40-80*160, and the second threshold is taken in the range of 1000-8196.
在一些具体的实施例中,在监控场景中分别安装Tof图像采集设备、红外图像采集设备和RGB图像采集设备,利用初始点云信息分别计算Tof图像采集设备、红外图像采集设备和RGB图像采集设备的位置关系以及姿态信息。利用三种模态的图像采集设备的位置关系和姿态信息可以便于进行后期特征点云的转换。In some specific embodiments, Tof image acquisition equipment, infrared image acquisition equipment and RGB image acquisition equipment are respectively installed in the monitoring scene, and the Tof image acquisition equipment, infrared image acquisition equipment and RGB image acquisition equipment are respectively calculated using the initial point cloud information position and attitude information. Using the positional relationship and attitude information of the image acquisition devices of the three modalities can facilitate the conversion of later feature point clouds.
在一些具体的实施例中,位置关系以及姿态信息的具体获取方式包括:In some specific embodiments, the specific acquisition methods of the positional relationship and attitude information include:
利用Tof图像采集设备获取监控场景的深度图像,结合初始点云信息利用迭代最近点算法获取Tof图像采集设备在监控场景中的自由度位姿;Use the Tof image acquisition device to obtain the depth image of the monitoring scene, and use the iterative closest point algorithm to obtain the DOF pose of the Tof image acquisition device in the monitoring scene in combination with the initial point cloud information;
利用红外图像采集设备和RGB图像采集设备获取监控场景的彩色图像,利用SIFT描述子和Bag of words词袋特征描述算法,根据采集的图像和初始点云信息,基于Bundle  Adjustment光束法平差算法,获取红外图像采集设备和RGB图像采集设备的位置和姿态信息。Use infrared image acquisition equipment and RGB image acquisition equipment to obtain color images of monitoring scenes, use SIFT descriptors and Bag of words feature description algorithm, and based on the collected images and initial point cloud information, based on the Bundle Adjustment beam adjustment algorithm, Obtain the position and attitude information of the infrared image acquisition device and the RGB image acquisition device.
在一些具体的实施例中,根据位置关系、姿态信息和图像采集设备的内参,获得Tof图像采集设备与红外图像采集设备的第一变换矩阵、Tof图像采集设备与RGB图像采集设备的第二变换矩阵、红外图像采集设备与RGB图像采集设备的第三变换矩阵。In some specific embodiments, the first transformation matrix between the Tof image capturing device and the infrared image capturing device and the second transformation matrix between the Tof image capturing device and the RGB image capturing device are obtained according to the positional relationship, the attitude information and the internal parameters of the image capturing device. Matrix, the third transformation matrix of the infrared image acquisition device and the RGB image acquisition device.
在一些具体的实施例中,步骤S4具体包括:In some specific embodiments, step S4 specifically includes:
利用第二变换矩阵将人员点云信息根据相机成像原理投影到RGB图像采集设备的图像平面空间获得第一投影区域集合;Using the second transformation matrix to project the personnel point cloud information to the image plane space of the RGB image acquisition device according to the camera imaging principle to obtain the first projection area set;
利用第三变换矩阵将连通区域信息投影到RGB图像采集设备的图像平面空间获得第二投影区域集合;Using the third transformation matrix to project the connected region information to the image plane space of the RGB image acquisition device to obtain a second set of projection regions;
对第一投影区与集合与第二投影区与集合的像素进行交集运算。An intersection operation is performed on the pixels of the first projection area and the set and the second projection area and the set.
根据本公开的第二方面,提出了一种计算机可读存储介质,其上存储有一或多个计算机程序,该一或多个计算机程序被计算机处理器执行时实施上述任一项的方法。According to a second aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon one or more computer programs that, when executed by a computer processor, implement any of the methods described above.
根据本公开的第三方面,提出了一种多模态无监督的行人像素级语义标注系统,该系统包括:According to a third aspect of the present disclosure, a multimodal unsupervised pedestrian pixel-level semantic annotation system is proposed, the system comprising:
初始点云信息获取单元:配置用于对无人的监控场景进行三维重建,获取监控场景的初始点云信息;Initial point cloud information acquisition unit: configured to perform three-dimensional reconstruction of an unmanned monitoring scene, and obtain initial point cloud information of the monitoring scene;
人员点云信息集合获取单元:配置用于利用Tof图像采集设备获取监控场景中的第一点云信息,将其与初始点云信息配准后进行集合的差运算,获得第二点云信息,并将第二点云信息在水平面上进行投影,获得人员点云信息集合;Personnel point cloud information collection acquisition unit: configured to use Tof image acquisition equipment to acquire the first point cloud information in the monitoring scene, register it with the initial point cloud information, and perform a set difference operation to obtain the second point cloud information, Project the second point cloud information on the horizontal plane to obtain a set of personnel point cloud information;
连通区域信息集合获取单元:配置用于对红外图像采集设备获取的场景信息阈值化后的二值化图像进行膨胀和腐蚀,获得连通区域信息集合;以及Connected area information set acquisition unit: configured to dilate and corrode the binarized image obtained by thresholding the scene information obtained by the infrared image acquisition device to obtain a connected area information set; and
人体区域集合获取单元:配置用于分别将人员点云信息集合和连通区域信息集合,利用已经标定的相机之间的位置关系,投影到RGB图像采集设备的图像平面空间中进行集合的交集运算,响应于共同像素超过第一阈值时,获取对应的人体区域集合。Human body area collection acquisition unit: configured to separately project the personnel point cloud information collection and the connected area information collection into the image plane space of the RGB image collection device by using the positional relationship between the cameras that have been calibrated to perform the intersection operation of the collections, In response to the common pixels exceeding the first threshold, a corresponding set of human body regions is obtained.
本公开提出了一种多模态无监督的行人像素级语义标注方法和系统,融合了Tof图像采集设备、红外图像采集设备和RGB图像采集设备的不同模态摄像机的优点,可以有效提取出场景中的人体像素点。在行人检测任务中,可自动地提供像素级的标注信息,供机器学习算法使用。The present disclosure proposes a multi-modal unsupervised pedestrian pixel-level semantic annotation method and system, which integrates the advantages of different modal cameras of Tof image acquisition equipment, infrared image acquisition equipment and RGB image acquisition equipment, and can effectively extract scenes Human Pixels in . In pedestrian detection tasks, pixel-level annotation information can be automatically provided for use by machine learning algorithms.
附图说明Description of drawings
包括附图以提供对实施例的进一步理解并且附图被并入本说明书中并且构成本说明书的一部分。附图图示了实施例并且与描述一起用于解释本公开的原理。将容易认识到其它实施例和实施例的很多预期优点,因为通过引用以下详细描述,它们变得被更好地理解。通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本申请的其它特征、目的和优点将会变得更明显:The accompanying drawings are included to provide a further understanding of the embodiments and are incorporated into and constitute a part of this specification. The drawings illustrate embodiments and together with the description serve to explain the principles of the present disclosure. Other embodiments and many of the intended advantages of the embodiments will be readily recognized as they become better understood by reference to the following detailed description. Other features, objects and advantages of the present application will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:
图1是本申请可以应用于其中的示例性系统架构图;FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;
图2是本申请的一个实施例的多模态无监督的行人像素级语义标注方法流程图;2 is a flowchart of a multimodal unsupervised pedestrian pixel-level semantic labeling method according to an embodiment of the present application;
图3是本申请的一个实施例的多模态无监督的行人像素级语义标注系统的框架图;3 is a frame diagram of a multimodal unsupervised pedestrian pixel-level semantic annotation system according to an embodiment of the present application;
图4是适于用来实现本申请实施例的电子设备的计算机系统的结构示意图。FIG. 4 is a schematic structural diagram of a computer system suitable for implementing the electronic device according to the embodiment of the present application.
具体实施方式Detailed ways
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关公开,而非对该公开的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关公开相关的部分。The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the related disclosure, but not to limit the disclosure. In addition, it should be noted that, for the convenience of description, only the parts related to the relevant disclosure are shown in the drawings.
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict. The present application will be described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
图1示出了可以应用本申请实施例的多模态无监督的行人像素级语义标注方法的示例性系统架构100。FIG. 1 shows an exemplary system architecture 100 to which the multimodal unsupervised pixel-level semantic annotation method for pedestrians according to embodiments of the present application can be applied.
如图1所示,系统架构100可以包括数据服务器101,网络102和主服务器103。网络102用以在数据服务器101和主服务器103之间提供通信链路的介质。网络102可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1 , the system architecture 100 may include a data server 101 , a network 102 and a main server 103 . The network 102 is the medium used to provide the communication link between the data server 101 and the main server 103 . The network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
主服务器103可以是提供各种服务的服务器,例如对数据服务器101上传的信息进行处理的数据处理服务器。数据处理服务器可以对行人的检测,并将检测结果关联存储到数据库中。The main server 103 may be a server that provides various services, such as a data processing server that processes the information uploaded by the data server 101 . The data processing server can detect pedestrians and store the detection results in the database.
需要说明的是,本申请实施例所提供的多模态无监督的行人像素级语义标注析方法一般由主服务器103执行,相应地,用于小数据集的语义分析的装置一般设置于主服务器103中。It should be noted that the multimodal unsupervised pixel-level semantic annotation analysis method for pedestrians provided by the embodiments of the present application is generally executed by the main server 103, and accordingly, the apparatus for semantic analysis of small data sets is generally installed in the main server 103.
需要说明的是,数据服务器和主服务器可以是硬件,也可以是软件。当为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块),也 可以实现成单个软件或软件模块。It should be noted that the data server and the main server may be hardware or software. In the case of hardware, it can be implemented as a distributed server cluster composed of multiple servers, or as a single server. When it is software, it can be implemented as a plurality of software or software modules (such as software or software modules for providing distributed services), or can be implemented as a single software or software module.
应该理解,图1中的数据服务器、网络和主服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of data servers, networks and main servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
根据本申请的一个实施例的多模态无监督的行人像素级语义标注方法,图2示出了根据本申请的实施例的多模态无监督的行人像素级语义标注方法流程图。如图2所示,该方法包括:According to a multimodal unsupervised pedestrian pixel-level semantic annotation method according to an embodiment of the present application, FIG. 2 shows a flowchart of the multi-modal unsupervised pedestrian pixel-level semantic annotation method according to an embodiment of the present application. As shown in Figure 2, the method includes:
S201:对无人的监控场景进行三维重建,获取监控场景的初始点云信息。在无人的监控场景下利用Structure from motion的三维重建算法对M张图像进行监控场景的三维重建,并获取初始点云信息。运动恢复结构(Structure from Motion,SfM)的目标是能够利用两个场景或多个场景自动恢复相机运动和场景结构,是一种自校准的技术能够自动地完成相机追踪与运动匹配。S201: Perform three-dimensional reconstruction on an unmanned monitoring scene, and obtain initial point cloud information of the monitoring scene. In the unmanned monitoring scene, the three-dimensional reconstruction algorithm of Structure from motion is used to reconstruct the three-dimensional monitoring scene of M images, and the initial point cloud information is obtained. The goal of Structure from Motion (SfM) is to automatically recover camera motion and scene structure using two or more scenes. It is a self-calibration technology that can automatically complete camera tracking and motion matching.
S202:利用Tof图像采集设备获取监控场景中的第一点云信息,将其与初始点云信息配准后进行集合的差运算,获得第二点云信息,并将第二点云信息在水平面上进行投影,获得人员点云信息集合。利用迭代最近点算法将第一点云信息和初始点云信息进行配准。该步骤下监控场景中允许有行人进入,将第二点云信息在步骤S201的三维坐标系的XY平面进行投影,基于霍夫变换获得若干圆形区域,将属于同一个圆形区域所对应的点云信息纳入人员点云信息集合中。S202: Use the Tof image acquisition device to acquire the first point cloud information in the monitoring scene, perform a set difference operation after registering it with the initial point cloud information, obtain the second point cloud information, and place the second point cloud information on the horizontal plane Projection on it to obtain a collection of personnel point cloud information. The first point cloud information and the initial point cloud information are registered by the iterative closest point algorithm. In this step, pedestrians are allowed to enter the monitoring scene, the second point cloud information is projected on the XY plane of the three-dimensional coordinate system in step S201, and a number of circular areas are obtained based on Hough transformation, and those corresponding to the same circular area are The point cloud information is included in the personnel point cloud information collection.
S203:对红外图像采集设备获取的场景信息阈值化后的二值化图像进行膨胀和腐蚀,获得连通区域信息集合。对二值化图像膨胀腐蚀后进一步去除像素区域小于第二阈值的区域进而获得连通区域信息集合,其中,第二阈值取自1000-8196的范围内。S203 : Dilate and corrode the binarized image obtained by the infrared image acquisition device after thresholding the scene information to obtain a connected area information set. After the binarized image is dilated and eroded, the region with the pixel region smaller than the second threshold is further removed to obtain the connected region information set, where the second threshold is taken in the range of 1000-8196.
S204:分别将人员点云信息集合和连通区域信息集合,利用已经标定的相机之间的位置关系,投影到RGB图像采集设备的图像平面空间中进行集合的交集运算,响应于共同像素超过第一阈值时,获取对应的人体区域集合。S204: Project the personnel point cloud information collection and the connected area information collection into the image plane space of the RGB image acquisition device by using the positional relationship between the cameras that have been calibrated, respectively, to perform the intersection operation of the collections, and in response to the common pixel exceeding the first When the threshold is set, the corresponding set of human body regions is obtained.
在具体的实施例中,在监控场景中分别安装Tof图像采集设备、红外图像采集设备和RGB图像采集设备,利用初始点云信息分别计算Tof图像采集设备、红外图像采集设备和RGB图像采集设备的位置关系以及姿态信息:In a specific embodiment, a Tof image acquisition device, an infrared image acquisition device, and an RGB image acquisition device are respectively installed in the monitoring scene, and the initial point cloud information is used to calculate the Tof image acquisition device, the infrared image acquisition device, and the RGB image acquisition device, respectively. Position relationship and attitude information:
利用Tof图像采集设备获取监控场景的深度图像,结合初始点云信息利用迭代最近点算法获取Tof图像采集设备在监控场景中的自由度位姿;Use the Tof image acquisition device to obtain the depth image of the monitoring scene, and use the iterative closest point algorithm to obtain the DOF pose of the Tof image acquisition device in the monitoring scene in combination with the initial point cloud information;
利用红外图像采集设备和RGB图像采集设备获取监控场景的彩色图像,利用SIFT描述子和Bag of words词袋特征描述算法,根据采集的图像和初始点云信息,基于Bundle  Adjustment光束法平差算法,获取红外图像采集设备和RGB图像采集设备的位置和姿态信息。Use infrared image acquisition equipment and RGB image acquisition equipment to obtain color images of monitoring scenes, use SIFT descriptors and Bag of words feature description algorithm, and based on the collected images and initial point cloud information, based on the Bundle Adjustment beam adjustment algorithm, Obtain the position and attitude information of the infrared image acquisition device and the RGB image acquisition device.
根据位置关系、姿态信息和图像采集设备的内参,获得Tof图像采集设备与红外图像采集设备的第一变换矩阵、Tof图像采集设备与RGB图像采集设备的第二变换矩阵、红外图像采集设备与RGB图像采集设备的第三变换矩阵。利用第二变换矩阵将人员点云信息根据相机成像原理投影到RGB图像采集设备的图像平面空间获得第一投影区域集合;利用第三变换矩阵将连通区域信息投影到RGB图像采集设备的图像平面空间获得第二投影区域集合;对第一投影区与集合与第二投影区与集合的像素进行交集运算,通过共同判断,获取共同像素超过第一阈值的区域为人体区域集合,第一阈值取自20*40-80*160的范围内。According to the position relationship, attitude information and internal parameters of the image acquisition device, the first transformation matrix of the Tof image acquisition device and the infrared image acquisition device, the second transformation matrix of the Tof image acquisition device and the RGB image acquisition device, and the infrared image acquisition device and the RGB image acquisition device are obtained. The third transformation matrix of the image acquisition device. Use the second transformation matrix to project the personnel point cloud information to the image plane space of the RGB image acquisition device according to the camera imaging principle to obtain the first projection area set; use the third transformation matrix to project the connected area information to the image plane space of the RGB image acquisition device Obtain a second set of projection areas; perform an intersection operation on the pixels of the first projection area and the set and the second projection area and the set, and through joint judgment, obtain the area where the common pixels exceed the first threshold as the human body area set, and the first threshold is taken from Within the range of 20*40-80*160.
根据本公开的一个具体的实施例的多模态无监督的行人像素级语义标注方法,具体可以通过以下步骤来实现行人检测并实现自动标注,该方法中需使用采集设备有三种:Time-of-Flight相机(相机A)、红外热成像相机(相机B)和RGB彩色摄像机(相机C)。以下的描述中,分别用相机A,相机B和相机C来代替。According to a specific embodiment of the present disclosure, the multi-modal unsupervised pixel-level semantic annotation method for pedestrians can specifically implement pedestrian detection and automatic annotation through the following steps. There are three types of acquisition devices that need to be used in this method: Time-of -Flight camera (camera A), thermal imaging camera (camera B) and RGB color camera (camera C). In the following description, camera A, camera B and camera C are used instead.
步骤1:选定一个监控场景,任取地面上的一点P,建立三维坐标系XYZ,其中X轴在水平面内,并指向某一方向,Z轴垂直于地面并且指向地心,Y轴在水平面内垂直于X轴,其指向按右手定则确定;Step 1: Select a monitoring scene, take any point P on the ground, and establish a three-dimensional coordinate system XYZ, where the X axis is in the horizontal plane and points in a certain direction, the Z axis is perpendicular to the ground and points to the center of the earth, and the Y axis is in the horizontal plane It is perpendicular to the X-axis, and its direction is determined according to the right-hand rule;
步骤2:在X轴方向上,每隔100cm选取一个点,共选取m个点,作为相机C的水平方向上拍摄位置,记为Q1,Q2,…,Qm;在Z轴方向上,每隔50cm选取一个点,共选取P1,P2,…,Pn作为相机垂直方向上的拍摄高度;在m*n个位置上,分别对俯仰角、偏航角和滚转角分别每隔k度选择一个拍摄角度。相机C在不同的位置、不同的角度上共采集M=m*n*(180/k)*(180/k)*(180/k)张图像。Step 2: In the X-axis direction, select a point every 100cm, a total of m points are selected as the shooting position of the camera C in the horizontal direction, denoted as Q1, Q2, ..., Qm; in the Z-axis direction, every other Select a point of 50cm, and select P1, P2, ..., Pn as the shooting height in the vertical direction of the camera; at m*n positions, select one shot every k degrees for the pitch angle, yaw angle and roll angle respectively. angle. Camera C collects M=m*n*(180/k)*(180/k)*(180/k) images at different positions and angles.
步骤3:对步骤2获取的M张图像,利用Structure-from-Motion技术,对场景进行三维重建,从而获取场景的点云信息,记为Scene_Point_Cloud_BG。Step 3: Use the Structure-from-Motion technology to reconstruct the scene three-dimensionally for the M images obtained in Step 2, so as to obtain the point cloud information of the scene, which is recorded as Scene_Point_Cloud_BG.
步骤4:在场景中分别安装相机A、相机B和相机C,利用根据步骤3的场景点云信息,计算出相机ABC之间的相互位置关系。在该步骤中,必须确保场景中没有运动物体。Step 4: Install camera A, camera B and camera C in the scene respectively, and use the scene point cloud information according to step 3 to calculate the mutual positional relationship between cameras ABC. In this step, it must be ensured that there are no moving objects in the scene.
4a)安装好相机A后,获取场景的深度图像,记为Depth_Image;将Depth_Image和Scene_Point_Cloud_BG作为输入,利用迭代最近点算法(ICP)求解出相机A在场景中的6个自由度位姿(三个旋转角度,三个平移坐标信息)。4a) After installing Camera A, obtain the depth image of the scene, denoted as Depth_Image; take Depth_Image and Scene_Point_Cloud_BG as input, and use the Iterative Closest Point Algorithm (ICP) to solve the 6 DOF poses (three degrees of freedom) of Camera A in the scene. rotation angle, three translation coordinate information).
4b)安装好相机B和C后,获取场景的彩色图片信息,分别记为Color_Image_B和Color_Image_C,利用SIFT描述子和Bag-of-Word特征描述方法,根据M张采集的图像以及场景的点云信息,基于BundleAdjustment算法,求解出相机B和C的位置和姿态信息。4b) After installing cameras B and C, obtain the color image information of the scene, denoted as Color_Image_B and Color_Image_C respectively, use the SIFT descriptor and the Bag-of-Word feature description method, according to the M collected images and the point cloud information of the scene , based on the BundleAdjustment algorithm, to solve the position and attitude information of cameras B and C.
4c)根据步骤4a和步骤4b得到的相机的位姿信息,以及相机ABC各自的内参,得到相机A和相机B的变换矩阵Tab,相机A和相机C之间的变换矩阵Tac,相机B和相机C之间的变换矩阵Tbc。4c) According to the pose information of the camera obtained in step 4a and step 4b, and the respective internal parameters of camera ABC, obtain the transformation matrix Tab of camera A and camera B, the transformation matrix Tac between camera A and camera C, camera B and camera The transformation matrix Tbc between C.
步骤5:完成上述步骤1-4后,开放场景,允许场景中有行人进入。利用摄像机A获取场景中三维点云信息Scene_Point_Cloud_New,再次使用ICP算法,将Scene_Point_Cloud_New和Scene_Point_Cloud_BG进行配准;配准后对两个点云集合做集合的差运算,得到新的点云Scene_Point_Cloud_FG。将Scene_Point_Cloud_FG在XY平面上进行投影,基于霍夫变换得到若干圆形区域C1,C2,…,Cp。属于同一个圆形区域Ci所对应的点云信息记为Person_i。Step 5: After completing the above steps 1-4, open the scene and allow pedestrians to enter the scene. Use camera A to obtain the 3D point cloud information Scene_Point_Cloud_New in the scene, and use the ICP algorithm again to register Scene_Point_Cloud_New and Scene_Point_Cloud_BG; after registration, perform the set difference operation on the two point cloud sets to obtain a new point cloud Scene_Point_Cloud_FG. Project Scene_Point_Cloud_FG on the XY plane, and obtain several circular areas C1, C2, ..., Cp based on Hough transform. The point cloud information corresponding to the same circular area Ci is denoted as Person_i.
步骤6:利用相机B拍摄到的场景信息,通过阈值化处理后得到二值化图像Camera_B_Image_Binary;对Camera_B_Image_Binary进行膨胀和腐蚀操作,同时去除像素区域小于阈值thr(thr根据实际场景进行设置,范围为1000-8196)的区域;将所得到的联通区域信息记为R1,R2,…,Rq。Step 6: Use the scene information captured by camera B to obtain a binarized image Camera_B_Image_Binary after thresholding; perform dilation and erosion operations on Camera_B_Image_Binary, and remove the pixel area smaller than the threshold thr (thr is set according to the actual scene, the range is 1000 -8196) area; record the obtained connectivity area information as R1, R2, ..., Rq.
步骤7:根据步骤4得到的相机AC之间的变换矩阵Tac,将步骤5的点云信息Person_i(i=1,2,…,p),根据相机成像原理投影到相机C的图像平面空间中,记所对应的区域为Region_From_A_i(i=1,2,…,p)。根据步骤4得到的相机BC之间的变换矩阵Tbc,步骤6得到的区域Rj(j=1,2,…,q)投影到相机C的图像平面空间中,记所对应的区域为Region_From_B_j(j=1,2,…,q)。Step 7: According to the transformation matrix Tac between the cameras AC obtained in Step 4, the point cloud information Person_i (i=1, 2, . , and denote the corresponding region as Region_From_A_i (i=1, 2, . . . , p). According to the transformation matrix Tbc between cameras BC obtained in step 4, the region Rj (j=1, 2, . =1,2,...,q).
步骤8:对步骤7中得到的两个区域集合:{Region_From_A_1,…,Region_From_A_p}和{Region_From_B_1,…,Region_From_B_q},进行集合求交运算。当Region_From_A_i和Region_From_B_j之间的共同像素个数超过阈值thr_region(设置为20x40至80x160),则得到对应的人体区域Region_From_C_k,其中,k≥1 and k≤min(p,q)。Step 8: Perform a set intersection operation on the two region sets obtained in Step 7: {Region_From_A_1, ..., Region_From_A_p} and {Region_From_B_1, ..., Region_From_B_q}. When the number of common pixels between Region_From_A_i and Region_From_B_j exceeds the threshold thr_region (set from 20x40 to 80x160), the corresponding human body region Region_From_C_k is obtained, where k≥1 and k≤min(p, q).
继续参考图3,图3示出了根据本申请的一个实施例的多模态无监督的行人像素级语义标注系统的框架图。该系统具体包括初始点云信息获取单元301、人员点云信息集合获取单元302、连通区域信息集合获取单元303和人体区域集合获取单元304。Continuing to refer to FIG. 3 , FIG. 3 shows a frame diagram of a multi-modal unsupervised pixel-level semantic annotation system for pedestrians according to an embodiment of the present application. The system specifically includes an initial point cloud information acquisition unit 301 , a person point cloud information collection acquisition unit 302 , a connected area information collection acquisition unit 303 , and a human body area collection acquisition unit 304 .
在具体的实施例中,初始点云信息获取单元301配置用于对无人的监控场景进行三维重建,获取监控场景的初始点云信息;人员点云信息集合获取单元302配置用于利用 Tof图像采集设备获取监控场景中的第一点云信息,将其与初始点云信息配准后进行集合的差运算,获得第二点云信息,并将第二点云信息在水平面上进行投影,获得人员点云信息集合;连通区域信息集合获取单元303配置用于对红外图像采集设备获取的场景信息阈值化后的二值化图像进行膨胀和腐蚀,获得连通区域信息集合;人体区域集合获取单元304配置用于分别将人员点云信息集合和连通区域信息集合投影到RGB图像采集设备的图像平面空间中进行集合的交集运算,响应于共同像素超过第一阈值时,获取对应的人体区域集合。In a specific embodiment, the initial point cloud information acquisition unit 301 is configured to perform three-dimensional reconstruction of an unmanned monitoring scene, and obtain initial point cloud information of the monitoring scene; the personnel point cloud information collection acquisition unit 302 is configured to use Tof images The acquisition device obtains the first point cloud information in the monitoring scene, registers it with the initial point cloud information, and performs a set difference operation to obtain the second point cloud information, and projects the second point cloud information on the horizontal plane to obtain Person point cloud information set; the connected area information set acquisition unit 303 is configured to expand and corrode the binarized image of the scene information obtained by the infrared image acquisition device after thresholding, to obtain the connected area information set; the human body area set acquisition unit 304 It is configured to respectively project the personnel point cloud information set and the connected area information set into the image plane space of the RGB image acquisition device to perform the intersection operation of the sets, and obtain the corresponding human body area set in response to the common pixels exceeding the first threshold.
下面参考图4,其示出了适于用来实现本申请实施例的电子设备的计算机系统400的结构示意图。图4示出的电子设备仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。Referring next to FIG. 4 , it shows a schematic structural diagram of a computer system 400 suitable for implementing the electronic device of the embodiment of the present application. The electronic device shown in FIG. 4 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present application.
如图4所示,计算机系统400包括中央处理单元(CPU)401,其可以根据存储在只读存储器(ROM)402中的程序或者从存储部分408加载到随机访问存储器(RAM)403中的程序而执行各种适当的动作和处理。在RAM 403中,还存储有系统400操作所需的各种程序和数据。CPU 401、ROM 402以及RAM 403通过总线404彼此相连。输入/输出(I/O)接口405也连接至总线404。As shown in FIG. 4, a computer system 400 includes a central processing unit (CPU) 401, which can be loaded into a random access memory (RAM) 403 according to a program stored in a read only memory (ROM) 402 or a program from a storage section 408 Instead, various appropriate actions and processes are performed. In the RAM 403, various programs and data required for the operation of the system 400 are also stored. The CPU 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404. An input/output (I/O) interface 405 is also connected to bus 404 .
以下部件连接至I/O接口405:包括键盘、鼠标等的输入部分406;包括诸如液晶显示器(LCD)等以及扬声器等的输出部分407;包括硬盘等的存储部分408;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分409。通信部分409经由诸如因特网的网络执行通信处理。驱动器410也根据需要连接至I/O接口405。可拆卸介质411,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器410上,以便于从其上读出的计算机程序根据需要被安装入存储部分408。The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, etc.; an output section 407 including a liquid crystal display (LCD), etc. and a speaker, etc.; a storage section 408 including a hard disk, etc.; Communication section 409 of a network interface card such as a modem. The communication section 409 performs communication processing via a network such as the Internet. A drive 410 is also connected to the I/O interface 405 as needed. A removable medium 411, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 410 as needed so that a computer program read therefrom is installed into the storage section 408 as needed.
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读存储介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分409从网络上被下载和安装,和/或从可拆卸介质411被安装。在该计算机程序被中央处理单元(CPU)401执行时,执行本申请的方法中限定的上述功能。需要说明的是,本申请的计算机可读存储介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只 读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读存储介质,该计算机可读存储介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读存储介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable storage medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 409 and/or installed from the removable medium 411 . When the computer program is executed by the central processing unit (CPU) 401, the above-mentioned functions defined in the method of the present application are performed. It should be noted that the computer-readable storage medium of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In this application, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable storage medium other than a computer-readable storage medium that can be sent, propagated, or transmitted for use by or in connection with the instruction execution system, apparatus, or device program of. Program code embodied on a computer-readable storage medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
可以以一种或多种程序设计语言或其组合来编写用于执行本申请的操作的计算机程序代码,程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing the operations of the present application may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, and also conventional procedures, or a combination thereof programming language - such as "C" or a similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).
附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
描述于本申请实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。The modules involved in the embodiments of the present application may be implemented in a software manner, and may also be implemented in a hardware manner.
作为另一方面,本申请还提供了一种计算机可读存储介质,该计算机可读存储介质可以是上述实施例中描述的电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。上述计算机可读存储介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:包括对无人的监控场景进行三维重建,获取监控场景的初始点云信息;利用Tof图像采集设备获取监控场景中的第一点云信息,将其与初始点云信息配准后进行集合的差运算,获得第二点云信息,并将第二点云信息在水平面上进行投影,获得人员点云信息集合;对红外图像采集设备获取的场景信息阈值化后的二值化图像进行膨胀和腐蚀,获得连通区域信息集合;分别将人员点云信息集合和连通区域信息集合投影到RGB图像采集设备的图像平面空间中进行集合的交集运算,响应于共同像素超过第一阈值时,获取对应的人体区域集合。As another aspect, the present application also provides a computer-readable storage medium. The computer-readable storage medium may be included in the electronic device described in the above-mentioned embodiments; in electronic equipment. The above-mentioned computer-readable storage medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic equipment, the electronic equipment: including three-dimensional reconstruction of the unmanned monitoring scene, and obtaining the initial point of the monitoring scene Cloud information; use Tof image acquisition equipment to obtain the first point cloud information in the monitoring scene, register it with the initial point cloud information, and perform a set difference operation to obtain the second point cloud information, and put the second point cloud information in the Projection on the horizontal plane to obtain a set of personnel point cloud information; dilate and corrode the binarized image of the scene information obtained by the infrared image acquisition device after thresholding to obtain a set of connected area information; separate the personnel point cloud information set and the connected area The information set is projected into the image plane space of the RGB image acquisition device to perform the intersection operation of the set, and in response to the common pixel exceeding the first threshold, the corresponding human body region set is obtained.
以上描述仅为本申请的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本申请中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本申请中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present application and an illustration of the applied technical principles. Those skilled in the art should understand that the scope of disclosure involved in this application is not limited to the technical solutions formed by the specific combination of the above-mentioned technical features, and should also cover, without departing from the above-mentioned disclosed concept, the technical solutions made of the above-mentioned technical features or Other technical solutions formed by any combination of its equivalent features. For example, a technical solution is formed by replacing the above-mentioned features with the technical features disclosed in this application (but not limited to) with similar functions.

Claims (12)

  1. 一种多模态无监督的行人像素级语义标注方法,其特征在于,包括:A multimodal unsupervised pedestrian pixel-level semantic annotation method, characterized in that it includes:
    S1:对无人的监控场景进行三维重建,获取所述监控场景的初始点云信息;S1: perform three-dimensional reconstruction on an unmanned monitoring scene, and obtain initial point cloud information of the monitoring scene;
    S2:利用Tof图像采集设备获取所述监控场景中的第一点云信息,将其与初始点云信息配准后进行集合的差运算,获得第二点云信息,并将所述第二点云信息在水平面上进行投影,获得人员点云信息集合;S2: Use the Tof image acquisition device to acquire the first point cloud information in the monitoring scene, perform a set difference operation after registering it with the initial point cloud information, obtain the second point cloud information, and use the second point cloud information to obtain the second point cloud information. The cloud information is projected on the horizontal plane to obtain a collection of personnel point cloud information;
    S3:对红外图像采集设备获取的场景信息阈值化后的二值化图像进行膨胀和腐蚀,获得连通区域信息集合;以及S3: Dilate and corrode the binarized image obtained by the thresholding of the scene information obtained by the infrared image acquisition device to obtain a set of connected area information; and
    S4:分别将所述人员点云信息集合和所述连通区域信息集合,利用已经标定的相机之间的位置关系,投影到RGB图像采集设备的图像平面空间中进行集合的交集运算,响应于共同像素超过第一阈值时,获取对应的人体区域集合。S4: Project the personnel point cloud information set and the connected area information set respectively, using the positional relationship between the cameras that have been calibrated, to the image plane space of the RGB image acquisition device to perform the intersection operation of the sets, and in response to the common When the pixel exceeds the first threshold, a corresponding set of human body regions is obtained.
  2. 根据权利要求1所述的多模态无监督的行人像素级语义标注方法,其特征在于,所述步骤S1具体包括:The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 1, wherein the step S1 specifically includes:
    在所述无人的监控场景中任取一原点,建立三维坐标系;An origin is arbitrarily chosen in the unmanned monitoring scene to establish a three-dimensional coordinate system;
    在x轴和z轴方向上间隔设置m*n个点位作为RGB图像采集设备的图像采集位置,对俯仰角、偏航角和滚转角分别间隔k度选择拍摄角度,采集M=m*n*(180/k)*(180/k)*(180/k)张图像;Set m*n points at intervals in the x-axis and z-axis directions as the image acquisition positions of the RGB image acquisition device, select the shooting angles for the pitch angle, yaw angle and roll angle at k degrees, respectively, and collect M=m*n *(180/k)*(180/k)*(180/k) images;
    利用Structure from motion的三维重建算法对M张图像进行所述监控场景的三维重建,并获取所述初始点云信息。Use the three-dimensional reconstruction algorithm of Structure from motion to perform three-dimensional reconstruction of the monitoring scene on M images, and obtain the initial point cloud information.
  3. 根据权利要求1所述的多模态无监督的行人像素级语义标注方法,其特征在于,利用迭代最近点算法将第一点云信息和初始点云信息进行配准。The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 1, wherein the first point cloud information and the initial point cloud information are registered by using an iterative closest point algorithm.
  4. 根据权利要求2所述的多模态无监督的行人像素级语义标注方法,其特征在于,将第二点云信息在所述三维坐标系的XY平面进行投影,基于霍夫变换获得若干圆形区域,将属于同一个圆形区域所对应的点云信息纳入所述人员点云信息集合中。The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 2, wherein the second point cloud information is projected on the XY plane of the three-dimensional coordinate system, and several circles are obtained based on Hough transform area, and include the point cloud information corresponding to the same circular area into the set of personnel point cloud information.
  5. 根据权利要求1所述的多模态无监督的行人像素级语义标注方法,其特征在于,所述步骤S3中对二值化图像进行膨胀和腐蚀之后,还包括去除像素区域小于第二阈值的区域。The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 1, characterized in that, after dilating and eroding the binarized image in the step S3, it further comprises removing pixels whose pixel area is smaller than the second threshold. area.
  6. 根据权利要求5所述的多模态无监督的行人像素级语义标注方法,其特征在于,所述第一阈值取自20*40-80*160的范围内,第二阈值取自1000-8196的范围内。The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 5, wherein the first threshold is taken in the range of 20*40-80*160, and the second threshold is taken in the range of 1000-8196 In the range.
  7. 根据权利要求1所述的多模态无监督的行人像素级语义标注方法,其特征在于,在所述监控场景中分别安装Tof图像采集设备、红外图像采集设备和RGB图像采集设备,利用初始点云信息分别计算所述Tof图像采集设备、红外图像采集设备和RGB图像采集设备的位置关系以及姿态信息。The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 1, wherein a Tof image acquisition device, an infrared image acquisition device and an RGB image acquisition device are respectively installed in the monitoring scene, and the initial point is used The cloud information calculates the positional relationship and attitude information of the Tof image acquisition device, the infrared image acquisition device, and the RGB image acquisition device, respectively.
  8. 根据权利要求7所述的多模态无监督的行人像素级语义标注方法,其特征在于,位置关系以及姿态信息的具体获取方式包括:The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 7, wherein the specific acquisition method of the positional relationship and the posture information comprises:
    利用Tof图像采集设备获取所述监控场景的深度图像,结合初始点云信息利用迭代最近点算法获取所述Tof图像采集设备在所述监控场景中的自由度位姿;Use the Tof image acquisition device to obtain the depth image of the monitoring scene, and use the iterative closest point algorithm to obtain the DOF pose of the Tof image acquisition device in the monitoring scene in combination with the initial point cloud information;
    利用红外图像采集设备和RGB图像采集设备获取所述监控场景的彩色图像,利用SIFT描述子和Bag of words词袋特征描述算法,根据采集的图像和所述初始点云信息,基于Bundle Adjustment光束法平差算法,获取红外图像采集设备和RGB图像采集设备的位置和姿态信息。Use infrared image acquisition equipment and RGB image acquisition equipment to obtain the color image of the monitoring scene, use SIFT descriptor and Bag of words feature description algorithm, according to the collected image and the initial point cloud information, based on the Bundle Adjustment beam method The adjustment algorithm obtains the position and attitude information of the infrared image acquisition device and the RGB image acquisition device.
  9. 根据权利要求7或8所述的多模态无监督的行人像素级语义标注方法,其特征在于,根据所述位置关系、姿态信息和图像采集设备的内参,获得所述Tof图像采集设备与所述红外图像采集设备的第一变换矩阵、所述Tof图像采集设备与所述RGB图像采集设备的第二变换矩阵、所述红外图像采集设备与所述RGB图像采集设备的第三变换矩阵。The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 7 or 8, characterized in that, according to the position relationship, attitude information and internal parameters of the image acquisition device, the Tof image acquisition device and the image acquisition device are obtained. The first transformation matrix of the infrared image acquisition device, the second transformation matrix of the Tof image acquisition device and the RGB image acquisition device, and the third transformation matrix of the infrared image acquisition device and the RGB image acquisition device.
  10. 根据权利要求9所述的多模态无监督的行人像素级语义标注方法,其特征在于,所述步骤S4具体包括:The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 9, wherein the step S4 specifically comprises:
    利用所述第二变换矩阵将所述人员点云信息根据相机成像原理投影到所述RGB图像采集设备的图像平面空间获得第一投影区域集合;Using the second transformation matrix to project the personnel point cloud information to the image plane space of the RGB image acquisition device according to the camera imaging principle to obtain a first set of projection regions;
    利用所述第三变换矩阵将所述连通区域信息投影到所述RGB图像采集设备的图像平面空间获得第二投影区域集合;Using the third transformation matrix to project the connected region information to the image plane space of the RGB image acquisition device to obtain a second set of projection regions;
    对所述第一投影区与集合与所述第二投影区与集合的像素进行交集运算。An intersection operation is performed on the pixels of the first projection area and the set and the second projection area and the set.
  11. 一种计算机可读存储介质,其上存储有一或多个计算机程序,其特征在于,该一或多个计算机程序被计算机处理器执行时实施权利要求1至10中任一项所述的方法。A computer-readable storage medium on which one or more computer programs are stored, characterized in that, when the one or more computer programs are executed by a computer processor, the method described in any one of claims 1 to 10 is implemented.
  12. 一种多模态无监督的行人像素级语义标注系统,其特征在于,所述系统包括:A multimodal unsupervised pedestrian pixel-level semantic annotation system, characterized in that the system includes:
    初始点云信息获取单元:配置用于对无人的监控场景进行三维重建,获取所述监控场景的初始点云信息;Initial point cloud information acquisition unit: configured to perform three-dimensional reconstruction of an unmanned monitoring scene, and obtain initial point cloud information of the monitoring scene;
    人员点云信息集合获取单元:配置用于利用Tof图像采集设备获取所述监控场景中的第一点云信息,将其与初始点云信息配准后进行集合的差运算,获得第二点云信息,并将所述第二点云信息在水平面上进行投影,获得人员点云信息集合;Personnel point cloud information collection acquisition unit: configured to use Tof image acquisition equipment to acquire the first point cloud information in the monitoring scene, register it with the initial point cloud information, and perform a set difference operation to obtain a second point cloud. information, and project the second point cloud information on the horizontal plane to obtain a set of personnel point cloud information;
    连通区域信息集合获取单元:配置用于对红外图像采集设备获取的场景信息阈值化后的二值化图像进行膨胀和腐蚀,获得连通区域信息集合;以及Connected area information set acquisition unit: configured to dilate and corrode the binarized image obtained by thresholding the scene information obtained by the infrared image acquisition device to obtain a connected area information set; and
    人体区域集合获取单元:配置用于分别将所述人员点云信息集合和所述连通区域信息集合投影到RGB图像采集设备的图像平面空间中进行集合的交集运算,响应于共同像素超过第一阈值时,获取对应的人体区域集合。A human body area set acquisition unit: configured to respectively project the person point cloud information set and the connected area information set into the image plane space of the RGB image acquisition device to perform the intersection operation of the sets, in response to the common pixels exceeding the first threshold , obtain the corresponding human body region set.
PCT/CN2021/074232 2020-12-30 2021-01-28 Multimodal unsupervised pedestrian pixel-level semantic labeling method and system WO2022141721A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011615688.6 2020-12-30
CN202011615688.6A CN112766061A (en) 2020-12-30 2020-12-30 Multi-mode unsupervised pedestrian pixel-level semantic annotation method and system

Publications (1)

Publication Number Publication Date
WO2022141721A1 true WO2022141721A1 (en) 2022-07-07

Family

ID=75697793

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/074232 WO2022141721A1 (en) 2020-12-30 2021-01-28 Multimodal unsupervised pedestrian pixel-level semantic labeling method and system

Country Status (2)

Country Link
CN (1) CN112766061A (en)
WO (1) WO2022141721A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080088623A1 (en) * 2006-10-13 2008-04-17 Richard William Bukowski Image-mapped point cloud with ability to accurately represent point coordinates
CN110456363A (en) * 2019-06-17 2019-11-15 北京理工大学 The target detection and localization method of three-dimensional laser radar point cloud and infrared image fusion
CN111160278A (en) * 2019-12-31 2020-05-15 河南中原大数据研究院有限公司 Face texture structure data acquisition method based on single image sensor
CN111260773A (en) * 2020-01-20 2020-06-09 深圳市普渡科技有限公司 Three-dimensional reconstruction method, detection method and detection system for small obstacles

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080088623A1 (en) * 2006-10-13 2008-04-17 Richard William Bukowski Image-mapped point cloud with ability to accurately represent point coordinates
CN110456363A (en) * 2019-06-17 2019-11-15 北京理工大学 The target detection and localization method of three-dimensional laser radar point cloud and infrared image fusion
CN111160278A (en) * 2019-12-31 2020-05-15 河南中原大数据研究院有限公司 Face texture structure data acquisition method based on single image sensor
CN111260773A (en) * 2020-01-20 2020-06-09 深圳市普渡科技有限公司 Three-dimensional reconstruction method, detection method and detection system for small obstacles

Also Published As

Publication number Publication date
CN112766061A (en) 2021-05-07

Similar Documents

Publication Publication Date Title
JP7221089B2 (en) Stable simultaneous execution of location estimation and map generation by removing dynamic traffic participants
US10607369B2 (en) Method and device for interactive calibration based on 3D reconstruction in 3D surveillance system
Arth et al. Instant outdoor localization and slam initialization from 2.5 d maps
CN106791710B (en) Target detection method and device and electronic equipment
WO2018196396A1 (en) Person re-identification method based on consistency constraint feature learning
US20210342990A1 (en) Image coordinate system transformation method and apparatus, device, and storage medium
Xue et al. Panoramic Gaussian Mixture Model and large-scale range background substraction method for PTZ camera-based surveillance systems
WO2021258579A1 (en) Image splicing method and apparatus, computer device, and storage medium
Santos et al. 3D plant modeling: localization, mapping and segmentation for plant phenotyping using a single hand-held camera
Xia et al. Zoom better to see clearer: Human part segmentation with auto zoom net
Pintore et al. Recovering 3D existing-conditions of indoor structures from spherical images
WO2022237048A1 (en) Pose acquisition method and apparatus, and electronic device, storage medium and program
He et al. Ground and aerial collaborative mapping in urban environments
CN108229281B (en) Neural network generation method, face detection device and electronic equipment
Gupta et al. Augmented reality system using lidar point cloud data for displaying dimensional information of objects on mobile phones
CN112861776A (en) Human body posture analysis method and system based on dense key points
JP2013037539A (en) Image feature amount extraction device and program thereof
KR101817440B1 (en) The 3d model-based object recognition techniques and systems with a multi-camera
Debaque et al. Thermal and visible image registration using deep homography
US9392146B2 (en) Apparatus and method for extracting object
CN111860084B (en) Image feature matching and positioning method and device and positioning system
WO2022141721A1 (en) Multimodal unsupervised pedestrian pixel-level semantic labeling method and system
Hanzla et al. Smart Traffic Monitoring through Drone Images via Yolov5 and Kalman Filter
Viguier et al. Resilient mobile cognition: Algorithms, innovations, and architectures
JP2017207960A (en) Image analysis device, image analysis method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21912501

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21912501

Country of ref document: EP

Kind code of ref document: A1