WO2022141721A1

WO2022141721A1 - Multimodal unsupervised pedestrian pixel-level semantic labeling method and system

Info

Publication number: WO2022141721A1
Application number: PCT/CN2021/074232
Authority: WO
Inventors: 彭鹭斌; 苏松志; 苏松剑; 蔡国榕; 陈延艺; 陈延行
Original assignee: 罗普特科技集团股份有限公司; 罗普特(厦门)系统集成有限公司
Priority date: 2020-12-30
Filing date: 2021-01-28
Publication date: 2022-07-07
Also published as: CN112766061A

Abstract

Provided in the present disclosure is a multimodal unsupervised pedestrian pixel-level semantic labeling method and system, the method comprising: performing three-dimensional reconstruction on an unmanned monitoring scene, and acquiring initial point cloud information of the monitoring scene; acquiring first point cloud information in the monitoring scene by using a Tof image acquisition device, registering the first point cloud information with the initial point cloud information and then executing a set difference operation to acquire second point cloud information, and projecting the second point cloud information on a horizontal plane to obtain a personnel point cloud information set; expanding and corroding a binary image after thresholding scene information acquired by an infrared image acquisition device to obtain a connected area information set; using positional relationships between calibrated cameras to project the personnel point cloud information set and the connected area information set into an image plane space of an RGB image acquisition device so as to perform a set intersection operation, and acquiring a corresponding human body area set in response to when a common pixel exceeds a first threshold. The method and system fully integrate advantages of cameras having different modalities, and can effectively extract human pixel points in the scene.

Description

A Multimodal Unsupervised Pedestrian Pixel-Level Semantic Annotation Method and System

Related applications

This application claims the priority of Chinese Patent Application No. 202011615688.6 filed on December 30, 2020, the entire contents of which are incorporated herein by reference.

technical field

The present disclosure relates to the technical field of object detection, in particular to a multimodal unsupervised pedestrian pixel-level semantic labeling method and system.

Background technique

Pedestrian detection is a classic problem in computer vision, and its related technologies can be applied in fields such as video surveillance and autonomous driving. The current common method is to first shoot a large number of samples containing pedestrians, and then manually mark the pedestrian's position in the picture as training data; finally, use supervised learning methods (such as support vector machines, deep learning) to train a classifier to distinguish pedestrians from non-pedestrian areas. The development of stochastic deep learning techniques requires an increasing number of training samples. Labeling a large number of samples is time-consuming and labor-intensive.

According to the format of input data, pedestrian detection technology can be divided into methods based on two-dimensional images (including color and grayscale); methods based on three-dimensional point clouds; and methods based on infrared imaging. From a technical point of view, it can be divided into: overall method, part method and local block method. Most of the above methods utilize supervised classification techniques in machine learning. Supervised classification technology needs to mark the location of pedestrians in the picture, so it requires a lot of human, material and financial resources.

public content

In order to solve the technical problem in the prior art that a lot of manpower, material resources and financial resources are required to mark the location of pedestrians in pictures, the present disclosure proposes a multimodal unsupervised pixel-level semantic annotation method and system for pedestrians, which eliminates the need for manual The trouble of labeling pedestrian samples.

According to an aspect of the present disclosure, a multimodal unsupervised pedestrian pixel-level semantic annotation method is proposed, including:

S1: Perform 3D reconstruction on the unmanned monitoring scene, and obtain the initial point cloud information of the monitoring scene;

S2: Use the Tof image acquisition device to obtain the first point cloud information in the monitoring scene, register it with the initial point cloud information, and perform a set difference operation to obtain the second point cloud information, and place the second point cloud information on the horizontal plane. Projection on it to obtain a collection of personnel point cloud information;

S3: Dilate and corrode the binarized image obtained by the thresholding of the scene information obtained by the infrared image acquisition device to obtain a set of connected area information; and

S4: respectively project the personnel point cloud information set and the connected area information set into the image plane space of the RGB image acquisition device by using the positional relationship between the cameras that have been calibrated to perform the intersection operation of the sets, and in response to the common pixel exceeding the first When the threshold is set, the corresponding set of human body regions is obtained.

In some specific embodiments, step S1 specifically includes:

Take any origin in an unmanned monitoring scene to establish a three-dimensional coordinate system;

Set m*n points at intervals in the x-axis and z-axis directions as the image acquisition positions of the RGB image acquisition device, select the shooting angles for the pitch angle, yaw angle and roll angle at k degrees, respectively, and collect M=m*n *(180/k)*(180/k)*(180/k) images;

The three-dimensional reconstruction algorithm of Structure from motion is used to reconstruct the three-dimensional monitoring scene of M images, and the initial point cloud information is obtained. Using the STM algorithm, the three-dimensional structure can be recovered from the projected two-dimensional motion field of a moving object or scene.

In some specific embodiments, the first point cloud information and the initial point cloud information are registered using an iterative closest point algorithm. With this step, images acquired by different acquisition devices can be registered.

In some specific embodiments, the second point cloud information is projected on the XY plane of the three-dimensional coordinate system, several circular areas are obtained based on Hough transform, and the point cloud information corresponding to the same circular area is included in the personnel points cloud information collection.

In some specific embodiments, after the binarized image is dilated and eroded in step S3, the method further includes removing the area where the pixel area is smaller than the second threshold. With this step, the image can be processed to obtain connected regions.

In some specific embodiments, the first threshold is taken in the range of 20*40-80*160, and the second threshold is taken in the range of 1000-8196.

In some specific embodiments, Tof image acquisition equipment, infrared image acquisition equipment and RGB image acquisition equipment are respectively installed in the monitoring scene, and the Tof image acquisition equipment, infrared image acquisition equipment and RGB image acquisition equipment are respectively calculated using the initial point cloud information position and attitude information. Using the positional relationship and attitude information of the image acquisition devices of the three modalities can facilitate the conversion of later feature point clouds.

In some specific embodiments, the specific acquisition methods of the positional relationship and attitude information include:

Use the Tof image acquisition device to obtain the depth image of the monitoring scene, and use the iterative closest point algorithm to obtain the DOF pose of the Tof image acquisition device in the monitoring scene in combination with the initial point cloud information;

Use infrared image acquisition equipment and RGB image acquisition equipment to obtain color images of monitoring scenes, use SIFT descriptors and Bag of words feature description algorithm, and based on the collected images and initial point cloud information, based on the Bundle Adjustment beam adjustment algorithm, Obtain the position and attitude information of the infrared image acquisition device and the RGB image acquisition device.

In some specific embodiments, the first transformation matrix between the Tof image capturing device and the infrared image capturing device and the second transformation matrix between the Tof image capturing device and the RGB image capturing device are obtained according to the positional relationship, the attitude information and the internal parameters of the image capturing device. Matrix, the third transformation matrix of the infrared image acquisition device and the RGB image acquisition device.

In some specific embodiments, step S4 specifically includes:

Using the second transformation matrix to project the personnel point cloud information to the image plane space of the RGB image acquisition device according to the camera imaging principle to obtain the first projection area set;

Using the third transformation matrix to project the connected region information to the image plane space of the RGB image acquisition device to obtain a second set of projection regions;

An intersection operation is performed on the pixels of the first projection area and the set and the second projection area and the set.

According to a second aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon one or more computer programs that, when executed by a computer processor, implement any of the methods described above.

According to a third aspect of the present disclosure, a multimodal unsupervised pedestrian pixel-level semantic annotation system is proposed, the system comprising:

Initial point cloud information acquisition unit: configured to perform three-dimensional reconstruction of an unmanned monitoring scene, and obtain initial point cloud information of the monitoring scene;

Personnel point cloud information collection acquisition unit: configured to use Tof image acquisition equipment to acquire the first point cloud information in the monitoring scene, register it with the initial point cloud information, and perform a set difference operation to obtain the second point cloud information, Project the second point cloud information on the horizontal plane to obtain a set of personnel point cloud information;

Connected area information set acquisition unit: configured to dilate and corrode the binarized image obtained by thresholding the scene information obtained by the infrared image acquisition device to obtain a connected area information set; and

Human body area collection acquisition unit: configured to separately project the personnel point cloud information collection and the connected area information collection into the image plane space of the RGB image collection device by using the positional relationship between the cameras that have been calibrated to perform the intersection operation of the collections, In response to the common pixels exceeding the first threshold, a corresponding set of human body regions is obtained.

The present disclosure proposes a multi-modal unsupervised pedestrian pixel-level semantic annotation method and system, which integrates the advantages of different modal cameras of Tof image acquisition equipment, infrared image acquisition equipment and RGB image acquisition equipment, and can effectively extract scenes Human Pixels in . In pedestrian detection tasks, pixel-level annotation information can be automatically provided for use by machine learning algorithms.

Description of drawings

The accompanying drawings are included to provide a further understanding of the embodiments and are incorporated into and constitute a part of this specification. The drawings illustrate embodiments and together with the description serve to explain the principles of the present disclosure. Other embodiments and many of the intended advantages of the embodiments will be readily recognized as they become better understood by reference to the following detailed description. Other features, objects and advantages of the present application will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;

2 is a flowchart of a multimodal unsupervised pedestrian pixel-level semantic labeling method according to an embodiment of the present application;

3 is a frame diagram of a multimodal unsupervised pedestrian pixel-level semantic annotation system according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a computer system suitable for implementing the electronic device according to the embodiment of the present application.

Detailed ways

The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the related disclosure, but not to limit the disclosure. In addition, it should be noted that, for the convenience of description, only the parts related to the relevant disclosure are shown in the drawings.

It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict. The present application will be described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

FIG. 1 shows an exemplary system architecture 100 to which the multimodal unsupervised pixel-level semantic annotation method for pedestrians according to embodiments of the present application can be applied.

As shown in FIG. 1 , the system architecture 100 may include a data server 101 , a network 102 and a main server 103 . The network 102 is the medium used to provide the communication link between the data server 101 and the main server 103 . The network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The main server 103 may be a server that provides various services, such as a data processing server that processes the information uploaded by the data server 101 . The data processing server can detect pedestrians and store the detection results in the database.

It should be noted that the multimodal unsupervised pixel-level semantic annotation analysis method for pedestrians provided by the embodiments of the present application is generally executed by the main server 103, and accordingly, the apparatus for semantic analysis of small data sets is generally installed in the main server 103.

It should be noted that the data server and the main server may be hardware or software. In the case of hardware, it can be implemented as a distributed server cluster composed of multiple servers, or as a single server. When it is software, it can be implemented as a plurality of software or software modules (such as software or software modules for providing distributed services), or can be implemented as a single software or software module.

It should be understood that the numbers of data servers, networks and main servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.

According to a multimodal unsupervised pedestrian pixel-level semantic annotation method according to an embodiment of the present application, FIG. 2 shows a flowchart of the multi-modal unsupervised pedestrian pixel-level semantic annotation method according to an embodiment of the present application. As shown in Figure 2, the method includes:

S201: Perform three-dimensional reconstruction on an unmanned monitoring scene, and obtain initial point cloud information of the monitoring scene. In the unmanned monitoring scene, the three-dimensional reconstruction algorithm of Structure from motion is used to reconstruct the three-dimensional monitoring scene of M images, and the initial point cloud information is obtained. The goal of Structure from Motion (SfM) is to automatically recover camera motion and scene structure using two or more scenes. It is a self-calibration technology that can automatically complete camera tracking and motion matching.

S202: Use the Tof image acquisition device to acquire the first point cloud information in the monitoring scene, perform a set difference operation after registering it with the initial point cloud information, obtain the second point cloud information, and place the second point cloud information on the horizontal plane Projection on it to obtain a collection of personnel point cloud information. The first point cloud information and the initial point cloud information are registered by the iterative closest point algorithm. In this step, pedestrians are allowed to enter the monitoring scene, the second point cloud information is projected on the XY plane of the three-dimensional coordinate system in step S201, and a number of circular areas are obtained based on Hough transformation, and those corresponding to the same circular area are The point cloud information is included in the personnel point cloud information collection.

S203 : Dilate and corrode the binarized image obtained by the infrared image acquisition device after thresholding the scene information to obtain a connected area information set. After the binarized image is dilated and eroded, the region with the pixel region smaller than the second threshold is further removed to obtain the connected region information set, where the second threshold is taken in the range of 1000-8196.

S204: Project the personnel point cloud information collection and the connected area information collection into the image plane space of the RGB image acquisition device by using the positional relationship between the cameras that have been calibrated, respectively, to perform the intersection operation of the collections, and in response to the common pixel exceeding the first When the threshold is set, the corresponding set of human body regions is obtained.

In a specific embodiment, a Tof image acquisition device, an infrared image acquisition device, and an RGB image acquisition device are respectively installed in the monitoring scene, and the initial point cloud information is used to calculate the Tof image acquisition device, the infrared image acquisition device, and the RGB image acquisition device, respectively. Position relationship and attitude information:

According to the position relationship, attitude information and internal parameters of the image acquisition device, the first transformation matrix of the Tof image acquisition device and the infrared image acquisition device, the second transformation matrix of the Tof image acquisition device and the RGB image acquisition device, and the infrared image acquisition device and the RGB image acquisition device are obtained. The third transformation matrix of the image acquisition device. Use the second transformation matrix to project the personnel point cloud information to the image plane space of the RGB image acquisition device according to the camera imaging principle to obtain the first projection area set; use the third transformation matrix to project the connected area information to the image plane space of the RGB image acquisition device Obtain a second set of projection areas; perform an intersection operation on the pixels of the first projection area and the set and the second projection area and the set, and through joint judgment, obtain the area where the common pixels exceed the first threshold as the human body area set, and the first threshold is taken from Within the range of 20*40-80*160.

According to a specific embodiment of the present disclosure, the multi-modal unsupervised pixel-level semantic annotation method for pedestrians can specifically implement pedestrian detection and automatic annotation through the following steps. There are three types of acquisition devices that need to be used in this method: Time-of -Flight camera (camera A), thermal imaging camera (camera B) and RGB color camera (camera C). In the following description, camera A, camera B and camera C are used instead.

Step 1: Select a monitoring scene, take any point P on the ground, and establish a three-dimensional coordinate system XYZ, where the X axis is in the horizontal plane and points in a certain direction, the Z axis is perpendicular to the ground and points to the center of the earth, and the Y axis is in the horizontal plane It is perpendicular to the X-axis, and its direction is determined according to the right-hand rule;

Step 2: In the X-axis direction, select a point every 100cm, a total of m points are selected as the shooting position of the camera C in the horizontal direction, denoted as Q1, Q2, ..., Qm; in the Z-axis direction, every other Select a point of 50cm, and select P1, P2, ..., Pn as the shooting height in the vertical direction of the camera; at m*n positions, select one shot every k degrees for the pitch angle, yaw angle and roll angle respectively. angle. Camera C collects M=m*n*(180/k)*(180/k)*(180/k) images at different positions and angles.

Step 3: Use the Structure-from-Motion technology to reconstruct the scene three-dimensionally for the M images obtained in Step 2, so as to obtain the point cloud information of the scene, which is recorded as Scene_Point_Cloud_BG.

Step 4: Install camera A, camera B and camera C in the scene respectively, and use the scene point cloud information according to step 3 to calculate the mutual positional relationship between cameras ABC. In this step, it must be ensured that there are no moving objects in the scene.

4a) After installing Camera A, obtain the depth image of the scene, denoted as Depth_Image; take Depth_Image and Scene_Point_Cloud_BG as input, and use the Iterative Closest Point Algorithm (ICP) to solve the 6 DOF poses (three degrees of freedom) of Camera A in the scene. rotation angle, three translation coordinate information).

4b) After installing cameras B and C, obtain the color image information of the scene, denoted as Color_Image_B and Color_Image_C respectively, use the SIFT descriptor and the Bag-of-Word feature description method, according to the M collected images and the point cloud information of the scene , based on the BundleAdjustment algorithm, to solve the position and attitude information of cameras B and C.

4c) According to the pose information of the camera obtained in step 4a and step 4b, and the respective internal parameters of camera ABC, obtain the transformation matrix Tab of camera A and camera B, the transformation matrix Tac between camera A and camera C, camera B and camera The transformation matrix Tbc between C.

Step 5: After completing the above steps 1-4, open the scene and allow pedestrians to enter the scene. Use camera A to obtain the 3D point cloud information Scene_Point_Cloud_New in the scene, and use the ICP algorithm again to register Scene_Point_Cloud_New and Scene_Point_Cloud_BG; after registration, perform the set difference operation on the two point cloud sets to obtain a new point cloud Scene_Point_Cloud_FG. Project Scene_Point_Cloud_FG on the XY plane, and obtain several circular areas C1, C2, ..., Cp based on Hough transform. The point cloud information corresponding to the same circular area Ci is denoted as Person_i.

Step 6: Use the scene information captured by camera B to obtain a binarized image Camera_B_Image_Binary after thresholding; perform dilation and erosion operations on Camera_B_Image_Binary, and remove the pixel area smaller than the threshold thr (thr is set according to the actual scene, the range is 1000 -8196) area; record the obtained connectivity area information as R1, R2, ..., Rq.

Step 7: According to the transformation matrix Tac between the cameras AC obtained in Step 4, the point cloud information Person_i (i=1, 2, . , and denote the corresponding region as Region_From_A_i (i=1, 2, . . . , p). According to the transformation matrix Tbc between cameras BC obtained in step 4, the region Rj (j=1, 2, . =1,2,...,q).

Step 8: Perform a set intersection operation on the two region sets obtained in Step 7: {Region_From_A_1, ..., Region_From_A_p} and {Region_From_B_1, ..., Region_From_B_q}. When the number of common pixels between Region_From_A_i and Region_From_B_j exceeds the threshold thr_region (set from 20x40 to 80x160), the corresponding human body region Region_From_C_k is obtained, where k≥1 and k≤min(p, q).

Continuing to refer to FIG. 3 , FIG. 3 shows a frame diagram of a multi-modal unsupervised pixel-level semantic annotation system for pedestrians according to an embodiment of the present application. The system specifically includes an initial point cloud information acquisition unit 301 , a person point cloud information collection acquisition unit 302 , a connected area information collection acquisition unit 303 , and a human body area collection acquisition unit 304 .

In a specific embodiment, the initial point cloud information acquisition unit 301 is configured to perform three-dimensional reconstruction of an unmanned monitoring scene, and obtain initial point cloud information of the monitoring scene; the personnel point cloud information collection acquisition unit 302 is configured to use Tof images The acquisition device obtains the first point cloud information in the monitoring scene, registers it with the initial point cloud information, and performs a set difference operation to obtain the second point cloud information, and projects the second point cloud information on the horizontal plane to obtain Person point cloud information set; the connected area information set acquisition unit 303 is configured to expand and corrode the binarized image of the scene information obtained by the infrared image acquisition device after thresholding, to obtain the connected area information set; the human body area set acquisition unit 304 It is configured to respectively project the personnel point cloud information set and the connected area information set into the image plane space of the RGB image acquisition device to perform the intersection operation of the sets, and obtain the corresponding human body area set in response to the common pixels exceeding the first threshold.

Referring next to FIG. 4 , it shows a schematic structural diagram of a computer system 400 suitable for implementing the electronic device of the embodiment of the present application. The electronic device shown in FIG. 4 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present application.

As shown in FIG. 4, a computer system 400 includes a central processing unit (CPU) 401, which can be loaded into a random access memory (RAM) 403 according to a program stored in a read only memory (ROM) 402 or a program from a storage section 408 Instead, various appropriate actions and processes are performed. In the RAM 403, various programs and data required for the operation of the system 400 are also stored. The CPU 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404. An input/output (I/O) interface 405 is also connected to bus 404 .

The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, etc.; an output section 407 including a liquid crystal display (LCD), etc. and a speaker, etc.; a storage section 408 including a hard disk, etc.; Communication section 409 of a network interface card such as a modem. The communication section 409 performs communication processing via a network such as the Internet. A drive 410 is also connected to the I/O interface 405 as needed. A removable medium 411, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 410 as needed so that a computer program read therefrom is installed into the storage section 408 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable storage medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 409 and/or installed from the removable medium 411 . When the computer program is executed by the central processing unit (CPU) 401, the above-mentioned functions defined in the method of the present application are performed. It should be noted that the computer-readable storage medium of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In this application, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable storage medium other than a computer-readable storage medium that can be sent, propagated, or transmitted for use by or in connection with the instruction execution system, apparatus, or device program of. Program code embodied on a computer-readable storage medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for performing the operations of the present application may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, and also conventional procedures, or a combination thereof programming language - such as "C" or a similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.

The modules involved in the embodiments of the present application may be implemented in a software manner, and may also be implemented in a hardware manner.

As another aspect, the present application also provides a computer-readable storage medium. The computer-readable storage medium may be included in the electronic device described in the above-mentioned embodiments; in electronic equipment. The above-mentioned computer-readable storage medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic equipment, the electronic equipment: including three-dimensional reconstruction of the unmanned monitoring scene, and obtaining the initial point of the monitoring scene Cloud information; use Tof image acquisition equipment to obtain the first point cloud information in the monitoring scene, register it with the initial point cloud information, and perform a set difference operation to obtain the second point cloud information, and put the second point cloud information in the Projection on the horizontal plane to obtain a set of personnel point cloud information; dilate and corrode the binarized image of the scene information obtained by the infrared image acquisition device after thresholding to obtain a set of connected area information; separate the personnel point cloud information set and the connected area The information set is projected into the image plane space of the RGB image acquisition device to perform the intersection operation of the set, and in response to the common pixel exceeding the first threshold, the corresponding human body region set is obtained.

The above description is only a preferred embodiment of the present application and an illustration of the applied technical principles. Those skilled in the art should understand that the scope of disclosure involved in this application is not limited to the technical solutions formed by the specific combination of the above-mentioned technical features, and should also cover, without departing from the above-mentioned disclosed concept, the technical solutions made of the above-mentioned technical features or Other technical solutions formed by any combination of its equivalent features. For example, a technical solution is formed by replacing the above-mentioned features with the technical features disclosed in this application (but not limited to) with similar functions.

Claims

A multimodal unsupervised pedestrian pixel-level semantic annotation method, characterized in that it includes:

S1: perform three-dimensional reconstruction on an unmanned monitoring scene, and obtain initial point cloud information of the monitoring scene;

S2: Use the Tof image acquisition device to acquire the first point cloud information in the monitoring scene, perform a set difference operation after registering it with the initial point cloud information, obtain the second point cloud information, and use the second point cloud information to obtain the second point cloud information. The cloud information is projected on the horizontal plane to obtain a collection of personnel point cloud information;

S3: Dilate and corrode the binarized image obtained by the thresholding of the scene information obtained by the infrared image acquisition device to obtain a set of connected area information; and

S4: Project the personnel point cloud information set and the connected area information set respectively, using the positional relationship between the cameras that have been calibrated, to the image plane space of the RGB image acquisition device to perform the intersection operation of the sets, and in response to the common When the pixel exceeds the first threshold, a corresponding set of human body regions is obtained.
The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 1, wherein the step S1 specifically includes:

An origin is arbitrarily chosen in the unmanned monitoring scene to establish a three-dimensional coordinate system;

Set m*n points at intervals in the x-axis and z-axis directions as the image acquisition positions of the RGB image acquisition device, select the shooting angles for the pitch angle, yaw angle and roll angle at k degrees, respectively, and collect M=m*n *(180/k)*(180/k)*(180/k) images;

Use the three-dimensional reconstruction algorithm of Structure from motion to perform three-dimensional reconstruction of the monitoring scene on M images, and obtain the initial point cloud information.
The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 1, wherein the first point cloud information and the initial point cloud information are registered by using an iterative closest point algorithm.
The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 2, wherein the second point cloud information is projected on the XY plane of the three-dimensional coordinate system, and several circles are obtained based on Hough transform area, and include the point cloud information corresponding to the same circular area into the set of personnel point cloud information.
The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 1, characterized in that, after dilating and eroding the binarized image in the step S3, it further comprises removing pixels whose pixel area is smaller than the second threshold. area.
The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 5, wherein the first threshold is taken in the range of 20*40-80*160, and the second threshold is taken in the range of 1000-8196 In the range.
The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 1, wherein a Tof image acquisition device, an infrared image acquisition device and an RGB image acquisition device are respectively installed in the monitoring scene, and the initial point is used The cloud information calculates the positional relationship and attitude information of the Tof image acquisition device, the infrared image acquisition device, and the RGB image acquisition device, respectively.
The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 7, wherein the specific acquisition method of the positional relationship and the posture information comprises:

Use the Tof image acquisition device to obtain the depth image of the monitoring scene, and use the iterative closest point algorithm to obtain the DOF pose of the Tof image acquisition device in the monitoring scene in combination with the initial point cloud information;

Use infrared image acquisition equipment and RGB image acquisition equipment to obtain the color image of the monitoring scene, use SIFT descriptor and Bag of words feature description algorithm, according to the collected image and the initial point cloud information, based on the Bundle Adjustment beam method The adjustment algorithm obtains the position and attitude information of the infrared image acquisition device and the RGB image acquisition device.
The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 7 or 8, characterized in that, according to the position relationship, attitude information and internal parameters of the image acquisition device, the Tof image acquisition device and the image acquisition device are obtained. The first transformation matrix of the infrared image acquisition device, the second transformation matrix of the Tof image acquisition device and the RGB image acquisition device, and the third transformation matrix of the infrared image acquisition device and the RGB image acquisition device.
The multimodal unsupervised pixel-level semantic annotation method for pedestrians according to claim 9, wherein the step S4 specifically comprises:

Using the second transformation matrix to project the personnel point cloud information to the image plane space of the RGB image acquisition device according to the camera imaging principle to obtain a first set of projection regions;

Using the third transformation matrix to project the connected region information to the image plane space of the RGB image acquisition device to obtain a second set of projection regions;

An intersection operation is performed on the pixels of the first projection area and the set and the second projection area and the set.
A computer-readable storage medium on which one or more computer programs are stored, characterized in that, when the one or more computer programs are executed by a computer processor, the method described in any one of claims 1 to 10 is implemented.
A multimodal unsupervised pedestrian pixel-level semantic annotation system, characterized in that the system includes:

Initial point cloud information acquisition unit: configured to perform three-dimensional reconstruction of an unmanned monitoring scene, and obtain initial point cloud information of the monitoring scene;

Personnel point cloud information collection acquisition unit: configured to use Tof image acquisition equipment to acquire the first point cloud information in the monitoring scene, register it with the initial point cloud information, and perform a set difference operation to obtain a second point cloud. information, and project the second point cloud information on the horizontal plane to obtain a set of personnel point cloud information;

Connected area information set acquisition unit: configured to dilate and corrode the binarized image obtained by thresholding the scene information obtained by the infrared image acquisition device to obtain a connected area information set; and

A human body area set acquisition unit: configured to respectively project the person point cloud information set and the connected area information set into the image plane space of the RGB image acquisition device to perform the intersection operation of the sets, in response to the common pixels exceeding the first threshold , obtain the corresponding human body region set.