CN111680670A

CN111680670A - Cross-mode human head detection method and device

Info

Publication number: CN111680670A
Application number: CN202010808291.2A
Authority: CN
Inventors: 陈俊逸; 褚怡文
Original assignee: Changsha Xiaogu Technology Co ltd
Current assignee: Changsha Xiaogu Technology Co ltd
Priority date: 2020-08-12
Filing date: 2020-08-12
Publication date: 2020-09-18
Anticipated expiration: 2040-08-12
Also published as: CN111680670B

Abstract

The embodiment of the invention provides a cross-mode human head detection method, a device, terminal equipment and a computer readable medium, wherein the method comprises the following steps: acquiring a current image frame of a fixed scene in a plurality of image modalities and a historical image frame set of the fixed scene in a 3D image modality in the plurality of image modalities; performing background modeling according to the historical image frame set of the fixed scene in the 3D image mode, and acquiring a non-background mask of a current image frame of the fixed scene in the 3D image mode according to the background modeling; extracting non-background regions of the current image frame in the plurality of modes by using the non-background mask, and performing channel fusion on the non-background regions of the current image frame in the plurality of modes to obtain a multi-mode fusion image; and inputting the multi-mode fusion image into a trained deep learning network model for human head detection, so that the influence of an external environment on image detection can be reduced, and the human head misrecognition rate is greatly reduced.

Description

Cross-mode human head detection method and device

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a fixed scene cross-modal head detection method and device based on a 3D (three-dimensional) stereo camera, a terminal device and a computer readable medium.

Background

In recent years, various technologies such as computer vision, machine learning, and the like in the field of artificial intelligence have been developed. Among them, Computer Vision technology (CV) is a science that studies how to "look" at a machine. More specifically, the method uses a camera and a computer to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further performs graphic processing, so that the computer is processed into an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. CV technology generally includes technologies such as image processing, image recognition, image semantic understanding, image retrieval, OCR (optical character recognition), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and the like, and also includes common biometric technologies such as face recognition, fingerprint recognition, and the like.

Human head detection is a fundamental task of many head-related application tasks, such as: and the tasks comprise character recognition, visual tracking, action recognition and the like. In large video monitoring places such as hotels, airports and the like, the positions of all people also need to be detected, and the total number of people needs to be counted through the number of people. Particularly in certain unsupervised and crowded environments, the probability of accidents increases, requiring restrictions on the number of people or subsequent tasks using the results of the detected heads. Human head detection is a widely used task, the existence of a person needs to be identified through human head detection in a complex scene, the head detection belongs to a subclass in target detection, and the position of the human head in a picture needs to be found, so that higher requirements on the performance of a detector are provided.

Human head detection may be considered a particular form of object detection. Many Convolutional Neural Network (CNN) based target detection methods have been optimized in the human head detection task and achieved significant performance gains. However, human head detection remains a very challenging problem. In a complex scene, due to the shielding of the human head in the picture and the interference of the illumination and the blurring of the scene, many phenomena of false alarm and missed detection of human head detection exist, the accuracy of the detection task is greatly reduced, and the detector cannot meet the requirements of practical application, so that further research on human head detection is necessary.

Disclosure of Invention

In view of this, embodiments of the present invention provide a cross-mode human head detection method, apparatus, terminal device and computer readable medium, which can reduce the influence of an external environment on image detection and greatly reduce the human head misrecognition rate.

The first aspect of the embodiments of the present invention provides a cross-modal human head detection method, including:

acquiring a current image frame of a fixed scene in a plurality of image modalities and a historical image frame set of the fixed scene in a 3D image modality in the plurality of image modalities;

performing background modeling according to the historical image frame set of the fixed scene in the 3D image mode, and acquiring a non-background mask of a current image frame of the fixed scene in the 3D image mode according to the background modeling;

extracting non-background regions of the current image frame in the plurality of modes by using the non-background mask, and performing channel fusion on the non-background regions of the current image frame in the plurality of modes to obtain a multi-mode fusion image;

and inputting the multi-mode fusion image into a trained deep learning network model so as to detect the head of the current image frame of the fixed scene.

A second aspect of an embodiment of the present invention provides a cross-modal human head detection apparatus, including:

the image processing device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a current image frame of a fixed scene in a plurality of image modalities and a historical image frame set of the fixed scene in a 3D image modality in the plurality of image modalities;

the modeling module is used for carrying out background modeling according to the historical image frame set of the fixed scene in the 3D image mode and acquiring a non-background mask of the current image frame of the fixed scene in the 3D image mode according to the background modeling;

the fusion module is used for extracting non-background areas of the current image frame in the plurality of modes by using the non-background mask, and performing channel fusion on the non-background areas of the current image frame in the plurality of modes to obtain a multi-mode fusion image;

and the detection module is used for inputting the multi-mode fusion image into a trained deep learning network model so as to detect the head of the current image frame of the fixed scene.

A third aspect of the embodiments of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the cross-modal head detection method when executing the computer program.

A fourth aspect of the embodiments of the present invention provides a computer-readable medium, which stores a computer program, and when the computer program is processed and executed, the computer program implements the steps of the cross-modal head detection method.

In the cross-modal human head detection method provided by the embodiment of the invention, current image frames of a fixed scene in a plurality of image modalities and a historical image frame set of the fixed scene in a 3D image modality of the plurality of image modalities can be acquired, background modeling is performed according to the historical image frame set of the fixed scene in the 3D image modality, a non-background mask of the current image frames of the fixed scene in the 3D image modality is acquired according to the background modeling, non-background areas of the current image frames in the plurality of modalities are extracted by using the non-background mask, channel fusion is performed on the non-background areas of the current image frames in the plurality of modalities to obtain a multi-modal fusion image, and finally the multi-modal fusion image can be input to a trained deep learning network model for human head detection, so that the influence of an external environment on image detection can be reduced, greatly reducing the false recognition rate of human head.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart of a cross-modal human head detection method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another cross-modal head detection method provided by an embodiment of the invention;

fig. 3 is a schematic structural diagram of a cross-modal human head detection apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a detailed structure of the acquisition module in FIG. 3;

FIG. 5 is a schematic diagram of a refined structure of the modeling module of FIG. 3;

fig. 6 is a schematic structural diagram of another cross-mode human head detection apparatus according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

Referring to fig. 1, fig. 1 is a flowchart of a cross-mode human head detection method according to an embodiment of the present invention. As shown in fig. 1, the cross-modal human head detection method of the present embodiment includes the following steps:

s101, obtaining a current image frame of a fixed scene in a plurality of image modalities and a historical image frame set of the fixed scene in a 3D image modality in the plurality of image modalities.

In the embodiment of the present invention, the image capturing devices corresponding to a plurality of image modalities may be synchronized in time and space. Specifically, the plurality of image modalities include a 3D image modality, and the plurality of image modalities further include at least one of a visible light image modality, an infrared image modality, an ultrasonic image modality, a point cloud image modality obtained by a millimeter wave radar and a laser radar, and the like. Accordingly, the image acquisition devices corresponding to the plurality of image modalities here may be, for example, a 3D stereo camera, an infrared camera, an ultrasonic image acquisition device, a millimeter wave radar, a laser radar, or the like capable of acquiring depth information. The method for performing time and space synchronization on different image acquisition devices is the same as that in the prior art, and therefore, the detailed description is omitted here. A current image frame of a fixed scene in the plurality of image modalities and a set of historical image frames of the fixed scene in a 3D image modality of the plurality of image modalities may then be acquired based on the temporal and spatial synchronization. In another embodiment of the present invention, a historical set of image frames of the fixed scene in other image modalities of the plurality of image modalities than the 3D image modality may also be acquired.

S102: and performing background modeling according to the historical image frame set of the fixed scene in the 3D image mode, and acquiring a non-background mask of the current image frame of the fixed scene in the 3D image mode according to the background modeling.

In the embodiment of the present invention, the background modeling may be performed on the fixed scene according to depth information in a history image frame set of the fixed scene in a 3D image modality, where the depth information in the history image frame set refers to depth information in each history image frame in the history image frame set. The specific method for performing background modeling is the same as the prior art, for example, the background modeling is performed by using the gaussian mixture background modeling technique, and therefore, the detailed description thereof is omitted here. Then, a non-background mask of a current image frame of the fixed scene in a 3D image modality may be obtained based on the background modeling, and noise of the non-background mask may be eliminated through a switching operation. The method for obtaining the non-background mask through the background modeling and the method for eliminating the noise through the opening and closing operation are the same as those of the prior art, and therefore, the details are not repeated herein.

S103: and extracting non-background regions of the current image frame in the plurality of modes by using the non-background mask, and performing channel fusion on the non-background regions of the current image frame in the plurality of modes to obtain a multi-mode fusion image.

In the embodiment of the present invention, the non-background mask obtained in S102 may be used to extract a non-background region of the current image frame of the fixed scene in the 3D image modality, and the non-background mask is used to obtain a non-background region of the current image frame of the fixed scene in the other image modalities, except for the 3D image modality, of the multiple modalities, and then the non-background region of the current image frame in the D image modality and the non-background region of the current image frame in the other image modalities are channel-fused to obtain a multi-modal fused image of the non-background region of the fixed scene.

S104: and inputting the multi-mode fusion image into a trained deep learning network model for human head detection.

In the embodiment of the invention, after the multi-modal fusion image of the non-background area of the fixed scene is obtained, the multi-modal fusion image can be input to a trained deep learning network model for human head detection. The deep learning network model herein may include, for example, a Tiny-yolo convolutional neural network model, other convolutional network models, residual network models, and the like.

In a test example of the present invention, as shown in table one, 1592 test images containing complex scenes (5 people at most) were detected by two methods, and the threshold value was 0.5, i.e. if 0.5 head is detected, it is marked as 1, and if less than 0.5, it is ignored. In the first detection method, 2063 human heads to be detected in 1592 single Infrared (IR) images are detected by the prior art, the test result is that 245 human heads are detected by mistake, and 54 human heads are not detected; in the second detection method, 2063 human heads to be detected in 1592 IR &3D images are detected by the cross-mode human head detection method provided by the embodiment of the invention (1592 IR &3D images refer to that some IR images and some 3D images are included in the 1592 images), the detection result is that 23+160 human heads are detected by mistake, and 80 human heads are not detected. Therefore, the testing method fusing depth information in the 3D image is more stable and has low false detection rate. 160 false detections are at the image edge, because the image edge has no depth information and there are multiple repeated scenes; the missed detection is higher than that of the existing testing method under the single IR image mode, most of the missed detection is also because the lens of the testing scene is shielded, and the edge of the lens has no depth information. In addition, because the training data is not diverse enough, in some scenes, both methods cannot be detected.

Watch 1

1592A Chinese medicinal composition	IR image modality	IR&3D image modality
			Should be inspected	2063	2063
False detection	245	23+160
			Missing inspection	54	80

In the cross-modal human head detection method provided in fig. 1, current image frames of a fixed scene in a plurality of image modalities and a historical image frame set of the fixed scene in a 3D image modality of the plurality of image modalities may be acquired, a background modeling may be performed according to the historical image frame set of the fixed scene in the 3D image modality, a non-background mask of the current image frame of the fixed scene in the 3D image modality may be acquired according to the background modeling, non-background regions of the current image frame in the plurality of modalities may be extracted by using the non-background mask, and channel fusion may be performed on the non-background regions of the current image frame in the plurality of modalities to obtain a multi-modal fusion image, and finally the multi-modal fusion image may be input to a trained deep learning network model for human head detection, so that an influence of a complex background on a detection effect may be eliminated, meanwhile, the advantages of multi-mode data are integrated, so that the detection accuracy can be improved, and the amount of multi-mode data after background elimination is greatly reduced, the calculated amount can be reduced, and the detection speed can be improved.

Referring to fig. 2, fig. 2 is a flowchart of another cross-modal head detection method according to an embodiment of the present invention. The other cross-modal head detection method shown in fig. 2 is optimized based on the cross-modal head detection method shown in fig. 1. In addition to the steps S101 to S104 in fig. 1, the cross-modal human head detection method shown in fig. 2 further includes, before step S101:

s201: and training the deep learning network model by utilizing the manually marked multi-mode fusion image.

Specifically, for the edge embedded device, the traditional tiny-yolo convolutional neural network has the characteristics of large memory occupation, large parameter amount, large calculated amount and the like, so that the embodiment of the invention simplifies a redundancy layer with lower calculated performance in the conventional tiny-yolo convolutional neural network, and further designs and forms a high-performance convolutional feature extraction network. The tiny-yolo convolutional neural network model is compressed from 33.1Mb to 6.5Mb, and the operation acceleration which is about 2 times higher than that of the conventional tiny-yolo convolutional neural network is realized. The improved tiny-yolo convolutional neural network specifically comprises 12 convolution operations and 5 maximal pooling (Maxpoling) operations, wherein the convolution kernel of most convolution operations is 3x 3, the step size is 1, and the convolution kernel of a few convolution operations is 1 x1, and the step size is 1. Activation may be performed after each convolution operation using the Leaky-Relu function. The improved tiny-yolo deep convolutional network is used for finally calculating to obtain detection results of 2 scales (13 x13 and 26x 26), the detection results of the 2 scales can be output and overlapped, and a non-maximum suppression algorithm can be used after the overlapping so as to improve the detection accuracy. In the embodiment of the invention, the improved tiny-yolo convolutional neural network can be trained by utilizing a manually marked multi-modal fusion image. Specifically, a plurality of 3D images of a monitored scene may be acquired first, and a plurality of images of the monitored scene in at least one other image modality of the plurality of image modalities other than the 3D image modality (for example, 18047 RGB images, 18047 infrared images, 18047 3D images of the monitored scene may be acquired); then, performing background modeling by using empty scene 3D images corresponding to the plurality of monitored scenes (for example, selecting 200 empty scene 3D images from the 18047 3D images for background modeling), obtaining a 3D background mask, extracting non-background regions in the plurality of 3D images and the plurality of images in the at least one other image modality by using the 3D background mask, and fusing the non-background regions in the plurality of 3D images and the plurality of images in the at least one other image modality to manually label heads in the plurality of 3D images and the plurality of images in the at least one other image modality, so as to obtain a head sample; and then the human head sample can be put into an improved tiny-yolo convolutional neural network for training to obtain an original human head deep convolutional model. In order to increase the number of the human head samples and improve the accuracy of model training, all the human head samples can be subjected to data enhancement, the human head samples subjected to the data enhancement are sent to the original human head deep convolution model for training to obtain final network parameters, and a trained improved tiny-yolo convolution neural network model is obtained through the final network parameters. It is noted that the data enhancement process is to increase the number of image samples in the operation of transforming the image, which may include, for example, adjusting the image brightness, inverting the image, randomly vibrating the image color, randomly vibrating the image depth, and the like. After the improved tiny-yolo convolutional neural network model is trained, in S104, the multi-modal fusion image may be input to the trained improved tiny-yolo convolutional neural network model for human head detection.

In addition to the advantages of the cross-modal human head detection method provided in fig. 1, the cross-modal human head detection method provided in fig. 2 further increases the human head detection speed of the improved tiny-yolo detection network model adopted in S201 and S104, and the improved tiny-yolo detection network model can be deployed in edge artificial intelligence equipment, so that the application range can be increased, for example, the method can be applied to pedestrian number detection and the like.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a cross-mode human head detection device according to an embodiment of the present invention. As shown in fig. 3, the cross-modal human head detection apparatus 30 of the present embodiment includes an obtaining module 301, a modeling module 302, a fusing module 303, and a detecting module 304. The obtaining module 301, the modeling module 302, the fusing module 303 and the detecting module 304 are respectively configured to execute the specific methods in S101, S102, S103 and S104 in fig. 1, and details can be referred to the related introduction of fig. 1 and are only briefly described here:

an obtaining module 301, configured to obtain a current image frame of a fixed scene in a plurality of image modalities and a historical image frame set of the fixed scene in a 3D image modality of the plurality of image modalities.

The modeling module 302 is configured to perform background modeling according to the historical image frame set of the fixed scene in the 3D image modality, and obtain a non-background mask of a current image frame of the fixed scene in the 3D image modality according to the background modeling.

The fusion module 303 is configured to extract the non-background regions of the current image frame in the multiple modalities by using the non-background mask, and perform channel fusion on the non-background regions of the current image frame in the multiple modalities to obtain a multi-modality fusion image.

And the detection module 304 is configured to input the multi-modal fusion image to the trained deep learning network model for human head detection.

Further, referring to fig. 4, the obtaining module 301 may specifically include:

a synchronization unit 3011, configured to perform temporal and spatial synchronization on image acquisition devices corresponding to multiple image modalities.

An obtaining unit 3012, configured to obtain, based on the temporal and spatial synchronization, a current image frame of a fixed scene in the plurality of image modalities and a historical image frame set of the fixed scene in a 3D image modality of the plurality of image modalities; the plurality of image modalities comprise a 3D image modality and at least one of a visible light image modality, an infrared image modality, an ultrasonic image modality, a point cloud image modality obtained by a millimeter wave radar and a laser radar.

Further, referring to fig. 5, the modeling module 302 may specifically include:

a modeling unit 3021, configured to perform background modeling on the fixed scene according to depth information of a historical image frame set of the fixed scene in a 3D image modality.

A masking unit 3022, configured to obtain a non-background mask of a current image frame of the fixed scene in a 3D image modality based on the background modeling.

A noise cancellation unit 3023 configured to cancel the noise of the non-background mask through a switching operation.

The cross-modal head detection device provided in fig. 3 may acquire current image frames of a fixed scene in a plurality of image modalities and a historical image frame set of the fixed scene in a 3D image modality of the plurality of image modalities, perform background modeling according to the historical image frame set of the fixed scene in the 3D image modality, acquire a non-background mask of the current image frame of the fixed scene in the 3D image modality according to the background modeling, extract non-background regions of the current image frames in the plurality of modalities by using the non-background mask, perform channel fusion on the non-background regions of the current image frames in the plurality of modalities to obtain a multi-modal fusion image, and finally input the multi-modal fusion image to a trained deep learning network model for head detection, thereby eliminating the influence of a complex background on a detection effect, meanwhile, the advantages of multi-mode data are integrated, so that the detection accuracy can be improved, and the amount of multi-mode data after background elimination is greatly reduced, the calculated amount can be reduced, and the detection speed can be improved.

Referring to fig. 6, fig. 6 is a schematic structural diagram of another cross-mode human head detection device according to an embodiment of the present invention. The cross-modal head detector 60 shown in fig. 6 is optimized based on the cross-modal head detector 30 provided in fig. 3. The cross-modality head detection apparatus 60 includes, in addition to the acquisition module 301, the modeling module 302, the fusion module 303, and the detection module 304 in the cross-modality head detection apparatus 30:

the training module 601 is configured to train the deep learning network model by using a manually labeled multi-modal fusion image before the obtaining module 301 obtains a current image frame of a fixed scene in a plurality of image modalities and a historical image frame set of the fixed scene in a 3D image modality of the plurality of image modalities. More specifically, the training module is configured to execute the specific method in S201 in fig. 2, and details can be referred to in the related description of fig. 2, and therefore are not described herein again.

In addition to the advantages of the cross-modal human head detection apparatus provided in fig. 3, the improved tiny-yolo detection network model adopted by the training module 601 and the detection module 304 can further increase the human head detection speed, and the improved tiny-yolo detection network model can be deployed in edge artificial intelligence equipment, so that the application range can be increased.

Fig. 7 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 7, the terminal device 7 of this embodiment includes: a processor 70, a memory 71 and a computer program 72 stored in said memory 71 and executable on said processor 70, such as a program for performing a turbulent image restoration. The processor 70, when executing the computer program 72, implements the steps in the above-described method embodiments, e.g., S101 to S104 shown in fig. 1. Alternatively, the processor 70, when executing the computer program 72, implements the functions of the modules/units in the above-mentioned device embodiments, such as the functions of the modules 301 to 304 shown in fig. 3.

Illustratively, the computer program 72 may be partitioned into one or more modules/units that are stored in the memory 71 and executed by the processor 70 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 72 in the terminal device 7. For example, the computer program 72 may be partitioned into an acquisition module 301, a modeling module 302, a fusion module 303, and a detection module 304. (modules in the virtual device), the specific functions of each module are as follows:

The terminal device 7 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device 7 may include, but is not limited to, a processor 70, a memory 71. It will be appreciated by those skilled in the art that fig. 7 is merely an example of a terminal device 7 and does not constitute a limitation of the terminal device 7 and may comprise more or less components than shown, or some components may be combined, or different components, for example the terminal device may further comprise input output devices, network access devices, buses, etc.

The Processor 70 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 71 may be an internal storage unit of the terminal device 7, such as a hard disk or a memory of the terminal device 7. The memory 71 may also be an external storage device of the terminal device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 7. Further, the memory 71 may also include both an internal storage unit of the terminal device 7 and an external storage device. The memory 71 is used for storing the computer programs and other programs and data required by the terminal device 7. The memory 71 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A cross-modal human head detection method is characterized by comprising the following steps:

extracting non-background regions of the current image frame in the plurality of image modalities by using the non-background mask, and performing channel fusion on the non-background regions of the current image frame in the plurality of image modalities to obtain a multi-modality fusion image;

and inputting the multi-mode fusion image into a trained deep learning network model for human head detection.

2. The cross-modality head detection method according to claim 1, wherein the acquiring a current image frame of a fixed scene in a plurality of image modalities and a historical image frame set of the fixed scene in a 3D image modality of the plurality of image modalities comprises:

carrying out time and space synchronization on image acquisition equipment corresponding to a plurality of image modalities;

based on the temporal and spatial synchronization, obtaining a current image frame of a fixed scene in the plurality of image modalities and a historical image frame set of the fixed scene in a 3D image modality of the plurality of image modalities; the plurality of image modalities comprise a 3D image modality and at least one of a visible light image modality, an infrared image modality, an ultrasonic image modality, a point cloud image modality obtained by a millimeter wave radar and a laser radar.

3. The method according to claim 1, wherein the background modeling according to the historical image frame set of the fixed scene in the 3D image modality, and obtaining the non-background mask of the current image frame of the fixed scene in the 3D image modality according to the background modeling comprises:

performing background modeling on the fixed scene according to the depth information of the historical image frame set of the fixed scene in the 3D image modality;

obtaining a non-background mask of a current image frame of the fixed scene in a 3D image modality based on the background modeling;

and eliminating the noise of the non-background mask through an opening and closing operation.

4. The cross-modality head detection method according to any one of claims 1 to 3, wherein the acquiring a current image frame of a fixed scene in a plurality of image modalities and a historical image frame set of the fixed scene in a 3D image modality of the plurality of image modalities further comprises:

and training the deep learning network model by utilizing the manually marked multi-modal fusion image.

5. The cross-modal human head detection method according to claim 4, wherein the deep learning network model is a modified tiny-yolo convolutional neural network, the modified tiny-yolo convolutional neural network comprises 12 convolution operations and 5 Max pooling output operations, a majority of convolution operations have convolution kernels of 3x 3 and step size of 1, a minority of convolution operations have convolution kernels of 1 x1 and step size of 1, and the training of the deep learning network model by using the manually labeled multi-modal fusion image comprises:

acquiring a plurality of 3D images of a monitoring scene and a plurality of images of the monitoring scene in at least one other image modality except for the 3D image modality;

performing background modeling by using a plurality of empty scene 3D images corresponding to the monitoring scene to obtain a 3D background mask, extracting non-background regions in the plurality of 3D images and the plurality of images in the at least one other image modality through the 3D background mask, and fusing the plurality of 3D images and the non-background regions in the plurality of images in the at least one other image modality to manually label heads in the plurality of 3D images and the plurality of images in the at least one other image modality to obtain a head sample;

putting the human head sample into an improved tiny-yolo convolutional neural network for training to obtain an original human head deep convolutional model;

carrying out data enhancement processing on all human head samples to increase the number of the human head samples, sending the human head samples subjected to the data enhancement processing into the original human head deep convolution model for training to obtain final network parameters, and obtaining a trained improved tiny-yolo convolution neural network model through the final network parameters; the data enhancement processing is to increase the number of image samples in an operation mode of transforming the image, wherein the operation mode of transforming the image comprises adjusting image brightness, reversing the image, randomly vibrating image color and randomly vibrating image depth.

6. A trans-modal head detection device, comprising:

the fusion module is used for extracting non-background areas of the current image frame in the plurality of image modalities by using the non-background mask, and performing channel fusion on the non-background areas of the current image frame in the plurality of image modalities to obtain a multi-modal fusion image;

and the detection module is used for inputting the multi-mode fusion image into the trained deep learning network model for human head detection.

7. The cross-modality head detection apparatus of claim 6, further comprising:

the training module is used for training the deep learning network model by utilizing a multi-mode fusion image labeled manually before the acquisition module acquires current image frames of a fixed scene in a plurality of image modalities and a historical image frame set of the fixed scene in a 3D image modality in the plurality of image modalities.

8. The cross-modal human head detection apparatus according to claim 7, wherein the deep learning network model is a modified tiny-yolo convolutional neural network, the modified tiny-yolo convolutional neural network comprises 12 convolution operations and 5 Max pooling boosting operations, a majority of convolution operations have convolution kernels of 3x 3 and step size of 1, a minority of convolution operations have convolution kernels of 1 x1 and step size of 1, and the training module comprises:

the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a plurality of 3D images of a monitoring scene and a plurality of images of the monitoring scene in at least one other image modality except for a 3D image modality;

the labeling unit is used for performing background modeling by using a plurality of empty scene 3D images corresponding to the monitoring scenes to obtain a 3D background mask, extracting non-background regions in the plurality of 3D images and the plurality of images in the at least one other image modality through the 3D background mask, and fusing the plurality of 3D images and the non-background regions in the plurality of images in the at least one other image modality to manually label heads in the plurality of 3D images and the plurality of images in the at least one other image modality to obtain a head sample;

the training unit is used for putting the human head sample into the improved tiny-yolo convolutional neural network for training to obtain an original human head deep convolutional model;

the enhancement unit is used for carrying out data enhancement processing on all human head samples so as to increase the number of the human head samples, sending the human head samples subjected to the data enhancement processing into the original human head deep convolution model for training to obtain final network parameters, and obtaining a trained improved tiny-yolo convolution neural network model through the final network parameters; the data enhancement processing is to increase the number of image samples in an operation mode of transforming the image, wherein the operation mode of transforming the image comprises adjusting image brightness, reversing the image, randomly vibrating image color and randomly vibrating image depth.

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1-5 when executing the computer program.

10. A computer-readable medium, in which a computer program is stored which, when being processed and executed, carries out the steps of the method according to any one of claims 1 to 5.