WO2022100470A1 - Systems and methods for target detection

Info

Abstract

Description

Claims

WO2022100470A1

Publication number: WO2022100470A1
Application number: PCT/CN2021/127860
Authority: WO
Inventors: Jinpeng SU
Original assignee: Zhejiang Dahua Technology Co., Ltd.
Priority date: 2020-11-13
Filing date: 2021-11-01
Publication date: 2022-05-19
Also published as: CN112541395A

Systems and methods for target detection are provided. The system may obtain a visible light image. The system may further obtain a target detection result by performing a target detection on the visible light image based on a target detection model. The performing a target detection may include detecting one or more objects in the visible light image according to one or more detection boxes each of which corresponds to one of convolution layers in the target detection model. The one or more detection boxes may be determined according to a predetermined manner.

SYSTEMS AND METHODS FOR TARGET DETECTION

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority of Chinese Patent Application No. 202011272920.0 filed on November 13, 2020, the contents of which are hereby incorporated by reference.

TECHNICAL FIELD

The present disclosure generally relates to image processing, and in particular, to systems and methods for target detection.

BACKGROUND

Target detection is a very important, highly challenging, and actual computer vision task in the field of computer vision. The target detection may be regarded as a combination of image classification and image positioning. For example, for a given image, a target detection system needs to identify a target in the image and determine a position of the target. Since a count of targets in the image is uncertain, and precise position (s) of the target need to be determined, the target detection is more complicated than the image classification. The target detection may be used in unmanned driving, region monitoring, and other fields, which have high requirements for the accuracy and reliability of target detection and monitoring.

Therefore, the present disclosure provides methods and systems for target detection to increase the accuracy and reliability of the target detection.

SUMMARY

An aspect of the present disclosure relates to a method for target detection. The method may include obtaining a visible light image. The method may further include obtaining a target detection result by performing a target detection on the visible light image based on a target detection model. The performing a target detection may include detecting one or more objects in the visible light image according to one or more detection boxes each of which corresponds to one of convolution layers in the target detection model. The one or more detection boxes may be determined according to a predetermined manner.

In some embodiments, the detection box may be determined according to a first process. The first process may include obtaining a plurality of candidate detection boxes; and determining the one or more detection boxes by adjusting a portion of the plurality of candidate detection boxes according to the predetermined manner based on sizes of feature maps each of which corresponds to one of the convolution layers in the target detection model.

In some embodiments, the target detection model may be obtained according to a second process. The second process may include obtaining a plurality of training samples, each training sample among the plurality of training samples including a sample image and a label corresponding to the sample image; and obtaining the target detection model by training an initial target detection model based on the plurality of training samples. The training an initial target detection model may include obtaining a plurality of candidate detection boxes; determining one or more sample detection boxes by adjusting a portion of the plurality of candidate detection boxes according to the predetermined manner based on sizes of feature maps each of which corresponds to one of the convolution layers in the target detection model; and modifying an identification ratio between positive and negative samples of the one or more sample detection boxes based on a loss function to reduce a difference between a detection result of the one or more detection boxes and the label.

In some embodiments, the performing a target detection on the visible light image based on a target detection model may include obtaining a first image by processing the visible light image based on an image processing model; wherein the image processing model is configured to enhance the visible light image; and performing the target detection by inputting the first image into the target detection model.

In some embodiments, the performing a target detection on the visible light image based on a target detection model may include obtaining a second image by processing the visible light image based on an image enhancement algorithm; and performing the target detection by inputting the second image into the target detection model.

In some embodiments, the method may further include obtaining an infrared image corresponding to the visible light image; determining, based on a first detection result of the visible light image and the infrared image, whether a region corresponding to the first detection result of the visible light image satisfies a condition; in response to that the region corresponding to the first detection result of the visible light image satisfies the condition, obtaining a second detection result of the infrared image by performing the target detection on the infrared image; and determining the target detection result based on the first detection result of the visible light image and the second detection result of the infrared image.

In some embodiments, the determining, based on a first detection result of the visible light image and the infrared image, whether a region corresponding to the first detection result of the visible light image satisfies a preset condition may include matching the first detection result of the visible light image to the infrared image; determining whether a temperature of the region is changed in the matched infrared image; and in response to that the temperature of the region is changed in the matched infrared image, determining that the region satisfies the condition.

In some embodiments, the method may further include determining, based on the target detection result, a target tracking box that identifies an object in the target detection result; and tracking the object in the target detection result based on the target tracking box.

In some embodiments, the determining, based on the target detection result, a target tracking box that identifies an object in the target detection result may include obtaining, based on the target detection result, a detection box set that identifies the object in the target detection result; wherein the detection box set includes a first detection box set, a second detection box set, and a third detection box set, the first detection box set being marked with a penalty coefficient, the second detection box set being marked with no penalty coefficient; obtaining a target detection box set by screening the detection box set; and determining the target tracking box based on feature information of one or more detection boxes in the target detection box set.

In some embodiments, the obtaining a target detection box set by screening the detection box set may include obtaining one or more confidence levels each of which corresponds to one of one or more detection boxes in the detection box set; and determining the target detection box set by screening, based on the one or more confidence levels, the one or more detection boxes in the detection box set.

In some embodiments, the determining the target tracking box based on feature information of one or more detection boxes in the target detection box set may include obtaining the one or more detection boxes in the target detection box set and feature information of the target detection result; and determining one or more similarity degrees based on the feature information of the one or more detection boxes in the target detection box set and the feature information of the target detection result; and determining the target tracking box based on the one or more similarity degrees and the one or more confidence levels corresponding to the one or more detection boxes in the detection box set.

Another aspect of the present disclosure relates to a system for target detection. The system may include at least one storage device and at least one processor. The at least one storage device may include a set of instructions. The at least one processor may be in communication with the at least one storage device. When executing the set of instructions, the at least one processor may be directed to perform operations. The operations may include obtaining a visible light image. The operations may further include obtaining a target detection result by performing a target detection on the visible light image based on a target detection model. The performing a target detection may include detecting one or more objects in the visible light image according to one or more detection boxes each of which corresponds to one of convolution layers in the target detection model. The one or more detection boxes may be determined according to a predetermined manner.

In some embodiments, to perform a target detection on the visible light image based on a target detection model, the at least one processor may be directed to perform operations including obtaining a first image by processing the visible light image based on an image processing model; wherein the image processing model is configured to enhance the visible light image; and performing the target detection by inputting the first image into the target detection model.

In some embodiments, to perform a target detection on the visible light image based on a target detection model, the at least one processor may be directed to perform operations including obtaining a second image by processing the visible light image based on an image enhancement algorithm; and performing the target detection by inputting the second image into the target detection model.

In some embodiments, the operations may further include obtaining an infrared image corresponding to the visible light image; determining, based on a first detection result of the visible light image and the infrared image, whether a region corresponding to the first detection result of the visible light image satisfies a condition; in response to that the region corresponding to the first detection result of the visible light image satisfies the condition, obtaining a second detection result of the infrared image by performing the target detection on the infrared image; and determining the target detection result based on the first detection result of the visible light image and the second detection result of the infrared image.

In some embodiments, to determine, based on a first detection result of the visible light image and the infrared image, whether a region corresponding to the first detection result of the visible light image satisfies a preset condition, the at least one processor may be directed to perform operations including matching the first detection result of the visible light image to the infrared image; determining whether a temperature of the region is changed in the matched infrared image; and in response to that the temperature of the region is changed in the matched infrared image, determining that the region satisfies the condition.

In some embodiments, the operations may further include determining, based on the target detection result, a target tracking box that identifies an object in the target detection result; and tracking the object in the target detection result based on the target tracking box.

In some embodiments, to determine, based on the target detection result, a target tracking box that identifies an object in the target detection result, the at least one processor may be directed to perform operations including obtaining, based on the target detection result, a detection box set that identifies the object in the target detection result; wherein the detection box set includes a first detection box set, a second detection box set, and a third detection box set, the first detection box set being marked with a penalty coefficient, the second detection box set being marked with no penalty coefficient; obtaining a target detection box set by screening the detection box set; and determining the target tracking box based on feature information of one or more detection boxes in the target detection box set.

In some embodiments, to obtain a target detection box set by screening the detection box set, the at least one processor may be directed to perform operations including obtaining one or more confidence levels each of which corresponds to one of one or more detection boxes in the detection box set; and determining the target detection box set by screening, based on the one or more confidence levels, the one or more detection boxes in the detection box set.

In some embodiments, to determine the target tracking box based on feature information of one or more detection boxes in the target detection box set, the at least one processor may be directed to perform operations including obtaining the one or more detection boxes in the target detection box set and feature information of the target detection result; determining one or more similarity degrees based on the feature information of the one or more detection boxes in the target detection box set and the feature information of the target detection result; and determining the target tracking box based on the one or more similarity degrees and the one or more confidence levels corresponding to the one or more detection boxes in the detection box set.

Still another aspect of the present disclosure relates to a non-transitory computer readable medium. The non-transitory computer readable medium may include executable instructions that, when executed by at least one processor, direct the at least one processor to perform a method for target detection. The method may include obtaining a visible light image. The method may further include obtaining a target detection result by performing a target detection on the visible light image based on a target detection model. The performing a target detection may include detecting one or more objects in the visible light image according to one or more detection boxes each of which corresponds to one of convolution layers in the target detection model. The one or more detection boxes may be determined according to a predetermined manner.

Still another aspect of the present disclosure relates to a method for target detection and tracking. The method may include obtaining image information of a target monitoring region, wherein the image information includes a target visible light image and a target infrared image of the target monitoring region, and the target infrared image corresponds to the target visible light image. The method may include obtaining a first target detection result of the target visible light image by performing a first target detection on the target visible light image. The method may also include obtaining a second target detection result of the target infrared image by performing a second target detection on the target infrared image in response to that the target infrared image satisfies a first condition. The method may include obtaining a third target detection result of the target monitoring region based on the first target detection result and the second target detection result, wherein the third target detection result includes position information and type information of an object in the target monitoring region. The method may further include tracking the object in the third target detection result.

In some embodiments, the performing a first target detection on the target visible light image may include obtaining an enhancement region image by performing an enhancement operation on the target visible light image, and obtaining a first target region image by performing a resolution adjustment operation on the enhancement region image; obtaining a second target region image output by a target image processing model by inputting the target visible light image into the target image processing model; obtaining a first detection result and a second detection result by performing the first target detection on the first target region image and the second target region image, respectively; wherein each of the first detection result and the second detection result includes a first type and coordinates of the object in the target monitoring region; and determining the first target detection result of the target visible light image based on the first detection result and the second detection result.

In some embodiments, the obtaining a first detection result and a second detection result by performing the first target detection on the first target region image and the second target region image, respectively, may include obtaining the first detection result by inputting the first target region image to a target first detection model; wherein the target first detection model is configured to obtain a position of a detection box for identifying the detected object in images input into the target first detection model; and obtaining the second detection result by inputting the second target region image to the target first detection model.

In some embodiments, before the inputting the first target region image to a target first detection model, the method may include obtaining the target first detection model by training a first detection model. The obtaining the target first detection model by training a first detection model may include obtaining a prediction ratio of length to width of each pixel point of the object in the first target region image and the second target region image by performing a cluster operation on a ratio of length to width of a target label of each of the images input into the target first detection model; determining a range of a prediction box on each convolution layer based on a scale formula and the prediction ratio of length to width of each pixel point of the object, wherein the prediction box is configured to determine the position of the detection box for identifying the detected object; and modifying an identification ratio between positive and negative samples to correspond the prediction box to the object, and determining the first detection model after the prediction box is modified as the target first detection model.

In some embodiments, the scale formula may include

S _k=S _min+ ( (S _max-S _min) /W _max) × (W _m+1-k/W ₁) , k∈ [1, m] ,

where, m is a count of layers of convolution layers, W is a length or width of the object in the image, S _max=0.95, S _min=0.05, W _max=39, W ₁=1, S _max is a maximum area of a preset target label, S _min is a minimum area of the preset target label, and S _k is an area of the detected prediction box.

In some embodiments, the determining the first target detection result of the target visible light image based on the first detection result and the second detection result may include, when the first detection result represents a first group of object detection boxes detected in the first target region image, and the second detection result represents a second group of objects detected in the second target region image, obtaining a third group of object detection boxes by combining the first group of object detection boxes and the second group of object detection boxes, wherein the first target detection result includes the third group of object detection boxes.

In some embodiments, before the obtaining image information of a target monitoring region, the method may further include obtaining a training sample image set and a sample enhancement image set each of which corresponds to a training sample image in the training sample image set, wherein the sample enhancement image in the sample enhancement image set is obtained by performing an image enhancement on a region where each sample object in the training sample image corresponding to the training sample image set is located; training an image processing model using the training sample image set, wherein the image processing model includes a plurality of convolutional layers, parameters in the image processing model are initialized by randomly sampling from a Gaussian distribution with a mean of 0 and a standard deviation of 1, the image processing model is configured to identify the sample object in the training sample image in the training sample image set, perform the image enhancement on the region where the sample object is located, and output the image obtained by performing the image enhancement on the region where the sample object is located; and when a value of a loss function between the image output by the image processing model and the sample enhancement image in the sample enhancement image set satisfies a second condition, determining the image processing model as the target image processing model.

In some embodiments, before the determining the image processing model as the target image processing model, the method may further include determining the value of the loss function between the image output by the image processing model and the sample enhancement image in the sample enhancement image set; and when the value of the loss function is less than a predetermined threshold, determining that the value of the loss function satisfies the second condition.

In some embodiments, the obtaining a second target detection result of the target infrared image by performing a second target detection on the target infrared image in response to that the target infrared image satisfies a first condition may include obtaining a sixth region image in the target infrared image based on the first target detection result, wherein the sixth region image is configured to represent a temperature of a target region in the target infrared image, and the target region is configured to represent a region corresponding to the first target detection result; determining whether a temperature of the object in the target region is changed based on the sixth region image; and in response to determining that the temperature of the object in the target region is changed, determining that the second target region image satisfies the first condition, and obtaining the second target detection result by performing the second target detection on the sixth region image.

In some embodiments, the determining whether a temperature of the object in the target region is changed based on the sixth region image may include when a temperature of the target region in the target infrared image represented by the sixth region image is different from an initial temperature of the target area, determining that the temperature of the object in the target region is changed.

In some embodiments, the obtaining the second target detection result by performing the second target detection on the sixth region image may include obtaining a second type and coordinates of the object by performing an object type detection on the object the temperature of which is changed in the sixth region image, wherein the second target detection result includes the second type and the coordinates of the object.

In some embodiments, the obtaining a third target detection result of the target monitoring region based on the first target detection result and the second target detection result may include, when the first target detection result includes the third group of object detection boxes and the type of the object in the third group of object detection boxes, and a first object detection box in the third group of object detection boxes corresponds to the object the temperature of which is changed, determining whether the first type of the object in the first group of object detection boxes is the same as the second type of the object the temperature of which is changed; wherein the third group of object detection boxes is obtained by performing the first target detection on the target visible light image, and the second type is obtained by performing the second target detection of the target infrared image; in response to that the first type is the same as the second type, determining a first matching result, wherein the first matching result is configured to indicate that the first object detection box matches the second target detection result, and the third target detection result includes the first matching result; or in response to that the first type is different from the second type, determining a second matching result, wherein the second matching result is configured to indicate that the first object detection box does not match the second target detection result, and the third target detection result includes the second matching result.

In some embodiments, the tracking the object in the third target detection result may include obtaining the third target detection result of the target monitoring region; obtaining a target detection box configured to identify the object in the target monitoring region based on the third target detection result; tracking the object in the target monitoring region based on the target detection box, and recording a motion trajectory of the object; determining whether the motion trajectory of the object includes an anomaly based on the motion trajectory of the object; and prompting the abnormal when the motion trajectory of the object includes the anomaly.

In some embodiments, the obtaining a target detection box configured to identify the object in the target monitoring region based on the third target detection result may include obtaining, based on the third target detection result, a detection box set that identifies the object, wherein the detection box set includes a first detection box set, a second detection box set, and a third detection box set, the first detection box set being marked with a penalty coefficient, the second detection box set being marked with no penalty coefficient; obtaining a target detection box by performing a target operation on the detection box set; determining a final confidence level of the target detection box by performing feature extraction on the target detection box; designating the target detection box with a highest final confidence level as a target tracking box; and tracking the object in the target monitoring region based on the target tracking box.

In some embodiments, the obtaining a target detection box by performing a target operation on the detection box set may include arranging the first detection box set, the second detection box set, and the third tracking box set in descending order of an initial confidence level; selecting a specified count of detection boxes that are arranged at a top in the first detection box set, the second detection box set, and the third detection box set, and removing repeated detection boxes therein; obtaining a union of remaining selected detection boxes, and designating the union as a fourth detection box set; and obtaining the target detection box by screening the fourth detection box set based on a predetermined rule.

Still another aspect of the present disclosure relates to a device for target detection and tracking. The device for target detection and tracking may include an image acquisition module, a first target processing module, a second target processing module, and a third target processing module. The image acquisition module may be configured to obtain image information of a target monitoring region, wherein the image information includes a target visible light image and a target infrared image of the target monitoring region, and the target infrared image corresponds to the target visible light image. The first target processing module may be configured to obtain a first target detection result of the target visible light image by performing a first target detection on the target visible light image. The second target processing module may be configured to obtain a second target detection result of the target infrared image by performing a second target detection on the target infrared image in response to that the target infrared image satisfies a first condition. The third target processing module may be configured to obtain a third target detection result of the target monitoring region based on the first target detection result and the second target detection result, wherein the third target detection result includes position information and type information of an object in the target monitoring region.

Still another aspect of the present disclosure relates to a storage medium. The storage medium may store computer instructions that, when the computer instructions read by a computer, direct the computer to perform the method for target detection and tracking.

Still another aspect of the present disclosure relates to an electronic device. The device may include at least one processor and at least one storage. The at least one storage may be configured to store computer instructions; and the at least one processor may be configured to execute at least a portion of the computer instructions to implement the method for target detection and tracking.

Additional features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The features of the present disclosure may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities, and combinations set forth in the detailed examples discussed below.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. These embodiments are non-limiting exemplary embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:

FIG. 1 is a schematic diagram illustrating an exemplary detection system according to some embodiments of the present disclosure;

FIG. 2 is a block diagram illustrating an exemplary processing device according to some embodiments of the present disclosure;

FIG. 3 is a flowchart illustrating an exemplary process for target detection according to some embodiments of the present disclosure;

FIG. 4 is a flowchart illustrating an exemplary process for target detection according to some embodiments of the present disclosure;

FIG. 5 is a flowchart illustrating an exemplary process for determining a target detection result according to some embodiments of the present disclosure;

FIG. 6 is a flowchart illustrating an exemplary process for target tracking according to some embodiments of the present disclosure; and

FIG. 7 is a flowchart illustrating an exemplary process for tracking an object or alarming according to some embodiments of the present disclosure.

DETAILED DESCRIPTION

In the following detailed description, numerous specific details are set forth by way of examples in order to provide a thorough understanding of the relevant disclosure. However, it should be apparent to those skilled in the art that the present disclosure may be practiced without such details. In other instances, well-known methods, procedures, systems, components, and/or circuitry have been described at a relatively high-level, without detail, in order to avoid unnecessarily obscuring aspects of the present disclosure. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present disclosure is not limited to the embodiments shown, but to be accorded the widest scope consistent with the claims.

It will be understood that the terms “system, ” “engine, ” “unit, ” “module, ” and/or “block” used herein are one method to distinguish different components, elements, parts, sections, or assemblies of different levels in ascending order. However, the terms may be displaced by other expressions if they may achieve the same purpose.

Generally, the words “module, ” “unit, ” or “block” used herein, refer to logic embodied in hardware or firmware, or to a collection of software instructions. A module, a unit, or a block described herein may be implemented as software and/or hardware and may be stored in any type of non-transitory computer-readable medium or other storage devices. In some embodiments, a software module/unit/block may be compiled and linked into an executable program. It will be appreciated that software modules can be callable from other modules/units/blocks or from themselves, and/or may be invoked in response to detected events or interrupts. Software modules/units/blocks configured for execution on computing devices may be provided on a computer-readable medium, such as a compact disc, a digital video disc, a flash drive, a magnetic disc, or any other tangible medium, or as a digital download (and can be originally stored in a compressed or installable format that needs installation, decompression, or decryption prior to execution) . Such software code may be stored, partially or fully, on a storage device of the executing computing device, for execution by the computing device. Software instructions may be embedded in firmware, such as an EPROM. It will be further appreciated that hardware modules (or units or blocks) may be included in connected logic components, such as gates and flip-flops, and/or can be included in programmable units, such as programmable gate arrays or processors. The modules (or units or blocks) or computing device functionality described herein may be implemented as software modules (or units or blocks) , but may be represented in hardware or firmware. In general, the modules (or units or blocks) described herein refer to logical modules (or units or blocks) that may be combined with other modules (or units or blocks) or divided into sub-modules (or sub-units or sub-blocks) despite their physical organization or storage.

It will be understood that when a unit, an engine, a module, or a block is referred to as being “on, ” “connected to, ” or “coupled to” another unit, engine, module, or block, it may be directly on, connected or coupled to, or communicate with the other unit, engine, module, or block, or an intervening unit, engine, module, or block may be present, unless the context clearly indicates otherwise. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.

The terminology used herein is for the purposes of describing particular examples and embodiments only and is not intended to be limiting. As used herein, the singular forms “a, ” “an, ” and “the” may be intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “include” and/or “comprise, ” when used in this disclosure, specify the presence of integers, devices, behaviors, stated features, steps, elements, operations, and/or components, but do not exclude the presence or addition of one or more other integers, devices, behaviors, features, steps, elements, operations, components, and/or groups thereof.

In addition, it should be understood that in the description of the present disclosure, the terms “first, ” “second, ” or the like, are only used for the purpose of differentiation, and cannot be interpreted as indicating or implying relative importance, nor can be understood as indicating or implying the order.

The flowcharts used in the present disclosure illustrate operations that systems implement according to some embodiments of the present disclosure. It is to be expressly understood, the operations of the flowcharts may be implemented not in order. Conversely, the operations may be implemented in an inverted order, or simultaneously. Moreover, one or more other operations may be added to the flowcharts. One or more operations may be removed from the flowcharts.

Target detection is an important part of image processing and computer vision, and the target detection can be widely used in various scenarios. For example, the target detection may be used in unmanned aerial vehicle (UAV) detection, video detection, unmanned driving, face detection, etc. Taking the UAV detection as an example, a UAV may be an unmanned aircraft operated by a radio remote control device and a self-provided program control device, or completely or intermittently operated autonomously by an onboard computer. The UAV may be very suitable for use in a dangerous scenario or a scenario where an operation environment is harsh, which can greatly reduce the risk of injury to staff. In society, there are some regions where people, vehicles, or other targets cannot be approached. Since a camera in the region cannot achieve panoramic monitoring for three-dimensional 360 degrees, a UAV may be used for aerial real-time monitoring to improve operation efficiency and provide real-time warning, which becomes more and more important. With the development of deep learning, there are more and more researches using the UAVs combined with image recognition techniques, and the UAVs will play an increasingly important role in the society.

However, during a detection, an object often occupies a small proportion of a detection region, and an effect of a current target identification network is not ideal. Once the object is identified incorrectly, a serious consequence will be. For example, in the unmanned driving, if a pedestrian is detected as a substance or the pedestrian is not detected that causes the vehicle to avoid or brake untimely, a personal safety of the pedestrian may be seriously threatened. As another example, in the UAV detection, if the object is under-reported or misreported, great trouble may be brought to a subsequent target tracking or alarm. Therefore, some embodiments of the present disclosure provide methods for target detection. When the object is detected by a target detection model, a detection box that matches a size of the object in an image may be used for detection and identification, which improves the detection accuracy in complex situations such as a volume of the target is small, the background is complex, etc.

It should be noted that the above examples are merely for illustration, and are not limited to the application scenarios of the technical solution disclosed in the present disclosure. The technical solution disclosed in the present disclosure may be described in detail in combination below through the description of the drawings.

FIG. 1 is a schematic diagram illustrating an exemplary detection system 100 according to some embodiments of the present disclosure.

As shown in FIG. 1, the detection system 100 may include a server 110, a detection device 120, a storage device 130, and a network 140.

The server 110 may be configured to manage resources and processing data and/or information from at least one component or external data source of the detection system 100. The server 110 may execute program instructions based on the data, information, and/or processing results to perform one or more functions described in the present disclosure. In some embodiments, the server 110 may be a single server or a server group. The server group may be centralized or distributed (e.g., the server 110 may be a distributed system) . The server group may be dedicated or provided by other devices or systems at the same time. In some embodiments, the server 110 may be local or remote. In some embodiments, the server 110 may be implemented on a cloud platform or provided in a virtual manner. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof.

In some embodiments, the server 110 may include a processing device 112. The processing device 112 may process data and/or information obtained from other devices or system components. The processing device 112 may process the information, data, and/or processed results to perform one or more functions described in the present disclosure. For example, the processing device 112 may process image data 150 (e.g., an image, a video) based on a target detection model to obtain a target detection result 160. The image 150 may include various types of images such as visible light images, infrared images, etc. In some embodiments, the image 150 may be obtained by the detection device 120. In some embodiments, the processing device 112 may include one or more processing devices (e.g., single-core processing device (s) or multi-core processor (s) ) . In some embodiments, the processing device 112 may include a central processing unit (CPU) , an application-specific instruction-set processor (ASIP) , a graphics processing unit (GPU) , a physics processing unit (PPU) , a digital signal processor (DSP) , a field programmable gate array (FPGA) , a programmable logic device (PLD) , a controller, a microcontroller, a reduced instruction set computer (RISC) , a microprocessor, or the like, or combinations thereof. In some embodiment, the server 110 may be integrated into the server 110 or the detection device 120.

The detection device 120 may include one or more devices with functions of capturing images or videos. The detection device 120 may be configured to monitor a region (e.g., a community, a school, a shopping mall, a parking lot, etc. ) . For example, the detection device 120 may be configured to obtain the image data 150, and monitor the region based on the obtained image data 150. As a further example, when the image or video obtained by the detection device 120 involves an anomaly occurring in a scene represented in the image or video, the detection device 120 may detect the anomaly, and provide feedback (e.g., call the police, give an alarm) related to the anomaly in time. The detection device 120 may include a visible light imaging device, an infrared imaging device, or the like, or any combination thereof. Exemplary visible light imaging devices may include a visible light camera, a visible light video recorder, a visible light video sensor, or the like, or any combination thereof. Exemplary infrared imaging devices may include an infrared camera, an infrared video recorder, an infrared video sensor, or the like, or any combination thereof. In some embodiments, the detection device 120 may be mounted on any mobile or fixed platform. For example, the platform may include a UAV, a balloon, a vehicle, a building, a high tower, and other platforms.

The storage device 130 may be configured to store data and/or instructions. The storage device 130 may include one or more storage components, and each storage component may be an independent device or a part of other devices. For example, the storage device 130 may be integrated in the server 110 or the detection device 120. In some embodiments, the storage device 130 may include a random access memory (RAM) , a read-only memory (ROM) , a mass storage, a removable storage, a volatile read-and-write memory, or the like, or any combination thereof. Exemplary mass storages may include magnetic disks, optical disks, solid-state disks, etc. The data refers to a digital representation of information, which can include various types such as binary data, text data, image data, video data, etc. Instructions refer to programs that can control devices or components to perform functions. In some embodiments, the storage device 150 may be implemented on a cloud platform.

In some embodiments, the network 140 may be configured to connect to each component of the detection system 100 and/or connect the detection system 100 and an external resource portion. The network 140 may be configured to implement communication between components of the detection system 100 and/or between each component of the detection system 100 and an external resource portion to facilitate the exchange of information and/or data. In some embodiments, the network 140 may include a wired network, a wireless network, or a combination thereof. For example, the network 140 may include a cable network, a fiber network, a telecommunication network, Internet, a local area network (LAN) , a wide area network (WAN) , a wireless local area network (WLAN) , a metropolitan area network (MAN) , a public switched telephone network (PSTN) , a Bluetooth network, Zigbee, near field communication (NFC) , an intra-device bus, an intra-device line, a cable connection, or the like, or any combination thereof. A network connection between the components may be in one of the manners, or in multiple manners. In some embodiments, the network 140 may include a point-to-point topology structure, a shared topology structure, a centralized topology structure, or the like, or any combination thereof. In some embodiments, the network 140 may include one or more network access points. For example, the network 140 may include a wired or wireless network access point, such as base station and/or network exchange points 140-1, 140-2, etc. One or more components of the detection system 100 may be connected to the network 140 to exchange data and/or information through these network access points.

The server 110 may communicate with the processing device 112, the detection device 120, and the storage device 130 via the network 140 to obtain data and/or information. The server 110 may execute program instructions based on the obtained data, information, and/or processing results to achieve the acquisition of the target detection result. The storage device 130 may store data and/or information in operations of the method for target detection. An information transmission relationship between the devices is provided for the purposes of illustration, and not intended to limit the scope of the present disclosure.

FIG. 2 is a block diagram illustrating an exemplary processing device according to some embodiments of the present disclosure. The processing device 112 may include an image acquisition module 210 and a target detection module 220.

The image acquisition module 210 may be configured to obtain a visible light image. The visible light image may refer to an image that is captured by a visible light imaging device using visible light emitted from the object. More descriptions regarding the obtaining of the visible light image may be found elsewhere in the present disclosure, for example, operation 310 in FIG. 3 and relevant descriptions thereof. In some embodiments, the image acquisition module 210 may be configured to obtain an infrared image. More descriptions regarding the obtaining of the visible light image may be found elsewhere in the present disclosure, for example, FIG. 4 and relevant descriptions thereof.

The target detection module 220 may be configured to obtain a target detection result by performing a target detection on the visible light image based on a target detection model. The target detection model refers to a mathematical model for identifying object (s) in an image. The target detection result refers to a result that the object detection model identifies the object (s) in the image after the image is input into the target detection model. More descriptions regarding the obtaining of the target detection result may be found elsewhere in the present disclosure, for example, operation 320 in FIG. 3 and relevant descriptions thereof.

The modules in the processing device 112 may be connected to or communicate with each other via a wired connection or a wireless connection. The wired connection may include a metal cable, an optical cable, a hybrid cable, or the like, or any combination thereof. The wireless connection may include a Local Area Network (LAN) , a Wide Area Network (WAN) , a Bluetooth, a ZigBee, a Near Field Communication (NFC) , or the like, or any combination thereof. In some embodiments, two or more of the modules may be combined as a single module, and any one of the modules may be divided into two or more units. For example, the image acquisition module 210 may include a visible light image acquisition module and an infrared image acquisition module. In some embodiments, the processing device 112 may include one or more additional modules. For example, the processing device 112 may also include a transmission module configured to transmit signals (e.g., electrical signals, electromagnetic signals) to one or more components (e.g., the detection 120, the storage device 130) of the detection system 100. As another example, the processing device 112 may include a storage module (not shown) used to store information and/or data (e.g., the visible light image, the infrared image, the target detection result, etc. ) .

FIG. 3 is a flowchart illustrating an exemplary process for target detection according to some embodiments of the present disclosure. In some embodiments, process 300 may be executed by the detection system 100. For example, the process 300 may be implemented as a set of instructions (e.g., an application) stored in a storage device (e.g., the storage device 130) . In some embodiments, the processing device 112 (e.g., one or more modules illustrated in FIG. 2) may execute the set of instructions and may accordingly be directed to perform the process 300. The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 300 may be accomplished with one or more additional operations not described and/or without one or more of the operations discussed. Additionally, the order of the operations of process 300 illustrated in FIG. 3 and described below is not intended to be limiting.

In 310, the processing device 112 (e.g., the image acquisition module 210) may obtain a visible light image.

The visible light image may refer to an image that is captured by a visible light imaging device using visible light emitted from the object. Exemplary visible light imaging devices may include a visible light camera, a visible light video recorder, a visible light video sensor, or the like, or any combination thereof. In some embodiments, the visible light image may include a color image, a grayscale image, or the like, or any combination thereof. In some embodiments, the visible light image may represent a scene in a physical region (i.e., a region) to be detected (or a portion of the region to be detected) . For example, a detection device may be fixed to monitor a region (e.g., a community, a school, a shopping mall, a parking lot, etc. ) , and an image obtained by the detection device may correspond to the region (or a portion of the region) and represent a scene in the region. As another example, a detection device may be disposed on a movable device (e.g., a UAV) . The movable device may be caused to move in the region or close to the region, and the image obtained by the detection device may correspond to the region (or a portion of the region) .

In some embodiments, the visible light image may be directly obtained by a visible light imaging device (e.g., the detection device 120) . For example, a visible light camera may capture a visible light image, and the processing device 112 may obtain the visible light image through the network 140.

In some embodiments, the visible light image may be obtained by extracting an image frame from a video captured by the imaging device. For example, a visible light video recorder may capture a visible light video, and the processing device 112 may obtain one or more frames from the visible light video.

In some embodiments, the visible light image may also be obtained by reading from a storage device (e.g., the storage device 130) or a database, or retrieving an interface related to image data.

In 320, the processing device 112 (e.g., the target detection module 220) may obtain a target detection result based on the visible light image and a target detection model.

The target detection model refers to a mathematical model for identifying object (s) in an image. A type of the image may include a visible image, an infrared image, etc. The object in the image may include a pedestrian, a vehicle, a building, a face, etc. The identification of an object in an image may include distinguishing the object from others (e.g., other objects and/or background) in the image and/or determining a characteristic (e.g., a position, a size, a type) of the object. The object may also be referred to as a target, and the “object” and “target” may be interchangeable in the present disclosure.

The target detection result may indicate a characteristic of an object represented in the image data. The characteristic of the object may include a type, a size, a position, etc., of the object in the visible light image. The type of the object may include the pedestrian, the vehicle, etc. In some embodiments, the target detection result may indicate whether an object in the visible light image and an object represented in other types of images (e.g., an infrared image) belong to a same type. In some embodiments, the size of an object may be denoted by a size of a target detection box for identifying the object. For example, the size of the target detection box for identifying the object in the visible light image may 10×10, which indicates that an area of the object in the visible light image is 100 pixels. In some embodiments, the target detection box for identifying the object may include a bounding box that encloses the object. In some embodiments, the position of object may be denoted as position coordinates of pixel point (s) that represent the object in the visible light image. In some embodiments, the position of object may be denoted as a position of the target detection box for identifying the object. For example, the position of object may be denoted as coordinates of the center of the target detection box, or coordinates of vertexes of the target detection box, etc.

In some embodiments, the processing device 112 may input the visible light image into the target detection model for target detection, and then the target detection model may output the target detection result.

In some embodiments, the target detection model may include a single shot multibox detector (SSD) algorithm model, a region convolutional neural network (RCNN) , a fast RCNN algorithm model, a faster RCNN algorithm model, a “you only look once” (YOLO) algorithm model (e.g., YOLO v1, YOLO v2, YOLO v3, etc. ) , etc.

In some embodiments, the target detection model may include convolution layers. After the processing device 112 inputs the visible light image into the target detection model, the target detection model may detect the object (s) in the visible light image according to one or more detection boxes each of which corresponds to one of the convolution layers of the target detection model. The detection box (es) may be determined according to a predetermined manner.

The detection box may refer to a candidate detection box with a ratio of length to width that is preset and is configured to identify the object. The ratio of length to width may refer to a ratio between a length and a width of the detection box. For example, the one or more detection boxes may include a box with a size of 10×10, a box with a size of 10×8, a box with a size of 8×10, etc. In some embodiments, the detection box (es) may be configured to detect whether the visible light image includes the object (s) and/or where the detected object is located in the visible light image. For example, the one or more detection boxes may be configured to determine whether object (s) are in the image. As another example, the one or more detection boxes may be configured to determine position (s) of the object (s) . When a detection box detects an object and locates the object, the detection box may be also referred to as the target detection box. The target detection box may be configured to mark the detection object, and enclose the detected object.

The predetermined manner may be associated with the size of the one or more detection boxes. The predetermined manner may include a predetermined rule, a predetermined scale formula, etc. In some embodiments, one of the one or more detection boxes may be obtained by other manners, such as clustering, manual designation, etc.

In some embodiments, one of the detection boxes may be determined based on a candidate detection box according to a predetermined rule. For example, the detection box may be determined by adjusting the size of the candidate detection box according to the predetermined rule. The predetermined rule may indicate that the candidate detection boxes is continuously adjusted from small to large and/or from large to small based on a certain size interval. For example, the predetermined rule may be that the candidate detection box is adjusted from large to small based on the size interval of 1. If a size of the candidate detection box is 10×10, the one or more detection boxes (e.g., 9×9, 8×8, 7×7, etc. ) may be obtained after the adjustment.

By setting different sizes of the detection boxes, a probability that the object in the image matches one of the detection boxes may be improved, thereby increasing the accuracy that the target detection model detects the object in the image.

The predetermined scale formula refers to a predetermined formula for adjusting the length and the width of the detection box. The scale may be understood as a ratio that needs to be changed to the detection box. Merely by way of example, the processing device 112 may determine the detection box based on the predetermined scale formula according to following embodiments.

In some embodiments, the processing device 112 may obtain a plurality of candidate detection boxes, and determine the one or more detection boxes by adjusting a portion of the plurality of candidate detection boxes according to the predetermined manner based on sizes of feature maps each of which corresponds to one of the convolution layers in the target detection model.

The ratio of length to width and/or values of the length and width of each of the plurality of candidate detection boxes may be predetermined.

In some embodiments, the plurality of candidate detection boxes may be obtained by manually determining the size of the candidate detection box. In some embodiments, the plurality of candidate detection boxes may be obtained by clustering ratios of length to width of target labels in a data set based on a clustering algorithm. For example, ratios of length to width of target labels in a data set (e.g., visual object classes (VOC) , common objects in context (COCO) , Kitti data set, etc. ) may be clustered using a K-Means++ algorithm, and five different sizes of detection boxes (whose ratios of length to width are different) may be selected as the plurality of candidate detection boxes.

The feature map corresponding to one of the convolution layers refers to an output result of the convolution layer. The output result (e.g., output data) of the convolution layer may be present in a three-dimensional form, and the output result may be regarded as a superposition of a plurality of two-dimensional images. For example, the output data of the convolution layer may be present as A×B×C, where, A×B represents a two-dimensional image, and C represents a count (number) of superposed two-dimensional images. Each two-dimensional image may have a corresponding length and a corresponding width. In some embodiments, the length of the feature map may be the same as the width of the feature map.

The predetermined scale formula may be associated with the size of the feature map corresponding to the convolution layer of the target detection model. In some embodiments, input data of the predetermined scale formula may be the sizes of the feature maps corresponding to the convolution layers of the target detection model, and a calculation result of the predetermined scale formula may be a proportional coefficient of adjustment for the ratios of length to width of the candidate detection box.

Merely by way of example, the predetermined scale formula may be shown as Equation (1) :

S _k=S _min+ ( (S _max-S _min) /W _max) × (W _m+1-k/W ₁) , k∈ [1, m] , (1)

where, S _k represents the proportional coefficient of the adjustment for the ratios of length to width of the candidate detection box, W is a length or width (the length is equal to the width) of the feature map of the object corresponding to the convolution layer of the target detection model, S _max is a maximum value of the scale, S _min is a minimum value of a scale, W _max is a maximum value of the length or the width of the feature map, W _min is a minimum value of the length or the width of the feature map, k is a number of the network layer where the convolutional layer is located, and m is a count of layers of convolution layers. In some embodiments, S _max=0.95, S _min=0.05, W _max=39, W ₁=1.

In some embodiments, the count of convolution layers of the target detection model may be 6. That is, m in the Equation (1) may be 6. An exemplary scale range may be obtained according to the Equation (1) , which may be shown in Table 1. The proportional coefficient S _k may be within the scale range.

Table 1 shows a comparison between a scale range of different convolutional layers of the target detection model improved by the predetermined scale formula and a scale range of different convolutional layers of the target detection model not improved by the predetermined scale formula.

Table 1

According to Table 1, the minimum scale and maximum scale obtained based on the predetermined scale formula are smaller than the initial minimum scale and the initial maximum scale, respectively. If the scale is smaller, a larger count of detection boxes with different sizes may be obtained, a small object may be better adapted to one detection box when object (s) in the image are identified, thereby improving the detection accuracy of the small object. The small object refers to an object whose area in the visible light image is less than a specified value. For example, in the COCO data set, if an area of an object is less than 32×32, the object may be determined as the small object. As another example, in a pedestrian library (e.g., CityPerson) , when a resolution of an initial image is 1024×2048, an object whose height is less than 75 pixels may be defined as the small object.

In some embodiments, the processing device 112 may obtain one or more detection boxes by adjusting the ratio of length to width of the candidate detection box. For example, the processing device 112 may obtain one or more detection boxes by multiplying the proportional coefficient determined based on the predetermined scale formula with the size of the candidate detection box. The size of the candidate detection box may be fixed.

However, in the visible light image, the size of the object that needs to be detected may be changed, and the size of the object may not match the size of the candidate detection box. Accordingly, in the embodiment, the plurality of detection boxes may be obtained based on the predetermined scale formula, the sizes of the feature maps corresponding to the convolutional layers, and the size of the candidate detection box, so that the detection boxes with different sizes may be obtained to match the size of the object to be detected in the visible light image, and then accurate detection results may be obtained.

In some embodiments, the processing device 112 may obtain the target detection model from a storage device offline. In some embodiments, the processing device 112 may obtain the target detection model online. The target detection model may be obtained according to a training process.

In some embodiments, a processing device that is the same as or different from the processing device 112 may obtain a plurality of training samples. Each training sample among the plurality of training samples may include a sample image and a label corresponding to the sample image. The sample image may include a sample visible light image, a partial region image (e.g., a regional image including the small target) in a sample visible light image, etc. The label may include a type, a size, a position, etc., of the detection box including a sample object in the sample visible light image.

The sample image may be obtained from image acquisition by imaging devices, public data sets, etc. The label may be obtained by manual annotation, automatic labeling, etc., which is not limited herein.

The processing device may obtain the target detection model by training an initial target detection model based on the plurality of training samples. For example, sample images may be input into the initial target detection model, and prediction results of the initial target detection model may be obtained. By adjusting parameters of the initial target detection model to reduce a difference between the prediction results and the labels, the initial target detection model (i.e., parameter values of the initial target detection model) may be iteratively updated, and finally, the trained target detection model may be obtained. The initial target detection model may refer to a model whose parameter values are initialized.

In some embodiments, the training of the initial target detection model may include the following operations.

The processing device may obtain a plurality of candidate detection boxes. In some embodiments, the processing device may obtain a prediction ratio of length to width of each pixel point of the sample object in the sample image by performing a cluster operation on a ratio of length to width of a label corresponding to the sample image. The ratio of length to width may include a ratio of a length value and a width value. In some embodiments, the ratio of length to width may include the length value and the width value. In some embodiments, the processing device may cluster ratios of length to width of target labels in the data set using a clustering algorithm, and select five (more or fewer) ratios of length to width to obtain the prediction ratio of length to width of each pixel point of the feature map. The clustering algorithm may include other algorithms, as long as a clustering function may be implemented.

The processing device may define areas of detection boxes corresponding to the convolution layers based on the predetermined scale formula and the prediction ratio of length to width of each pixel point of the object. More descriptions regarding the clustering algorithm and the predetermined scale formula may be found above, which is not repeated herein.

The processing device may modify an identification ratio of positive samples to negative samples of the one or more sample detection boxes based on a loss function to reduce a difference between the detection result of the one or more detection boxes and the label. The positive sample may refer to a sample detection box that detects a sample object in the sample image. The negative sample may refer to a sample detection box that detects no sample object in the sample image. The size of the detection box may correspond to the object after the identification ratio is modified. When the target detection is performed, the detection box may be matched to the object to be identified to improve the accuracy of the target detection.

In some embodiments, the processing device may balance the positive and negative samples using a focal loss algorithm. Formulas of the algorithm may have a balanced effect, so that the target detection box may be identified, and the accuracy of the identification detection may be improved.

In some embodiments, when a count of iterations of the training of the target detection model reaches a predetermined count, or the loss function value is converged, the training may be stopped to obtain the target detection model.

When performing the target detection, the processing device 112 may perform the target detection through a variety of manners based on the target detection model. Merely by way of example, the processing device 112 may perform the target detection on the visible light image according to the following embodiments.

In some embodiments, the processing device 112 may obtain a first image by processing the visible light image based on an image processing model. The image processing model may be configured to enhance the visible light image. For example, the image processing model may be configured to enhance features of the visible light image. The processing device 112 may perform the target detection by inputting the first image into the target detection model.

The first image may be obtained after performing an enhancement operation on the features of the visible light image based on the image processing model. For example, the first image may be obtained after performing the enhancement operation on the visible light image or a partial region (e.g., a target region including the object) of the visible light image based on the image processing model. The enhancement operation based on the image processing model may enhance the features of the small object in the visible light image, and the visible light image may be denoised and performed a geometric repair, which reduces false detections and missed detections due to unclear features of the small object in the image, thereby improving the detection accuracy. For example, under the action of weighted average, by calculating a difference between the pixel points of the object and pixel points of surrounding areas, illumination changes in the visible light image may be estimated, edge information in the image may be enhanced, and problems (e.g., insufficient edge sharpness, abrupt shadow boundaries, distortion of partial colors, unclear texture, etc. ) may be reduced, thereby improving the accuracy of the image. The features of the small object refer to attribute information of the small object in the visible light image, which can be identified by the computer. In some embodiments, the attribute information of the small object may be configured to distinguish the small object and other objects in the visible light image. For example, the attribute information of the small object may include the edge information, the color, a shape, etc., of the small object in the image.

In some embodiments, the processing device 112 may input the visible light image into the image processing model, and the image processing model may output the first image after the features of the visible light image are enhanced. In some embodiments, the processing device 112 may input an image including a target region in the visible light image to the image processing model, and the image processing model may output the first image. The target region may be a portion of the visible light image. The target region may be divided from the visible light image based on a segmentation algorithm. In some embodiments, the processing device 112 may enlarge the target region in the visible light image. Merely by way of example, a magnification for enlarging the target region in the visible light image may be 2 times, or other specified times. The features of the visible light image may be enhanced to obtain more features of the object. For example, for small objects in the visible light image, new features may be detected after the target region is enhanced, thereby improving the accuracy of the target detection.

In some embodiments, the image processing model may be obtained by a process according to the following embodiments.

A processing device that is same as or different from the processing device 112 may obtain a plurality of first training samples. Each of the plurality of first training samples may include a first sample image and a first label. The first sample image may include a visible light sample image and/or an image including the target region in the visible light sample image. The first label may include an enhancement image corresponding to the first sample image, such as an image after the features are enhanced by enhancing the target region including the object. The first label may be obtained by mapping the first sample image (can be implemented by codes) .

The processing device may train an initial image processing model using the plurality of first training samples. The initial image processing model may include a model with certain model parameters. For example, the model parameters of the image processing model may be initialized by randomly sampling from a Gaussian distribution with a mean of 0 and a standard deviation of 1.

In some embodiments, the initial image processing model may be a neural network model, such as a depth learning network model. The image processing model may include a plurality of convolutional layers, and convolutional parameters of each convolution layer may be shown in Table 2 below.

Table 2 shows the convolutional parameters of each convolution layer in the image processing model.

Table 2

It should be noted that in the above examples, the count of layers of convolution layers in the image processing model is six. The count of the convolution layers and setting of the parameters of the image processing model may be adjusted according to needs, which is not limited herein. For example, the parameters (e.g., the count of filters, the filter size, the step length, the padding size, etc. ) may include an initial value of a weight value (e.g., 0) , an initial value (e.g., 0) of an offset item, a count of groups (e.g., 1) , etc.

Detailed parameters of each convolution layer in the image processing model are shown in Table 2. In Table 2, since the step size is set to 2, there is an overlap between a front convolution portion and a rear convolution portion. Since the padding size is equal to (the filter size-1) /2, the processing device may fill the image according to the padding size when the image is reduced after convolution operation, thereby ensuring a width and a height of the image unchanged. For example, if the filter size is 5, the padding size may be 2, and four edges may be expanded to 2 pixels. That is, the width and height may be expanded to 4 pixels to avoid the feature map from reducing after the convolution operation.

In the embodiment, the count of convolution layers is little, which can reduce calculation consumption, while maintaining excellent convolutional effects. Therefore, the operation efficiency may be improved, and the accuracy of the model detection may be ensured.

In some embodiments, the processing device may train the initial image processing model through a model training manner, for example, a gradient descent manner, etc. When the training meets one or more termination conditions, the training may be stopped to obtain the image processing model. The termination conditions may include that a count of iterations reaches a predetermined count, a loss function is converged, etc. In some embodiments, a root mean square error (RMSE) function may be used as the loss function of the image processing model, which can be shown in Equation (2) :

where, RMSW represents a value of the loss function, m represents a total count of training samples, y _m represents a processing result of the image processing model, and

represents the first label corresponding to the first sample image. In the embodiment, the function may increase a sharpness of an image by 0.2 times that is processed using the image processing model, thereby improving the accuracy of the target detection.

In some embodiments, the processing device 112 may obtain a second image by processing the visible light image based on an image enhancement algorithm, and perform the target detection by inputting the second image into the target detection model.

The second image may be obtained after image enhancement based on an image enhancement algorithm. The image enhancement algorithm may include a single scale retinex (SSR) algorithm, a multi-scale retinex (MSR) algorithm, etc., which is not limited herein.

In some embodiments, an image that is processed based on the image enhancement algorithm may refer to the visible light image or a partial region (e.g., a target region including the object) of the visible light image. For example, the processing device may process an image of the target region including the object in the visible light image or an enlarged image of the image of the target region based on the image enhancement algorithm. The image processed based on the image enhancement algorithm may be the same as the image input to the image processing model for processing in the above embodiments.

In some embodiments, a resolution of the image processed based on the image processing model may be the same as a resolution of the image processed based on the image enhancement algorithm. For example, the resolution of the image of the target region may be 300×300. When a resolution of the first image is different from a resolution of the second image, the resolutions of the first image and the second image may be adjusted, so that the resolutions of the first image and the second image are the same. The adjusting of the resolutions refers to uniformly scaling the resolutions of the images including the small object in the visible light image that is processed to obtain the first image and the second image to a specified resolution. For example, a resolution of the image of the small object may be transformed to 300×300 from 128×96. The transformation may be implemented by a pixel converter, a resolution conversion algorithm, and other techniques, which are not repeated herein. The transforming of the resolution of the obtained image to a specified resolution may help to quickly identify and detect the object from the obtained image, and reduce identification errors caused by different resolutions.

The transforming of the resolution of the image may help to quickly identify the image on the premise of ensuring accuracy. In some embodiments, when a resolution of the obtained visible light image is 128×96, the resolution of the obtained visible light image may be transformed to 300×300 to facilitate identification. If the resolution is transformed to 600×600, the identification efficiency may be reduced, and the recognition accuracy may be improved. If the resolution is transformed to 100×100, the identification efficiency may be improved, and the recognition accuracy may be reduced.

In some embodiments, the processing device 112 may directly designate a detection result obtained based on the first image as a first detection result of the visible light image, or directly designate a detection result obtained based on the second image as the first detection result of the visible light image. In some embodiments, the processing device 112 may combine the detection result of the first image and the detection result of the second image to obtain a combination result according to a combination mode, and designate the combination (or fused) detection result as the first detection result of the visible light image.

In some embodiments, the combination mode may be that the detection result of the first image is mapped to the detection result of the second image. For example, the detection result of the first image may be marked in the detection result of the second image. As another example, the detection result of the second image may be marked in the detection result of the first image. As still another example, the detection result of the first image and the detection result of the second image may be marked in the visible light image or the processed image. In some embodiments, a combination result may be processed using a non-maximum suppression (NMS) algorithm to obtain the first detection result of the visible light image.

In some embodiments, the processing device 112 may directly use the first detection result of the visible light image as the target detection result. In some embodiments, the processing device 112 may perform the second or more detection based on the first detection result of the visible light image, and obtain the target detection result. Taking the second detection as an example, the processing device 112 may determine whether the second detection is required based on the first detection result of the visible light image. When the second detection is required, the second detection may be performed based on the infrared image corresponding to the visible light image, and the target detection result may be determined based on the first detection result and the second detection result.

In some embodiments, the processing device 112 may obtain an infrared image corresponding to the visible light image. The infrared image corresponding to the visible light image may indicate that the infrared image and the visible light image represent the same scene in a region. For example, the infrared image and the visible light image may be obtained simultaneously. As another example, an object in the infrared image may be the same as an object in the visible light image. The processing device 112 may determine, based on the first detection result of the visible light image and the infrared image, whether a region corresponding to the first detection result of the visible light image satisfies a condition. In response to that the region corresponding to the first detection result of the visible light image satisfies the condition, the processing device 112 may obtain a second detection result of the infrared image by performing the target detection on the infrared image. Finally, the processing device 112 may determine the target detection result based on the first detection result of the visible light image and the second detection result of the infrared image. More descriptions regarding the infrared detection may be found elsewhere in the present disclosure, e.g., FIG. 4 and the descriptions thereof.

Through the second detection, the infrared image detection may be cooperated with visible light image detection. A joint determination of the infrared image detection and the visible light image detection may effectively improve the accuracy of the target detection, so that subsequent target tracking and alarm processing may be reliable.

In some embodiments, the target detection result may be used to track the detection object in the visible light image. For example, the processing device 112 may obtain a motion trajectory of the object by tracking the object, and determine whether the motion trajectory of the object includes an anomaly based on the motion trajectory of the object. When the motion trajectory of the object includes the anomaly, the processing device 112 may perform an alarm operation. More descriptions regarding the tracking the object in the target detection result and alarming may be found elsewhere in the present disclosure, e.g., FIG. 5 and the descriptions thereof.

In some embodiments of the present disclosure, the object in the image may be detected based on an improved model for target detection. Since the detection box used by the target detection model for target detection is a detection box adjusted based on the size of the feature map corresponding to each convolutional layer, the detection box may be matched with the object in the image, thereby improving the accuracy of the detection result. In addition, before the target detection is performed using the target detection model, the image enhancement may be performed on the visible light image or the image of the target region in the visible light image using the image processing model and/or the image enhancement algorithm, so that the features in the image are extracted, and the accuracy of target detection is also improved. Besides, the accuracy and reliability of the target detection may be further improved in a combination of the visible light detection and the infrared image detection.

FIG. 4 is a flowchart illustrating an exemplary process for target detection according to some embodiments of the present disclosure. In some embodiments, process 400 may be executed by the detection system 100. For example, the process 400 may be implemented as a set of instructions (e.g., an application) stored in a storage device (e.g., the storage device 130) . In some embodiments, the processing device 112 (e.g., one or more modules illustrated in FIG. 2) may execute the set of instructions and may accordingly be directed to perform the process 400. The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 400 may be accomplished with one or more additional operations not described and/or without one or more of the operations discussed. Additionally, the order of the operations of process 400 illustrated in FIG. 4 and described below is not intended to be limiting.

In 410, the processing device 112 (e.g., the image acquisition module 210) may obtain an infrared image corresponding to a visible light image.

The infrared image refers to a visible image reflecting temperature of objects and scenes. Infrared radiation from the objects and scenes may be received by an infrared imaging device, and the infrared radiation may be converted into the infrared image reflecting the temperature of the objects and scenes through a photoelectric conversion. Exemplary infrared imaging devices may include an infrared camera, an infrared video recorder, an infrared video sensor, or the like, or any combination thereof.

In some embodiments, the corresponding between the infrared image and the visible image may refer to that the infrared image and the visible light image represent the same scene in a same space. That is, the infrared image and the visible light image may be obtained simultaneously. In some embodiments, the visible light image may include a same object with the infrared image. For example, a visible light imaging device and an infrared imaging device may be mounted on a same position or two positions adjacent to each other. In addition, imaging parameters (e.g., exposure time, interval time, etc. ) and/or mounting parameters (e.g., a mounting angle, a mounting height, etc. ) of the visible light imaging device may be adjusted to be the same as or substantially similar to those of the infrared imaging device. Therefore, an imaging region of the visible light imaging device may be the same as or substantially similar to an imaging region of the infrared imaging device. When imaging of the visible light imaging device and imaging of the infrared imaging device are performed at the same time, the infrared image and the corresponding visible light image may be obtained.

In some embodiments, the infrared image may be obtained by the infrared thermal imaging device (e.g., an infrared thermal camera, an infrared thermal imaging sensor, etc. ) at a same time that the visible light image is obtained using an imaging device.

In 420, the processing device 112 (e.g., the target detection module 220) may determine, based on the infrared image and a first detection result that is determined based on the visible light image, whether a region corresponding to the first detection result satisfies a condition.

The first detection result refers to a result by performing target detection on the visible light image. In some embodiments, the first detection result may be obtained by directly performing the target detection on the visible light image through the target detection model. In some embodiments, the first detection result may be a result by performing, through the target detection model, the target detection on a first image processed through an image processing model. In some embodiments, the first detection result may be a result by performing, through the target detection model, the target detection on a second image processed through an image enhancement algorithm. In some embodiments, the first detection result may also be a combination result of the results by performing the target detection on the first image and the second image. More descriptions for the first detection result may be found elsewhere in the present disclosure (e.g., FIG, 3 and the descriptions thereof) .

The first detection result may include a position of an object detected from the visual light image. The region corresponding to the first detection result may indicate the position of the object in the visible light image.

The condition may indicate whether a temperature in the infrared image at different moments in the region where the object is located in the first detection result of the visible light image has changed. The condition may be satisfied if the temperature in the infrared image at different moments in the region where the object is located in the first detection result of the visible light image has changed. Since different objects reflect different temperatures in the infrared image, when there is a change in the position of a pedestrian or a vehicle in a region, the corresponding position change may be reflected in the infrared image through temperature changes. For example, in a first infrared image (which can be any time in history) , a temperature of a position may be 10 degrees Celsius. At a current moment (e.g., a time corresponding to the visible light image) , the temperature of the position may be 36.5 degrees Celsius. In response to that the temperature changes, the processing device 112 may determine that the region including the position meets the condition.

In some embodiments, the processing device 112 may determine whether the region corresponding to the first detection result of the visible light image meets the condition in the following manner.

The processing device 112 may match the first detection result of the visible light image to the infrared image. The matching between the first detection result of the visible light image and the infrared image may also be referred to matching the visible light image with the detected object to the infrared image to determine a position of the object or the region in the infrared image. The processing device 112 may determine the temperature of the region based on the infrared image and the position of the region in the infrared image.

The processing device 112 may determine whether a temperature of the region is changed in the matched infrared image. In response to that the temperature changes, the processing device 112 may determine that the region meets the condition.

In some embodiments, the processing device 112 may obtain an initial infrared image corresponding to the infrared image. For example, the initial infrared image may include first few frames, dozens of frames, or hundreds of frames before the infrared image is acquired. A temperature of the region in the initial infrared image may be compared with the temperature of the region in the infrared image. If the temperatures are different (or a temperature difference is greater than a threshold to eliminate an impact of environmental changes) , the processing device 112 may determine that the temperature has changed and meets the condition.

If the condition is met, operation 330 may be performed. If the condition is not met, the first detection result of the visible light image needs to be re-determined, and the target detection may not be performed on the infrared image.

In 430, the processing device 112 (e.g., the target detection module 220) may obtain a second detection result of the infrared image by performing the target detection on the infrared image.

The second detection result refers to a detection result obtained by performing the target detection on the infrared image. Similar to the first detection result, the second detection result may include a type, a size, and a position, etc., of the detection object.

In some embodiments, the processing device 112 may perform the target detection on the infrared image in a similar manner to the target detection of the visible light image. For example, the target detection may be performed by using an infrared target detection model that is trained by a plurality of sample infrared images and corresponding labels. The training mode of the infrared target detection model may be the same as the training mode of the target detection model, which is not described herein.

In some embodiments, the processing device 112 may perform the target detection on the infrared image in a different manner to the target detection of the visible light image. For example, the target detection of the visible light image may be performed based on the target detection model of the SSD algorithm, and the target detection of the infrared image may be based on other algorithms (e.g., Faster RCNN, YOLO V2&V3, etc. ) , which is not limited herein.

In 440, the processing device 112 (e.g., the target detection module 220) may determine the target detection result based on the first detection result of the visible light image and the second detection result of the infrared image.

In some embodiments, the processing device 112 may compare the second detection result with the first detection result, and determine whether the first detection result is the same as the second detection result. For example, the processing device 112 may determine whether the type, size, position, etc., of the object in the first detection result are the same as the type, size, position, etc., of the object in the second detection result, respectively. In response to that the type, size, and position of the object in the first detection result are the same as the type, size, and position of the object in the second detection result, the first detection result may be correct, and the type, size, and position of the object in the first detection result and/or the second detection result may be designated as the target detection result. In response to that at least one of the type, size, and position of the object in the first detection result is different of that of the type, size, and position of the object in the second detection result, at least one of the first detection result and the second detection result may be incorrect, and the processing device 112 may repeated operation 410 to obtain a new first detection result.

In the embodiment, a joint determination of the first detection result of the visible light image and the second detection result of the infrared image may effectively reduce a false detection caused by a single detection based on the visible light image, so that the final detection result may be reliable.

FIG. 5 is a flowchart illustrating an exemplary process for determining a target detection result according to some embodiments of the present disclosure. In some embodiments, process 500 may be executed by the detection system 100. For example, the process 500 may be implemented as a set of instructions (e.g., an application) stored in a storage device (e.g., the storage device 130) . In some embodiments, the processing device 112 (e.g., one or more modules illustrated in FIG. 2) may execute the set of instructions and may accordingly be directed to perform the process 500. The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 500 may be accomplished with one or more additional operations not described and/or without one or more of the operations discussed. Additionally, the order of the operations of process 500 illustrated in FIG. 5 and described below is not intended to be limiting.

In 510, the processing device 112 may obtain a visible light image.

In 520, the processing device 112 may determine a first detection result of the visible light image by performing a target detection on the visible light image based on a target detection model.

In 530, the processing device 112 may obtain a target region corresponding to the first detection result in the infrared image by matching the first detection result of the visible light image into the infrared image.

In 540, the processing device 112 may determine whether a temperature of the target region in the infrared image is abnormal based on the infrared image.

If the temperature is abnormal, operation 550 may be proceeded to perform the target detection on the infrared image, and obtain a second detection result of the infrared image.

If the temperature is normal, operation 580 may be proceeded to end a current operation.

In 560, the processing device 112 may determine whether the second detection result of the infrared image is the same as the first detection result of the visible light image.

In response to that the second detection result of the infrared image is the same as the first detection result of the visible light image, operation 570 may be proceeded to determine the target detection result.

In response to that the second detection result of the infrared image is different from the first detection result of the visible light image, operation 580 may be proceeded to end the current operation.

More descriptions regarding operations in FIG. 5 may be found elsewhere in the present disclosure, e.g., FIG. 3 to FIG. 4, which are not be described herein.

FIG. 6 is a flowchart illustrating an exemplary process for target tracking according to some embodiments of the present disclosure. In some embodiments, process 600 may be executed by the detection system 100. For example, the process 600 may be implemented as a set of instructions (e.g., an application) stored in a storage device (e.g., the storage device 130) . In some embodiments, the processing device 112 (e.g., one or more modules illustrated in FIG. 2) may execute the set of instructions and may accordingly be directed to perform the process 600. The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 600 may be accomplished with one or more additional operations not described and/or without one or more of the operations discussed. Additionally, the order of the operations of process 600 illustrated in FIG. 6 and described below is not intended to be limiting.

In 610, the processing device 112 (e.g., the target detection module 220) may obtain, based on a target detection result, a detection box set that identifies an object in the target detection result.

The detection box set refers to a set of multiple detection boxes for identifying the object. The object may be an object included in the target detection result.

In some embodiments, the detection box set may include a first detection box set, a second detection box set, and a third detection box set.

The first detection box set and the second detection box set may refer to sets of detection boxes for identifying the object. The first detection box set and the second detection box set may be obtained from tracking images based on the object in the target detection result using a first target tracking algorithm. The tracking image refers to an image including the object. In some embodiments, the tracking images may be obtained at different times by a visible light imaging device.

The third detection box set may refer to a set of detection boxes for identifying the object. The third detection box set may be obtained from the tracking image based on the object in the target detection result using a second target tracking algorithm.

In some embodiments, the first target tracking algorithm may be different from the second target tracking algorithm. For example, the first target tracking algorithm may be a depth learning neural network (e.g., siamese region proposal network (SiamRPN) ) algorithm, and the second target tracking algorithm may be a kernel correlation filter (KCF) tracking algorithm. In some embodiments, the first target tracking algorithm and the second target tracking algorithm may be other algorithm types. For example, the first target tracking algorithm may be a depth learning neural network SIAMRPN++ algorithm, etc., which is not limited herein. In some embodiments, more detection box sets may be obtained based on a third target tracking algorithm and/or a fourth target tracking algorithm.

Merely by way of example, taking the first target tracking algorithm as the depth learning neural network SIAMRPN algorithm and the second target tracking algorithm as the KCF tracking algorithm, how to obtain a detection box set is illustrated. In some embodiments, the processing device 112 may input the target detection result and the first target tracking image to the depth learning neural network SIAMRPN algorithm and the KCF tracking algorithm, and accordingly, an output result of the first input of (the target detection result and the first target tracking image) may be obtained. The deep learning neural network SIAMRPN algorithm may output a detection box in the first detection box set, and the KCF tracking algorithm may output a detection box in the third detection box set. From a second input, input data may be the output result of a previous round and a new tracking image, and output results may be the detection box corresponding to the input data. After a plurality of rounds of calculation, the first detection box set and the third detection box set may be obtained.

In some embodiments, the second detection box set may be obtained based on the first detection box set. The first detection box set may be marked with a penalty coefficient, and the second detection box set may be marked with no penalty coefficient. The penalty coefficient may improve the probability of obtaining a detection box that identifies the object as an accurate box. For example, if there are 10 boxes with penalty coefficients, the detection boxes may be selected according to a size of the penalty coefficient corresponding to each detection box. The penalty coefficient of a correct detection box may be smaller, and a probability that the detection box is selected may be higher. The penalty coefficient of a wrong detection box may be larger, and a probability that the detection box is selected may be lower. Therefore, the correct detection box may be selected with a higher probability.

In the embodiment, the detection box sets (e.g., the first detection box set, the second detection box set, the third detection box set) may be obtained by the depth learning neural network SIAMRPN algorithm and the KCF tracking algorithm respectively, and then the target detection box set may be obtained based on the detection box sets, so that the target detection box is determined by two different manners, thereby increasing the accuracy of tracking the object by the target detection box to avoid inaccurate tracking of the object caused by a single detection box acquisition manner.

When the deep learning neural network SIAMRPN algorithm outputs the detection box, a corresponding score corresponding to the detection box may be also output. The penalty coefficient may be calculated based on the score, for example, multiplying a difference between 1 and the score by a fixed value.

In some embodiments, the processing device 112 may copy the obtained first detection box set, then remove the penalty coefficients from the detection boxes to obtain the second detection box set.

In 620, the processing device 112 (e.g., the target detection module 220) may obtain a target detection box set by screening the detection box set.

The target detection box set refers to a set of remaining detection boxes after screening the detection boxes in the detection box set.

In some embodiments, the processing device 112 may screen based on confidence levels corresponding to the detection boxes in the first detection box set, the second detection box set, and the third detection box set.

For example, the processing device 112 may obtain one or more confidence levels each of which corresponds to one of one or more detection boxes in the detection box set (e.g., the first detection box set, the second detection box set, and the third detection box set) . The confidence level may be obtained based on the score when the target tracking algorithm outputs the detection box.

The processing device 112 may screen the detection box in the detection box set (e.g., the first detection box set, the second detection box set, and the third detection box set) based on the confidence levels, and determine the target detection box set. In some embodiments, the processing device 112 may select a preset count (such as 100) of detection boxes with highest confidence levels in each detection box according to the confidence levels, and then perform a fusion operation on the detection boxes in each set. For example, the first detection box set may be fused to the third detection box set, and the second detection box set may be fused to the third detection box set, and then a union of the two fused detection boxes may be taken. In some embodiments, repeated detection boxes may be removed during the union.

For the detection box set obtained after the union, screening may also be performed based on a preset screening algorithm (e.g., an NMS algorithm, etc. ) to screen out a detection box with a high similarity degree, and finally, the target detection box set may be obtained. For example, 5 detection boxes may be finally obtained and used as target detection boxes in the target detection box set.

In 630, the processing device 112 (e.g., the target detection module 220) may determine a target tracking box based on feature information of one or more target detection boxes in the target detection box set.

The target tracking box refers to a target detection box for tracking an object.

The feature information refers to information obtained by performing feature extraction on the object in each of the target detection boxes in the target detection box set.

In some embodiments, the processing device 112 may obtain the feature information by various feature extraction manners. For example, the feature information may be obtained by a feature extraction network.

In some embodiments, the processing device 112 may obtain a similarity degree by comparing the feature information and the initial feature information of the object to be tracked, and determine the target tracking box based on the similarity degree. The initial feature information may be extracted based on the object in the target detection result.

For example, the processing device 112 may obtain the one or more target detection boxes in the target detection box set and the initial feature information extracted based on the object in the target detection result.

The processing device 112 may determine one or more similarity degrees based on the feature information of the object in the one or more target detection boxes in the target detection box set and the initial feature information extracted based on the object in the target detection result. In some embodiments, the similarity degree may be determined based on various modes, for example, calculating a cosine distance, a European distance, etc., between the feature information.

The processing device 112 may determine the target tracking box based on the one or more similarity degrees and the one or more confidence levels corresponding to the one or more target detection boxes in the detection box set. In some embodiments, the processing device 112 may determine the target tracking box based on a similarity degree between an object corresponding to the target detection box in the target detection box set and an object corresponding to the target detection result, and a respective corresponding confidence level. Merely by way of example, the processing device 112 may determine the target tracking box by a weighted summation. If the confidence level corresponding to each target detection box in the target detection box set is {s ₁, s ₂, s ₃, s ₄, s ₅} , s _i∈ (0, 1) , i∈ (1, 2, 3, 4, 5) , the similarity degree of each target detection box and the target detection result is {y ₁, y ₂, y ₃, y ₄, y ₅} , y _i∈ (0, 1) , i∈ (1, 2, 3, 4, 5) , a final confidence score of each target detection box may be represented as score _i=0.6*s _i+0.4* _y, i∈ (1, 2, 3, 4, 5) . The final confidence score of each target detection box may be arranged in descending order, and the target detection box with a highest final confidence score may be selected as the target tracking box.

The target tracking box may be used to track a motion trajectory of the object to determine whether a behavior of the object is abnormal, for example, crossing a pre-set perimeter, staying in place for a long time, etc. If the behavior of the object is abnormal, the alarm may be performed. The alarm may be prompted by a text, a sound, a light, or the like, or any combination thereof.

FIG. 7 is a flowchart illustrating an exemplary process for tracking an object or alarming according to some embodiments of the present disclosure. In some embodiments, process 700 may be executed by the detection system 100. For example, the process 700 may be implemented as a set of instructions (e.g., an application) stored in a storage device (e.g., the storage device 130) . In some embodiments, the processing device 112 (e.g., one or more modules illustrated in FIG. 2) may execute the set of instructions and may accordingly be directed to perform the process 700. The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 700 may be accomplished with one or more additional operations not described and/or without one or more of the operations discussed. Additionally, the order of the operations of process 700 illustrated in FIG. 7 and described below is not intended to be limiting.

In 710, the processing device 112 may obtain a target detection result.

In 720, the processing device 112 may obtain a detection box set that identifies an object in the target detection result. The detection box set may include a first detection box set, a second detection box set, and a third detection box set.

In 730, the processing device 112 may determine a target tracking box based on the first detection box set, the second detection box set, and the third detection box set.

In 740, the processing device 112 may analyze, based on the target tracking box, a motion trajectory of the object.

In 750, the processing device 112 may determine whether there is an abnormal behavior of the object based on the motion trajectory of the object.

If there is abnormal behavior, operation 760 may be proceeded, the processing device 112 may perform an alarm operation.

If there is no abnormal behavior, operation 710 may be repeated, the processing device 112 may re-tracking the object or obtain a new target detection result to track other objects.

More descriptions regarding operations in FIG. 7 may be found elsewhere in the present disclosure, e.g., FIG. 5 and the descriptions thereof, which is not repeated herein.

It should be noted that the description of each process is merely for illustration and description, without limiting the scope of the present disclosure. For those skilled in the art, the process may be modified and varied under the guidance of the present disclosure. However, these modifications and variations are still within the scope of the present disclosure. For example, a storage operation, etc. may be added.

Having thus described the basic concepts, it may be rather apparent to those skilled in the art after reading this detailed disclosure that the foregoing detailed disclosure is intended to be presented by way of example only and is not limiting. Various alterations, improvements, and modifications may occur and are intended to those skilled in the art, though not expressly stated herein. These alterations, improvements, and modifications are intended to be suggested by this disclosure, and are within the spirit and scope of the exemplary embodiments of this disclosure.

Moreover, certain terminology has been used to describe embodiments of the present disclosure. For example, the terms “one embodiment, ” “an embodiment, ” and/or “some embodiments” mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Therefore, it is emphasized and should be appreciated that two or more references to “an embodiment” or “one embodiment” or “an alternative embodiment” in various portions of this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined as suitable in one or more embodiments of the present disclosure.

Further, it will be appreciated by one skilled in the art, aspects of the present disclosure may be illustrated and described herein in any of a number of patentable classes or context including any new and useful process, machine, manufacture, or colocation of matter, or any new and useful improvement thereof. Accordingly, aspects of the present disclosure may be implemented entirely hardware, entirely software (including firmware, resident software, micro-code, etc. ) or combining software and hardware implementation that may all generally be referred to herein as a “unit, ” “module, ” or “system. ” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable media having computer-readable program code embodied thereon.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including electromagnetic, optical, or the like, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including wireless, wireline, optical fiber cable, RF, or the like, or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present disclosure may be written in a combination of one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB. NET, Python or the like, conventional procedural programming languages, such as the "C" programming language, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, Ruby, and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer, and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) or in a cloud computing environment or offered as a service such as a Software as a Service (SaaS) .

Furthermore, the recited order of processing elements or sequences, or the use of numbers, letters, or other designations thereof, are not intended to limit the claimed processes and methods to any order except as may be specified in the claims. Although the above disclosure discusses through various examples what is currently considered to be a variety of useful embodiments of the disclosure, it is to be understood that such detail is solely for that purpose, and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover modifications and equivalent arrangements that are within the spirit and scope of the disclosed embodiments. For example, although the implementation of various components described above may be embodied in a hardware device, it may also be implemented as a software only solution, e.g., an installation on an existing server or mobile device.

Similarly, it should be appreciated that in the foregoing description of embodiments of the present disclosure, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the various embodiments. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed object matter requires more features than are expressly recited in each claim. Rather, claimed object matter may lie in less than all features of a single foregoing disclosed embodiment.

Count of Filters

Filter Size

Step Length

Padding Size

A method for target detection, comprising:

obtaining a visible light image;

obtaining a target detection result by performing a target detection on the visible light image based on a target detection model; wherein

the performing a target detection includes detecting one or more objects in the visible light image according to one or more detection boxes each of which corresponds to one of convolution layers in the target detection model, the one or more detection boxes being determined according to a predetermined manner.
The method of claim 1, wherein the detection box is determined according to a first process, the first process including:

obtaining a plurality of candidate detection boxes; and

determining the one or more detection boxes by adjusting a portion of the plurality of candidate detection boxes according to the predetermined manner based on sizes of feature maps each of which corresponds to one of the convolution layers in the target detection model.
The method of claim 1, wherein the target detection model is obtained according to a second process, the second process including:

obtaining a plurality of training samples, each training sample among the plurality of training samples including a sample image and a label corresponding to the sample image; and

obtaining the target detection model by training an initial target detection model based on the plurality of training samples; wherein

the training an initial target detection model includes:

obtaining a plurality of candidate detection boxes;

determining one or more sample detection boxes by adjusting a portion of the plurality of candidate detection boxes according to the predetermined manner based on sizes of feature maps each of which corresponds to one of the convolution layers in the target detection model; and

modifying an identification ratio between positive and negative samples of the one or more sample detection boxes based on a loss function to reduce a difference between a detection result of the one or more detection boxes and the label.
The method of claim 1, wherein the performing a target detection on the visible light image based on a target detection model includes:

obtaining a first image by processing the visible light image based on an image processing model; wherein the image processing model is configured to enhance the visible light image; and

performing the target detection by inputting the first image into the target detection model.
The method of claim 1, wherein the performing a target detection on the visible light image based on a target detection model includes:

obtaining a second image by processing the visible light image based on an image enhancement algorithm; and

performing the target detection by inputting the second image into the target detection model.
The method of claim 1, wherein the method further includes:

obtaining an infrared image corresponding to the visible light image;

determining, based on a first detection result of the visible light image and the infrared image, whether a region corresponding to the first detection result of the visible light image satisfies a condition;

in response to that the region corresponding to the first detection result of the visible light image satisfies the condition, obtaining a second detection result of the infrared image by performing the target detection on the infrared image; and

determining the target detection result based on the first detection result of the visible light image and the second detection result of the infrared image.
The method of claim 6, wherein the determining, based on a first detection result of the visible light image and the infrared image, whether a region corresponding to the first detection result of the visible light image satisfies a preset condition includes:

matching the first detection result of the visible light image to the infrared image;

determining whether a temperature of the region is changed in the matched infrared image; and

in response to that the temperature of the region is changed in the matched infrared image, determining that the region satisfies the condition.
The method of claim 1, wherein the method further includes:

determining, based on the target detection result, a target tracking box that identifies an object in the target detection result; and

tracking the object in the target detection result based on the target tracking box.
The method of claim 8, wherein the determining, based on the target detection result, a target tracking box that identifies an object in the target detection result includes:

obtaining, based on the target detection result, a detection box set that identifies the object in the target detection result; wherein the detection box set includes a first detection box set, a second detection box set, and a third detection box set, the first detection box set being marked with a penalty coefficient, the second detection box set being marked with no penalty coefficient;

obtaining a target detection box set by screening the detection box set; and

determining the target tracking box based on feature information of one or more detection boxes in the target detection box set.
The method of claim 9, wherein the obtaining a target detection box set by screening the detection box set includes:

obtaining one or more confidence levels each of which corresponds to one of one or more detection boxes in the detection box set; and

determining the target detection box set by screening, based on the one or more confidence levels, the one or more detection boxes in the detection box set.
The method of claim 9, wherein the determining the target tracking box based on feature information of one or more detection boxes in the target detection box set includes:

obtaining the one or more detection boxes in the target detection box set and feature information of the target detection result;

determining one or more similarity degrees based on the feature information of the one or more detection boxes in the target detection box set and the feature information of the target detection result; and

determining the target tracking box based on the one or more similarity degrees and the one or more confidence levels corresponding to the one or more detection boxes in the detection box set.
A system for target detection, comprising:

at least one storage device including a set of instructions; and

at least one processor in communication with the at least one storage device, wherein when executing the set of instructions, the at least one processor is directed to perform operations including:

obtaining a visible light image;

obtaining a target detection result by performing a target detection on the visible light image based on a target detection model; wherein

the performing a target detection includes detecting one or more objects in the visible light image according to one or more detection boxes each of which corresponds to one of convolution layers in the target detection model, the one or more detection boxes being determined according to a predetermined manner.
The system of claim 12, wherein the detection box is determined according to a first process, the first process including:

obtaining a plurality of candidate detection boxes; and

determining the one or more detection boxes by adjusting a portion of the plurality of candidate detection boxes according to the predetermined manner based on sizes of feature maps each of which corresponds to one of the convolution layers in the target detection model.
The system of claim 12, wherein the target detection model is obtained according to a second process, the second process including:

obtaining a plurality of training samples, each training sample among the plurality of training samples including a sample image and a label corresponding to the sample image; and

obtaining the target detection model by training an initial target detection model based on the plurality of training samples; wherein

the training an initial target detection model includes:

obtaining a plurality of candidate detection boxes;

determining one or more sample detection boxes by adjusting a portion of the plurality of candidate detection boxes according to the predetermined manner based on sizes of feature maps each of which corresponds to one of the convolution layers in the target detection model; and

modifying an identification ratio between positive and negative samples of the one or more sample detection boxes based on a loss function to reduce a difference between a detection result of the one or more detection boxes and the label.
The system of claim 12, wherein to perform a target detection on the visible light image based on a target detection model, the at least one processor is directed to perform operations including:

obtaining a first image by processing the visible light image based on an image processing model; wherein the image processing model is configured to enhance the visible light image; and

performing the target detection by inputting the first image into the target detection model.
The system of claim 12, wherein to perform a target detection on the visible light image based on a target detection model, the at least one processor is directed to perform operations including:

obtaining a second image by processing the visible light image based on an image enhancement algorithm; and

performing the target detection by inputting the second image into the target detection model.
The system of claim 12, wherein the operations further include:

obtaining an infrared image corresponding to the visible light image;

determining, based on a first detection result of the visible light image and the infrared image, whether a region corresponding to the first detection result of the visible light image satisfies a condition;

in response to that the region corresponding to the first detection result of the visible light image satisfies the condition, obtaining a second detection result of the infrared image by performing the target detection on the infrared image; and

determining the target detection result based on the first detection result of the visible light image and the second detection result of the infrared image.
The system of claim 17, wherein to determine, based on a first detection result of the visible light image and the infrared image, whether a region corresponding to the first detection result of the visible light image satisfies a preset condition, the at least one processor is directed to perform operations including:

matching the first detection result of the visible light image to the infrared image;

determining whether a temperature of the region is changed in the matched infrared image; and

in response to that the temperature of the region is changed in the matched infrared image, determining that the region satisfies the condition.
The system of claim 12, wherein the operations further include:

determining, based on the target detection result, a target tracking box that identifies an object in the target detection result; and

tracking the object in the target detection result based on the target tracking box.
The system of claim 19, wherein to determine, based on the target detection result, a target tracking box that identifies an object in the target detection result, the at least one processor is directed to perform operations including:

obtaining, based on the target detection result, a detection box set that identifies the object in the target detection result; wherein the detection box set includes a first detection box set, a second detection box set, and a third detection box set, the first detection box set being marked with a penalty coefficient, the second detection box set being marked with no penalty coefficient;

obtaining a target detection box set by screening the detection box set; and

determining the target tracking box based on feature information of one or more detection boxes in the target detection box set.
The system of claim 20, wherein to obtain a target detection box set by screening the detection box set, the at least one processor is directed to perform operations including:

obtaining one or more confidence levels each of which corresponds to one of one or more detection boxes in the detection box set; and

determining the target detection box set by screening, based on the one or more confidence levels, the one or more detection boxes in the detection box set.
The system of claim 20, wherein to determine the target tracking box based on feature information of one or more detection boxes in the target detection box set, the at least one processor is directed to perform operations including:

obtaining the one or more detection boxes in the target detection box set and feature information of the target detection result;

determining one or more similarity degrees based on the feature information of the one or more detection boxes in the target detection box set and the feature information of the target detection result; and

determining the target tracking box based on the one or more similarity degrees and the one or more confidence levels corresponding to the one or more detection boxes in the detection box set.
A non-transitory computer readable medium, comprising executable instructions that, when executed by at least one processor, direct the at least one processor to perform a method for target detection, the method comprising:

obtaining a visible light image;

obtaining a target detection result by performing a target detection on the visible light image based on a target detection model; wherein

the performing a target detection includes detecting one or more objects in the visible light image according to one or more detection boxes each of which corresponds to one of convolution layers in the target detection model, the one or more detection boxes being determined according to a predetermined manner.
A method for target detection and tracking, comprising:

obtaining image information of a target monitoring region, wherein the image information includes a target visible light image and a target infrared image of the target monitoring region, and the target infrared image corresponds to the target visible light image;

obtaining a first target detection result of the target visible light image by performing a first target detection on the target visible light image;

obtaining a second target detection result of the target infrared image by performing a second target detection on the target infrared image in response to that the target infrared image satisfies a first condition;

obtaining a third target detection result of the target monitoring region based on the first target detection result and the second target detection result, wherein the third target detection result includes position information and type information of an object in the target monitoring region; and

tracking the object in the third target detection result.
The method of claim 24, wherein the performing a first target detection on the target visible light image includes:

obtaining an enhancement region image by performing an enhancement operation on the target visible light image, and obtaining a first target region image by performing a resolution adjustment operation on the enhancement region image;

obtaining a second target region image output by a target image processing model by inputting the target visible light image into the target image processing model;

obtaining a first detection result and a second detection result by performing the first target detection on the first target region image and the second target region image, respectively; wherein each of the first detection result and the second detection result includes a first type and coordinates of the object in the target monitoring region; and

determining the first target detection result of the target visible light image based on the first detection result and the second detection result.
The method of claim 25, wherein the obtaining a first detection result and a second detection result by performing the first target detection on the first target region image and the second target region image, respectively, includes:

obtaining the first detection result by inputting the first target region image to a target first detection model; wherein the target first detection model is configured to obtain a position of a detection box for identifying the detected object in images input into the target first detection model; and

obtaining the second detection result by inputting the second target region image to the target first detection model.
The method of claim 26, wherein before the inputting the first target region image to a target first detection model, the method includes:

obtaining the target first detection model by training a first detection model; wherein the obtaining the target first detection model by training a first detection model includes:

obtaining a prediction ratio of length to width of each pixel point of the object in the first target region image and the second target region image by performing a cluster operation on a ratio of length to width of a target label of each of the images input into the target first detection model;

determining a range of a prediction box on each convolution layer based on a scale formula and the prediction ratio of length to width of each pixel point of the object, wherein the prediction box is configured to determine the position of the detection box for identifying the detected object; and

modifying an identification ratio between positive and negative samples to correspond the prediction box to the object, and determining the first detection model after the prediction box is modified as the target first detection model.
The method of claim 27, wherein the scale formula includes:

S _k=S _min+ ( (S _max-S _min) /W _max) × (W _m+1-k/W ₁) , k∈ [1, m] ,

where, m is a count of layers of convolution layers, W is a length or width of the object in the image, S _max=0.95, S _min=0.05, W _max=39, W ₁=1, S _max is a maximum area of a preset target label, S _min is a minimum area of the preset target label, and S _k is an area of the detected prediction box.
The method of claim 25, wherein the determining the first target detection result of the target visible light image based on the first detection result and the second detection result includes:

when the first detection result represents a first group of object detection boxes detected in the first target region image, and the second detection result represents a second group of objects detected in the second target region image, obtaining a third group of object detection boxes by combining the first group of object detection boxes and the second group of object detection boxes, wherein the first target detection result includes the third group of object detection boxes.
The method of claim 25, wherein before the obtaining image information of a target monitoring region, the method further includes:

obtaining a training sample image set and a sample enhancement image set each of which corresponds to a training sample image in the training sample image set, wherein the sample enhancement image in the sample enhancement image set is obtained by performing an image enhancement on a region where each sample object in the training sample image corresponding to the training sample image set is located;

training an image processing model using the training sample image set, wherein the image processing model includes a plurality of convolutional layers, parameters in the image processing model are initialized by randomly sampling from a Gaussian distribution with a mean of 0 and a standard deviation of 1, the image processing model is configured to identify the sample object in the training sample image in the training sample image set, perform the image enhancement on the region where the sample object is located, and output the image obtained by performing the image enhancement on the region where the sample object is located; and

when a value of a loss function between the image output by the image processing model and the sample enhancement image in the sample enhancement image set satisfies a second condition, determining the image processing model as the target image processing model.
The method of claim 30, wherein before the determining the image processing model as the target image processing model, the method further includes:

determining the value of the loss function between the image output by the image processing model and the sample enhancement image in the sample enhancement image set; and

when the value of the loss function is less than a predetermined threshold, determining that the value of the loss function satisfies the second condition.
The method of claim 24, wherein the obtaining a second target detection result of the target infrared image by performing a second target detection on the target infrared image in response to that the target infrared image satisfies a first condition includes:

obtaining a sixth region image in the target infrared image based on the first target detection result, wherein the sixth region image is configured to represent a temperature of a target region in the target infrared image, and the target region is configured to represent a region corresponding to the first target detection result;

determining whether a temperature of the object in the target region is changed based on the sixth region image; and

in response to determining that the temperature of the object in the target region is changed, determining that the second target region image satisfies the first condition, and obtaining the second target detection result by performing the second target detection on the sixth region image.
The method of claim 32, wherein the determining whether a temperature of the object in the target region is changed based on the sixth region image includes:

when a temperature of the target region in the target infrared image represented by the sixth region image is different from an initial temperature of the target area, determining that the temperature of the object in the target region is changed.
The method of claim 32, wherein the obtaining the second target detection result by performing the second target detection on the sixth region image includes:

obtaining a second type and coordinates of the object by performing an object type detection on the object the temperature of which is changed in the sixth region image, wherein the second target detection result includes the second type and the coordinates of the object.
The method of claim 24, wherein the obtaining a third target detection result of the target monitoring region based on the first target detection result and the second target detection result includes:

when the first target detection result includes the third group of object detection boxes and the type of the object in the third group of object detection boxes, and a first object detection box in the third group of object detection boxes corresponds to the object the temperature of which is changed, determining whether the first type of the object in the first group of object detection boxes is the same as the second type of the object the temperature of which is changed; wherein the third group of object detection boxes is obtained by performing the first target detection on the target visible light image, and the second type is obtained by performing the second target detection of the target infrared image;

in response to that the first type is the same as the second type, determining a first matching result, wherein the first matching result is configured to indicate that the first object detection box matches the second target detection result, and the third target detection result includes the first matching result; or

in response to that the first type is different from the second type, determining a second matching result, wherein the second matching result is configured to indicate that the first object detection box does not match the second target detection result, and the third target detection result includes the second matching result.
The method of claim 24, wherein the tracking the object in the third target detection result includes:

obtaining the third target detection result of the target monitoring region;

obtaining a target detection box configured to identify the object in the target monitoring region based on the third target detection result;

tracking the object in the target monitoring region based on the target detection box, and recording a motion trajectory of the object;

determining whether the motion trajectory of the object includes an anomaly based on the motion trajectory of the object; and

prompting the abnormal when the motion trajectory of the object includes the anomaly.
The method of claim 36, wherein the obtaining a target detection box configured to identify the object in the target monitoring region based on the third target detection result includes:

obtaining, based on the third target detection result, a detection box set that identifies the object, wherein the detection box set includes a first detection box set, a second detection box set, and a third detection box set, the first detection box set being marked with a penalty coefficient, the second detection box set being marked with no penalty coefficient;

obtaining a target detection box by performing a target operation on the detection box set;

determining a final confidence level of the target detection box by performing feature extraction on the target detection box;

designating the target detection box with a highest final confidence level as a target tracking box; and

tracking the object in the target monitoring region based on the target tracking box.
The method of claim 37, wherein the obtaining a target detection box by performing a target operation on the detection box set includes:

arranging the first detection box set, the second detection box set, and the third tracking box set in descending order of an initial confidence level;

selecting a specified count of detection boxes that are arranged at a top in the first detection box set, the second detection box set, and the third detection box set, and removing repeated detection boxes therein;

obtaining a union of remaining selected detection boxes, and designating the union as a fourth detection box set; and

obtaining the target detection box by screening the fourth detection box set based on a predetermined rule.
A device for target detection and tracking, comprising:

an image acquisition module, configured to obtain image information of a target monitoring region, wherein the image information includes a target visible light image and a target infrared image of the target monitoring region, and the target infrared image corresponds to the target visible light image;

a first target processing module, configured to obtain a first target detection result of the target visible light image by performing a first target detection on the target visible light image;

a second target processing module, configured to obtain a second target detection result of the target infrared image by performing a second target detection on the target infrared image in response to that the target infrared image satisfies a first condition; and

a third target processing module, configured to obtain a third target detection result of the target monitoring region based on the first target detection result and the second target detection result, wherein the third target detection result includes position information and type information of an object in the target monitoring region.
A storage medium, the storage medium storing computer instructions that, when the computer instructions read by a computer, direct the computer to perform the method for target detection and tracking of any one of claims 24-38.
An electronic device, wherein the device includes at least one processor and at least one storage;

the at least one storage configured to store computer instructions; and

the at least one processor configured to execute at least a portion of the computer instructions to implement the method for target detection and tracking of any one of claims 24-38.