CN111801689A

CN111801689A - System for real-time object detection and recognition using image and size features

Info

Publication number: CN111801689A
Application number: CN201980016839.5A
Authority: CN
Inventors: 陈洋; D·科斯拉; R·M·乌伦布罗克
Original assignee: HRL Laboratories LLC
Current assignee: HRL Laboratories LLC
Priority date: 2018-04-17
Filing date: 2019-02-14
Publication date: 2020-10-20
Also published as: EP3782075A1; EP3782075A4; WO2019203921A1

Abstract

An object recognition system is described. Using an Integrated Channel Feature (ICF) detector, the system extracts a candidate target region from an input image of a scene surrounding the platform (the candidate target region having an associated raw confidence score representing a candidate object). A modified confidence score is generated based on the detected position and height of the candidate object. Using a trained Convolutional Neural Network (CNN) classifier, classifying the candidate target region based on the modified confidence score, resulting in a classified object. The classified objects are tracked using a multi-target tracker to ultimately classify individual classified objects as either target or non-target. If the classification object is a target, the device is controlled based on the target.

Description

System for real-time object detection and recognition using image and size features

Government rights

The invention was made with government support under U.S. contract number W15P 7T-10-D413. The government has certain rights in the invention.

Cross Reference to Related Applications

This application is a continuation-in-part application of U.S. application No.15/883,822 filed on 30.1.2018, and U.S. application No.15/883,822 is a non-provisional application of U.S. provisional application No.62/479,204 filed on 30.3.2017, the entire contents of which are incorporated herein by reference.

This application is also a non-provisional patent application of U.S. provisional application serial No.62/659,100 filed on 2018, 4 and 17, which is incorporated herein by reference in its entirety.

Background

(1) Field of the invention

The present invention relates to object detection systems, and more particularly to object detection and recognition systems using image and size features.

(2) Description of the related Art

Object detection and identification systems are commonly used in autonomous driving vehicles and reconnaissance systems to quickly and automatically detect and identify objects within a field of view. Conventional object detection and recognition systems attempt to identify objects based on their image characteristics. When functioning, a limitation of such systems is that recognition cannot be verified based on dimensional characteristics.

Other attempts to determine dimensions have been made using estimates of the ground plane. See, for example, Dragon r, VanGool l, "Ground plane estimation using a high Markov model," which is disclosed in the 27th IEEE computer vision and pattern recognition conference (the 27th IEEE conference on computer vision and pattern recognition) -CVPR 2014, pp.4026-4033,2014, 23 rd to 28 th month 6, golomb, ohio, the entire contents of which are incorporated herein by reference. However, this approach often fails when there are obstructions, shadows, or other problems that do not provide a clear view of the open space (e.g., vehicles traveling in a forest area).

Other prior art uses camera calibration to determine object height from an image. See, e.g., g.fuhr, c.r.jung, and m.b.d.paula, "On the Use of Calibration for Pedestrian Detection in On-board vehicular Cameras", disclosed in 2016 at 29th sibbrapi Conference On graphics, Patterns, and Images (the 201629 th sibbrapi Conference On graphics, Patterns, and Images) (sibbrapi), Sao Paulo,2016, pp.80-87, the entire contents of which are incorporated herein by reference. This calibration method will fail when no camera calibration is available.

Accordingly, there is a continuing need for an object detection and recognition system that learns from both image and position data, which is robust to data types and accurate at a variety of target sizes and positions.

Disclosure of Invention

The present disclosure provides an object identification system. In various embodiments, the system includes a memory and one or more processors. The memory includes executable instructions encoded thereon such that, when executed, the one or more processors perform operations as described herein. For example, using an Integrated Channel Feature (ICF) detector, the system extracts a candidate target region from an input image of a scene surrounding the platform (the candidate target region having an associated raw confidence score representing a candidate object). A modified confidence score is generated based on the detected position and height of the candidate object. Using a trained Convolutional Neural Network (CNN) classifier, classifying the candidate target region based on the modified confidence score, resulting in a classified object. The classified objects are tracked using a multi-target tracker to ultimately classify individual classified objects as either target or non-target. If the classification object is a target, the device may be controlled based on the target.

In another aspect, the ICF detector calculates channel feature vectors for image frames of the video, and wherein, for each image frame, the ICF classifier is applied at a plurality of image scales and across the entire image frame.

In yet another aspect, the CNN classifier is implemented as an interactive software module including a CNN interface and a CNN server, wherein the CNN interface displays results received from the CNN server.

In another aspect, the trained CNN is used for both electro-optical (EO) and Infrared (IR) image classification.

In yet another aspect, the input image is divided into a plurality of horizontal bands, and true (ground) objects in the input image are divided into the same number of groups based on whether their positions are located in the bands, the objects in each group being used to estimate a mean and a standard deviation of an object height distribution in the input image.

In another aspect, the process of generating the modified confidence score uses a weighted gaussian according to the following equation:

and

the modified confidence score is the original confidence score wf,

where h denotes the height of the candidate object in the input image, m and σ denote the mean and standard deviation of the object height distribution in the input image and the straight bar, respectively, exp (.) denotes an exponential function, N is a multiplier, and x denotes multiplication.

In yet another aspect, the process of generating the modified confidence score uses a weighted threshold (gate) according to the following equation:

and

the modified confidence score is the original confidence score wf,

where h denotes the height of the candidate object in the input image, m and σ denote the mean and standard deviation of the object height distribution in the input image and the straight bar, respectively, N is a multiplier, and x denotes multiplication.

In yet another aspect, the one or more processors perform the following: classifying the candidate target region based on the modified confidence score using a modified convolutional network (CNN-2) classifier, resulting in a modified classification object; and fusing the modified classification objects with classification objects from the trained CNN classifier for processing by the multi-target tracker.

Finally, the present invention also includes a computer program product and a computer implemented method. The computer program product includes computer-readable instructions stored on a non-transitory computer-readable medium that are executable by a computer having one or more processors such that, when the instructions are executed, the one or more processors perform the operations listed herein. Alternatively, a computer-implemented method includes acts that cause a computer to execute such instructions and perform the resulting operations.

Drawings

The objects, features and advantages of the present invention will be readily understood from the following detailed description of the various aspects of the invention with reference to the following drawings, in which:

FIG. 1 is a block diagram illustrating components of a system in accordance with various embodiments of the invention;

FIG. 2 is an illustration of a computer program product embodying an aspect of the present invention;

FIG. 3 is a system block diagram according to various embodiments of the invention;

FIG. 4A is a block diagram of a system according to various embodiments of the invention;

FIG. 4B is a system block diagram illustrating a modified convolutional network classifier in accordance with various embodiments of the present invention;

FIG. 5 is an image illustrating the division of an image frame into N horizontal stripes, where true-valued objects are grouped into N bins;

FIG. 6A is an illustration showing an example height distribution of objects in 88 training sequences for a side-facing sensor;

FIG. 6B is an illustration showing an example height distribution of objects in 88 training sequences for a front facing sensor;

FIG. 7 is a graph illustrating a comparison of weighted and unweighted detection scores for 30 test sequences;

FIG. 8 is a graph illustrating post-CNN (second-level) baseline Receiver Operating Characteristics (ROC) compared to an ROC for an unweighted score for a Gaussian and gated (gated) detection score weight method;

FIG. 9A is an illustration showing an example height distribution of truth objects (e.g., people, alighters (dis)) in 88 training sequences for side-facing sensors;

FIG. 9B is an illustration showing an example height distribution of truth MAN objects in 88 training sequences for front facing sensors;

FIG. 10 is a graph illustrating the results of comparing the detection score weighted ROC to the detection score unweighted ROC for 30 EO test sequences in a pre-CNN (first stage + size filtering) and post-CNN (second stage) results; and

FIG. 11 is a block diagram illustrating control of a device according to various embodiments.

Detailed Description

The present invention relates to object detection systems, and more particularly to object detection and recognition systems using image and size features. The following description is presented to enable any person skilled in the art to make and use the invention and is incorporated in the context of a particular application. Various modifications and uses in different applications will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to a wide range of aspects. Thus, the present invention is not intended to be limited to the aspects shown, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

In the following detailed description, numerous specific details are set forth in order to provide a more thorough understanding of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without limitation to these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.

The reader's attention is directed to all papers and documents which are filed concurrently with this specification and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference. All the features disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.

Furthermore, any element in the claims that does not explicitly recite a "means" or a "step" to perform a particular function is not to be construed as an "apparatus" or "step" clause as set forth in clause 6 of 35 u.s.c. clause 112. In particular, the use of "step" or "action" in the claims herein is not intended to refer to the provisions of clause 6 of 35 u.s.c.112.

Before describing the present invention in detail, a description of the various principal aspects of the invention will first be provided. The following description provides the reader with a general understanding of the invention. In the following, specific details of various embodiments of the invention are provided to give an understanding of the specific aspects. Fourth, example implementations with experimental results are provided. Finally, example implementations illustrating practical applications of the system are described.

(1) Main aspects of the invention

Various embodiments of the present invention include three "primary" aspects. A first aspect is a system for object detection and recognition. The system is typically in the form of computer system operating software or in the form of a "hard coded" instruction set. The system may be incorporated into a wide variety of devices that provide different functionality. The second main aspect is a method, typically in the form of software running using a data processing system (computer). A third broad aspect is a computer program product. The computer program product generally represents computer readable instructions stored on a non-transitory computer readable medium, such as an optical storage device, such as a Compact Disc (CD) or Digital Versatile Disc (DVD), or a magnetic storage device, such as a floppy disk or magnetic tape. Other non-limiting examples of computer-readable media include hard disks, Read Only Memories (ROMs), and flash-type memories. These aspects will be described in more detail below.

A block diagram depicting an example of the system of the present invention (i.e., computer system 100) is provided in fig. 1. The computer system 100 is configured to perform calculations, processes, operations, and/or functions associated with a program or algorithm. In one aspect, certain processes and steps discussed herein are implemented as a series of instructions (e.g., a software program) that reside within a computer-readable storage unit and are executed by one or more processors of the computer system 100. When executed, the instructions cause the computer system 100 to perform particular actions and exhibit particular behavior, such as those described herein.

Computer system 100 may include an address/data bus 102 configured to communicate information. In addition, one or more data processing units (such as one or more processors 104) are coupled to the address/data bus 102. The processor 104 is configured to process information and instructions. In an aspect, the processor 104 is a microprocessor. Alternatively, the processor 104 may be a different type of processor, such as a parallel processor, an Application Specific Integrated Circuit (ASIC), a Programmable Logic Array (PLA), a Complex Programmable Logic Device (CPLD), or a Field Programmable Gate Array (FPGA).

Computer system 100 is configured to utilize one or more data storage units. The computer system 100 may include a volatile memory unit 106 (e.g., random access memory ("RAM"), static RAM, dynamic RAM, etc.) coupled to the address/data bus 102, wherein the volatile memory unit 106 is configured to store information and instructions for the processor 104. The computer system 100 may also include a non-volatile memory unit 108 (e.g., read only memory ("ROM"), programmable ROM ("PROM"), erasable programmable ROM ("EPROM"), electrically erasable programmable ROM "EEPROM," flash memory, etc.) coupled to the address/data bus 102, wherein the non-volatile memory unit 108 is configured to store static information and instructions for the processor 104. Alternatively, the computer system 100 may execute instructions retrieved from an online data storage unit (such as in "cloud" computing). In an aspect, computer system 100 may also include one or more interfaces, such as interface 110, coupled to address/data bus 102. The one or more interfaces are configured to enable the computer system 100 to interface with other electronic devices and computer systems. The communication interfaces implemented by the one or more interfaces may include wired communication techniques (e.g., serial cable, modem, network adapter, etc.) and/or wireless communication techniques (e.g., wireless modem, wireless network adapter, etc.).

In one aspect, the computer system 100 may include an input device 112 coupled to the address/data bus 102, wherein the input device 112 is configured to communicate information and command selections to the processor 100. According to one aspect, the input device 112 is an alphanumeric input device (such as a keyboard) that may include alphanumeric and/or function keys. Alternatively, input device 112 may be an input device other than an alphanumeric input device. In an aspect, the computer system 100 may include a cursor control device 114 coupled with the address/data bus 102, wherein the cursor control device 114 is configured to communicate user input information and/or command selections to the processor 100. In one aspect, cursor control device 114 is implemented using a device such as a mouse, trackball, trackpad, optical tracking device, or touchscreen. Notwithstanding the foregoing, in one aspect, cursor control device 114 is directed and/or activated via input from input device 112 (such as in response to use of particular keys and key sequence commands associated with input device 112). In an alternative aspect, cursor control device 114 is configured to be guided or directed by voice commands.

In an aspect, computer system 100 may also include one or more optional computer usable data storage devices, such as storage device 116, coupled to address/data bus 102. Storage device 116 is configured to store information and/or computer-executable instructions. In one aspect, the storage device 116 is a storage device such as a magnetic or optical disk drive (e.g., a hard disk drive ("HDD"), a floppy disk, a compact disk read-only memory ("CD-ROM"), a digital versatile disk ("DVD"). according to one aspect, a display device 118 is coupled to the address/data bus 102, wherein the display device 118 is configured to display video and/or graphics.

Computer system 100 presented herein is an example computing environment in accordance with an aspect. However, non-limiting examples of computer system 100 are not strictly limited to computer systems. For example, an aspect provides that the computer system 100 represents one type of data processing analysis that may be used in accordance with aspects described herein. Other computing systems may also be implemented. Indeed, the spirit and scope of the present technology is not limited to any single data processing environment. Thus, in one aspect, one or more operations of aspects of the present technology are controlled or implemented using computer-executable instructions (such as program modules) executed by a computer. In one implementation, such program modules include routines, programs, objects, components, and/or data structures that are configured to perform particular tasks or implement particular abstract data types. In addition, an aspect provides for implementing one or more aspects of the technology by utilizing one or more distributed computing environments, such as where tasks are performed by remote processing devices that are linked through a communications network, or such as where multiple program modules are located in both local and remote computer storage media (including memory storage devices).

An illustrative diagram of a computer program product (i.e., a storage device) embodying the present invention is shown in FIG. 2. The computer program product is shown as a floppy disk 200 or an optical disk 202 such as a CD or DVD. However, as previously mentioned, the computer program product generally represents computer readable instructions stored on any compatible non-transitory computer readable medium. The term "instructions" as used in relation to the present invention generally represents a set of operations to be performed on a computer and may represent multiple pieces of an entire program or separate separable software modules. Non-limiting examples of "instructions" include computer program code (source code or object code) and "hard-coded" electronic devices (i.e., computer operations encoded into a computer chip). "instructions" are stored on any non-transitory computer readable medium, such as in the memory of a computer or in floppy disks, CD-ROMs, and flash drives. In either case, the instructions are encoded on a non-transitory computer-readable medium.

(2) Introduction to

The present disclosure provides an object detection and recognition system that uses both image and size/location features. The system extends the disclosure of U.S. application No.15/883,822 which uses only image features. The system of the present disclosure is operable to: 1) learning from the image and location data to accurately detect and recognize the target; 2) performing confidence adjustment on the detection result based on the position data; and 3) combine all of the above into an integrated system as a single pipeline. Upon reviewing the system and corresponding performance evaluation described below, it is apparent that the present disclosure provides significant technical improvements to the field and techniques for object detection and recognition.

(3) Details of various embodiments

As shown in fig. 3, the system of the present disclosure improves upon a three-stage cascade classifier for target recognition in EO and IR video from static or mobile platforms. Specific details regarding the three-stage classifier can be found in U.S. application No.15/883,822. The first stage is an Integrated Channel Features (ICF) detector 300 that receives the video and runs fast detection (e.g., greater than 15 frames per second) to provide high confidence candidate target regions and scores (e.g., "MANs" or human targets (or other objects of interest)) that are bounding boxes in the video. ICF is based on the aggregation of "channel features" and the training of small decision trees using these features. The basic features can be seen as a mapping from the original pixel values (RGB/IR) to more informative features such as directional gradients, Haar features, regional differences or simple color space transforms. The output of the ICF detector 300 is the detected target frame position and associated confidence level. The ICF detector 300 computes channel feature vectors for the image frames of the video and, for each image frame, an ICF classifier is applied at multiple image scales and across the entire image frame.

In contrast to the three-stage classifier described in U.S. application No.15/883,822, the system of the present disclosure adds a target size filter 302 to the system. A target size filter 302 is applied to the output of the first stage to affect a confidence score based on the expected target size box versus the detected target size box. The candidate bounding box with the modified confidence score is then fed to a second stage, which is a Convolutional Neural Network (CNN) classifier 304 that outputs the target class, location, and confidence.

In various embodiments, CNN classifier 304 is implemented as an interactive software module that includes a CNN interface and a CNN server (e.g., one or more processors and corresponding memory), where the CNN interface displays results received from the CNN server. The CNN interface acquires the candidate target box information from the ICF detector 300 and extracts an image region from the input video, and hands it to the CNN server for classification. Upon receiving the results from the CNN server, the CNN interface may display the results in real time, and may also record the results into a disk file and provide an output target box for further processing.

The third level is a multi-target tracker (MTT)306, which multi-target tracker (MTT)306 tracks the target box from the CNN level (i.e., CNN classifier 304) for the final target classification, location, and confidence score. In an alternative embodiment, the tracker results are fed to a comparator for further processing by the CNN stage.

The idea of imposing an object size constraint is as follows. When the camera senses a lower vehicle (e.g., "MAN" or a person (or other object of interest)) on a flat surface, the height of the subject (subject) in the image is directly related to the subject's distance from the camera, which is reflected in the position of the subject's feet in the image. Given the image line of the position in the image where the subject's foot is located, the system can accurately calculate (or predict) the height of the subject in the image given the height of the subject, the intrinsic characteristics of the camera, the height of the camera above the ground, and the camera tilt. By comparing the predicted height with the height of the detection box in the image, the system can provide confidence in the detection based on the degree of match between the two.

Rather than constructing an analytical formula, the system empirically estimates the bottom (e.g., foot position) versus the height of the object (i.e., the top of the object) in the image. In doing so and as shown in fig. 5, an image frame 500 (e.g., 640 x 480 pixels) is divided into N (e.g., 16) horizontal strips 502, and the true value objects are grouped into N bars 504 according to where the bottom (or foot) of the true value box is located in the image. From the set of objects in each bar, the system can estimate their height distribution. For example, a normal (gaussian) distribution with a mean and standard deviation (m, σ) is used to represent the distribution. However, as described and illustrated, it is convenient to illustrate the distribution, typically via a histogram plot.

To use the above empirical height distribution to affect the confidence of object detection and thereby improve system performance, the detector or confidence score from the ICF detector 300 is modified. Post-processing is used to evaluate the impact of this method on the system Receiver Operating Characteristics (ROC) so that no actual detector or classification process is required. This evaluation method involves only the first two stages (i.e., the ICF detector 300 and the CNN classifier 304) and the target size filter 302, and does not involve the third stage (i.e., the MTT 306). To modify the confidence score, the system calculates and applies a multiplicative weight factor (wf) to the detection confidence score based on the position and height of the detection. Two methods of calculating wf are described further below.

In a second embodiment and as shown in fig. 4A, the system learns to predict object classes by combining images and position/size information to train an alternative neural network to produce classification results. In this embodiment, the top row processing employs the same deep convolutional network (CNN-1) classifier 304 as in the second stage, while the second row employs a modified convolutional network (CNN-2) classifier 400. The modified convolutional network classifier 400 outputs the target class, location, and confidence, fuses 402 it with the target class, location, and confidence from the CNN classifier 304, and provides to the MTT306 that tracks the target box for the final target classification, location, and confidence score.

The modified convolutional network classifier is extended and further illustrated in fig. 4B. After receiving input 404 from the ICF 300, 1024-D (dimensional) features 408 from the final convolutional layer (i.e., depth convolutional layer 406) are populated with target size and position 410 and fed to a Fully Connected (FC) layer 412 that precedes the classifier layer. The modified CNN-2400 may be used with the original CNN-1304 and the results of CNN-1304 and CNN-2400 may be fused 402 to arrive at the final decision. Alternatively, CNN-2400 may be utilized in place of CNN-1304 while maintaining the same process flow. The fusion 402 may be achieved by combining the probability distributions of CNN-1 and CNN-2 across the set of classes to be classified; for example, the weights from the classes of the two CNNs are simply averaged and then renormalized to make the sum of the weights 1.0.

(3.1) method 1: weighted gaussian

A first method of modifying the confidence score uses a weighted gaussian according to the following equation:

new score wf,

where h represents the height of the detected object in the image whose detection confidence score is to be modified. m and σ represent the mean and standard deviation of the object height distribution in the corresponding image and bar, respectively. Exp (.) represents an exponential function. In the absence of a histogram corresponding to an image slice (i.e., no estimation of m and σ), the multiplicative weight factor (wf) is set to zero (i.e., wf ═ 0.0). Further, N ═ {1,2,3,4. } is a multiplier for relaxing the detection size constraint. The multiplicative weight factor (wf) is then multiplied with the original score to derive a new and modified confidence score (i.e., new _ score).

It was determined experimentally that N-1 is very limiting and does not result in significant improvement in system performance, while N-4 provides most benefits. Fig. 6A and 6B show a plot 604 that shows a target height 600 distribution of 88 training sequences from IR data sets of side-facing (shown in fig. 6A) and front-facing (shown in fig. 6B) sensors, respectively, based on true value (GT) information from human annotations. The height distribution 600 is collected in 16 horizontal bands of image height across 480 rows (represented as 16 plots in each of fig. 6A and 6B) and plotted as a 25-bar histogram using their gaussian approximations (mean and standard deviation). The histogram of each strip is marked with the image row it covers. The absence of a plotted histogram means that there are not enough GT target samples to support histogram estimation for the corresponding band.

For further understanding, fig. 6A and 6B illustrate two things. First, when histogram plots are absent in corresponding bars, objects appearing in those bars are not possible regardless of object height. Second, for a bar with histogram plot, the histogram is Gaussian-like. That is, the potential target height can be well modeled by a gaussian distribution. Thus, once the estimated target height for a particular bar is estimated, the likelihood that target detection will be true can be estimated based on this distribution. These observations support the approach proposed here for approach 1.

(3.2) method 2: weighted gating

A second method of modifying the confidence score uses a weighted threshold according to the following equation

Basically, this gates the detection frame height (h) around the mean (m) of the GT frame heights of the corresponding image strips. If the detected altitude is within the threshold, the original score is retained. Otherwise, the original score will be cleared, effectively avoiding further processing of the detection. This approach is called "weighted gating". Similar to method 1, in the absence of a histogram corresponding to an image slice (i.e., no estimation of m and σ), wf is 0. As in the above case, once the multiplicative weight factor (wf) is determined, it is multiplied with the original score to yield a new and modified confidence score.

In experiments, it was found that the "weighted gating" method was equally effective in improving detection performance in terms of ROC when N >3, while poorer performance was obtained when N ═ 1. For example, fig. 8 is a graph illustrating the post-CNN (second level) ROC of gaussian and gated detection score weighting methods compared to the baseline ROC of the unweighted score. As can be seen, both the gaussian method and the gating method performed much better (4 for N) and about 5% better than baseline.

(4) Results of the experiment

The embodiment shown in fig. 3 has been implemented and tested for identifying alighters ("MAN" targets) and their activities in EO and IR videos from fixed and moving ground vehicles.

(4.1) IR video data

As shown in fig. 6A and 6B, a set of IR video sequences is collected from a moving vehicle and a histogram distribution of the height 600 of the true (GT) box of the object 602 is plotted to form a set of collected height histogram plots 604. Since the sensors for the side-facing IR sensor sequence and the front-facing IR sensor sequence have different tilt angles, histogram collection is performed for the h1co and h2co sequences, respectively. These are shown in fig. 6A and 6B discussed previously. As can be seen from the histogram plots in fig. 6A and 6B, the GT height 600 monotonically decreases from the bottom (line 450 to line 479)606 of the image towards the top and ends at line 30 to line 59608 with a mean of 23.9 pixels for h1co and at line 180 to line 209610 with a mean of 17.2 for h2 co. For both h1co and h2co, the greatest concentration of GT MAN-objects was located in the 3 to 4 bands below the shortest target band listed above.

The weighting factor wf is calculated as described above and the detection score is modified using wf. The remaining pipelines shown in fig. 3 were executed and performance was evaluated. Experiments show how the weighted gaussian method (method 1) improves the effectiveness of the confidence score. In particular, fig. 7 is a graph illustrating a comparison of weighted and unweighted detection scores for 30 IR test sequences in pre-CNN (first stage) and post-CNN (second stage) Receiver Operating Characteristic (ROC) curves. Since the gating method (method 2) achieves almost the same results, only the gaussian window method is shown. As can be seen from these plots in fig. 7, for any given false positive (FPPI) or FPPI reduction per image of nearly 50% (e.g., slightly above 75% Pd level), we obtained a Pd (probability of correct detection) of 7% (0.07) or more.

(4.2) EO video data

The same experiment and analysis was performed on visible banding (EO) video data. The results are shown in fig. 9A and 9B, respectively. In particular, fig. 9A and 9B are graphs (i.e., a set of height histogram plots 900) illustrating the height distribution of GT "MAN" objects in 88 training sequences for side-facing color (or EO) sensors (fig. 9A) and front-facing EO sensors (fig. 9B). The height distribution was collected in 16 horizontal bands of image height across 768 rows and plotted as 25 bar histograms with overlaid gaussian approximations (from mean and standard deviation). The histogram of each strip is marked with the image row it covers. In case there is no plotted histogram, it means that there are not enough GT entries to support the histogram estimation.

These histograms do not look much different from the histograms from the IR sequences presented in the previous section (see fig. 6A and 6B). Although the specific numbers differ, the trends are the same: as the target moves closer to the sensor (the bottom edge of the target frame moves closer to the bottom of the image), the height of the target may become larger, as shown by the mean of the plot. The estimated distributions shown in fig. 9A and 8B are used to weight MAN object detection scores in the same manner as discussed for the IR images. Fig. 10 shows the results of comparing the detection scores of 30 EO test sequences in the pre-CNN (first stage + size filtering) and post-CNN (second stage) results against the weighted and unweighted ROCs. As can be seen, the detection ROC for the weighted score (pre-CNN) has a performance advantage of about 4% higher than the unweighted ROC, but the benefit of ROC for the post-CNN is almost eliminated. This is assumed to occur because the performance of CNNs trained in EO is already very good, and improved detection hardly helps to further improve the situation. This is in contrast to the IR case. CNN in the IR domain is not as good as CNN in EO because the target objects tend to be smaller and contain less distinguishing texture.

Thus, the above is an object detection and recognition system that utilizes the addition of detection size filtering to improve the work described in U.S. application No.15/883,822. According to system ROC, experiments show positive improvement in overall detection performance (Pd 5% to 7%, FPPI drop close to 50%) before CNN identification (i.e., only at the first stage) and after CNN identification (at the first and second stages) compared to baseline.

(5) Control of a device

The invention described herein allows real-time identification of objects/targets based on EO or IR vision even in small, low-power, low-cost platforms such as UAVs and UGVs. The method is also suitable for implementation on emerging spiking neuromorphic hardware (e.g., neuromorphic chips). Systems according to embodiments of the present disclosure may be used in intelligence, monitoring and reconnaissance (ISR) operations, boundary security, and mission security, such as for UAV-based monitoring, human activity detection, threat detection, and distributed mobile operations. For example, for military applications, the classified object output may be used to alert the driver/team (via audible, tactile, and/or visual alerts, etc.) that a high confidence "MAN" target exists and its location. Then, after manually confirming the danger thereof, evasive action or attack on the target can be taken by causing the vehicle to change the route or the like. For a remotely operated vehicle, it may also provide a similar alarm.

Additionally, the system may be embedded in autonomous driving robotic vehicles (such as UAVs and UGVs) as well as autonomous driving vehicles. For example, in autonomous vehicles, the system may be used for collision avoidance. In this example, if the system detects an object (target) (e.g., a pedestrian, another vehicle) in its path, an alert is sent to the vehicle operating system to cause the vehicle to perform a braking operation. Alternatively, the alert may signal that the vehicle operating system should perform a turning motion around an object (target), including steering and acceleration operations or any other operation required to provide collision avoidance. Further, the detected object may be a road sign, such as a stop sign. In classifying the stop sign, an alert may be sent to the vehicle operating system to brake or otherwise comply with the message conveyed by the road sign. Thus, as described above, the systems and processes described herein may be used to control various devices, such as to cause the devices to perform operations or physical manipulations.

Fig. 11 is a flowchart illustrating controlling the device 1102 based on a classification of objects as targets using the processor 1100. Non-limiting examples of devices 1102 that may be controlled via the processor 1100 and categories of target objects include vehicles or vehicle components such as brakes, acceleration/deceleration controls, steering mechanisms, suspension or safety devices (e.g., airbags, seat belt pretensioners, etc.), or any combination thereof. Further, the vehicle may be an Unmanned Aerial Vehicle (UAV), autonomous ground vehicle, or human-operated vehicle controlled by a driver or remote operator. As can be appreciated by a person skilled in the art, other device type controls are possible given the classification of the object as the target and the corresponding situation in which the system is employed.

Finally, while the invention has been described in terms of various embodiments, those of ordinary skill in the art will readily recognize that the invention can have other applications in other environments. It should be noted that many embodiments and implementations are possible. Furthermore, the following claims are in no way intended to limit the scope of the present invention to the specific embodiments described above. Additionally, any recitation of "means for … …" is intended to cause a device-plus-function reading of elements and claims, and any element not specifically recited using the recitation of "means for … …" is not intended to be read as a device-plus-function element, even if the claims otherwise include the word "means. Further, although specific method steps are recited in a particular order, these method steps may occur in any desired order and are within the scope of the present invention.

Claims

1. An object recognition system, the object recognition system comprising:

a non-transitory computer-readable medium having executable instructions encoded thereon such that when executed, the one or more processors perform the following:

extracting a candidate target region from an input image of a scene surrounding a platform using an Integrated Channel Feature (ICF) detector, wherein the candidate target region has an associated raw confidence score representing a candidate object;

generating a modified confidence score based on the detected position and height of the candidate object;

classifying the candidate target region based on the modified confidence score using a trained Convolutional Neural Network (CNN) classifier, resulting in a classified object;

tracking the classified objects using a multi-target tracker to ultimately classify each classified object as either a target or a non-target; and

and if the classification object is a target, controlling the equipment based on the target.

2. The object recognition system of claim 1, wherein the ICF detector computes channel feature vectors for image frames of a video, and wherein for each image frame an ICF classifier is applied at a plurality of image scales and across the entire image frame.

3. The object recognition system of claim 1, wherein the CNN classifier is implemented as an interactive software module comprising a CNN interface and a CNN server, wherein the CNN interface displays results received from the CNN server.

4. The object recognition system of claim 1, wherein the trained CNN is used for both electro-optical (EO) and Infrared (IR) image classification.

5. The object recognition system of claim 1, wherein the input image is divided into a plurality of horizontal bands, and the true objects are divided into the same number of groups based on whether the locations of the true objects in the input image are located in the bands, the objects in each group being used to estimate the mean and standard deviation of the object height distribution in the input image.

6. The object recognition system of claim 1, wherein the modified confidence score is generated using a weighted gaussian according to the following equation:

and

the modified confidence score is the original confidence score wf,

where h denotes the height of the candidate object in the input image, m and σ denote the mean and standard deviation of the object height distribution in the input image and straight bars, respectively, exp (.) denotes an exponential function, N is a multiplier, and x denotes multiplication.

7. The object recognition system of claim 1, wherein the modified confidence score is generated using a weighted threshold according to the following equation:

and

the modified confidence score is the original confidence score wf,

where h denotes the height of the candidate object in the input image, m and σ denote the mean and standard deviation of the object height distribution in the input image and straight bars, respectively, N is a multiplier, and x denotes multiplication.

8. The object recognition system of claim 1, further comprising the operations of:

classifying the candidate target region based on the modified confidence score using a modified convolutional network (CNN-2) classifier, resulting in a modified classification object; and

fusing the modified classified objects with the classified objects from the trained CNN classifier for processing by the multi-target tracker.

9. A computer program product for object recognition, the computer program product comprising:

a non-transitory computer-readable medium having executable instructions encoded thereon such that, when executed by one or more processors, the one or more processors perform operations comprising:

10. The computer program product of claim 9, wherein the ICF detector computes channel feature vectors for image frames of a video, and wherein, for each image frame, an ICF classifier is applied at a plurality of image scales and across the entire image frame.

11. The computer program product of claim 9, wherein the CNN classifier is implemented as an interactive software module comprising a CNN interface and a CNN server, wherein the CNN interface displays results received from the CNN server.

12. The computer program product of claim 9, wherein the trained CNN is used for both electro-optical (EO) and Infrared (IR) image classification.

13. The computer program product of claim 9, wherein the input image is divided into a plurality of horizontal bands, and the true objects are divided into a same number of groups based on whether locations of the true objects in the input image are located in the bands, the objects in each group being used to estimate a mean and a standard deviation of an object height distribution in the input image.

14. The computer program product of claim 9, wherein the modified confidence score is generated using a weighted gaussian according to the following equation:

and

the modified confidence score is the original confidence score wf,

15. The computer program product of claim 9, wherein the modified confidence score is generated using a weighted threshold according to the following equation:

and

the modified confidence score is the original confidence score wf,

16. The computer program product of claim 9, further comprising operations of:

17. A computer-implemented method for object recognition, the computer-implemented method comprising acts of:

18. The computer-implemented method of claim 17, wherein the ICF detector computes channel feature vectors for image frames of a video, and wherein, for each image frame, an ICF classifier is applied at a plurality of image scales and across the entire image frame.

19. The computer-implemented method of claim 17, wherein the CNN classifier is implemented as an interactive software module comprising a CNN interface and a CNN server, wherein the CNN interface displays results received from the CNN server.

20. The computer-implemented method of claim 17, wherein the trained CNN is used for both electro-optical (EO) and Infrared (IR) image classification.

21. The computer-implemented method of claim 17, wherein the input image is divided into a plurality of horizontal bands, and the true objects are divided into a same number of groups based on whether locations of the true objects in the input image are located in the bands, the objects in each group being used to estimate a mean and a standard deviation of an object height distribution in the input image.

22. The computer-implemented method of claim 17, wherein the modified confidence score is generated using a weighted gaussian according to the following equation:

and

the modified confidence score is the original confidence score wf,

23. The computer-implemented method of claim 17, wherein the modified confidence score is generated using a weighted threshold according to the following equation:

and

the modified confidence score is the original confidence score wf,

24. The computer-implemented method of claim 17, the method further comprising the acts of: