WO2023245635A1

WO2023245635A1 - Apparatus and method for object detection

Info

Publication number: WO2023245635A1
Application number: PCT/CN2022/101174
Authority: WO
Inventors: Haoran WEI; Ping Guo; Bing Wang; Peng Wang
Original assignee: Intel Corporation
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2023-12-28

Abstract

An apparatus, method, device and medium for object detection. The method includes: predicting a pair of pseudo 3D corners including a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner (S610), modeling the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth (S620); and combining the SA and the CA via an interaction function to gain object confidence (S630).

Description

APPARATUS AND METHOD FOR OBJECT DETECTION

Technical Field

Embodiments of the present disclosure generally relate to techniques of convolutional neural networks (CNNs) , and in particular to an apparatus and a method for object detection.

Background Art

Object Detection aims to localize and classify a target of interest in an image. In simple scenarios, existing prevalent detectors, e.g., Fast R-CNN, RetinaNet, CenterNet and FCOS, are robust enough, which utilize center points to model the objects bounding box. However, their performance will be severely degraded in some extreme scenarios, especially the objects occlusion which often occurs in practical application, such as pedestrian detection. The reason why objects occlusion is difficult to solve is that most detectors model an object instance via its center point, yet objects’ centers are easy to overlap.

Corner-guided detectors such as CornerNet and its variants, CenterNet and CentripetalNet, use corners to replace centers to alleviate the problem by transforming object detection to be corner keypoints prediction and grouping without center estimation, in which the corner modeling method can effectively alleviate the centers overlap problem.

Summary

According to an aspect of the disclosure, a method is provided. The method includes: predicting a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth; modeling the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and combining the SA and the CA via an interaction function to gain object confidence.

According to another aspect of the disclosure, an apparatus is provided. The apparatus includes: interface circuitry; and processor circuitry coupled to the interface circuitry and configured to: predict a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth; model the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and combine the SA and the CA via an interaction function to gain object confidence.

Another aspect of the disclosure provides a device including means for implementing the method of the disclosure.

Another aspect of the disclosure provides a machine readable storage medium having instructions stored thereon, which when executed by a machine cause the machine to perform the method of the disclosure.

Brief Description of the Drawings

In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. Like numerals having different letter suffixes may represent different instances of similar components. The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments discussed in the present document.

Fig. 1 illustrates visualization comparisons of different solutions for object detection in an image.

Fig. 2 is an exemplary illustration of pseudo 3D corner representation in accordance with some embodiments of the disclosure.

Fig. 3 is an exemplary illustration of object detection based on the pseudo 3D corner representation in accordance with some embodiments of the disclosure.

Fig. 4 is an exemplary illustration of structure affinity (SA) in accordance with some embodiments of the disclosure.

Fig. 5 illustrates visualization comparisons of the proposed method with classical corner-guided object detectors on Citypersons dataset in accordance with some embodiments of the disclosure.

Fig. 6 illustrates a flow chart illustrating an exemplary method 600 for object detection in accordance with some embodiments of the disclosure.

Fig. 7 is a block diagram illustrating components, according to some example embodiments, able to read instructions from a machine-readable or computer-readable medium and perform any one or more of the methodologies discussed herein.

Fig. 8 is a block diagram of an example processor platform in accordance with some embodiments of the disclosure.

Detailed Description of Embodiments

Various aspects of the illustrative embodiments will be described using terms commonly employed by those skilled in the art to convey the substance of the disclosure to others skilled in the art. However, it will be apparent to those skilled in the art that many alternate embodiments may be practiced using portions of the described aspects. For purposes of explanation, specific numbers, materials, and configurations are set forth in order to provide a thorough understanding of the illustrative embodiments. However, it will be apparent to those skilled in the art that alternate embodiments may be practiced without the specific details. In other instances, well known features may have been omitted or simplified in order to avoid obscuring the illustrative embodiments.

Further, various operations will be described as multiple discrete operations, in turn, in a manner that is most helpful in understanding the illustrative embodiments; however, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations need not be performed in the order of presentation.

The phrases “in an embodiment” , “in one embodiment” and “in some embodiments” are used repeatedly herein. The phrase generally does not refer to the same embodiment; however, it may. The terms “comprising, ” “having, ” and “including” are synonymous, unless the context dictates otherwise. The phrases “A or B” and “A/B” mean “ (A) , (B) , or (A and B) . ”

Although corner modeling method has potential to overcome the objects occlusion problem, the corner grouping process is challenging. To our best knowledge, current corner grouping algorithms, no matter CornerNet or its variants, CenterNet and CentripetalNet, cannot deal with dense occlusions.

Existing detection methods such as Faster-RCNN, SSD, use a 2D-box represented by height and width to model the object in an image. However, for at least two partially overlapped objects in an image, their 2D-boxes are overlapped and are difficult to distinguish.

Fig. 1 illustrates visualization comparisons of different solutions for object detection in an image. As shown in Fig. 1, there are 4 people in the image and 3 of them are dense occluded. CornerNet fails to distinguish corners of different objects with similar appearance. Upon CornerNet, CenterNet filters out false positives (FP) via center points yet cannot solve the scenario that a center point of the third object right within the center of an FP. CentripetalNet takes the two dense occluded people as one due to the center overlapping.

To addresses the occluded object detection problem faced by existing solutions, in the present disclosure, a pseudo 3D corner representation is provided to model object (s) in an image, wherein the pseudo 3D may include the dimensions of height, width, and pseudo depth. Through the promotion of dimensions from 2D to 3D, the coincided detection box can be distinguished, i.e., the overlapped detection box in a 2D space can be distinguished in a 3D space.

Back to Fig. 1, the solution based on the pseudo 3D corner representation of the present disclosure, termed as Corner Affinity, can distinguish the 4 people in the image clearly with four boxes with no boxes across two objects.

Fig. 2 is an exemplary illustration of pseudo 3D corner representation in accordance with some embodiments of the disclosure. As shown in Fig. 2, in the pseudo 3D corner representation, an object may be represented by a pair of pseudo 3D corners. In an embodiment, the pair of pseudo 3D corners includes a top-left corner and a bottom-right corner. Each pair of pseudo 3D corners is modeled by: 1) Structure Affinity (SA) , applied to mine preliminary similarity of corner pairs through the corresponding object's shallow construction knowledge. 2) Context Affinity (CA) , running as optimizing corner similarity via deeper semantic features of affiliated instances.

In some embodiments, the height and width may be embedded in an SA module and the pseudo depth may be embedded in a CA module.

Fig. 3 is an exemplary illustration of object detection based on the pseudo 3D corner representation in accordance with some embodiments of the disclosure. As shown in Fig. 3, an object is represented by a pair of pseudo 3D corners: its top-left and bottom-right corners. To measure object confidence, the SA for embedding width and height and the CA for embedding pseudo depth are provided for each pair of corners. The SA and the CA is combined by an interaction function, for example, a corner affinity function. In the following subsections, we will detail the SA, CA, and the interaction function.

Mining the Construction Similarity via Structure Affinity

The structure affinity (SA) aims to mine preliminary construction similarity of corner pairs through the corresponding object's shallow structure knowledge. In an embodiment, the shape and location information of an instance may be defined as the structure knowledge. In some embodiments, the shape (e.g., width and height) information of each instance may be regressed, for example, at a ground-truth corner location.

In some embodiments, a detection network may generate SA regression maps for the top-left and bottom-right corners of the object. Note that regression is utilized to encode SA values and the width and height are not regarded as detection box, which is very different from popular detectors that use regression to obtain bounding boxes. In some embodiments, smoothL1 Loss may be adopted as SA loss to mine the shape knowledge (e.g., width and height) of each instance. In some embodiments, SA loss is only applied at the ground-truth corner location.

After obtaining the SA regression map of each corner, predicted width (w) and height (h) vectors may be decoded at each estimated corner location. Fig. 4 is an exemplary illustration of SA in accordance with some embodiments of the disclosure. As shown in Fig. 4, the regressed w and h can not only form a rough object box but also decode a new vector (vector _tl and vector _br) that point to the opposite corner. Accordingly, in some embodiments, the SA may be designed via coupling Intersection-over-Union (IoU) and corner drifting, as follows:

where the box _tl and box _br represent top-left and bottom-right boxes formed via the corresponding regressed w and h. The d _tl and d _br means the value of corner drifting from the end of decoded vectors to target corners, the decoded vectors are composed of regressed width and height while the target corners are the estimated corners in heatmap. D represents the distance value of the two predicted corners. In an example, D, d _tl, and d _br may be calculated via Euclidean distance. More details are shown in Fig. 4.

As described in Eq. (1) , the SA is composed of the IoU of two formed boxes and a bias named corner drifting. It is intuitive and reasonable that if different identity corners (top-left and bottom-left) belong to the same instance, their formed boxes will overlap significantly. Thus, the IoU may be utilized as the basic distance metric of SA. However, vanilla IoU cannot measure offsets of the decoded vectors directly. Therefore, the corner drifting may be provided as a bias of SA. The corner drifting is the mean of top-left and bottom-right drifting, which may be calculated, for example, via Euclidean distance, as shown in Fig. 4. Based on the above design, the value range of SA may be -1 to 1 and the closer the value is to 1, the higher the possibility that two corners belong to the same instance.

Mining the Semantic Similarity via Context Affinity

As mentioned in the above, the SA only embeds low-level construction information, which is not enough to perform grouping under extreme scenarios, e.g., two objects with similar shapes coincide. To this end, the CA part is introduced to mine high-level distinguishable semantic knowledge for Corner Affinity. The pseudo depth of each corner may be embedded in CA.

In some embodiments, the pseudo depth of each of the top-left and bottom-right corner may be represented by an embedding value for the corner.

In some embodiments, an Associative Embedding method may be used to predict the embedding value for each corner. The embedding value may be predicted based on the feature local response via a self-supervision manner, which is no need for a real ground-truth value.

In some embodiments, CA loss is only applied at the ground-truth corner location. In some embodiments, CA loss is used to determine the embedding values, for example, based on “pull” loss and “push” loss, as follows:

where e _tlk is the embedding vector for the predicted top-left corner, e _brk is the embedding vector for the predicted bottom-right corner, and e _k is the average of e _tlk and e _brk. N is the number of objects. Δis a predefined value manually, as an example, which may be set to 1 default.

As shown in Eq. (2) , the “pull" loss may be used to close the embedding distance of paired corners and the “push" loss may be used to separate the embedding distance of irrelevant corners. It’s not cared what the value of each corner is and it just need minimize embedding distances of corners that belong to the same object and maximize those of different ones. Thus, each embedding without a real ground-truth mine the high-level semantic knowledge of an instance.

The distance of embedding is smoothed to be the context affinity. Suppose a top-left corner with an embedding value e _tl and a bottom-right one with an embedding value e _br, the corresponding CA may be defined as follows:

CA=tanh (|e _tl-e _br|) (3)

where the tanh function is employed to normalize the distance of embedding values. The value range of CA is 0 to 1, yet unlike SA, that the closer the CA value is to 1, the lower the possibility that two corners belong to the same object.

Coupling the SA and CA via the interaction Function

The SA and CA are combined to gain the object confidence, which is termed as CornerAffinity, as follows:

where SA and CA are represented in Eq. (1) and Eq. (3) , respectively. σ is a manually set Gaussian variance, for example, its value may be set to 0.5 empirically.

Upon the above designs, our method encodes not only low-level structural knowledge but also high-level semantic knowledge. Even in extreme situation that two objects with similar shape overlap (the SA value of two corners belong to different instances may be large) , CA will decay SA to make sure that the value of overall Corner Affinity is low, v. v. In an embodiment, the value of overall Corner Affinity may be 0.1. In brief, only when the SA value is large and the CA value is small, the combined value is large so that the two corners can be grouped together. The hyper-surface of interaction function as shown in Fig. 3.

Experimental Results

In order to verify the effectiveness of the solution based on the proposed pseudo 3D corner representation, extensive experiments on different datasets with different object detectors are conducted. Tables 1-2 summarize the detailed result comparisons.

Table 1

Table 1 shows comparisons with different object detectors on COCO database both test-dev and val set. As shown in Table 1, the method proposed in the disclosure brings 5.8%boosting, from 40.5%to 46.3%, on AP for CornerNet baseline on COCO test-dev set. And only updated with the proposed new grouping algorithm, CornerNet surpasses popular detection baseline with a large margin, proving the corner-guided detector exists high ceiling and the proposed method remarkably promotes the development of this detector via optimizing corner grouping. It is worth noting that our pure grouping optimization produces more improvements for vanilla CornerNet than the other two single-stage variants, i.e., CenterNet and CentripetalNet. The single-stage two not only optimize corner grouping upon CornerNet, but also use stronger corner enhanced features yet still gain weaker accuracy than our pure grouping optimization. The experimental results show that the proposed method is the optimal corner grouping strategy at present.

Table 1 further shows comparisons with the newest favorite Transformer-based approaches. As shown in Table 1, employing the proposed method in the disclosure, CornerNet (Hourglass-104) yields an Average Precision (AP) of 45.1%and an AP75 of 48.3%on COCO val set, surmounting recent advanced self-attention-based single-stage detectors (e.g., DETR) , even without any feature enhanced module, e.g., FPN. Higher AP75 means higher-quality detection boxes.

Table 2

Table 2 shows comparisons with different object detectors on Citypersons and UCAS-AOD datasets, where AP _c means the AP generated on Citypersons and AP _u represents the AP obtained on UCAS-AOD. These two datasets are selected to test the generalization performance of the proposed method in the disclosure under two extreme scenarios, i.e., occlusion scene (Citypersons) and symmetrical arrangement of similar objects scene (UCAS-AOD) . As shown in Table 2, the proposed method in the disclosure produces excellent accuracy upon higher-quality corner pairs under all aforementioned challenging situations. Specifically, compared with vanilla CornerNet, the proposed method boosts AP of amazing 35.8%and 17.2%on Citypersons and UCAS-AOD, respectively, firmly proving our design brings more robust grouping capability for corner-guided detectors.

As can be clearly seen from the results shown in Tables 1-2, the proposed method, which is tested in 3 public datasets: COCO, Citypersons, and UCAS-AOD, significantly increases the Mean Average Precision (mAP) on all datasets. Especially on the Citypersons where people are dense occluded (examples are shown in Fig. 5) , the proposed method boosts the mAP by 35.8%.

Fig. 5 illustrates visualization comparisons of the proposed method with classical corner-guided object detectors, i.e., CornerNet and CenterNet on Citypersons on Citypersons dataset to show the robustness of the new devised grouping algorithm. As shown in Fig. 5, for CornerNet, some unusual keypoints in the background are grouped as pairs and form some meaningless boxes. CenterNet filters false positives via a center point of object yet cannot overcome the situation that a center of a third object is right within the center of a false positive. The proposed method in the disclosure can overcome the objects occlusion problem in this pedestrians’ scenarios.

Fig. 6 illustrates a flow chart illustrating an exemplary method 600 for object detection in accordance with some embodiments of the disclosure. The method 600 may include blocks S610-S630.

At block S610, a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image may be predicted, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner. At block S620, the pair of pseudo 3D corners for the object may be modeled by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth. At block S630, the SA and the CA may be combined via an interaction function to gain object confidence.

In some embodiments, the method 600 may include more or less steps. The disclosure is not limited in this aspect. Also, the method 600 may be understood in conjunction with the embodiments described above.

In some embodiments, the method 600 may further includes determining the SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.

In some embodiments, the method 600 may further includes calculating the SA value by the IoU value minus a minimum value of the corner drifting value and 1.

In some embodiments, wherein the SA value may calculated as follows:

In some embodiments, wherein a range of the SA value may be -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.

In some embodiments, the method 600 may further includes generating SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and obtaining the regressed width and height at each corner location after obtaining the SA regression maps.

In some embodiments, the method 600 may further includes applying SmoothL1 Loss to mine the width and height of each corner.

In some embodiments, the method 600 may further includes calculating the CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.

In some embodiments, the CA value may be calculated as follows:

CA=tanh (|e _tl-e _br|)

In some embodiments, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner may be predicted using an Associative Embedding method.

In some embodiments, the method 600 may further include: predicting the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.

In some embodiments, the embedding values for the predicted top-left corner and the predicted bottom-right corner may be predicted as follows:

In some embodiments, wherein a range of CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.

In some embodiments, wherein the combining the SA and the CA via an interaction function to gain the object confidence includes: grouping the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.

In some embodiments, wherein the interaction function is as follows:

where σ is a manually set Gaussian variance.

The present disclosure provides a novel corner representation method, termed as pseudo 3D corner representation, to address the objects occlusion problem, wherein the pseudo 3D may include the dimensions of height, width, and pseudo depth, the height and width may be embedded in a proposed SA module and pseudo depth may be embedded in a proposed CA module. Through the promotion of dimensions from 2D to 3D, the coincided detection box can be distinguished and dense object occlusion problem can be overcome.

Fig. 7 is a block diagram illustrating components, according to some example embodiments, able to read instructions from a machine-readable or computer-readable medium (e.g., a non-transitory machine-readable storage medium) and perform any one or more of the methodologies discussed herein. Specifically, Fig. 7 shows a diagrammatic representation of hardware resources 700 including one or more processors (or processor cores) 710, one or more memory/storage devices 720, and one or more communication resources 730, each of which may be communicatively coupled via a bus 740. For embodiments where node virtualization (e.g., NFV) is utilized, a hypervisor 702 may be executed to provide an execution environment for one or more network slices/sub-slices to utilize the hardware resources 700.

The processors 710 may include, for example, a processor 712 and a processor 714 which may be, e.g., a central processing unit (CPU) , a reduced instruction set computing (RISC) processor, a complex instruction set computing (CISC) processor, a graphics processing unit (GPU) , a digital signal processor (DSP) such as a baseband processor, an application specific integrated circuit (ASIC) , a radio-frequency integrated circuit (RFIC) , another processor, or any suitable combination thereof.

The memory/storage devices 720 may include main memory, disk storage, or any suitable combination thereof. The memory/storage devices 720 may include, but are not limited to any type of volatile or non-volatile memory such as dynamic random access memory (DRAM) , static random-access memory (SRAM) , erasable programmable read-only memory (EPROM) , electrically erasable programmable read-only memory (EEPROM) , Flash memory, solid-state storage, etc.

The communication resources 730 may include interconnection or network interface components or other suitable devices to communicate with one or more peripheral devices 704 or one or more databases 706 via a network 708. For example, the communication resources 730 may include wired communication components (e.g., for coupling via a Universal Serial Bus (USB) ) , cellular communication components, NFC components,

components (e.g.,

Low Energy) ,

components, and other communication components.

Instructions 750 may comprise software, a program, an application, an applet, an app, or other executable code for causing at least any of the processors 710 to perform any one or more of the methodologies discussed herein. The instructions 750 may reside, completely or partially, within at least one of the processors 710 (e.g., within the processor’s cache memory) , the memory/storage devices 720, or any suitable combination thereof. Furthermore, any portion of the instructions 750 may be transferred to the hardware resources 700 from any combination of the peripheral devices 704 or the databases 706. Accordingly, the memory of processors 710, the memory/storage devices 720, the peripheral devices 704, and the databases 706 are examples of computer-readable and machine-readable media.

Fig. 8 is a block diagram of an example processor platform in accordance with some embodiments of the disclosure. The processor platform 800 can be, for example, a server, a personal computer, a workstation, a self-learning machine (e.g., a neural network) , a mobile device (e.g., a cell phone, a smart phone, a tablet such as an iPad ^TM) , a personal digital assistant (PDA) , an Internet appliance, a DVD player, a CD player, a digital video recorder, a Blu-ray player, a gaming console, a personal video recorder, a set top box, a headset or other wearable device, or any other type of computing device.

The processor platform 800 of the illustrated example includes a processor 812. The processor 812 of the illustrated example is hardware. For example, the processor 812 can be implemented by one or more integrated circuits, logic circuits, microprocessors, GPUs, DSPs, or controllers from any desired family or manufacturer. The hardware processor may be a semiconductor based (e.g., silicon based) device. In some embodiments, the processor implements one or more of the methods or processes described above.

The processor 812 of the illustrated example includes a local memory 813 (e.g., a cache) . The processor 812 of the illustrated example is in communication with a main memory including a volatile memory 814 and a non-volatile memory 816 via a bus 818. The volatile memory 814 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM) , Dynamic Random Access Memory (DRAM) ,

Dynamic Random Access Memory

and/or any other type of random access memory device. The non-volatile memory 816 may be implemented by flash memory and/or any other desired type of memory device. Access to the

main memory

814, 816 is controlled by a memory controller.

The processor platform 800 of the illustrated example also includes interface circuitry 820. The interface circuitry 820 may be implemented by any type of interface standard, such as an Ethernet interface, a universal serial bus (USB) , a

interface, a near field communication (NFC) interface, and/or a PCI express interface.

In the illustrated example, one or more input devices 822 are connected to the interface circuitry 820. The input device (s) 822 permit (s) a user to enter data and/or commands into the processor 812. The input device (s) can be implemented by, for example, an audio sensor, a microphone, a camera (still or video) , a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, and/or a voice recognition system.

One or more output devices 824 are also connected to the interface circuitry 820 of the illustrated example. The output devices 824 can be implemented, for example, by display devices (e.g., a light emitting diode (LED) , an organic light emitting diode (OLED) , a liquid crystal display (LCD) , a cathode ray tube display (CRT) , an in-place switching (IPS) display, a touchscreen, etc. ) , a tactile output device, a printer and/or speaker. The interface circuitry 820 of the illustrated example, thus, typically includes a graphics driver card, a graphics driver chip and/or a graphics driver processor.

The interface circuitry 820 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem, a residential gateway, a wireless access point, and/or a network interface to facilitate exchange of data with external machines (e.g., computing devices of any kind) via a network 826. The communication can be via, for example, an Ethernet connection, a digital subscriber line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-of-site wireless system, a cellular telephone system, etc.

For example, the interface circuitry 820 may include a training dataset inputted through the input device (s) 822 or retrieved from the network 826.

The processor platform 800 of the illustrated example also includes one or more mass storage devices 828 for storing software and/or data. Examples of such mass storage devices 828 include floppy disk drives, hard drive disks, compact disk drives, Blu-ray disk drives, redundant array of independent disks (RAID) systems, and digital versatile disk (DVD) drives.

Machine executable instructions 832 may be stored in the mass storage device 828, in the volatile memory 814, in the non-volatile memory 816, and/or on a removable non-transitory computer readable storage medium such as a CD or DVD.

The following paragraphs describe examples of various embodiments.

Example 1 includes a method, comprising: predicting a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner; modeling the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and combining the SA and the CA via an interaction function to gain object confidence.

Example 2 includes the method of Example 1, further comprising: determining a SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.

Example 3 includes the method of Example 1 or 2, further comprising: calculating the SA value by the IoU value minus a minimum value of the corner drifting value and 1.

Example 4 includes the method of any of Examples 1-3, wherein a range of the SA value is -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.

Example 5 includes the method of any of Examples 1-4, further comprising: generating SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and obtaining the regressed width and height at each corner location after obtaining the SA regression maps.

Example 6 includes the method of any of Examples 1-5, further comprising: applying SmoothL1 Loss to mine the width and height of each corner.

Example 7 includes the method of any of Examples 1-6, further comprising: calculating a CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.

Example 8 includes the method of any of Examples 1-7, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner are predicted using an Associative Embedding method.

Example 9 includes the method of any of Examples 1-8, further comprising: predicting the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.

Example 10 includes the method of any of Examples 1-9, wherein a range of the CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.

Example 11 includes the method of any of Examples 1-10, wherein the combining the SA and the CA via an interaction function to gain object confidence comprising: grouping the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.

Example 12 includes the method of any of Examples 1-11, wherein the object confidence is represented as Corner Affinity, and the interaction function is as follows:

where σ is a manually set Gaussian variance.

Example 13 includes an apparatus, comprising: interface circuitry; and processor circuitry coupled to the interface circuitry and configured to: : predict a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner; model the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and combine the SA and the CA via an interaction function to gain object confidence.

Example 14 includes the apparatus of Example 13, wherein the processing circuitry is further to: determine a SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.

Example 15 includes the apparatus of Example 13 or 14, wherein the processing circuitry is further to: calculate the SA value by the IoU value minus a minimum value of the corner drifting value and 1.

Example 16 includes the apparatus of any of Examples 13-15, wherein a range of the SA value is -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.

Example 17 includes the apparatus of any of Examples 13-16, wherein the processing circuitry is further to: generate SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and obtain the regressed width and height at each corner location after obtaining the SA regression maps.

Example 18 includes the apparatus of any of Examples 13-17, wherein the processing circuitry is further to: apply SmoothL1 Loss to mine the width and height of each corner.

Example 19 includes the apparatus of any of Examples 13-18, wherein the processing circuitry is further to: calculate a CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.

Example 20 includes the apparatus of any of Examples 13-19, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner are predicted using an Associative Embedding method.

Example 21 includes the apparatus of any of Examples 13-20, wherein the processing circuitry is further to: predict the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.

Example 22 includes the apparatus of any of Examples 13-21, wherein a range of the CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.

Example 23 includes the apparatus of any of Examples 13-22, wherein the processing circuitry is further to: group the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.

Example 24 includes the apparatus of any of Examples 13-23, wherein the object confidence is represented as Corner Affinity, and the interaction function is as follows:

where σ is a manually set Gaussian variance.

Example 25 includes a computer-readable medium having instructions stored thereon, wherein the instructions, when executed by processor circuitry, cause the processor circuitry to perform a method, the method comprising: predicting a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner; modeling the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and combining the SA and the CA via an interaction function to gain object confidence.

Example 26 includes the computer-readable medium of Example 25, the method further comprising: determining a SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.

Example 27 includes the computer-readable medium of Example 25 or 26, the method further comprising: calculating the SA value by the IoU value minus a minimum value of the corner drifting value and 1.

Example 28 includes the computer-readable medium of any of Examples 25-27, wherein a range of the SA value is -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.

Example 29 includes the computer-readable medium of any of Examples 25-28, the method further comprising: generating SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and obtaining the regressed width and height at each corner location after obtaining the SA regression maps.

Example 30 includes the computer-readable medium of any of Examples 25-29, the method further comprising: applying SmoothL1 Loss to mine the width and height of each corner.

Example 31 includes the computer-readable medium of any of Examples 25-30, the method further comprising: calculating a CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.

Example 32 includes the computer-readable medium of any of Examples 25-31, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner are predicted using an Associative Embedding method.

Example 33 includes the computer-readable medium of any of Examples 25-32, the method further comprising: predicting the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.

Example 34 includes the computer-readable medium of any of Examples 25-33, wherein a range of the CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.

Example 35 includes the computer-readable medium of any of Examples 25-34, wherein the combining the SA and the CA via an interaction function to gain object confidence comprising: grouping the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.

Example 36 includes the computer-readable medium of any of Examples 25-35, wherein the object confidence is represented as Corner Affinity, and the interaction function is as follows:

where σ is a manually set Gaussian variance.

Example 37 includes a device, comprising: means for predicting a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner; means for modeling the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and means for combining the SA and the CA via an interaction function to gain object confidence.

Example 38 includes the device of Example 37, further comprising: means for determining a SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.

Example 39 includes the device of Example 37 or 38, further comprising: means for calculating the SA value by the IoU value minus a minimum value of the corner drifting value and 1.

Example 40 includes the device of any of Examples 37-39, wherein a range of the SA value is -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.

Example 41 includes the device of any of Examples 37-40, further comprising: means for generating SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and means for obtaining the regressed width and height at each corner location after obtaining the SA regression maps.

Example 42 includes the device of any of Examples 37-41, further comprising: means for applying SmoothL1 Loss to mine the width and height of each corner.

Example 43 includes the device of any of Examples 37-42, further comprising: means for calculating a CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.

Example 44 includes the device of any of Examples 37-43, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner are predicted using an Associative Embedding method.

Example 45 includes the device of any of Examples 37-44, further comprising: means for predicting the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.

Example 46 includes the device of any of Examples 37-45, wherein a range of the CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.

Example 47 includes the device of any of Examples 37-46, further comprising: means for grouping the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.

Example 48 includes the device of any of Examples 37-47, wherein the object confidence is represented as Corner Affinity, and the interaction function is as follows:

where σ is a manually set Gaussian variance.

Example 49 includes an apparatus as shown and described in the description.

Example 50 includes a method performed at an apparatus as shown and described in the description.

The above description is intended to be illustrative, and not restrictive. For example, the above-described examples (or one or more aspects thereof) may be used in combination with each other. Other embodiments may be used, such as by one of ordinary skill in the art upon reviewing the above description. The Abstract is to allow the reader to quickly ascertain the nature of the technical disclosure and is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. Also, in the above Detailed Description, various features may be grouped together to streamline the disclosure. This should not be interpreted as intending that an unclaimed disclosed feature is essential to any claim. Rather, inventive subject matter may lie in less than all features of a particular disclosed embodiment. Thus, the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separate embodiment. The scope of the embodiments should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Although certain embodiments have been illustrated and described herein for purposes of description, a wide variety of alternate and/or equivalent embodiments or implementations calculated to achieve the same purposes may be substituted for the embodiments shown and described without departing from the scope of the present disclosure. The disclosure is intended to cover any adaptations or variations of the embodiments discussed herein. Therefore, it is manifestly intended that embodiments described herein be limited only by the appended claims and the equivalents thereof.

Claims

A method, comprising:

predicting a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner;

modeling the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and

combining the SA and the CA via an interaction function to gain object confidence.
The method of claim 1, further comprising:

determining a SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.
The method of claim 2, further comprising:

calculating the SA value by the IoU value minus a minimum value of the corner drifting value and 1.
The method of claim 3, wherein a range of the SA value is -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
The method of claim 2, further comprising:

generating SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and

obtaining the regressed width and height at each corner location after obtaining the SA regression maps.
The method of claim 5, further comprising:

applying SmoothL1 Loss to mine the width and height of each corner.
The method of claim 2, further comprising:

calculating a CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.
The method of claim 7, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner are predicted using an Associative Embedding method.
The method of claim 8, further comprising:

predicting the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.
The method of claim 7, wherein a range of the CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
The method of claim 7, wherein the combining the SA and the CA via an interaction function to gain object confidence comprising:

grouping the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.
The method of claim 11, wherein the object confidence is represented as Corner Affinity, and the interaction function is as follows:

where σ is a manually set Gaussian variance.
An apparatus, comprising:

interface circuitry; and

processor circuitry coupled to the interface circuitry and configured to:

predict a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner;

model the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and

combine the SA and the CA via an interaction function to gain object confidence.
The apparatus of claim 13, wherein the processing circuitry is further to:

determine a SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.
The apparatus of claim 2, wherein the processing circuitry is further to:

calculate the SA value by the IoU value minus a minimum value of the corner drifting value and 1.
The apparatus of claim 15, wherein a range of the SA value is -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
The apparatus of claim 14, wherein the processing circuitry is further to:

generate SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and

obtain the regressed width and height at each corner location after obtaining the SA regression maps.
The apparatus of claim 17, wherein the processing circuitry is further to:

apply SmoothL1 Loss to mine the width and height of each corner.
The apparatus of claim 14, wherein the processing circuitry is further to:

calculate a CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.
The apparatus of claim 19, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner are predicted using an Associative Embedding method.
The apparatus of claim 20, wherein the processing circuitry is further to:

predict the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.
The apparatus of claim 19, wherein a range of the CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
The apparatus of claim 19, wherein the processing circuitry is further to:

group the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.
The apparatus of claim 23, wherein the object confidence is represented as Corner Affinity, and the interaction function is as follows:

where σ is a manually set Gaussian variance.
A computer-readable medium having instructions stored thereon, wherein the instructions, when executed by processor circuitry, cause the processor circuitry to perform the method of any of claims 1 to 12.