WO2023245635A1 - Apparatus and method for object detection - Google Patents

Apparatus and method for object detection Download PDF

Info

Publication number
WO2023245635A1
WO2023245635A1 PCT/CN2022/101174 CN2022101174W WO2023245635A1 WO 2023245635 A1 WO2023245635 A1 WO 2023245635A1 CN 2022101174 W CN2022101174 W CN 2022101174W WO 2023245635 A1 WO2023245635 A1 WO 2023245635A1
Authority
WO
WIPO (PCT)
Prior art keywords
corner
value
predicted
embedding
pseudo
Prior art date
Application number
PCT/CN2022/101174
Other languages
French (fr)
Inventor
Haoran WEI
Ping Guo
Bing Wang
Peng Wang
Original Assignee
Intel Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corporation filed Critical Intel Corporation
Priority to PCT/CN2022/101174 priority Critical patent/WO2023245635A1/en
Publication of WO2023245635A1 publication Critical patent/WO2023245635A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • G06V20/647Three-dimensional objects by matching two-dimensional images to three-dimensional objects

Definitions

  • Embodiments of the present disclosure generally relate to techniques of convolutional neural networks (CNNs) , and in particular to an apparatus and a method for object detection.
  • CNNs convolutional neural networks
  • Object Detection aims to localize and classify a target of interest in an image.
  • existing prevalent detectors e.g., Fast R-CNN, RetinaNet, CenterNet and FCOS
  • F-CNN RetinaNet
  • CenterNet CenterNet
  • FCOS FCOS
  • their performance will be severely degraded in some extreme scenarios, especially the objects occlusion which often occurs in practical application, such as pedestrian detection.
  • objects occlusion is difficult to solve is that most detectors model an object instance via its center point, yet objects’ centers are easy to overlap.
  • Corner-guided detectors such as CornerNet and its variants, CenterNet and CentripetalNet, use corners to replace centers to alleviate the problem by transforming object detection to be corner keypoints prediction and grouping without center estimation, in which the corner modeling method can effectively alleviate the centers overlap problem.
  • a method includes: predicting a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth; modeling the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and combining the SA and the CA via an interaction function to gain object confidence.
  • SA structure affinity
  • CA context affinity
  • an apparatus includes: interface circuitry; and processor circuitry coupled to the interface circuitry and configured to: predict a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth; model the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and combine the SA and the CA via an interaction function to gain object confidence.
  • SA structure affinity
  • CA context affinity
  • Another aspect of the disclosure provides a device including means for implementing the method of the disclosure.
  • Another aspect of the disclosure provides a machine readable storage medium having instructions stored thereon, which when executed by a machine cause the machine to perform the method of the disclosure.
  • Fig. 1 illustrates visualization comparisons of different solutions for object detection in an image.
  • Fig. 2 is an exemplary illustration of pseudo 3D corner representation in accordance with some embodiments of the disclosure.
  • Fig. 3 is an exemplary illustration of object detection based on the pseudo 3D corner representation in accordance with some embodiments of the disclosure.
  • Fig. 4 is an exemplary illustration of structure affinity (SA) in accordance with some embodiments of the disclosure.
  • Fig. 5 illustrates visualization comparisons of the proposed method with classical corner-guided object detectors on Citypersons dataset in accordance with some embodiments of the disclosure.
  • Fig. 6 illustrates a flow chart illustrating an exemplary method 600 for object detection in accordance with some embodiments of the disclosure.
  • Fig. 7 is a block diagram illustrating components, according to some example embodiments, able to read instructions from a machine-readable or computer-readable medium and perform any one or more of the methodologies discussed herein.
  • Fig. 8 is a block diagram of an example processor platform in accordance with some embodiments of the disclosure.
  • Corner modeling method has potential to overcome the objects occlusion problem, the corner grouping process is challenging.
  • current corner grouping algorithms no matter CornerNet or its variants, CenterNet and CentripetalNet, cannot deal with dense occlusions.
  • Fig. 1 illustrates visualization comparisons of different solutions for object detection in an image.
  • CornerNet fails to distinguish corners of different objects with similar appearance.
  • CenterNet filters out false positives (FP) via center points yet cannot solve the scenario that a center point of the third object right within the center of an FP.
  • CentripetalNet takes the two dense occluded people as one due to the center overlapping.
  • a pseudo 3D corner representation is provided to model object (s) in an image, wherein the pseudo 3D may include the dimensions of height, width, and pseudo depth.
  • the coincided detection box can be distinguished, i.e., the overlapped detection box in a 2D space can be distinguished in a 3D space.
  • Corner Affinity the solution based on the pseudo 3D corner representation of the present disclosure, termed as Corner Affinity, can distinguish the 4 people in the image clearly with four boxes with no boxes across two objects.
  • Fig. 2 is an exemplary illustration of pseudo 3D corner representation in accordance with some embodiments of the disclosure.
  • an object may be represented by a pair of pseudo 3D corners.
  • the pair of pseudo 3D corners includes a top-left corner and a bottom-right corner.
  • Each pair of pseudo 3D corners is modeled by: 1) Structure Affinity (SA) , applied to mine preliminary similarity of corner pairs through the corresponding object's shallow construction knowledge. 2) Context Affinity (CA) , running as optimizing corner similarity via deeper semantic features of affiliated instances.
  • SA Structure Affinity
  • CA Context Affinity
  • the height and width may be embedded in an SA module and the pseudo depth may be embedded in a CA module.
  • Fig. 3 is an exemplary illustration of object detection based on the pseudo 3D corner representation in accordance with some embodiments of the disclosure.
  • an object is represented by a pair of pseudo 3D corners: its top-left and bottom-right corners.
  • the SA for embedding width and height and the CA for embedding pseudo depth are provided for each pair of corners.
  • the SA and the CA is combined by an interaction function, for example, a corner affinity function.
  • an interaction function for example, a corner affinity function.
  • the structure affinity aims to mine preliminary construction similarity of corner pairs through the corresponding object's shallow structure knowledge.
  • the shape and location information of an instance may be defined as the structure knowledge.
  • the shape (e.g., width and height) information of each instance may be regressed, for example, at a ground-truth corner location.
  • a detection network may generate SA regression maps for the top-left and bottom-right corners of the object. Note that regression is utilized to encode SA values and the width and height are not regarded as detection box, which is very different from popular detectors that use regression to obtain bounding boxes.
  • smoothL1 Loss may be adopted as SA loss to mine the shape knowledge (e.g., width and height) of each instance. In some embodiments, SA loss is only applied at the ground-truth corner location.
  • Fig. 4 is an exemplary illustration of SA in accordance with some embodiments of the disclosure. As shown in Fig. 4, the regressed w and h can not only form a rough object box but also decode a new vector (vector tl and vector br ) that point to the opposite corner. Accordingly, in some embodiments, the SA may be designed via coupling Intersection-over-Union (IoU) and corner drifting, as follows:
  • box tl and box br represent top-left and bottom-right boxes formed via the corresponding regressed w and h.
  • the d tl and d br means the value of corner drifting from the end of decoded vectors to target corners, the decoded vectors are composed of regressed width and height while the target corners are the estimated corners in heatmap.
  • D represents the distance value of the two predicted corners.
  • D, d tl , and d br may be calculated via Euclidean distance. More details are shown in Fig. 4.
  • the SA is composed of the IoU of two formed boxes and a bias named corner drifting. It is intuitive and reasonable that if different identity corners (top-left and bottom-left) belong to the same instance, their formed boxes will overlap significantly. Thus, the IoU may be utilized as the basic distance metric of SA. However, vanilla IoU cannot measure offsets of the decoded vectors directly. Therefore, the corner drifting may be provided as a bias of SA.
  • the corner drifting is the mean of top-left and bottom-right drifting, which may be calculated, for example, via Euclidean distance, as shown in Fig. 4. Based on the above design, the value range of SA may be -1 to 1 and the closer the value is to 1, the higher the possibility that two corners belong to the same instance.
  • the SA only embeds low-level construction information, which is not enough to perform grouping under extreme scenarios, e.g., two objects with similar shapes coincide.
  • the CA part is introduced to mine high-level distinguishable semantic knowledge for Corner Affinity.
  • the pseudo depth of each corner may be embedded in CA.
  • the pseudo depth of each of the top-left and bottom-right corner may be represented by an embedding value for the corner.
  • an Associative Embedding method may be used to predict the embedding value for each corner.
  • the embedding value may be predicted based on the feature local response via a self-supervision manner, which is no need for a real ground-truth value.
  • CA loss is only applied at the ground-truth corner location. In some embodiments, CA loss is used to determine the embedding values, for example, based on “pull” loss and “push” loss, as follows:
  • e tlk is the embedding vector for the predicted top-left corner
  • e brk is the embedding vector for the predicted bottom-right corner
  • e k is the average of e tlk and e brk .
  • the “pull” loss may be used to close the embedding distance of paired corners and the “push” loss may be used to separate the embedding distance of irrelevant corners. It’s not cared what the value of each corner is and it just need minimize embedding distances of corners that belong to the same object and maximize those of different ones. Thus, each embedding without a real ground-truth mine the high-level semantic knowledge of an instance.
  • the distance of embedding is smoothed to be the context affinity.
  • a top-left corner with an embedding value e tl and a bottom-right one with an embedding value e br the corresponding CA may be defined as follows:
  • the value range of CA is 0 to 1, yet unlike SA, that the closer the CA value is to 1, the lower the possibility that two corners belong to the same object.
  • CornerAffinity the object confidence
  • is a manually set Gaussian variance, for example, its value may be set to 0.5 empirically.
  • our method encodes not only low-level structural knowledge but also high-level semantic knowledge. Even in extreme situation that two objects with similar shape overlap (the SA value of two corners belong to different instances may be large) , CA will decay SA to make sure that the value of overall Corner Affinity is low, v. v. In an embodiment, the value of overall Corner Affinity may be 0.1. In brief, only when the SA value is large and the CA value is small, the combined value is large so that the two corners can be grouped together. The hyper-surface of interaction function as shown in Fig. 3.
  • Table 1 shows comparisons with different object detectors on COCO database both test-dev and val set.
  • the method proposed in the disclosure brings 5.8%boosting, from 40.5%to 46.3%, on AP for CornerNet baseline on COCO test-dev set.
  • CornerNet surpasses popular detection baseline with a large margin, proving the corner-guided detector exists high ceiling and the proposed method remarkably promotes the development of this detector via optimizing corner grouping.
  • our pure grouping optimization produces more improvements for vanilla CornerNet than the other two single-stage variants, i.e., CenterNet and CentripetalNet.
  • the single-stage two not only optimize corner grouping upon CornerNet, but also use stronger corner enhanced features yet still gain weaker accuracy than our pure grouping optimization.
  • the experimental results show that the proposed method is the optimal corner grouping strategy at present.
  • Table 1 further shows comparisons with the newest favorite Transformer-based approaches.
  • CornerNet Hourglass-104
  • AP75 Average Precision
  • DETR advanced self-attention-based single-stage detectors
  • Higher AP75 means higher-quality detection boxes.
  • Table 2 shows comparisons with different object detectors on Citypersons and UCAS-AOD datasets, where AP c means the AP generated on Citypersons and AP u represents the AP obtained on UCAS-AOD.
  • AP c means the AP generated on Citypersons
  • AP u represents the AP obtained on UCAS-AOD.
  • the proposed method which is tested in 3 public datasets: COCO, Citypersons, and UCAS-AOD, significantly increases the Mean Average Precision (mAP) on all datasets. Especially on the Citypersons where people are dense occluded (examples are shown in Fig. 5) , the proposed method boosts the mAP by 35.8%.
  • Fig. 5 illustrates visualization comparisons of the proposed method with classical corner-guided object detectors, i.e., CornerNet and CenterNet on Citypersons on Citypersons dataset to show the robustness of the new devised grouping algorithm.
  • CornerNet some unusual keypoints in the background are grouped as pairs and form some meaningless boxes.
  • CenterNet filters false positives via a center point of object yet cannot overcome the situation that a center of a third object is right within the center of a false positive.
  • the proposed method in the disclosure can overcome the objects occlusion problem in this pedestrians’ scenarios.
  • Fig. 6 illustrates a flow chart illustrating an exemplary method 600 for object detection in accordance with some embodiments of the disclosure.
  • the method 600 may include blocks S610-S630.
  • a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image may be predicted, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner.
  • the pair of pseudo 3D corners for the object may be modeled by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth.
  • SA and the CA may be combined via an interaction function to gain object confidence.
  • the method 600 may include more or less steps. The disclosure is not limited in this aspect. Also, the method 600 may be understood in conjunction with the embodiments described above.
  • the method 600 may further includes determining the SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.
  • IoU Intersection-over-Union
  • the method 600 may further includes calculating the SA value by the IoU value minus a minimum value of the corner drifting value and 1.
  • the SA value may calculated as follows:
  • a range of the SA value may be -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
  • the method 600 may further includes generating SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and obtaining the regressed width and height at each corner location after obtaining the SA regression maps.
  • the method 600 may further includes applying SmoothL1 Loss to mine the width and height of each corner.
  • the method 600 may further includes calculating the CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.
  • the CA value may be calculated as follows:
  • CA tanh (
  • the embedding values of the predicted top-left corner and the predicted bottom-right corner may be predicted using an Associative Embedding method.
  • the method 600 may further include: predicting the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.
  • the embedding values for the predicted top-left corner and the predicted bottom-right corner may be predicted as follows:
  • a range of CA value is 0 to 1
  • the closer the CA value is to 1 the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
  • the combining the SA and the CA via an interaction function to gain the object confidence includes: grouping the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.
  • interaction function is as follows:
  • is a manually set Gaussian variance
  • the present disclosure provides a novel corner representation method, termed as pseudo 3D corner representation, to address the objects occlusion problem, wherein the pseudo 3D may include the dimensions of height, width, and pseudo depth, the height and width may be embedded in a proposed SA module and pseudo depth may be embedded in a proposed CA module.
  • the coincided detection box can be distinguished and dense object occlusion problem can be overcome.
  • Fig. 7 is a block diagram illustrating components, according to some example embodiments, able to read instructions from a machine-readable or computer-readable medium (e.g., a non-transitory machine-readable storage medium) and perform any one or more of the methodologies discussed herein.
  • Fig. 7 shows a diagrammatic representation of hardware resources 700 including one or more processors (or processor cores) 710, one or more memory/storage devices 720, and one or more communication resources 730, each of which may be communicatively coupled via a bus 740.
  • node virtualization e.g., NFV
  • a hypervisor 702 may be executed to provide an execution environment for one or more network slices/sub-slices to utilize the hardware resources 700.
  • the processors 710 may include, for example, a processor 712 and a processor 714 which may be, e.g., a central processing unit (CPU) , a reduced instruction set computing (RISC) processor, a complex instruction set computing (CISC) processor, a graphics processing unit (GPU) , a digital signal processor (DSP) such as a baseband processor, an application specific integrated circuit (ASIC) , a radio-frequency integrated circuit (RFIC) , another processor, or any suitable combination thereof.
  • CPU central processing unit
  • RISC reduced instruction set computing
  • CISC complex instruction set computing
  • GPU graphics processing unit
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • RFIC radio-frequency integrated circuit
  • the memory/storage devices 720 may include main memory, disk storage, or any suitable combination thereof.
  • the memory/storage devices 720 may include, but are not limited to any type of volatile or non-volatile memory such as dynamic random access memory (DRAM) , static random-access memory (SRAM) , erasable programmable read-only memory (EPROM) , electrically erasable programmable read-only memory (EEPROM) , Flash memory, solid-state storage, etc.
  • DRAM dynamic random access memory
  • SRAM static random-access memory
  • EPROM erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • Flash memory solid-state storage, etc.
  • the communication resources 730 may include interconnection or network interface components or other suitable devices to communicate with one or more peripheral devices 704 or one or more databases 706 via a network 708.
  • the communication resources 730 may include wired communication components (e.g., for coupling via a Universal Serial Bus (USB) ) , cellular communication components, NFC components, components (e.g., Low Energy) , components, and other communication components.
  • wired communication components e.g., for coupling via a Universal Serial Bus (USB)
  • USB Universal Serial Bus
  • NFC components e.g., Low Energy
  • components e.g., Low Energy
  • Instructions 750 may comprise software, a program, an application, an applet, an app, or other executable code for causing at least any of the processors 710 to perform any one or more of the methodologies discussed herein.
  • the instructions 750 may reside, completely or partially, within at least one of the processors 710 (e.g., within the processor’s cache memory) , the memory/storage devices 720, or any suitable combination thereof.
  • any portion of the instructions 750 may be transferred to the hardware resources 700 from any combination of the peripheral devices 704 or the databases 706.
  • the memory of processors 710, the memory/storage devices 720, the peripheral devices 704, and the databases 706 are examples of computer-readable and machine-readable media.
  • Fig. 8 is a block diagram of an example processor platform in accordance with some embodiments of the disclosure.
  • the processor platform 800 can be, for example, a server, a personal computer, a workstation, a self-learning machine (e.g., a neural network) , a mobile device (e.g., a cell phone, a smart phone, a tablet such as an iPad TM ) , a personal digital assistant (PDA) , an Internet appliance, a DVD player, a CD player, a digital video recorder, a Blu-ray player, a gaming console, a personal video recorder, a set top box, a headset or other wearable device, or any other type of computing device.
  • a self-learning machine e.g., a neural network
  • a mobile device e.g., a cell phone, a smart phone, a tablet such as an iPad TM
  • PDA personal digital assistant
  • an Internet appliance e.g., a DVD player, a CD player,
  • the processor platform 800 of the illustrated example includes a processor 812.
  • the processor 812 of the illustrated example is hardware.
  • the processor 812 can be implemented by one or more integrated circuits, logic circuits, microprocessors, GPUs, DSPs, or controllers from any desired family or manufacturer.
  • the hardware processor may be a semiconductor based (e.g., silicon based) device.
  • the processor implements one or more of the methods or processes described above.
  • the processor 812 of the illustrated example includes a local memory 813 (e.g., a cache) .
  • the processor 812 of the illustrated example is in communication with a main memory including a volatile memory 814 and a non-volatile memory 816 via a bus 818.
  • the volatile memory 814 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM) , Dynamic Random Access Memory (DRAM) , Dynamic Random Access Memory and/or any other type of random access memory device.
  • the non-volatile memory 816 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 814, 816 is controlled by a memory controller.
  • the processor platform 800 of the illustrated example also includes interface circuitry 820.
  • the interface circuitry 820 may be implemented by any type of interface standard, such as an Ethernet interface, a universal serial bus (USB) , a interface, a near field communication (NFC) interface, and/or a PCI express interface.
  • one or more input devices 822 are connected to the interface circuitry 820.
  • the input device (s) 822 permit (s) a user to enter data and/or commands into the processor 812.
  • the input device (s) can be implemented by, for example, an audio sensor, a microphone, a camera (still or video) , a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, and/or a voice recognition system.
  • One or more output devices 824 are also connected to the interface circuitry 820 of the illustrated example.
  • the output devices 824 can be implemented, for example, by display devices (e.g., a light emitting diode (LED) , an organic light emitting diode (OLED) , a liquid crystal display (LCD) , a cathode ray tube display (CRT) , an in-place switching (IPS) display, a touchscreen, etc. ) , a tactile output device, a printer and/or speaker.
  • display devices e.g., a light emitting diode (LED) , an organic light emitting diode (OLED) , a liquid crystal display (LCD) , a cathode ray tube display (CRT) , an in-place switching (IPS) display, a touchscreen, etc.
  • the interface circuitry 820 of the illustrated example thus, typically includes a graphics driver card, a graphics driver chip and/or a graphics driver processor.
  • the interface circuitry 820 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem, a residential gateway, a wireless access point, and/or a network interface to facilitate exchange of data with external machines (e.g., computing devices of any kind) via a network 826.
  • the communication can be via, for example, an Ethernet connection, a digital subscriber line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-of-site wireless system, a cellular telephone system, etc.
  • DSL digital subscriber line
  • the interface circuitry 820 may include a training dataset inputted through the input device (s) 822 or retrieved from the network 826.
  • the processor platform 800 of the illustrated example also includes one or more mass storage devices 828 for storing software and/or data.
  • mass storage devices 828 include floppy disk drives, hard drive disks, compact disk drives, Blu-ray disk drives, redundant array of independent disks (RAID) systems, and digital versatile disk (DVD) drives.
  • Machine executable instructions 832 may be stored in the mass storage device 828, in the volatile memory 814, in the non-volatile memory 816, and/or on a removable non-transitory computer readable storage medium such as a CD or DVD.
  • Example 1 includes a method, comprising: predicting a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner; modeling the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and combining the SA and the CA via an interaction function to gain object confidence.
  • SA structure affinity
  • CA context affinity
  • Example 2 includes the method of Example 1, further comprising: determining a SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.
  • IoU Intersection-over-Union
  • Example 3 includes the method of Example 1 or 2, further comprising: calculating the SA value by the IoU value minus a minimum value of the corner drifting value and 1.
  • Example 4 includes the method of any of Examples 1-3, wherein a range of the SA value is -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
  • Example 5 includes the method of any of Examples 1-4, further comprising: generating SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and obtaining the regressed width and height at each corner location after obtaining the SA regression maps.
  • Example 6 includes the method of any of Examples 1-5, further comprising: applying SmoothL1 Loss to mine the width and height of each corner.
  • Example 7 includes the method of any of Examples 1-6, further comprising: calculating a CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.
  • Example 8 includes the method of any of Examples 1-7, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner are predicted using an Associative Embedding method.
  • Example 9 includes the method of any of Examples 1-8, further comprising: predicting the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.
  • Example 10 includes the method of any of Examples 1-9, wherein a range of the CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
  • Example 11 includes the method of any of Examples 1-10, wherein the combining the SA and the CA via an interaction function to gain object confidence comprising: grouping the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.
  • Example 12 includes the method of any of Examples 1-11, wherein the object confidence is represented as Corner Affinity, and the interaction function is as follows:
  • is a manually set Gaussian variance
  • Example 13 includes an apparatus, comprising: interface circuitry; and processor circuitry coupled to the interface circuitry and configured to: : predict a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner; model the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and combine the SA and the CA via an interaction function to gain object confidence.
  • SA structure affinity
  • CA context affinity
  • Example 14 includes the apparatus of Example 13, wherein the processing circuitry is further to: determine a SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.
  • IoU Intersection-over-Union
  • Example 15 includes the apparatus of Example 13 or 14, wherein the processing circuitry is further to: calculate the SA value by the IoU value minus a minimum value of the corner drifting value and 1.
  • Example 16 includes the apparatus of any of Examples 13-15, wherein a range of the SA value is -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
  • Example 17 includes the apparatus of any of Examples 13-16, wherein the processing circuitry is further to: generate SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and obtain the regressed width and height at each corner location after obtaining the SA regression maps.
  • Example 18 includes the apparatus of any of Examples 13-17, wherein the processing circuitry is further to: apply SmoothL1 Loss to mine the width and height of each corner.
  • Example 19 includes the apparatus of any of Examples 13-18, wherein the processing circuitry is further to: calculate a CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.
  • Example 20 includes the apparatus of any of Examples 13-19, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner are predicted using an Associative Embedding method.
  • Example 21 includes the apparatus of any of Examples 13-20, wherein the processing circuitry is further to: predict the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.
  • Example 22 includes the apparatus of any of Examples 13-21, wherein a range of the CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
  • Example 23 includes the apparatus of any of Examples 13-22, wherein the processing circuitry is further to: group the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.
  • Example 24 includes the apparatus of any of Examples 13-23, wherein the object confidence is represented as Corner Affinity, and the interaction function is as follows:
  • is a manually set Gaussian variance
  • Example 25 includes a computer-readable medium having instructions stored thereon, wherein the instructions, when executed by processor circuitry, cause the processor circuitry to perform a method, the method comprising: predicting a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner; modeling the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and combining the SA and the CA via an interaction function to gain object confidence.
  • SA structure affinity
  • CA context affinity
  • Example 26 includes the computer-readable medium of Example 25, the method further comprising: determining a SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.
  • IoU Intersection-over-Union
  • Example 27 includes the computer-readable medium of Example 25 or 26, the method further comprising: calculating the SA value by the IoU value minus a minimum value of the corner drifting value and 1.
  • Example 28 includes the computer-readable medium of any of Examples 25-27, wherein a range of the SA value is -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
  • Example 29 includes the computer-readable medium of any of Examples 25-28, the method further comprising: generating SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and obtaining the regressed width and height at each corner location after obtaining the SA regression maps.
  • Example 30 includes the computer-readable medium of any of Examples 25-29, the method further comprising: applying SmoothL1 Loss to mine the width and height of each corner.
  • Example 31 includes the computer-readable medium of any of Examples 25-30, the method further comprising: calculating a CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.
  • Example 32 includes the computer-readable medium of any of Examples 25-31, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner are predicted using an Associative Embedding method.
  • Example 33 includes the computer-readable medium of any of Examples 25-32, the method further comprising: predicting the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.
  • Example 34 includes the computer-readable medium of any of Examples 25-33, wherein a range of the CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
  • Example 35 includes the computer-readable medium of any of Examples 25-34, wherein the combining the SA and the CA via an interaction function to gain object confidence comprising: grouping the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.
  • Example 36 includes the computer-readable medium of any of Examples 25-35, wherein the object confidence is represented as Corner Affinity, and the interaction function is as follows:
  • is a manually set Gaussian variance
  • Example 37 includes a device, comprising: means for predicting a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner; means for modeling the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and means for combining the SA and the CA via an interaction function to gain object confidence.
  • SA structure affinity
  • CA context affinity
  • Example 38 includes the device of Example 37, further comprising: means for determining a SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.
  • IoU Intersection-over-Union
  • Example 39 includes the device of Example 37 or 38, further comprising: means for calculating the SA value by the IoU value minus a minimum value of the corner drifting value and 1.
  • Example 40 includes the device of any of Examples 37-39, wherein a range of the SA value is -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
  • Example 41 includes the device of any of Examples 37-40, further comprising: means for generating SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and means for obtaining the regressed width and height at each corner location after obtaining the SA regression maps.
  • Example 42 includes the device of any of Examples 37-41, further comprising: means for applying SmoothL1 Loss to mine the width and height of each corner.
  • Example 43 includes the device of any of Examples 37-42, further comprising: means for calculating a CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.
  • Example 44 includes the device of any of Examples 37-43, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner are predicted using an Associative Embedding method.
  • Example 45 includes the device of any of Examples 37-44, further comprising: means for predicting the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.
  • Example 46 includes the device of any of Examples 37-45, wherein a range of the CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
  • Example 47 includes the device of any of Examples 37-46, further comprising: means for grouping the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.
  • Example 48 includes the device of any of Examples 37-47, wherein the object confidence is represented as Corner Affinity, and the interaction function is as follows:
  • is a manually set Gaussian variance
  • Example 49 includes an apparatus as shown and described in the description.
  • Example 50 includes a method performed at an apparatus as shown and described in the description.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

An apparatus, method, device and medium for object detection. The method includes: predicting a pair of pseudo 3D corners including a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner (S610), modeling the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth (S620); and combining the SA and the CA via an interaction function to gain object confidence (S630).

Description

APPARATUS AND METHOD FOR OBJECT DETECTION Technical Field
Embodiments of the present disclosure generally relate to techniques of convolutional neural networks (CNNs) , and in particular to an apparatus and a method for object detection.
Background Art
Object Detection aims to localize and classify a target of interest in an image. In simple scenarios, existing prevalent detectors, e.g., Fast R-CNN, RetinaNet, CenterNet and FCOS, are robust enough, which utilize center points to model the objects bounding box. However, their performance will be severely degraded in some extreme scenarios, especially the objects occlusion which often occurs in practical application, such as pedestrian detection. The reason why objects occlusion is difficult to solve is that most detectors model an object instance via its center point, yet objects’ centers are easy to overlap.
Corner-guided detectors such as CornerNet and its variants, CenterNet and CentripetalNet, use corners to replace centers to alleviate the problem by transforming object detection to be corner keypoints prediction and grouping without center estimation, in which the corner modeling method can effectively alleviate the centers overlap problem.
Summary
According to an aspect of the disclosure, a method is provided. The method includes: predicting a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth; modeling the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and combining the SA and the CA via an interaction function to gain object confidence.
According to another aspect of the disclosure, an apparatus is provided. The apparatus  includes: interface circuitry; and processor circuitry coupled to the interface circuitry and configured to: predict a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth; model the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and combine the SA and the CA via an interaction function to gain object confidence.
Another aspect of the disclosure provides a device including means for implementing the method of the disclosure.
Another aspect of the disclosure provides a machine readable storage medium having instructions stored thereon, which when executed by a machine cause the machine to perform the method of the disclosure.
Brief Description of the Drawings
In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. Like numerals having different letter suffixes may represent different instances of similar components. The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments discussed in the present document.
Fig. 1 illustrates visualization comparisons of different solutions for object detection in an image.
Fig. 2 is an exemplary illustration of pseudo 3D corner representation in accordance with some embodiments of the disclosure.
Fig. 3 is an exemplary illustration of object detection based on the pseudo 3D corner representation in accordance with some embodiments of the disclosure.
Fig. 4 is an exemplary illustration of structure affinity (SA) in accordance with some embodiments of the disclosure.
Fig. 5 illustrates visualization comparisons of the proposed method with classical  corner-guided object detectors on Citypersons dataset in accordance with some embodiments of the disclosure.
Fig. 6 illustrates a flow chart illustrating an exemplary method 600 for object detection in accordance with some embodiments of the disclosure.
Fig. 7 is a block diagram illustrating components, according to some example embodiments, able to read instructions from a machine-readable or computer-readable medium and perform any one or more of the methodologies discussed herein.
Fig. 8 is a block diagram of an example processor platform in accordance with some embodiments of the disclosure.
Detailed Description of Embodiments
Various aspects of the illustrative embodiments will be described using terms commonly employed by those skilled in the art to convey the substance of the disclosure to others skilled in the art. However, it will be apparent to those skilled in the art that many alternate embodiments may be practiced using portions of the described aspects. For purposes of explanation, specific numbers, materials, and configurations are set forth in order to provide a thorough understanding of the illustrative embodiments. However, it will be apparent to those skilled in the art that alternate embodiments may be practiced without the specific details. In other instances, well known features may have been omitted or simplified in order to avoid obscuring the illustrative embodiments.
Further, various operations will be described as multiple discrete operations, in turn, in a manner that is most helpful in understanding the illustrative embodiments; however, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations need not be performed in the order of presentation.
The phrases “in an embodiment” , “in one embodiment” and “in some embodiments” are used repeatedly herein. The phrase generally does not refer to the same embodiment; however, it may. The terms “comprising, ” “having, ” and “including” are synonymous, unless  the context dictates otherwise. The phrases “A or B” and “A/B” mean “ (A) , (B) , or (A and B) . ” 
Although corner modeling method has potential to overcome the objects occlusion problem, the corner grouping process is challenging. To our best knowledge, current corner grouping algorithms, no matter CornerNet or its variants, CenterNet and CentripetalNet, cannot deal with dense occlusions.
Existing detection methods such as Faster-RCNN, SSD, use a 2D-box represented by height and width to model the object in an image. However, for at least two partially overlapped objects in an image, their 2D-boxes are overlapped and are difficult to distinguish.
Fig. 1 illustrates visualization comparisons of different solutions for object detection in an image. As shown in Fig. 1, there are 4 people in the image and 3 of them are dense occluded. CornerNet fails to distinguish corners of different objects with similar appearance. Upon CornerNet, CenterNet filters out false positives (FP) via center points yet cannot solve the scenario that a center point of the third object right within the center of an FP. CentripetalNet takes the two dense occluded people as one due to the center overlapping.
To addresses the occluded object detection problem faced by existing solutions, in the present disclosure, a pseudo 3D corner representation is provided to model object (s) in an image, wherein the pseudo 3D may include the dimensions of height, width, and pseudo depth. Through the promotion of dimensions from 2D to 3D, the coincided detection box can be distinguished, i.e., the overlapped detection box in a 2D space can be distinguished in a 3D space.
Back to Fig. 1, the solution based on the pseudo 3D corner representation of the present disclosure, termed as Corner Affinity, can distinguish the 4 people in the image clearly with four boxes with no boxes across two objects.
Fig. 2 is an exemplary illustration of pseudo 3D corner representation in accordance with some embodiments of the disclosure. As shown in Fig. 2, in the pseudo 3D corner representation, an object may be represented by a pair of pseudo 3D corners. In an embodiment, the pair of pseudo 3D corners includes a top-left corner and a bottom-right corner. Each pair of pseudo 3D corners is modeled by: 1) Structure Affinity (SA) , applied to mine preliminary  similarity of corner pairs through the corresponding object's shallow construction knowledge. 2) Context Affinity (CA) , running as optimizing corner similarity via deeper semantic features of affiliated instances.
In some embodiments, the height and width may be embedded in an SA module and the pseudo depth may be embedded in a CA module.
Fig. 3 is an exemplary illustration of object detection based on the pseudo 3D corner representation in accordance with some embodiments of the disclosure. As shown in Fig. 3, an object is represented by a pair of pseudo 3D corners: its top-left and bottom-right corners. To measure object confidence, the SA for embedding width and height and the CA for embedding pseudo depth are provided for each pair of corners. The SA and the CA is combined by an interaction function, for example, a corner affinity function. In the following subsections, we will detail the SA, CA, and the interaction function.
Mining the Construction Similarity via Structure Affinity
The structure affinity (SA) aims to mine preliminary construction similarity of corner pairs through the corresponding object's shallow structure knowledge. In an embodiment, the shape and location information of an instance may be defined as the structure knowledge. In some embodiments, the shape (e.g., width and height) information of each instance may be regressed, for example, at a ground-truth corner location.
In some embodiments, a detection network may generate SA regression maps for the top-left and bottom-right corners of the object. Note that regression is utilized to encode SA values and the width and height are not regarded as detection box, which is very different from popular detectors that use regression to obtain bounding boxes. In some embodiments, smoothL1 Loss may be adopted as SA loss to mine the shape knowledge (e.g., width and height) of each instance. In some embodiments, SA loss is only applied at the ground-truth corner location.
After obtaining the SA regression map of each corner, predicted width (w) and height (h) vectors may be decoded at each estimated corner location. Fig. 4 is an exemplary illustration of SA in accordance with some embodiments of the disclosure. As shown in Fig. 4, the regressed  w and h can not only form a rough object box but also decode a new vector (vector tl and vector br) that point to the opposite corner. Accordingly, in some embodiments, the SA may be designed via coupling Intersection-over-Union (IoU) and corner drifting, as follows:
Figure PCTCN2022101174-appb-000001
where the box tl and box br represent top-left and bottom-right boxes formed via the corresponding regressed w and h. The d tl and d br means the value of corner drifting from the end of decoded vectors to target corners, the decoded vectors are composed of regressed width and height while the target corners are the estimated corners in heatmap. D represents the distance value of the two predicted corners. In an example, D, d tl, and d br may be calculated via Euclidean distance. More details are shown in Fig. 4.
As described in Eq. (1) , the SA is composed of the IoU of two formed boxes and a bias named corner drifting. It is intuitive and reasonable that if different identity corners (top-left and bottom-left) belong to the same instance, their formed boxes will overlap significantly. Thus, the IoU may be utilized as the basic distance metric of SA. However, vanilla IoU cannot measure offsets of the decoded vectors directly. Therefore, the corner drifting may be provided as a bias of SA. The corner drifting is the mean of top-left and bottom-right drifting, which may be calculated, for example, via Euclidean distance, as shown in Fig. 4. Based on the above design, the value range of SA may be -1 to 1 and the closer the value is to 1, the higher the possibility that two corners belong to the same instance.
Mining the Semantic Similarity via Context Affinity
As mentioned in the above, the SA only embeds low-level construction information, which is not enough to perform grouping under extreme scenarios, e.g., two objects with similar shapes coincide. To this end, the CA part is introduced to mine high-level distinguishable semantic knowledge for Corner Affinity. The pseudo depth of each corner may be embedded in CA.
In some embodiments, the pseudo depth of each of the top-left and bottom-right  corner may be represented by an embedding value for the corner.
In some embodiments, an Associative Embedding method may be used to predict the embedding value for each corner. The embedding value may be predicted based on the feature local response via a self-supervision manner, which is no need for a real ground-truth value.
In some embodiments, CA loss is only applied at the ground-truth corner location. In some embodiments, CA loss is used to determine the embedding values, for example, based on “pull” loss and “push” loss, as follows:
Figure PCTCN2022101174-appb-000002
where e tlk is the embedding vector for the predicted top-left corner, e brk is the embedding vector for the predicted bottom-right corner, and e k is the average of e tlk and e brk. N is the number of objects. Δis a predefined value manually, as an example, which may be set to 1 default.
As shown in Eq. (2) , the “pull" loss may be used to close the embedding distance of paired corners and the “push" loss may be used to separate the embedding distance of irrelevant corners. It’s not cared what the value of each corner is and it just need minimize embedding distances of corners that belong to the same object and maximize those of different ones. Thus, each embedding without a real ground-truth mine the high-level semantic knowledge of an instance.
The distance of embedding is smoothed to be the context affinity. Suppose a top-left corner with an embedding value e tl and a bottom-right one with an embedding value e br, the corresponding CA may be defined as follows:
CA=tanh (|e tl-e br|)         (3)
where the tanh function is employed to normalize the distance of embedding values. The value range of CA is 0 to 1, yet unlike SA, that the closer the CA value is to 1, the lower the possibility that two corners belong to the same object.
Coupling the SA and CA via the interaction Function
The SA and CA are combined to gain the object confidence, which is termed as CornerAffinity, as follows:
Figure PCTCN2022101174-appb-000003
where SA and CA are represented in Eq. (1) and Eq. (3) , respectively. σ is a manually set Gaussian variance, for example, its value may be set to 0.5 empirically.
Upon the above designs, our method encodes not only low-level structural knowledge but also high-level semantic knowledge. Even in extreme situation that two objects with similar shape overlap (the SA value of two corners belong to different instances may be large) , CA will decay SA to make sure that the value of overall Corner Affinity is low, v. v. In an embodiment, the value of overall Corner Affinity may be 0.1. In brief, only when the SA value is large and the CA value is small, the combined value is large so that the two corners can be grouped together. The hyper-surface of interaction function as shown in Fig. 3.
Experimental Results
In order to verify the effectiveness of the solution based on the proposed pseudo 3D corner representation, extensive experiments on different datasets with different object detectors are conducted. Tables 1-2 summarize the detailed result comparisons.
Table 1
Figure PCTCN2022101174-appb-000004
Table 1 shows comparisons with different object detectors on COCO database both test-dev and val set. As shown in Table 1, the method proposed in the disclosure brings 5.8%boosting, from 40.5%to 46.3%, on AP for CornerNet baseline on COCO test-dev set. And only updated with the proposed new grouping algorithm, CornerNet surpasses popular detection baseline with a large margin, proving the corner-guided detector exists high ceiling and the proposed method remarkably promotes the development of this detector via optimizing corner grouping. It is worth noting that our pure grouping optimization produces more improvements for vanilla CornerNet than the other two single-stage variants, i.e., CenterNet and CentripetalNet. The single-stage two not only optimize corner grouping upon CornerNet, but also use stronger corner enhanced features yet still gain weaker accuracy than our pure grouping optimization. The experimental results show that the proposed method is the optimal corner grouping strategy at present.
Table 1 further shows comparisons with the newest favorite Transformer-based approaches. As shown in Table 1, employing the proposed method in the disclosure, CornerNet (Hourglass-104) yields an Average Precision (AP) of 45.1%and an AP75 of 48.3%on COCO val set, surmounting recent advanced self-attention-based single-stage detectors (e.g., DETR) , even without any feature enhanced module, e.g., FPN. Higher AP75 means higher-quality detection boxes.
Table 2
Figure PCTCN2022101174-appb-000005
Table 2 shows comparisons with different object detectors on Citypersons and UCAS-AOD datasets, where AP c means the AP generated on Citypersons and AP u represents the AP obtained on UCAS-AOD. These two datasets are selected to test the generalization performance of the proposed method in the disclosure under two extreme scenarios, i.e., occlusion scene (Citypersons) and symmetrical arrangement of similar objects scene (UCAS-AOD) . As shown in Table 2, the proposed method in the disclosure produces excellent accuracy upon higher-quality corner pairs under all aforementioned challenging situations. Specifically, compared with vanilla CornerNet, the proposed method boosts AP of amazing 35.8%and 17.2%on Citypersons and UCAS-AOD, respectively, firmly proving our design brings more robust grouping capability for corner-guided detectors.
As can be clearly seen from the results shown in Tables 1-2, the proposed method, which is tested in 3 public datasets: COCO, Citypersons, and UCAS-AOD, significantly increases the Mean Average Precision (mAP) on all datasets. Especially on the Citypersons where people are dense occluded (examples are shown in Fig. 5) , the proposed method boosts the mAP by 35.8%.
Fig. 5 illustrates visualization comparisons of the proposed method with classical corner-guided object detectors, i.e., CornerNet and CenterNet on Citypersons on Citypersons dataset to show the robustness of the new devised grouping algorithm. As shown in Fig. 5, for CornerNet, some unusual keypoints in the background are grouped as pairs and form some meaningless boxes. CenterNet filters false positives via a center point of object yet cannot overcome the situation that a center of a third object is right within the center of a false positive. The proposed method in the disclosure can overcome the objects occlusion problem in this  pedestrians’ scenarios.
Fig. 6 illustrates a flow chart illustrating an exemplary method 600 for object detection in accordance with some embodiments of the disclosure. The method 600 may include blocks S610-S630.
At block S610, a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image may be predicted, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner. At block S620, the pair of pseudo 3D corners for the object may be modeled by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth. At block S630, the SA and the CA may be combined via an interaction function to gain object confidence.
In some embodiments, the method 600 may include more or less steps. The disclosure is not limited in this aspect. Also, the method 600 may be understood in conjunction with the embodiments described above.
In some embodiments, the method 600 may further includes determining the SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.
In some embodiments, the method 600 may further includes calculating the SA value by the IoU value minus a minimum value of the corner drifting value and 1.
In some embodiments, wherein the SA value may calculated as follows:
Figure PCTCN2022101174-appb-000006
In some embodiments, wherein a range of the SA value may be -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted  bottom-right corner belong to a same object.
In some embodiments, the method 600 may further includes generating SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and obtaining the regressed width and height at each corner location after obtaining the SA regression maps.
In some embodiments, the method 600 may further includes applying SmoothL1 Loss to mine the width and height of each corner.
In some embodiments, the method 600 may further includes calculating the CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.
In some embodiments, the CA value may be calculated as follows:
CA=tanh (|e tl-e br|)
In some embodiments, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner may be predicted using an Associative Embedding method.
In some embodiments, the method 600 may further include: predicting the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.
In some embodiments, the embedding values for the predicted top-left corner and the predicted bottom-right corner may be predicted as follows:
Figure PCTCN2022101174-appb-000007
In some embodiments, wherein a range of CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
In some embodiments, wherein the combining the SA and the CA via an interaction function to gain the object confidence includes: grouping the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.
In some embodiments, wherein the interaction function is as follows:
Figure PCTCN2022101174-appb-000008
where σ is a manually set Gaussian variance.
The present disclosure provides a novel corner representation method, termed as pseudo 3D corner representation, to address the objects occlusion problem, wherein the pseudo 3D may include the dimensions of height, width, and pseudo depth, the height and width may be embedded in a proposed SA module and pseudo depth may be embedded in a proposed CA module. Through the promotion of dimensions from 2D to 3D, the coincided detection box can be distinguished and dense object occlusion problem can be overcome.
Fig. 7 is a block diagram illustrating components, according to some example embodiments, able to read instructions from a machine-readable or computer-readable medium (e.g., a non-transitory machine-readable storage medium) and perform any one or more of the methodologies discussed herein. Specifically, Fig. 7 shows a diagrammatic representation of hardware resources 700 including one or more processors (or processor cores) 710, one or more memory/storage devices 720, and one or more communication resources 730, each of which may be communicatively coupled via a bus 740. For embodiments where node virtualization (e.g., NFV) is utilized, a hypervisor 702 may be executed to provide an execution environment for one or more network slices/sub-slices to utilize the hardware resources 700.
The processors 710 may include, for example, a processor 712 and a processor 714 which may be, e.g., a central processing unit (CPU) , a reduced instruction set computing (RISC) processor, a complex instruction set computing (CISC) processor, a graphics processing unit (GPU) , a digital signal processor (DSP) such as a baseband processor, an application specific integrated circuit (ASIC) , a radio-frequency integrated circuit (RFIC) , another processor, or  any suitable combination thereof.
The memory/storage devices 720 may include main memory, disk storage, or any suitable combination thereof. The memory/storage devices 720 may include, but are not limited to any type of volatile or non-volatile memory such as dynamic random access memory (DRAM) , static random-access memory (SRAM) , erasable programmable read-only memory (EPROM) , electrically erasable programmable read-only memory (EEPROM) , Flash memory, solid-state storage, etc.
The communication resources 730 may include interconnection or network interface components or other suitable devices to communicate with one or more peripheral devices 704 or one or more databases 706 via a network 708. For example, the communication resources 730 may include wired communication components (e.g., for coupling via a Universal Serial Bus (USB) ) , cellular communication components, NFC components, 
Figure PCTCN2022101174-appb-000009
components (e.g., 
Figure PCTCN2022101174-appb-000010
Low Energy) , 
Figure PCTCN2022101174-appb-000011
components, and other communication components.
Instructions 750 may comprise software, a program, an application, an applet, an app, or other executable code for causing at least any of the processors 710 to perform any one or more of the methodologies discussed herein. The instructions 750 may reside, completely or partially, within at least one of the processors 710 (e.g., within the processor’s cache memory) , the memory/storage devices 720, or any suitable combination thereof. Furthermore, any portion of the instructions 750 may be transferred to the hardware resources 700 from any combination of the peripheral devices 704 or the databases 706. Accordingly, the memory of processors 710, the memory/storage devices 720, the peripheral devices 704, and the databases 706 are examples of computer-readable and machine-readable media.
Fig. 8 is a block diagram of an example processor platform in accordance with some embodiments of the disclosure. The processor platform 800 can be, for example, a server, a personal computer, a workstation, a self-learning machine (e.g., a neural network) , a mobile device (e.g., a cell phone, a smart phone, a tablet such as an iPad TM) , a personal digital assistant (PDA) , an Internet appliance, a DVD player, a CD player, a digital video recorder, a Blu-ray player, a gaming console, a personal video recorder, a set top box, a headset or other wearable  device, or any other type of computing device.
The processor platform 800 of the illustrated example includes a processor 812. The processor 812 of the illustrated example is hardware. For example, the processor 812 can be implemented by one or more integrated circuits, logic circuits, microprocessors, GPUs, DSPs, or controllers from any desired family or manufacturer. The hardware processor may be a semiconductor based (e.g., silicon based) device. In some embodiments, the processor implements one or more of the methods or processes described above.
The processor 812 of the illustrated example includes a local memory 813 (e.g., a cache) . The processor 812 of the illustrated example is in communication with a main memory including a volatile memory 814 and a non-volatile memory 816 via a bus 818. The volatile memory 814 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM) , Dynamic Random Access Memory (DRAM) , 
Figure PCTCN2022101174-appb-000012
Dynamic Random Access Memory 
Figure PCTCN2022101174-appb-000013
and/or any other type of random access memory device. The non-volatile memory 816 may be implemented by flash memory and/or any other desired type of memory device. Access to the  main memory  814, 816 is controlled by a memory controller.
The processor platform 800 of the illustrated example also includes interface circuitry 820. The interface circuitry 820 may be implemented by any type of interface standard, such as an Ethernet interface, a universal serial bus (USB) , a 
Figure PCTCN2022101174-appb-000014
interface, a near field communication (NFC) interface, and/or a PCI express interface.
In the illustrated example, one or more input devices 822 are connected to the interface circuitry 820. The input device (s) 822 permit (s) a user to enter data and/or commands into the processor 812. The input device (s) can be implemented by, for example, an audio sensor, a microphone, a camera (still or video) , a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, and/or a voice recognition system.
One or more output devices 824 are also connected to the interface circuitry 820 of the illustrated example. The output devices 824 can be implemented, for example, by display devices (e.g., a light emitting diode (LED) , an organic light emitting diode (OLED) , a liquid crystal display (LCD) , a cathode ray tube display (CRT) , an in-place switching (IPS) display,  a touchscreen, etc. ) , a tactile output device, a printer and/or speaker. The interface circuitry 820 of the illustrated example, thus, typically includes a graphics driver card, a graphics driver chip and/or a graphics driver processor.
The interface circuitry 820 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem, a residential gateway, a wireless access point, and/or a network interface to facilitate exchange of data with external machines (e.g., computing devices of any kind) via a network 826. The communication can be via, for example, an Ethernet connection, a digital subscriber line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-of-site wireless system, a cellular telephone system, etc.
For example, the interface circuitry 820 may include a training dataset inputted through the input device (s) 822 or retrieved from the network 826.
The processor platform 800 of the illustrated example also includes one or more mass storage devices 828 for storing software and/or data. Examples of such mass storage devices 828 include floppy disk drives, hard drive disks, compact disk drives, Blu-ray disk drives, redundant array of independent disks (RAID) systems, and digital versatile disk (DVD) drives.
Machine executable instructions 832 may be stored in the mass storage device 828, in the volatile memory 814, in the non-volatile memory 816, and/or on a removable non-transitory computer readable storage medium such as a CD or DVD.
The following paragraphs describe examples of various embodiments.
Example 1 includes a method, comprising: predicting a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner; modeling the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and combining the SA and the CA via an interaction function to gain object confidence.
Example 2 includes the method of Example 1, further comprising: determining a SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box  formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.
Example 3 includes the method of Example 1 or 2, further comprising: calculating the SA value by the IoU value minus a minimum value of the corner drifting value and 1.
Example 4 includes the method of any of Examples 1-3, wherein a range of the SA value is -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
Example 5 includes the method of any of Examples 1-4, further comprising: generating SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and obtaining the regressed width and height at each corner location after obtaining the SA regression maps.
Example 6 includes the method of any of Examples 1-5, further comprising: applying SmoothL1 Loss to mine the width and height of each corner.
Example 7 includes the method of any of Examples 1-6, further comprising: calculating a CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.
Example 8 includes the method of any of Examples 1-7, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner are predicted using an Associative Embedding method.
Example 9 includes the method of any of Examples 1-8, further comprising: predicting the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.
Example 10 includes the method of any of Examples 1-9, wherein a range of the CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
Example 11 includes the method of any of Examples 1-10, wherein the combining the SA and the CA via an interaction function to gain object confidence comprising: grouping the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.
Example 12 includes the method of any of Examples 1-11, wherein the object confidence is represented as Corner Affinity, and the interaction function is as follows:
Figure PCTCN2022101174-appb-000015
where σ is a manually set Gaussian variance.
Example 13 includes an apparatus, comprising: interface circuitry; and processor circuitry coupled to the interface circuitry and configured to: : predict a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner; model the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and combine the SA and the CA via an interaction function to gain object confidence.
Example 14 includes the apparatus of Example 13, wherein the processing circuitry is further to: determine a SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.
Example 15 includes the apparatus of Example 13 or 14, wherein the processing circuitry is further to: calculate the SA value by the IoU value minus a minimum value of the corner drifting value and 1.
Example 16 includes the apparatus of any of Examples 13-15, wherein a range of the SA value is -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted  top-left corner and the predicted bottom-right corner belong to a same object.
Example 17 includes the apparatus of any of Examples 13-16, wherein the processing circuitry is further to: generate SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and obtain the regressed width and height at each corner location after obtaining the SA regression maps.
Example 18 includes the apparatus of any of Examples 13-17, wherein the processing circuitry is further to: apply SmoothL1 Loss to mine the width and height of each corner.
Example 19 includes the apparatus of any of Examples 13-18, wherein the processing circuitry is further to: calculate a CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.
Example 20 includes the apparatus of any of Examples 13-19, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner are predicted using an Associative Embedding method.
Example 21 includes the apparatus of any of Examples 13-20, wherein the processing circuitry is further to: predict the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.
Example 22 includes the apparatus of any of Examples 13-21, wherein a range of the CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
Example 23 includes the apparatus of any of Examples 13-22, wherein the processing circuitry is further to: group the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.
Example 24 includes the apparatus of any of Examples 13-23, wherein the object confidence is represented as Corner Affinity, and the interaction function is as follows:
Figure PCTCN2022101174-appb-000016
where σ is a manually set Gaussian variance.
Example 25 includes a computer-readable medium having instructions stored thereon, wherein the instructions, when executed by processor circuitry, cause the processor circuitry to perform a method, the method comprising: predicting a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner; modeling the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and combining the SA and the CA via an interaction function to gain object confidence.
Example 26 includes the computer-readable medium of Example 25, the method further comprising: determining a SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.
Example 27 includes the computer-readable medium of Example 25 or 26, the method further comprising: calculating the SA value by the IoU value minus a minimum value of the corner drifting value and 1.
Example 28 includes the computer-readable medium of any of Examples 25-27, wherein a range of the SA value is -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
Example 29 includes the computer-readable medium of any of Examples 25-28, the method further comprising: generating SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and obtaining the regressed width and height at each corner location after obtaining the SA regression maps.
Example 30 includes the computer-readable medium of any of Examples 25-29, the method further comprising: applying SmoothL1 Loss to mine the width and height of each corner.
Example 31 includes the computer-readable medium of any of Examples 25-30, the method further comprising: calculating a CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.
Example 32 includes the computer-readable medium of any of Examples 25-31, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner are predicted using an Associative Embedding method.
Example 33 includes the computer-readable medium of any of Examples 25-32, the method further comprising: predicting the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.
Example 34 includes the computer-readable medium of any of Examples 25-33, wherein a range of the CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
Example 35 includes the computer-readable medium of any of Examples 25-34, wherein the combining the SA and the CA via an interaction function to gain object confidence comprising: grouping the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.
Example 36 includes the computer-readable medium of any of Examples 25-35, wherein the object confidence is represented as Corner Affinity, and the interaction function is as follows:
Figure PCTCN2022101174-appb-000017
where σ is a manually set Gaussian variance.
Example 37 includes a device, comprising: means for predicting a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner; means for modeling the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and means for combining the SA and the CA via an interaction function to gain object confidence.
Example 38 includes the device of Example 37, further comprising: means for determining a SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.
Example 39 includes the device of Example 37 or 38, further comprising: means for calculating the SA value by the IoU value minus a minimum value of the corner drifting value and 1.
Example 40 includes the device of any of Examples 37-39, wherein a range of the SA value is -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
Example 41 includes the device of any of Examples 37-40, further comprising: means for generating SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and means for obtaining the regressed width and height at each corner location after obtaining the SA regression maps.
Example 42 includes the device of any of Examples 37-41, further comprising: means for applying SmoothL1 Loss to mine the width and height of each corner.
Example 43 includes the device of any of Examples 37-42, further comprising: means for calculating a CA value by applying tanh function to normalize a distance of embedding  values of the predicted top-left corner and the predicted bottom-right corner.
Example 44 includes the device of any of Examples 37-43, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner are predicted using an Associative Embedding method.
Example 45 includes the device of any of Examples 37-44, further comprising: means for predicting the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.
Example 46 includes the device of any of Examples 37-45, wherein a range of the CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
Example 47 includes the device of any of Examples 37-46, further comprising: means for grouping the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.
Example 48 includes the device of any of Examples 37-47, wherein the object confidence is represented as Corner Affinity, and the interaction function is as follows:
Figure PCTCN2022101174-appb-000018
where σ is a manually set Gaussian variance.
Example 49 includes an apparatus as shown and described in the description.
Example 50 includes a method performed at an apparatus as shown and described in the description.
The above description is intended to be illustrative, and not restrictive. For example, the above-described examples (or one or more aspects thereof) may be used in combination with each other. Other embodiments may be used, such as by one of ordinary skill in the art upon reviewing the above description. The Abstract is to allow the reader to quickly ascertain the nature of the technical disclosure and is submitted with the understanding that it will not be  used to interpret or limit the scope or meaning of the claims. Also, in the above Detailed Description, various features may be grouped together to streamline the disclosure. This should not be interpreted as intending that an unclaimed disclosed feature is essential to any claim. Rather, inventive subject matter may lie in less than all features of a particular disclosed embodiment. Thus, the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separate embodiment. The scope of the embodiments should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
Although certain embodiments have been illustrated and described herein for purposes of description, a wide variety of alternate and/or equivalent embodiments or implementations calculated to achieve the same purposes may be substituted for the embodiments shown and described without departing from the scope of the present disclosure. The disclosure is intended to cover any adaptations or variations of the embodiments discussed herein. Therefore, it is manifestly intended that embodiments described herein be limited only by the appended claims and the equivalents thereof.

Claims (25)

  1. A method, comprising:
    predicting a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner;
    modeling the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and
    combining the SA and the CA via an interaction function to gain object confidence.
  2. The method of claim 1, further comprising:
    determining a SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.
  3. The method of claim 2, further comprising:
    calculating the SA value by the IoU value minus a minimum value of the corner drifting value and 1.
  4. The method of claim 3, wherein a range of the SA value is -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
  5. The method of claim 2, further comprising:
    generating SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and
    obtaining the regressed width and height at each corner location after obtaining the SA regression maps.
  6. The method of claim 5, further comprising:
    applying SmoothL1 Loss to mine the width and height of each corner.
  7. The method of claim 2, further comprising:
    calculating a CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.
  8. The method of claim 7, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner are predicted using an Associative Embedding method.
  9. The method of claim 8, further comprising:
    predicting the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.
  10. The method of claim 7, wherein a range of the CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
  11. The method of claim 7, wherein the combining the SA and the CA via an interaction function to gain object confidence comprising:
    grouping the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.
  12. The method of claim 11, wherein the object confidence is represented as Corner Affinity, and the interaction function is as follows:
    Figure PCTCN2022101174-appb-100001
    where σ is a manually set Gaussian variance.
  13. An apparatus, comprising:
    interface circuitry; and
    processor circuitry coupled to the interface circuitry and configured to:
    predict a pair of pseudo 3D corners comprising a top-left corner and a bottom-right corner for an object in an image, wherein each of the pair of pseudo 3D corners is represented by height, width, and pseudo depth of the pseudo 3D corner;
    model the pair of pseudo 3D corners for the object by structure affinity (SA) for embedding the height and width and context affinity (CA) for embedding the pseudo depth; and
    combine the SA and the CA via an interaction function to gain object confidence.
  14. The apparatus of claim 13, wherein the processing circuitry is further to:
    determine a SA value based on an Intersection-over-Union (IoU) value of a top-left box and a bottom-right box formed via corresponding regressed width and height of the predicted top-left corner and the predicted bottom-right corner respectively, and a corner drifting value which is the mean of top-left drifting and bottom-right drifting in the two boxes.
  15. The apparatus of claim 2, wherein the processing circuitry is further to:
    calculate the SA value by the IoU value minus a minimum value of the corner drifting value and 1.
  16. The apparatus of claim 15, wherein a range of the SA value is -1 to 1 and the closer the SA value is to 1, the higher the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
  17. The apparatus of claim 14, wherein the processing circuitry is further to:
    generate SA regression maps for the predicted top-left corner and the predicted bottom-right corner to regress the width and height of each corner at a ground-truth corner location; and
    obtain the regressed width and height at each corner location after obtaining the SA regression maps.
  18. The apparatus of claim 17, wherein the processing circuitry is further to:
    apply SmoothL1 Loss to mine the width and height of each corner.
  19. The apparatus of claim 14, wherein the processing circuitry is further to:
    calculate a CA value by applying tanh function to normalize a distance of embedding values of the predicted top-left corner and the predicted bottom-right corner.
  20. The apparatus of claim 19, wherein the embedding values of the predicted top-left corner and the predicted bottom-right corner are predicted using an Associative Embedding method.
  21. The apparatus of claim 20, wherein the processing circuitry is further to:
    predict the embedding values for the predicted top-left corner and the predicted bottom-right corner by using Pull Loss to close embedding distance of paired corners and Push Loss to separate embedding distance of irrelevant corners.
  22. The apparatus of claim 19, wherein a range of the CA value is 0 to 1, the closer the CA value is to 1, the lower the possibility that the predicted top-left corner and the predicted bottom-right corner belong to a same object.
  23. The apparatus of claim 19, wherein the processing circuitry is further to:
    group the predicted top-left corner and the predicted bottom-right corner together when a combined value of the SA value and the CA value is large, wherein the combined value is large only when the SA value is large and the CA value is small.
  24. The apparatus of claim 23, wherein the object confidence is represented as Corner Affinity, and the interaction function is as follows:
    Figure PCTCN2022101174-appb-100002
    where σ is a manually set Gaussian variance.
  25. A computer-readable medium having instructions stored thereon, wherein the instructions, when executed by processor circuitry, cause the processor circuitry to perform the method of any of claims 1 to 12.
PCT/CN2022/101174 2022-06-24 2022-06-24 Apparatus and method for object detection WO2023245635A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/101174 WO2023245635A1 (en) 2022-06-24 2022-06-24 Apparatus and method for object detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/101174 WO2023245635A1 (en) 2022-06-24 2022-06-24 Apparatus and method for object detection

Publications (1)

Publication Number Publication Date
WO2023245635A1 true WO2023245635A1 (en) 2023-12-28

Family

ID=89379086

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/101174 WO2023245635A1 (en) 2022-06-24 2022-06-24 Apparatus and method for object detection

Country Status (1)

Country Link
WO (1) WO2023245635A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104680525A (en) * 2015-02-12 2015-06-03 南通大学 Automatic human body fall-down detection method based on Kinect depth image
CN106778614A (en) * 2016-12-16 2017-05-31 中新智擎有限公司 A kind of human body recognition method and device
CN109598781A (en) * 2017-10-03 2019-04-09 斯特拉德视觉公司 The method of puppet 3D frame is obtained from 2D bounding box by regression analysis and uses the learning device and test device of this method
CN110678872A (en) * 2017-04-04 2020-01-10 罗伯特·博世有限公司 Direct vehicle detection as 3D bounding box by using neural network image processing
CN111857111A (en) * 2019-04-09 2020-10-30 商汤集团有限公司 Object three-dimensional detection and intelligent driving control method, device, medium and equipment
US10937178B1 (en) * 2019-05-09 2021-03-02 Zoox, Inc. Image-based depth data and bounding boxes

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104680525A (en) * 2015-02-12 2015-06-03 南通大学 Automatic human body fall-down detection method based on Kinect depth image
CN106778614A (en) * 2016-12-16 2017-05-31 中新智擎有限公司 A kind of human body recognition method and device
CN110678872A (en) * 2017-04-04 2020-01-10 罗伯特·博世有限公司 Direct vehicle detection as 3D bounding box by using neural network image processing
CN109598781A (en) * 2017-10-03 2019-04-09 斯特拉德视觉公司 The method of puppet 3D frame is obtained from 2D bounding box by regression analysis and uses the learning device and test device of this method
CN111857111A (en) * 2019-04-09 2020-10-30 商汤集团有限公司 Object three-dimensional detection and intelligent driving control method, device, medium and equipment
US10937178B1 (en) * 2019-05-09 2021-03-02 Zoox, Inc. Image-based depth data and bounding boxes

Similar Documents

Publication Publication Date Title
US11610384B2 (en) Zero-shot object detection
US10204619B2 (en) Speech recognition using associative mapping
US20180075608A1 (en) Efficient acquisition of a target image from an original image
WO2020206666A1 (en) Depth estimation method and apparatus employing speckle image and face recognition system
US11688191B2 (en) Contextually disambiguating queries
CN105814587B (en) Local real-time facial recognition
US20130238332A1 (en) Automatic input signal recognition using location based language modeling
US11810319B2 (en) Image detection method, device, storage medium and computer program product
EP3360137B1 (en) Identifying sound from a source of interest based on multiple audio feeds
US11276201B1 (en) Localizing an augmented reality device
JP7300034B2 (en) Table generation method, device, electronic device, storage medium and program
US20200105262A1 (en) Method and system for automatically managing operations of electronic device
US20220027661A1 (en) Method and apparatus of processing image, electronic device, and storage medium
US20200082588A1 (en) Context-aware selective object replacement
US11361532B1 (en) System and method for OCR based object registration
WO2022173601A1 (en) Mixed reality object detection
CN114207711A (en) System and method for recognizing speech of user
US11881050B2 (en) Method for detecting face synthetic image, electronic device, and storage medium
WO2023245635A1 (en) Apparatus and method for object detection
US20210375054A1 (en) Tracking an augmented reality device
CN106611595B (en) Electronic device and method for converting text to speech
US20230048495A1 (en) Method and platform of generating document, electronic device and storage medium
US11880405B2 (en) Method for searching similar images in an image database using global values of a similarity measure for discarding partitions of the image database
JP2022133474A (en) Text recognition method, apparatus, electronic device, storage medium, and computer program
KR102303851B1 (en) Method for searching location through image analysis and system thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22947402

Country of ref document: EP

Kind code of ref document: A1