EP4445331A1 - Objekterkennungsvorrichtung, objekterkennungsverfahren und objekterkennungssystem - Google Patents
Objekterkennungsvorrichtung, objekterkennungsverfahren und objekterkennungssystemInfo
- Publication number
- EP4445331A1 EP4445331A1 EP21967291.2A EP21967291A EP4445331A1 EP 4445331 A1 EP4445331 A1 EP 4445331A1 EP 21967291 A EP21967291 A EP 21967291A EP 4445331 A1 EP4445331 A1 EP 4445331A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- image
- target
- unit
- object detection
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Definitions
- the present disclosure generally relates to an object detection device, object detection method, and object detection system.
- sensor devices are increasingly used to collect data in a variety of fields. This collected data can then be analyzed to detect and measure various physical phenomena and gather valuable insights.
- One aspect of information technology and IoT systems relates to the detection and tracking of objects in images collected by cameras.
- the detection and tracking of objects has applications in the surveillance, healthcare, retail, and transportation domains.
- Objects of interest could include, for instance, humans, animals, vehicles, packages, or the like.
- the object of interest in an image may only be partially visible due to occlusion by other objects. In such cases, it is difficult for existing object detection methods to provide accurate object detection results.
- Patent Document 1 discloses “A method includes the following steps. A video sequence including detection results from one or more detectors is received, the detection results identifying one or more objects. A clustering framework is applied to the detection results to identify one or more clusters associated with the one or more objects. The clustering framework is applied to the video sequence on a frame-by-frame basis. Spatial and temporal information for each of the one or more clusters are determined. The one or more clusters are associated to the detection results based on the spatial and temporal information in consecutive frames of the video sequence to generate tracking information. One or more target tracks are generated based on the tracking information for the one or more clusters. The one or more target tracks are consolidated to generate refined tracks for the one or more objects.”
- KR102090487B1 discloses “Disclosed is an object detection apparatus and method using a fusion sensor.
- a method of detecting an object using a fusion sensor includes collecting rider data and radar data using a rider sensor and a radar sensor for a search area to detect an object; Extracting objects present in the search area based on the collected lidar data and radar data, respectively; Setting an area of interest of the lidar sensor using the collected lidar data, and generating an occluded depth using an object extracted based on the lidar data; Setting a region of interest of the radar sensor using the collected radar data; And determining whether an object existing in the search area is a moving object by using an area of interest or occlusion depth of the lidar sensor and an area of interest of the radar sensor.”
- Patent Document 1 discloses a technique in which, once a target object has been correctly identified in a particular frame, a clustering method is used to track the target object through subsequent frames in which it may be occluded.
- the clustering method of Patent Document 1 requires the collection of a video sequence, and a cluster is associated with the target object based on the spatial and temporal information of the target object in consecutive frames.
- Patent Document 1 depends on first correctly identifying the target object in a frame in which it is not occluded before it can be tracked through subsequent frames in which it may be occluded. Accordingly, the technique disclosed in Patent Document 1 is not suitable in cases in which a video sequence in which the target object is not occluded in one or more frames is not available. Further, as the technique disclosed in Patent Document 1 performs clustering based on the temporal and spatial relationship of the target object in subsequent frames of the video sequence, clustering cannot be accurately performed in situations where the image capture rate is slow, as the distance of the target object between subsequent frames may be too large for cluster association.
- Patent Document 2 discloses a sensor fusion method to determine the depth of partially occluded objects based on both radar and LiDAR measurements.
- Patent Document 2 focuses on determining the depth of partially occluded objects, and is limited in its ability to detect the class of the target object. Accordingly, the technique of Patent Document 2 is not suitable for situations in which accurate class identification of target objects is important, such as automobile or railroad applications in which the risk of a particular object to automobile or train safety may depend on the class of the object (e.g., dynamic objects may pose a greater risk than static objects).
- an object detection device including an image acquisition unit for receiving a target image that at least includes a first target object; an image segmentation unit for processing the target image to identify an occluded area of the target image in which at least the first target object is at least partially occluded by a first occluding object, and assign a first object label to the first occluding object that indicates an object class of the first occluding object; an image conversion unit for generating a first recovery image by converting the occluded area of the target image to a first recovery mask based on the first object label; a generator unit for processing the first recovery image to generate a first recovered image in which the first recovery mask has been replaced with a first predicted image that includes at least a first predicted object associated with a second object label that indicates an object class corresponding to both the first predicted object and the first target object; and an object detection unit for processing the first recovered image to detect the first target object and generating an object detection result that includes at least a location
- FIG. 1 illustrates an example computing architecture for executing the embodiments of the present disclosure.
- FIG. 2 illustrates an example configuration of an object detection system, according to embodiments.
- FIG. 3 illustrates a block diagram of an inference phase process of the object detection system, according to embodiments.
- FIG. 4 illustrates a block diagram of a training phase process of the object detection system, according to embodiments.
- FIG. 5 illustrates a block diagram of an image segmentation training process, according to embodiments.
- FIG. 6 illustrates a block diagram of a set of transformation units, according to embodiments.
- FIG. 7 illustrates an example of a target image, a recovery image, and a recovered image, according to embodiments.
- aspects of the disclosure relate to utilizing semantic segmentation to identify an occluding object that occludes a target object, replacing the area of the occluding object with a recovery mask, and subsequently utilizing a generative adversarial neural network (GAN) to generate a recovered image in which the recovery mask has been replaced with a predicted image that indicates at least a predicted object that shares an object class with the target object.
- GAN generative adversarial neural network
- occluded area refers to the area in the image that corresponds to the occluding object identified by semantic segmentation.
- the occluded area identified using the high-resolution semantic segmentation may be used to estimate model weights for the occluded object recovery ability of a patch label classifier, a discriminator and a generator in a GAN. Additionally, in an inference phase, the occluded area can be used to guide a GAN to generate a predicted image to replace the occluded area.
- the occluded area may be represented using a binary recover mask.
- the generator in the GAN may include global/ patch discriminators and an additional patch label classifier to guide the training of the generator to recover the object class of the target object with greater accuracy. More particularly, these global/ patch discriminators may receive the occluded area identified using the semantic segmentation to generalize the model for the occluded area from a set of labels, and guide the generator to generate a realistic predicted image to replace the occluded area.
- aspects of the disclosure relate to utilizing a plurality of transformation units (e.g., additional GANs) to enhance the quality and modify features (e.g., weather conditions, lighting conditions) of input images.
- the outputs from these transformation units may be weighted and combined into a single image using a set of convolution layers.
- FIG. 1 depicts a high-level block diagram of a computer system 100 for implementing various embodiments of the present disclosure, according to embodiments.
- the mechanisms and apparatus of the various embodiments disclosed herein apply equally to any appropriate computing system.
- the major components of the computer system 100 include one or more processors 102, a memory 104, a terminal interface 112, a storage interface 113, an I/O (Input/Output) device interface 114, and a network interface 115, all of which are communicatively coupled, directly or indirectly, for inter-component communication via a memory bus 106, an I/O bus 108, bus interface unit 109, and an I/O bus interface unit 110.
- the computer system 100 may contain one or more general-purpose programmable central processing units (CPUs) 102A and 102B, herein generically referred to as the processor 102.
- the computer system 100 may contain multiple processors; however, in certain embodiments, the computer system 100 may alternatively be a single CPU system.
- Each processor 102 executes instructions stored in the memory 104 and may include one or more levels of on-board cache.
- the memory 104 may include a random-access semiconductor memory, storage device, or storage medium (either volatile or non-volatile) for storing or encoding data and programs.
- the memory 104 represents the entire virtual memory of the computer system 100, and may also include the virtual memory of other computer systems coupled to the computer system 100 or connected via a network.
- the memory 104 can be conceptually viewed as a single monolithic entity, but in other embodiments the memory 104 is a more complex arrangement, such as a hierarchy of caches and other memory devices.
- memory may exist in multiple levels of caches, and these caches may be further divided by function, so that one cache holds instructions while another holds non-instruction data, which is used by the processor or processors.
- Memory may be further distributed and associated with different CPUs or sets of CPUs, as is known in any of various so-called non-uniform memory access (NUMA) computer architectures.
- NUMA non-uniform memory access
- the memory 104 may store all or a portion of the various programs, modules and data structures for processing data transfers as discussed herein.
- the memory 104 can store an object detection application 150.
- the object detection application 150 may include instructions or statements that execute on the processor 102 or instructions or statements that are interpreted by instructions or statements that execute on the processor 102 to carry out the functions as further described below.
- the object detection application 150 is implemented in hardware via semiconductor devices, chips, logical gates, circuits, circuit cards, and/or other physical hardware devices in lieu of, or in addition to, a processor-based system.
- the object detection application 150 may include data in addition to instructions or statements.
- a camera, sensor, or other data input device may be provided in direct communication with the bus interface unit 109, the processor 102, or other hardware of the computer system 100. In such a configuration, the need for the processor 102 to access the memory 104 and the object detection application 150 may be reduced.
- the computer system 100 may include a bus interface unit 109 to handle communications among the processor 102, the memory 104, a display system 124, and the I/O bus interface unit 110.
- the I/O bus interface unit 110 may be coupled with the I/O bus 108 for transferring data to and from the various I/O units.
- the I/O bus interface unit 110 communicates with multiple I/O interface units 112, 113, 114, and 115, which are also known as I/O processors (IOPs) or I/O adapters (IOAs), through the I/O bus 108.
- the display system 124 may include a display controller, a display memory, or both. The display controller may provide video, audio, or both types of data to a display device 126.
- the computer system 100 may include one or more sensors or other devices configured to collect and provide data to the processor 102.
- the computer system 100 may include biometric sensors (e.g., to collect heart rate data, stress level data), environmental sensors (e.g., to collect humidity data, temperature data, pressure data), motion sensors (e.g., to collect acceleration data, movement data), or the like. Other types of sensors are also possible.
- the display memory may be a dedicated memory for buffering video data.
- the display system 124 may be coupled with a display device 126, such as a standalone display screen, computer monitor, television, or a tablet or handheld device display. In one embodiment, the display device 126 may include one or more speakers for rendering audio.
- one or more speakers for rendering audio may be coupled with an I/O interface unit.
- one or more of the functions provided by the display system 124 may be on board an integrated circuit that also includes the processor 102.
- one or more of the functions provided by the bus interface unit 109 may be on board an integrated circuit that also includes the processor 102.
- the I/O interface units support communication with a variety of storage and I/O devices.
- the terminal interface unit 112 supports the attachment of one or more user I/O devices 116, which may include user output devices (such as a video display device, speaker, and/or television set) and user input devices (such as a keyboard, mouse, keypad, touchpad, trackball, buttons, light pen, or other pointing device).
- user input devices such as a keyboard, mouse, keypad, touchpad, trackball, buttons, light pen, or other pointing device.
- a user may manipulate the user input devices using a user interface in order to provide input data and commands to the user I/O device 116 and the computer system 100, and may receive output data via the user output devices.
- a user interface may be presented via the user I/O device 116, such as displayed on a display device, played via a speaker, or printed via a printer.
- the storage interface 113 supports the attachment of one or more disk drives or direct access storage devices 117 (which are typically rotating magnetic disk drive storage devices, although they could alternatively be other storage devices, including arrays of disk drives configured to appear as a single large storage device to a host computer, or solid-state drives, such as flash memory).
- the storage device 117 may be implemented via any type of secondary storage device.
- the contents of the memory 104, or any portion thereof, may be stored to and retrieved from the storage device 117 as needed.
- the I/O device interface 114 provides an interface to any of various other I/O devices or devices of other types, such as printers or fax machines.
- the network interface 115 provides one or more communication paths from the computer system 100 to other digital devices and computer systems; these communication paths may include, for example, one or more networks 130.
- the computer system 100 shown in FIG. 1 illustrates a particular bus structure providing a direct communication path among the processors 102, the memory 104, the bus interface 109, the display system 124, and the I/O bus interface unit 110
- the computer system 100 may include different buses or communication paths, which may be arranged in any of various forms, such as point-to-point links in hierarchical, star or web configurations, multiple hierarchical buses, parallel and redundant paths, or any other appropriate type of configuration.
- the I/O bus interface unit 110 and the I/O bus 108 are shown as single respective units, the computer system 100 may, in fact, contain multiple I/O bus interface units 110 and/or multiple I/O buses 108. While multiple I/O interface units are shown which separate the I/O bus 108 from various communications paths running to the various I/O devices, in other embodiments, some or all of the I/O devices are connected directly to one or more system I/O buses.
- the computer system 100 is a multi-user mainframe computer system, a single-user system, or a server computer or similar device that has little or no direct user interface, but receives requests from other computer systems (clients).
- the computer system 100 may be implemented as a desktop computer, portable computer, laptop or notebook computer, tablet computer, pocket computer, telephone, smart phone, or any other suitable type of electronic device.
- FIG. 2 illustrates an example configuration of an object detection system 200, according to the embodiments of the present disclosure.
- the object detection system 200 primarily includes an image capture device 210, a client device 220, and an object detection device 230.
- the image capture device 210, the client device 220, and the object detection device 230 may be communicably connected via a communication network 225 such as a local area network (LAN) or the Internet.
- LAN local area network
- the Internet such as a local area network (LAN) or the Internet.
- LAN local area network
- the image capture device 210 is device configured to capture image data.
- the image data captured by the image capture device may include still images or videos (that is, a sequence of chronological image frames).
- the image capture device 210 may include a stationary camera (e.g., a surveillance camera) mounted in a predetermined location, a camera mounted on an automobile or a train, a camera included in a mobile computing device (e.g., the camera in a smartphone or tablet) or the like.
- the image capture device may be configured to capture a target image 212 that includes a target object and transmit it to the object detection device 230 via the communication network 225.
- the client device 220 may include a computing device for managing the object detection process performed by the object detection device 230 and viewing and confirming an object detection result 222 transmitted from the object detection device 230 as a result of the object detection process.
- the client device 220 may be used to designate target objects for detection, set parameters of the object detection device 230, or the like.
- the client device 220 may include a smartphone, tablet device, laptop computer, desktop computer, or other suitable computing device.
- the object detection device 230 is a computing device configured to perform the object detection process according to the embodiments of the present disclosure in order to detect a target object that is at least partially occluded by an occluded object.
- the object detection device 230 may include an image acquisition unit 232, an image segmentation unit 234, an image conversion unit 236, a generator unit 238, and an object detection unit 240.
- the object detection device 230 may be implemented using the computer system 100 illustrated in FIG. 1, such that the functions of the image acquisition unit 232, the image segmentation unit 234, the image conversion unit 236, the generator unit 238, the object detection unit 240, and other functions of the object detection device 230 are carried out by means of the object detection application 150.
- the object detection device 230 including the image acquisition unit 232, the image segmentation unit 234, the image conversion unit 236, the generator unit 238, and the object detection unit 240 is illustrated in FIG. 2, the present disclosure is not limited hereto, and other functional units (for example, functional units used in the training process) of the object detection device 230 will be described herein.
- the image acquisition unit 232 is a functional unit configured to receive the target image 212 captured by the image capture device 210.
- the target image 212 may include an image in which a target object is at least partially occluded by an occluding object.
- the target image 212 is a 3-channel RGB image, but the present disclosure is not limited thereto.
- the image capture device 210 may capture a target image 212 of the surroundings of the driverless train that includes a target object for which class identification is desirable.
- the occluding object refers to any object that partially blocks, hides, or obscures the target object in the target image 212.
- the occluding object block less than or equal to 30% of the target object in the target image 212.
- the occluding object may include a crossing device, an electric pole, a tree, a building, or any other object that partially obscures the target object.
- the image segmentation unit 234 is a functional unit configured to process the target image acquired by the image acquisition unit 232 to identify an occluded area of the target image 212 in which the target object is at least partially occluded by an occluding object, and assign an object label to the occluding object that indicates the object class of the occluding object.
- an object class refers to the type classification of an object (e.g., person, dog, tree)
- an object label refers to a metadata tag that indicates an object class.
- the function of the image segmentation unit 234 may be realized using a suitable semantic segmentation technique.
- the image segmentation unit 234 may include a fully convolutional network, DeepLab, Atrous convolution, Spatial Pyramidal Pooling, a global convolutional network, a spatio-temporal fully convolutional network, or the like.
- the image conversion unit 236 is a functional unit configured to generate a recovery image by converting the occluded area of the target image 212 to a recovery mask based on the object label of the occluding object.
- the generator unit 238 is a functional unit for processing the recovery image generated by the image conversion unit 236 to generate a recovered image in which the recovery mask has been replaced with a predicted image.
- the predicted image is placed at substantially the same location and occupies substantially the same area as the recovery mask (e.g., the same area as the occluded area).
- the generator unit 238 may include the trained generator of a generative adversarial network (GAN).
- GAN generative adversarial network
- the object detection unit 240 is a functional unit configured to process the recovered image generated by the generator unit 238 to detect the target object and generate an object detection result 222.
- the function of the object detection unit 240 may be realized using any suitable object detection technique.
- the object detection unit 240 may include Histogram of Oriented Gradients (HOG), Region-based Convolutional Neural Networks (R-CNN), Fast R-CNN, Faster R-CNN, Region-based Fully Convolutional Network (R-FCN), Single Shot Detector (SSD), Spatial Pyramid Pooling (SPP-net), YOLO (You Only Look Once) or the like.
- aspects of the disclosure relate to a training phase process 400 in which the object detection device 230 is trained to achieve a high degree of object detection accuracy and an inference phase process 300 in which the trained object detection device 230 is used to generate object detection results with respect to a target image.
- FIG. 3 illustrates a block diagram of the inference phase process 300 of the object detection system 200, according to embodiments.
- the image segmentation unit 234 receives the target image 212 captured by the image acquisition unit 232 (not shown in FIG. 3) and processes the target image 212 to identify an occluded area of the target image 212 in which the target object is at least partially occluded by an occluding object, and assigns an object label to the occluding object.
- the occluded area of the target image 212 refers to the entire area of the occluding object.
- the object label refers to a data label that identifies the predicted class of the occluding object. In embodiments, the object label may also designate the location of the occluding object in the image.
- the image segmentation unit 234 may identify the entire area of the electric pole as the occluded area, and assign a first object label of “Electric Pole” to the electric pole.
- the image conversion unit 236 generates a recovery image 325 by converting the occluded area of the target image 212 to a recovery mask based on the object label assigned at Step S310.
- the recovery mask refers to an image used to indicate and isolate the region of the target image 212 to be processed by the generator unit 238 in Step S330.
- the recovery mask may include a binary image in which those pixels that correspond to the occluded area are indicated by a first pixel value (e.g., “1”) and those pixels that do not correspond to the occluded area are indicated by a second pixel value (e.g., “0”).
- the generator unit 238 processes the recovery image 325 generated by the image conversion unit 236 in Step S320 to generate a recovered image 335 in which the recovery mask has been replaced with a predicted image.
- the predicted image refers to an artificially generated image to replace the recovery mask (that, is, the occluding area).
- the predicted image includes a predicted object that shares an object class with the target object.
- the predicted image may include a plurality of predicted objects having object classes that respectively correspond to the occluded objects.
- the generator unit 238 may generate a predicted image in which predicted objects of a person, an automobile, the ground, and the sky replace the recovery mask (that, is, the occluding area).
- the object detection unit 240 processes the recovered image 335 generated by the generator unit 238 at Step S330 to detect the target object and generate an object detection result 222.
- the object detection result 222 may include a collection of data that indicates at least a target object label that indicates the object class of the target object.
- the object detection result 222 may include additional information regarding the target object, such as the location of the target object in the image, the trajectory (e.g., predicted path of motion) of the target object, or the like. Subsequently, the object detection result 222 may be transmitted to the client device 220 via the communication network 225.
- the inference phase process 300 of the object detection system 200 illustrated in FIG. 3 it is possible to provide a method capable of detecting occluded objects with a high degree of accuracy.
- aspects of the disclosure relate to a training phase process 400 in which the object detection device 230 is trained to achieve a high degree of object detection accuracy and an inference phase process 300 in which the trained object detection device 230 is used to generate object detection results with respect to a target image.
- FIG. 4 illustrates a block diagram of a training phase process 400 of the object detection system, according to embodiments.
- the image segmentation unit 234 is used to process a training image 401 to identify occluding objects (e.g., those objects that partially occlude a target object) in the training image 401.
- the image segmentation unit 234 may output a set of object labels 407 that indicate the object class and location information for the occluding objects in the training image 401 (e.g., coordinates indicating the occluding area corresponding to the occluding object).
- the training image 510 may include an image of a scene in which a target object is occluded by an occluding object.
- it is assumed that the image segmentation unit 234 has already been trained to accurately identify occluding objects.
- the image segmentation unit 235 may generate a set of pixel classification data 414 by performing a function of labeling each pixel of the training image as “real” or “fake,” such that each pixel corresponding to the occluding area is labeled as “fake” and each pixel not corresponding to the occluding area is labeled as “real.”
- This set of pixel classification data 414 can be used by the discriminator unit described later to evaluate the recovered image 413 generated by the generator unit 238.
- the image segmentation unit 235 may use a set of occluded object labels 418 to replace the object labels corresponding to the occluding objects with the object labels corresponding to the occluded objects, and generate an occluded object labeled image in which the occluded objects have been labeled with their correct object classes. More particularly, the image segmentation unit 235 may generate the occluded object labeled image by replacing the object label of each pixel of the occluded area with the object label of the object that it occluded.
- the set of occluded object labels 418 may be generated based on a set of annotated target object labels created by a user or administrator (e.g., and stored in a database) that indicate the correct object classes for the occluded objects in the training image 401.
- the image conversion unit 236 generates a recovery image 411 by converting the occluded area of the training image 401 to a recovery mask based on the object labels 407.
- the recovery mask may include a binary image in which those pixels that correspond to the occluded area are indicated by a first pixel value (e.g., “1”) and those pixels that do not correspond to the occluded area are indicated by a second pixel value (e.g., “0”).
- the generator unit 238 processes the recovery image 411 to generate a recovered image 413 (e.g., a second recovered image) in which the recovery mask has been replaced with a predicted image that indicates one or more predicted objects that share object classes with the respective occluded objects of the training image 401. It should be noted that, at this stage, as the generator unit 238 has not been fully trained, the accuracy of the recovered image 413 generated by the generator unit 238 at Step S412 may be low.
- the generator unit 238 can be trained to increase its accuracy and generate accurate recovered images 413 in which the occluded area occluded by an occluding object can be replaced with a predicted image including one or more predicted objects having the same object class as the occluded object in the training image 401.
- a discriminator unit receives the recovered image 413 from the generator unit 238, the pixel classification data 414 from the image segmentation unit 234, and a reference image 415 that indicates the target object in a state in which it is not occluded by an occluding object, and is trained using the received data to identify the area corresponding to the predicted image of the received recovered image 413.
- the discriminator unit is configured to distinguish whether the recovered image 413 is real (e.g., an original image) or fake (e.g., generated by the generator unit).
- the discriminator unit may include a classifier unit of a GAN.
- the discriminator unit may include a global discriminator configured to evaluate the recovered image 413 in its entirety and a patch discriminator configured to evaluate the predicted image within the recovered image 413.
- the reference image 415 may be an image that illustrates substantially the same scene as the training image 401 except that the target object is not occluded by the occluding object.
- the discriminator unit can be trained to avoid mis-labeling pixels corresponding to the target object.
- the discriminator unit may be trained to generate a first set of feedback weights 417 based on the image quality level of the recovered image 413.
- the image quality level refers to a quantitative measure of the degree to which the recovered image 413 resembles a real image (e.g., recovered images 413 that are more difficult for the discriminator to distinguish as fake may be considered to have a higher image quality level).
- This first set of feedback weights 417 may be back-propagated to the generator unit 238. Subsequently, the parameters of the generator unit 238 may be adjusted based on this first set of feedback weights 417 in order to facilitate the generation of more accurate recovered images 413.
- a patch label classifier unit receives the recovered image 413 from the generator unit 238, the set of occluded object labels 418 from the image segmentation unit 234 (e.g., the occluded object labeled image), and the reference image 415, and is trained using the received data to classify the object labels of the predicted objects in the received image 413.
- the patch label classifier unit may compare the object labels identified for the predicted objects in the recovered image 413 with the occluded object labels 418 to evaluate the accuracy of the object labels of the predicted objects in the recovered image 413, and generate a second set of feedback weights 420 based on the correlation between the object labels of the predicted objects in the recovered image 413 and the occluded object labels 418.
- This second set of feedback weights 420 may be back-propagated to the generator unit 238. Subsequently, the parameters of the generator unit 238 may be adjusted based on this second set of feedback weights 420 in order to facilitate the generation of recovered images 413 with more accurate predicted objects. Put differently, the generator unit 238 can be trained to generate recovered images 413 having predicted objects that share the same object class as the target objects that were occluded in the training image 401.
- a consistency management unit calculates, based on the recovered image 413 and the training image 401, a consistency loss value that indicates a degree of information loss between the recovered image 413 and the training image 401.
- the consistency loss value is a value that increases with the amount of pixels of the non-occluded area of the training image 401 that are not present in the recovered image 413. This consistency loss value may be back-propagated to the generator unit 238. Subsequently, the parameters of the generator unit 238 may be adjusted in order to reduce the consistency loss value for future recovered images 413.
- the generator unit 238 can be trained to maximize the amount of visual information maintained between the training image 401 and the recovered image 413 for the non-occluding area (e.g., to avoid generating recovered images 413 in which non-occluded portions of the training image 401 are lost).
- the discriminator unit can be trained to distinguish whether the recovered images 413 generated by the generator unit 238 are real or fake, and the generator unit 238 can be trained to generate recovered images 413 that fool the discriminator unit (e.g., images that the discriminator unit classifies as real images, despite being generated by the generator unit 238).
- the generator unit 238 can be trained to generate realistic recovered images in which the area occluded by occluding objects has been replaced with predicted images that illustrate objects corresponding to the occluded objects in the original target image.
- aspects of the disclosure relate to utilizing the image segmentation unit 234 to process the target image acquired by the image acquisition unit 232 to identify an occluding object that at least partially occludes a target object.
- the image segmentation unit 234 may include a fully convolutional network, DeepLab, Atrous convolution, Spatial Pyramidal Pooling, a global convolutional network, a spatio-temporal fully convolutional network, or the like. Accordingly, in order to accurately identify the occluding objects in the target image, it is desirable to perform an image segmentation training process 500 to train the image segmentation unit 234.
- FIG. 5 illustrates a block diagram of an image segmentation training process 500, according to embodiments.
- an untrained image segmentation unit 234 is used to process a training image 510.
- the training image 510 may include an image of a scene in which a target object is occluded by an occluding object.
- the image segmentation unit 234 processes the training image 510 to attempt to identify the object class of the occluding object together with the occluding area (e.g., that is, the area enclosed by the occluding object).
- the parameters of the image segmentation unit 234 are adjusted (e.g., via back propagation or the like) to reduce the loss value calculated based on the difference between the results of the image segmentation unit 234 and the ground truth data.
- the image segmentation unit 234 can be trained to accurately identify the object labels and occluding areas of occluding objects.
- This image segmentation training process 500 may be repeated until the image segmentation accuracy of the image segmentation unit 234 achieves a predetermined accuracy threshold (e.g., 90%, 95%). In this way, the image segmentation unit 234 can be trained to accurately identify the object labels and occluding areas of occluding objects.
- aspects of the disclosure relate to the recognition that in some cases, target images may have poor image quality due to weather conditions, lighting conditions, image resolution, or the like. In such situations, the poor image quality may negatively affect the accuracy of object detection performed on these target images. Accordingly, aspects of the disclosure relate to utilizing a set of transformation units configured to perform image transformation operations on a target image to facilitate accurate object detection. Accordingly, FIG. 6 illustrates a set of transformation units 600, according to the embodiments of the present disclosure.
- the set of transformation units 600 may be configured in parallel with the image segmentation unit 234, the image conversion unit 236, and the generator unit 238 described above.
- the set of transformation units 600, the image segmentation unit 234, the image conversion unit 236, and the generator unit 238 may be configured as different layers within a neural network.
- a set of output images 640 generated by the set of transformation units 600 and the generator unit 238 may be aggregated into a single combined output. It should be noted that, as the image segmentation unit 234, the image conversion unit 236, the generator unit 238, and the recovered image 335 have been described above, the descriptions thereof will be omitted here.
- the set of transformation units 600 includes a first transformation unit 610, a second transformation unit 620, and a third transformation unit 630.
- Each of the set of transformation units 600 may be configured using a generator (e.g., the generator of a GAN) trained to perform a different image transformation on the target image 212. More particularly, each of the set of transformation units 600 may be configured to perform an image transformation operation to increase the similarity of the target image 212 to the training images (e.g., an object detection training image) used to train the object detection unit 240. By increasing the similarity of the target image 212 to the training images used to train the object detection unit 240, it is possible to facilitate accurate object detection.
- a generator e.g., the generator of a GAN
- each of the set of transformation units 600 may be configured to perform an image transformation operation to increase the similarity of the target image 212 to the training images (e.g., an object detection training image) used to train the object detection unit 240.
- the third transformation unit 630 may be configured to perform a lighting transformation operation to generate a third transformed target image 632 having a lighting condition different than the target image 212. For instance, in the case that the target image 212 illustrates a nighttime scene, the third transformation unit 630 may perform a lighting transformation operation on the target image 212 to generate a third transformed target image 632 in which the target image 212 has been transformed to a daytime scene in order to facilitate object detection.
- a transformation management unit 645 may receive the target image 212, the recovered image 335, the first transformed target image 612, the second transformed target image 622, and the third transformed target image 632, and assign a first set of weights to the target image, a second set of weights to the recovered image 335, a third set of weights to the first transformed target image 612, a fourth set of weights to the second transformed target image 622, and a fifth set of weights to the third transformed target image 632.
- the first, second, third, fourth, and fifth set of weights are weights that indicate the degree to which the features corresponding to each respective image should be reflected in the final combined transformed target image.
- the transformation management unit 645 may be trained to assign the first, second, third, fourth, and fifth set of weights based on the degree to which the features of each of the target image 212, the recovered image 335, the first transformed target image 612, the second transformed target image 622, and the third transformed target image 632 are represented in the training images used to train the object detection unit 240, such that images that include features (lighting conditions, weather conditions, resolution) that are more similar to the training images used to train the object detection unit 240 are given a greater weight.
- the transformation management unit 645 may be implemented using the different model layers of a GAN.
- the transformation management unit 615 may combine the target image 212, the recovered image 335, the first transformed target image 612, the second transformed target image 622, and the third transformed target image 632 into a combined transformed target image 655 based on the first set of weights, the second set of weights, the third set of weights, the fourth set of weights, and the fifth set of weights (S650).
- a combined transformed target image 655 can be generated that reflects the characteristics of each of the target image 212, the recovered image 335, the first transformed target image 612, the second transformed target image 622, and the third transformed target image 632 in accordance with the first, second, third, fourth, and fifth set of weights.
- a combined transformed target image 655 that has been transformed to have greater similarity to the training images used to train the object detection unit 240 can be generated.
- the combined transformed target image 655 can be input to the object detection unit 240.
- the object detection unit 240 may process the combined transformed target image 655 to detect the target object and generate an object detection result 222, as described herein.
- aspects of the disclosure relate to converting a target image 212 in which one or more target objects are at least partially occluded by one or more occluding objects to a recovery image 720 in which the area occluded by the occluding objects is represented as a binary recovery mask, and subsequently generating a recovered image 335 in which the recovery mask has been replaced with a predicted image illustrating predicted objects that correspond to (e.g., share object classes with) the occluded target objects of the target image 212.
- FIG. 7 illustrates an example of a target image 212, a recovery image 720, and a recovered image 335, according to embodiments.
- a first target object of an automobile is occluded by a railroad crossing device 705
- a second target object of a person 712 is occluded by a railroad crossing device 715.
- the image segmentation unit 234 by processing the target image 212 using the image segmentation unit 234, the area occluded by both the railroad crossing devices 715, 715 can be identified, and an object label of “railroad crossing device” can be assigned thereto.
- the image segmentation unit 234 can generating recovery image 720 by converting the occluded area of the target image 212 to a recovery mask 725 based on the object label.
- the generator unit 238 can process the recovery image 720 to generate a recovered image 335 in which the recovery mask 725 has been replaced with a first predicted image 745 that includes one or more predicted objects that have the same object class as the target objects, respectively. More particularly, as shown in the recovered image 335, the area of the target image 212 occluded by the railroad crossing devices 705 has been replaced with a predicted image illustrating an automobile 752, and the area of the target image 212 occluded by the railroad crossing device 715 has been replaced with a predicted image illustrating a person 762, in addition to the appropriate background images of the sky and ground that were occluded in the target image 212.
- the recovery image 720 generated in this way can subsequently be input to an object detection unit (e.g., the object detection unit 240) to facilitate generation of the target objects of the automobile and the person with a high degree of accuracy.
- an object detection unit e.g., the object detection unit 240
- a generator unit e.g., the generator in a GAN
- a generator unit can be trained to generated a recovered image in which the occluded area of a target image has been replaced with a predicted image that indicates one or more predicted objects having the same object class as the respective occluded objects in the target image, and subsequently performing object detection on this recovered image, it is possible to generate an accurate object detection result even in cases where target objects are occluded.
- aspects of the disclosure relate to utilizing a plurality of transformation units (e.g., additional GANs) to enhance the quality and modify features (e.g., weather conditions, lighting conditions) of input images.
- the outputs from these transformation units may be weighted and combined into a single image using a set of convolution layers.
- the present invention may be a system, a method, and/or a computer program product.
- the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
- the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
- the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
- RAM random access memory
- ROM read-only memory
- EPROM or Flash memory erasable programmable read-only memory
- SRAM static random access memory
- CD-ROM compact disc read-only memory
- DVD digital versatile disk
- memory stick a floppy disk
- a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
- a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
- These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
- the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
- Embodiments according to this disclosure may be provided to end-users through a cloud-computing infrastructure.
- Cloud computing generally refers to the provision of scalable computing resources as a service over a network.
- Cloud computing may be defined as a computing capability that provides an abstraction between the computing resource and its underlying technical architecture (e.g., servers, storage, networks), enabling convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction.
- cloud computing allows a user to access virtual computing resources (e.g., storage, data, applications, and even complete virtualized computing systems) in “the cloud,” without regard for the underlying physical systems (or locations of those systems) used to provide the computing resources.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the block may occur out of the order noted in the figures.
- two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
- Object detection system 210... Image capture device 212... Target image 220... Client device 222... Object detection result 225... Communication network 230... Object detection device 232... Image acquisition unit 234... Image segmentation unit 236... Image conversion unit 238... Generator unit 240... Object detection unit 300... Inference phase process 335... Recovered image 400...Training phase process 500...Image segmentation training process 600...Set of transformation units
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2021/045698 WO2023105800A1 (en) | 2021-12-10 | 2021-12-10 | Object detection device, object detection method, and object detection system |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| EP4445331A1 true EP4445331A1 (de) | 2024-10-16 |
| EP4445331A4 EP4445331A4 (de) | 2025-06-11 |
Family
ID=86729924
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| EP21967291.2A Pending EP4445331A4 (de) | 2021-12-10 | 2021-12-10 | Objekterkennungsvorrichtung, objekterkennungsverfahren und objekterkennungssystem |
Country Status (3)
| Country | Link |
|---|---|
| EP (1) | EP4445331A4 (de) |
| JP (1) | JP7660264B2 (de) |
| WO (1) | WO2023105800A1 (de) |
Families Citing this family (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12505596B2 (en) | 2022-10-06 | 2025-12-23 | Adobe Inc. | Modifying digital images via perspective-aware object move |
| US20240135561A1 (en) * | 2022-10-06 | 2024-04-25 | Adobe Inc. | Modifying digital images via depth-aware object move |
| CN117611600B (zh) * | 2024-01-22 | 2024-03-29 | 南京信息工程大学 | 一种图像分割方法、系统、存储介质及设备 |
| CN118038172B (zh) * | 2024-03-11 | 2024-08-13 | 广东石油化工学院 | 基于特征增强和深度网络的温控器质量检测方法及系统 |
| CN120431323A (zh) * | 2024-10-23 | 2025-08-05 | 荣耀终端股份有限公司 | 遮挡判断方法、电子设备、存储介质及程序产品 |
| CN120259926B (zh) * | 2025-06-04 | 2025-08-08 | 德阳经开智航科技有限公司 | 针对遮挡目标的无人机智能识别方法及系统 |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11295439B2 (en) * | 2019-10-16 | 2022-04-05 | International Business Machines Corporation | Image recovery |
| CN112633234B (zh) * | 2020-12-30 | 2024-08-23 | 广州华多网络科技有限公司 | 人脸去眼镜模型训练、应用方法及其装置、设备和介质 |
-
2021
- 2021-12-10 EP EP21967291.2A patent/EP4445331A4/de active Pending
- 2021-12-10 WO PCT/JP2021/045698 patent/WO2023105800A1/en not_active Ceased
- 2021-12-10 JP JP2024533295A patent/JP7660264B2/ja active Active
Also Published As
| Publication number | Publication date |
|---|---|
| EP4445331A4 (de) | 2025-06-11 |
| WO2023105800A1 (en) | 2023-06-15 |
| JP7660264B2 (ja) | 2025-04-10 |
| JP2024545451A (ja) | 2024-12-06 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2023105800A1 (en) | Object detection device, object detection method, and object detection system | |
| Ghahremannezhad et al. | Object detection in traffic videos: A survey | |
| Azfar et al. | Deep learning-based computer vision methods for complex traffic environments perception: A review | |
| Jana et al. | YOLO based Detection and Classification of Objects in video records | |
| Xu et al. | Fast vehicle and pedestrian detection using improved Mask R‐CNN | |
| Gurram et al. | Monocular depth estimation through virtual-world supervision and real-world sfm self-supervision | |
| US20180114071A1 (en) | Method for analysing media content | |
| Sadiq et al. | FD-YOLOv5: A Fuzzy Image Enhancement Based Robust Object Detection Model for Safety Helmet Detection: Mohd. Sadiq et al. | |
| Ling et al. | Optimization of autonomous driving image detection based on RFAConv and triplet attention | |
| CN114117128A (zh) | 视频标注的方法、系统及设备 | |
| Wang et al. | Intrusion detection for high-speed railways based on unsupervised anomaly detection models | |
| Amisse et al. | Fine-tuning deep learning models for pedestrian detection | |
| Bloisi et al. | Parallel multi-modal background modeling | |
| Liu et al. | A cloud infrastructure for target detection and tracking using audio and video fusion | |
| CN115482523A (zh) | 轻量级多尺度注意力机制的小物体目标检测方法及系统 | |
| Zhou et al. | Cascaded multi-task learning of head segmentation and density regression for RGBD crowd counting | |
| Singha et al. | Novel deeper AWRDNet: adverse weather-affected night scene restorator cum detector net for accurate object detection | |
| Sreekumar et al. | TPCAM: Real-time traffic pattern collection and analysis model based on deep learning | |
| JP7733632B2 (ja) | 事前訓練されたオブジェクト分類器を再訓練するためのシステム、方法、及びコンピュータプログラム | |
| Luo et al. | Crowd counting for static images: a survey of methodology | |
| Fleck et al. | Low-power traffic surveillance using multiple rgb and event cameras: A survey | |
| Al-Ghanem et al. | Deep learning based efficient crowd counting system | |
| Mittal et al. | A feature pyramid based multi-stage framework for object detection in low-altitude UAV images | |
| Zhang et al. | MASNet: a novel deep learning approach for enhanced detection of small targets in complex scenarios | |
| Sankaranarayanan et al. | Improved vehicle detection accuracy and processing time for video based ITS applications |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
| PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
| 17P | Request for examination filed |
Effective date: 20240710 |
|
| AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
| DAV | Request for validation of the european patent (deleted) | ||
| DAX | Request for extension of the european patent (deleted) | ||
| A4 | Supplementary search report drawn up and despatched |
Effective date: 20250513 |
|
| RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06V 20/52 20220101ALI20250507BHEP Ipc: G06V 10/26 20220101ALI20250507BHEP Ipc: G06V 10/82 20220101ALI20250507BHEP Ipc: G06V 10/764 20220101ALI20250507BHEP Ipc: G06T 1/00 20060101ALI20250507BHEP Ipc: G06T 7/194 20170101AFI20250507BHEP |