WO2020107510A1 - Ai systems and methods for objection detection - Google Patents

Ai systems and methods for objection detection Download PDF

Info

Publication number
WO2020107510A1
WO2020107510A1 PCT/CN2018/119410 CN2018119410W WO2020107510A1 WO 2020107510 A1 WO2020107510 A1 WO 2020107510A1 CN 2018119410 W CN2018119410 W CN 2018119410W WO 2020107510 A1 WO2020107510 A1 WO 2020107510A1
Authority
WO
WIPO (PCT)
Prior art keywords
corners
crop
pooling
cropping
region proposals
Prior art date
Application number
PCT/CN2018/119410
Other languages
French (fr)
Inventor
Yuan Zhao
Ying Xin
Original Assignee
Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology And Development Co., Ltd. filed Critical Beijing Didi Infinity Technology And Development Co., Ltd.
Priority to JP2020557230A priority Critical patent/JP7009652B2/en
Publication of WO2020107510A1 publication Critical patent/WO2020107510A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • G06V20/47Detecting features for summarising video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • G06V10/243Aligning, centring, orientation detection or correction of the image by compensating for image skew or non-uniform image deformations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/48Matching video sequences
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Definitions

  • the present disclosure generally relates to systems and methods for image processing, and in particular, systems and methods for detecting objects in an image.
  • the artificial intelligent object detection techniques can identify and/or classify an object in an image, and locate the object in the image by drawing a bounding box.
  • the boundary box may generally be a rectangular box.
  • the boundary box e.g., a rectangular box
  • the boundary box may include background.
  • the boundary box may include more background than the object, which cannot locate the object accurately.
  • an artificial intelligent image processing system for object detection may include at least one storage device and at least one processor in communication with the at least one storage device.
  • the at least one storage device may include a set of instructions for determining a boundary corresponding to an object in an image.
  • the at least one processor may be directed to obtain an image including a target object, and generate a plurality of feature maps by inputting the image into a convolutional neural network (CNN) .
  • CNN convolutional neural network
  • the at least one processor may also be directed to determine a plurality of region proposals based on the plurality of feature maps, and determine a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps.
  • the at least one processor may be further directed to classify the plurality of pooling region proposals into one or more object categories or a background category via a classifier.
  • the one or more object categories may include a category of the target object
  • the plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object.
  • Each of the one or more pooling region proposals may have a plurality of corners.
  • the at least one processor may be directed to determine a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner; trim the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies; identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and map the boundary to the image to determine a boundary of the target object.
  • the CNN may include one or more convolution layers and one or more pooling layers and is without a full connection layer.
  • the plurality of region proposals may be determined according to a region proposal network (RPN) .
  • RPN region proposal network
  • the RPN may include at least one regression layer and at least one classification layer.
  • the at least one processor may be directed to slide a sliding window over the plurality of feature maps. At each sliding-window location, the sliding window may coincide with a sub-region of the plurality of feature maps.
  • the at least one processor may be directed to map the sub-region of the plurality of feature maps to a multi-dimensional feature vector, and generate an anchor by mapping a center pixel of the sub-region to a pixel of the image.
  • the anchor may correspond to a set of anchor boxes in the image, and each of the set of anchor boxes may be associated with a scale and an aspect ratio.
  • the at least one processor may also be directed to feed the multi-dimensional feature vector into the at least one regression layer and the at least one classification layer, respectively.
  • the at least one regression layer may be configured to conduct bounding-box regression to determine a set of preliminary region proposals corresponding to the set of anchor boxes, and the output of the at least one regression layer may include four coordinate values of each of the set of preliminary region proposals.
  • the at least one classification layer may be configured to determine a category for each of the set of preliminary region proposals.
  • the category may be a foreground or a background, and the output of the at least one classification layer may include a first score of being foreground and a second score of being background of each of the set of preliminary region proposals.
  • the at least one processor may be further directed to select a portion of the plurality of preliminary region proposals as the plurality of region proposals based on the first score of being foreground and the second score of being background of each of a plurality of preliminary region proposals and four coordinate values of each of the plurality of preliminary region proposals.
  • the at least one processor may be directed to select the plurality of region proposals using a non-maximum suppression (NMS) .
  • NMS non-maximum suppression
  • the plurality of pooling region proposals may correspond to a canonical size.
  • the at least one processor may be further directed to map the plurality of region proposals to the plurality of feature maps to determine a plurality of proposal feature maps, and determine the plurality of pooling region proposals by performing pooling on the plurality of proposal feature maps.
  • the plurality of corners may include a top left corner, a top right corner, a bottom left corner, and a bottom right corner.
  • the plurality of crop strategies of the top left corner may include at least one of cropping to right, cropping to bottom, cropping to bottom right, target position, or false.
  • the plurality of crop strategies of the top right corner may include at least one of cropping to left, cropping to bottom, cropping to bottom left, target position, or false.
  • the plurality of crop strategies of the bottom left corner may include at least one of cropping to right, cropping to top, cropping to top right, target position, or false.
  • the plurality of crop strategies of the bottom right corner may include at least one of cropping to left, cropping to top, cropping to top left, target position, or false.
  • the at least one processor may be further directed to stop to crop one of the plurality of corners when the corner corresponds to a crop strategy of target position.
  • the at least one processor may be directed to determine a cropping direction and a cropping length for each of the plurality of corners based on the pooling region proposal.
  • the cropping direction of each of the plurality of corners may be limited to one of the plurality of crop strategies of the corresponding corner.
  • the at least one processor may also be directed to crop each of the plurality of corners based on the cropping direction and the cropping length.
  • the at least one processor may be directed to perform one or more iterations. In each of the one or more iterations, the at least one processor may be directed to determine a crop strategy for each of the plurality of corners based on the pooling region proposal from the plurality of crop strategies; determine whether one of the plurality of corners corresponds to a crop strategy of false; determine whether each of the plurality of corners corresponds to a crop strategy of target position in response to a determination that each of the plurality of corners does not correspond to the crop strategy of false; in response to a determination that at least one of the plurality of corners does not correspond to the crop strategy of target position, crop the at least one of the plurality of corners according to the determined crop strategy of the at least one of the plurality of corners; perform, based on the cropped plurality of corners, a bounding mapping to determine a rectangular box; and resize the rectangular box into a canonical size.
  • the at least one processor may also be
  • the at least one processor may be further directed to abandon the pooling region proposal in response to a determination that at least one of the plurality of corners corresponds to the crop strategy of false.
  • the at least one processor may be further directed to determine one or more boundaries corresponding to the target object; determine an intersection-over-union (IoU) between each of the one or more boundaries and a ground truth; and determine one of the one or more boundaries with the greatest IoU as a target boundary corresponding to the target object.
  • IoU intersection-over-union
  • the boundary of the target object may be a quadrilateral box.
  • an artificial intelligent image processing method may be implemented on a computing device.
  • the computing device may have at least one processor, at least one computer-readable storage medium, and a communication platform connected to a network.
  • the method may include obtaining an image including a target object, and generating a plurality of feature maps by inputting the image into a convolutional neural network (CNN) .
  • CNN convolutional neural network
  • the method may also include determining a plurality of region proposals based on the plurality of feature maps, and determining a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps.
  • the method may further include classifying the plurality of pooling region proposals into one or more object categories or a background category via a classifier.
  • the one or more object categories may include a category of the target object, and the plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object.
  • Each of the one or more pooling region proposals may have a plurality of corners.
  • the method may include determining a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner; trimming the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies; identifying a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and mapping the boundary to the image to determine a boundary of the target object.
  • a non-transitory computer-readable storage medium may include at least one set of instructions for artificial intelligent object detection.
  • the at least one set of instructions may direct the at least one processor to perform acts of obtaining an image including a target object; generating a plurality of feature maps by inputting the image into a convolutional neural network (CNN) ; determining a plurality of region proposals based on the plurality of feature maps; determining a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps; classifying the plurality of pooling region proposals into one or more object categories or a background category via a classifier.
  • CNN convolutional neural network
  • the one or more object categories may include a category of the target object, and the plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object.
  • Each of the one or more pooling region proposals may have a plurality of corners.
  • the at least one set of instructions may also direct the at least one processor to perform acts of determining a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner; trimming the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies; identifying a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and mapping the boundary to the image to determine a boundary of the target object.
  • an artificial intelligent image processing system for object detection.
  • the artificial intelligent image processing system may include an acquisition module configured to obtain an image including a target object; a feature map determination module configured to generate a plurality of feature maps by inputting the image into a convolutional neural network (CNN) ; a region proposal determination module configured to determine a plurality of region proposals based on the plurality of feature maps; a pooling region proposal determination module configured to determine a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps; a classification module configured to classify the plurality of pooling region proposals into one or more object categories or a background category via a classifier.
  • CNN convolutional neural network
  • the one or more object categories may include a category of the target object, and the plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object.
  • Each of the one or more pooling region proposals may have a plurality of corners.
  • the artificial intelligent image processing system may also include a boundary determination module configured to determine a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner; the boundary determination module configured to trim the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies; the boundary determination module configured to identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and the boundary determination module configured to map the boundary to the image to determine a boundary of the target object.
  • FIG. 1 is a schematic diagram illustrating an exemplary AI image processing system according to some embodiments of the present disclosure
  • FIG. 2 is a schematic diagram illustrating exemplary hardware and/or software components of a computing device according to some embodiments of the present disclosure
  • FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of a mobile device according to some embodiments of the present disclosure
  • FIG. 4 is a block diagram illustrating an exemplary AI processing device according to some embodiments of the present disclosure
  • FIG. 5 is a flowchart illustrating an exemplary process for determining a boundary of a target object according to some embodiments of the present disclosure
  • FIG. 6 is a schematic diagram illustrating an exemplary region proposal network according to some embodiments of the present disclosure.
  • FIG. 7 is a flowchart illustrating an exemplary process for determining a boundary of a target object according to some embodiments of the present disclosure
  • FIG. 8 is a schematic diagram illustrating an exemplary process for determining a boundary of a target object according to some embodiments of the present disclosure.
  • FIGs. 9A to 9C are schematic diagrams illustrating an image according to some embodiments of the present disclosure.
  • the flowcharts used in the present disclosure illustrate operations that systems implement according to some embodiments of the present disclosure. It is to be expressly understood, the operations of the flowcharts may be implemented not in order. Conversely, the operations may be implemented in inverted order, or simultaneously. Moreover, one or more other operations may be added to the flowcharts. One or more operations may be removed from the flowcharts.
  • the system and method in the present disclosure is described primarily regarding an on-demand transportation service, it should also be understood that this is only one exemplary embodiment.
  • the system or method of the present disclosure may be applied to any other kind of on demand service.
  • the system or method of the present disclosure may be applied to transportation systems of different environments including land, ocean, aerospace, or the like, or any combination thereof.
  • the vehicle of the transportation systems may include a taxi, a private car, a hitch, a bus, a train, a bullet train, a high-speed rail, a subway, a vessel, an aircraft, a spaceship, a hot-air balloon, a driverless vehicle, or the like, or any combination thereof.
  • the transportation system may also include any transportation system for management and/or distribution, for example, a system for sending and/or receiving an express.
  • the application of the system or method of the present disclosure may include a web page, a plug-in of a browser, a client terminal, a custom system, an internal analysis system, an artificial intelligence robot, or the like, or any combination thereof.
  • bypassenger, ” “requester, ” “service requester, ” and “customer” in the present disclosure are used interchangeably to refer to an individual, an entity or a tool that may request or order a service.
  • driver, ” “provider, ” “service provider, ” and “supplier” in the present disclosure are used interchangeably to refer to an individual, an entity or a tool that may provide a service or facilitate the providing of the service.
  • the term “user” in the present disclosure may refer to an individual, an entity, or a tool that may request a service, order a service, provide a service, or facilitate the providing of the service.
  • the user may be a passenger, a driver, an operator, or the like, or any combination thereof.
  • “passenger” and “passenger terminal” may be used interchangeably
  • driver” and “driver terminal” may be used interchangeably.
  • service request and “order” in the present disclosure are used interchangeably to refer to a request that may be initiated by a passenger, a requester, a service requester, a customer, a driver, a provider, a service provider, a supplier, or the like, or any combination thereof.
  • the service request may be accepted by any one of a passenger, a requester, a service requester, a customer, a driver, a provider, a service provider, or a supplier.
  • the service request may be chargeable or free.
  • the positioning technology used in the present disclosure may be based on a global positioning system (GPS) , a global navigation satellite system (GLONASS) , a compass navigation system (COMPASS) , a Galileo positioning system, a quasi-zenith satellite system (QZSS) , a wireless fidelity (WiFi) positioning technology, or the like, or any combination thereof.
  • GPS global positioning system
  • GLONASS global navigation satellite system
  • COMPASS compass navigation system
  • Galileo positioning system Galileo positioning system
  • QZSS quasi-zenith satellite system
  • WiFi wireless fidelity positioning technology
  • the present disclosure relates to artificial intelligent (AI) systems and methods for object detection in an image.
  • the AI systems and methods may determine a boudnary for a target object in the image.
  • the determined boudnary of the target object may be a quadrilateral box.
  • the AI systems and methods may input the image into a convolutional neural network (CNN) to generate a plurlaity of feature maps, and generate a plurality of region proposals based on the plurality of feature maps.
  • CNN convolutional neural network
  • the AI systems and methods may determine a plurality of pooling region proposals based on the plurality of region proposals and the plurlaity of feature maps by performing an ROI pooling operation.
  • the AI systems and methods may further classify the plurality of pooling region proposals into one or more object categories or a background category via a classifier.
  • the plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object.
  • Each of the one or more pooling region proposals may have a plurality of corners.
  • the AI systems and methods may determine a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner.
  • the AI systems and methods may also trim the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies.
  • the AI systems and methods may crop a corner based on a cropping direction and a cropping length, which may be determined based on the pooling region proposal.
  • the AI systems and methods may identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners, and map the boundary to the image to determine the boundary of the target object.
  • information e.g., positions of the corners
  • the boundary of the target object determined according to the present disclosure may be more suitable for the target object, especially, for a tilted target object (e.g., a safety belt, tilted characters) , which may improve the accuracy of locating the target object.
  • FIG. 1 is a schematic diagram illustrating an exemplary AI image processing system 100 according to some embodiments of the present disclosure.
  • the AI image processing system 100 may be configured for objection detection. For example, the AI image processing system 100 may determine a boundary corresponding to an object in an image.
  • the AI image processing system 100 may be an online platform providing an Online to Offline (O2O) service.
  • the AI image processing system 100 may include a sensor 110, a network 120, a terminal 130, a server 140, and a storage device 150.
  • the sensor 110 may be configured to capture one or more images.
  • an image may be a still image, a video, a stream video, or a video frame obtained from a video.
  • the image may be a three-dimensional (3D) image or a two-dimensional (2D) image.
  • the sensor 110 may be or include one or more cameras.
  • the sensor 110 may be a digital camera, a video camera, a security camera, a web camera, a smartphone, a tablet, a laptop, a video gaming console equipped with a web camera, a camera with multiple lenses, a camcorder, etc.
  • the sensor 110 e.g., a camera
  • the network 120 may facilitate the exchange of information and/or data.
  • one or more components of the AI image processing system 100 e.g., the sensor 110, the terminal 130, the server 140, the storage device 150
  • the server 140 may process an image obtained from the sensor 110 via the network 120.
  • the server 140 may obtain user instructions from the terminal 130 via the network 120.
  • the network 120 may be any type of wired or wireless network, or combination thereof.
  • the network 120 may include a cable network, a wireline network, an optical fiber network, a telecommunications network, an intranet, an Internet, a local area network (LAN) , a wide area network (WAN) , a wireless local area network (WLAN) , a metropolitan area network (MAN) , a wide area network (WAN) , a public telephone switched network (PSTN) , a Bluetooth TM network, a ZigBee TM network, a near field communication (NFC) network, or the like, or any combination thereof.
  • the network 120 may include one or more network access points.
  • the network 120 may include wired or wireless network access points such as base stations and/or internet exchange points 120-1, 120-2, ..., through which one or more components of the AI image processing system 100 may be connected to the network 120 to exchange data and/or information.
  • the terminal 130 include a mobile device 130-1, a tablet computer 130-2, a laptop computer 130-3, or the like, or any combination thereof.
  • the mobile device 130-1 may include a smart home device, a wearable device, a smart mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof.
  • the smart home device may include a smart lighting device, a control device of an intelligent electrical apparatus, a smart monitoring device, a smart television, a smart video camera, an interphone, or the like, or any combination thereof.
  • the wearable device may include a bracelet, footgear, eyeglasses, a helmet, a watch, clothing, a backpack, an accessory, or the like, or any combination thereof.
  • the smart mobile device may include a smartphone, a personal digital assistant (PDA) , a gaming device, a navigation device, a point of sale (POS) device, or the like, or any combination thereof.
  • the virtual reality device and/or the augmented reality device may include a virtual reality helmet, a virtual reality glass, a virtual reality patch, an augmented reality helmet, an augmented reality glass, an augmented reality patch, or the like, or any combination thereof.
  • the virtual reality device and/or the augmented reality device may include a Google Glass, an Oculus Rift, a HoloLens, a Gear VR, etc.
  • the terminal 130 may remotely operate the sensor 110.
  • the terminal 130 may operate the sensor 110 via a wireless connection. In some embodiments, the terminal 130 may receive information and/or instructions inputted by a user, and send the received information and/or instructions to the sensor 110 or to the server 140 via the network 120. In some embodiments, the terminal 130 may receive data and/or information from the server 140. In some embodiments, the terminal 130 may be part of the server 140. In some embodiments, the terminal 130 may be omitted.
  • the server 140 may be a single server or a server group.
  • the server group may be centralized, or distributed (e.g., the server 140 may be a distributed system) .
  • the server 140 may be local or remote.
  • the server 140 may access information and/or data stored in the sensor 110, terminal 130, and/or the storage device 150 via the network 120.
  • the server 140 may be directly connected to the sensor 110, the terminal 130, and/or the storage device 150 to access stored information and/or data.
  • the server 140 may be implemented on a cloud platform.
  • the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof.
  • the server 140 may be implemented on a computing device 200 having one or more components illustrated in FIG. 2 in the present disclosure.
  • the server 140 may include an AI processing device 142.
  • the AI processing device 142 may process information and/or data to perform one or more functions described in the present disclosure.
  • the AI processing device 142 may process an image including a target object to determine a boundary of the target object in the image.
  • the AI processing device 142 may include one or more processing devices (e.g., single-core processing device (s) or multi-core processor (s) ) .
  • the AI processing device 142 may include a central processing unit (CPU) , an application-specific integrated circuit (ASIC) , an application-specific instruction-set processor (ASIP) , a graphics processing unit (GPU) , a physics processing unit (PPU) , a digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a controller, a microcontroller unit, a reduced instruction-set computer (RISC) , a microprocessor, or the like, or any combination thereof.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • ASIP application-specific instruction-set processor
  • GPU graphics processing unit
  • PPU physics processing unit
  • DSP digital signal processor
  • FPGA field-programmable gate array
  • PLD programmable logic device
  • controller a microcontroller unit, a reduced instruction-set computer (RISC) , a microprocessor, or the like, or any combination thereof.
  • RISC
  • the storage device 150 may store data and/or instructions. In some embodiments, the storage device 150 may store data obtained from the terminal 130 and/or the server 140. In some embodiments, the storage device 150 may store data and/or instructions that the server 140 may execute or use to perform exemplary methods described in the present disclosure. In some embodiments, the storage device 150 may include a mass storage, removable storage, a volatile read-and-write memory, a read-only memory (ROM) , or the like, or any combination thereof. Exemplary mass storage may include a magnetic disk, an optical disk, a solid-state drive, etc. Exemplary removable storage may include a flash drive, a floppy disk, an optical disk, a memory card, a zip disk, a magnetic tape, etc.
  • Exemplary volatile read-and-write memory may include a random-access memory (RAM) .
  • RAM may include a dynamic RAM (DRAM) , a double date rate synchronous dynamic RAM (DDR SDRAM) , a static RAM (SRAM) , a thyristor RAM (T-RAM) , and a zero-capacitor RAM (Z-RAM) , etc.
  • Exemplary ROM may include a mask ROM (MROM) , a programmable ROM (PROM) , an erasable programmable ROM (PEROM) , an electrically erasable programmable ROM (EEPROM) , a compact disk ROM (CD-ROM) , and a digital versatile disk ROM, etc.
  • MROM mask ROM
  • PROM programmable ROM
  • PROM erasable programmable ROM
  • EEPROM electrically erasable programmable ROM
  • CD-ROM compact disk ROM
  • digital versatile disk ROM etc.
  • the storage device 150 may be implemented on a cloud platform.
  • the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof.
  • the storage device 150 may be connected to the network 120 to communicate with one or more components of the AI image processing system 100 (e.g., the sensor 110, the terminal 130, the server 140) .
  • One or more components in the AI image processing system 100 may access the data or instructions stored in the storage device 150 via the network 120.
  • the storage device 150 may be directly connected to or communicate with one or more components in the AI image processing system 100 (e.g., the sensor 110, the terminal 130, the server 140) .
  • the storage device 150 may be part of the sensor 110.
  • an element or component of the AI image processing system 100 performs, the element may perform through electrical signals and/or electromagnetic signals.
  • a processor of the terminal 130 may generate an electrical signal encoding the request.
  • the processor of the terminal 130 may then transmit the electrical signal to an output port.
  • the output port may be physically connected to a cable, which further may transmit the electrical signal to an input port of the server 140.
  • the output port of the terminal 130 may be one or more antennas, which convert the electrical signal to electromagnetic signal.
  • an electronic device such as the terminal 130, and/or the server 140
  • the processor retrieves or saves data from a storage medium, it may transmit out electrical signals to a read/write device of the storage medium, which may read or write structured data in the storage medium.
  • the structured data may be transmitted to the processor in the form of electrical signals via a bus of the electronic device.
  • an electrical signal may refer to one electrical signal, a series of electrical signals, and/or a plurality of discrete electrical signals.
  • FIG. 2 is a schematic diagram illustrating exemplary hardware and software components of a computing device 200 according to some embodiments of the present disclosure.
  • the terminal 130, and/or the server 140 may be implemented on the computing device 200.
  • the AI processing device 142 of the server 140 may be implemented on the computing device 200 and configured to perform functions of the AI processing device 142 disclosed in this disclosure.
  • the computing device 200 may be a special purpose computer, and may be used to implement an AI image processing system 100 for the present disclosure.
  • the computing device 200 may be used to implement any component of the AI image processing system 100 as described herein.
  • the AI processing device 142 may be implemented on the computing device, via its hardware, software program, firmware, or a combination thereof.
  • only one such computer is shown, for convenience, the computer functions relating to the image processing as described herein may be implemented in a distributed fashion on several similar platforms, to distribute the processing load.
  • the computing device 200 may include a COM port 250 connected with a network that may implement the data communications.
  • the computing device 200 may also include a processor 220, in the form of one or more processors (or CPUs) , for executing program instructions.
  • the exemplary computing device may include an internal communication bus 210, different types of program storage units and data storage units (e.g., a disk 270, a read only memory (ROM) 230, a random-access memory (RAM) 240) , various data files applicable to computer processing and/or communication.
  • the exemplary computing device may also include program instructions stored in the ROM 230, RAM 240, and/or other type of non-transitory storage medium to be executed by the processor 220.
  • the method and/or process of the present disclosure may be implemented as the program instructions.
  • the computer device 200 also includes an I/O device 260 that may support the input and/or output of data flows between the computing device 200 and other components.
  • the computing device 200 may also receive programs and data via the communication network.
  • the computing device 200 in the present disclosure may also include multiple CPUs and/or processors, thus operations and/or method steps that are performed by one CPU and/or processor as described in the present disclosure may also be jointly or separately performed by the multiple CPUs and/or processors.
  • the CPU and/or processor of the computing device 200 executes both step A and step B
  • step A and step B may also be performed by two different CPUs and/or processors jointly or separately in the computing device 200 (e.g., the first processor executes step A and the second processor executes step B, or the first and second processors jointly execute steps A and B) .
  • FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of a mobile device 300 according to some embodiments of the present disclosure.
  • a terminal e.g., the terminal 130
  • the mobile device 300 may include a communication platform 310, a display 320, a graphic processing unit (GPU) 330, a central processing unit (CPU) 340, an I/O 350, a memory 360, an operating system (OS) 370, a storage 390.
  • any other suitable component including but not limited to a system bus or a controller (not shown) , may also be included in the mobile device 300.
  • a mobile operating system 370 e.g., iOS TM , Android MM , Windows Phone TW , etc.
  • the applications 380 may include a browser or any other suitable mobile apps for receiving and rendering information relating to image processing or other information from the AI image processing system 100.
  • User interactions with the information stream may be achieved via the I/O 350 and provided to the storage device 150, the server 140 /or other components of the AI image processing system 100.
  • computer hardware platforms may be used as the hardware platform (s) for one or more of the elements described herein.
  • a computer with user interface elements may be used to implement a personal computer (PC) or any other type of work station or terminal device.
  • PC personal computer
  • a computer may also act as a system if appropriately programmed.
  • FIG. 4 is a block diagram illustrating an exemplary AI processing device 142 according to some embodiments of the present disclosure.
  • the AI processing device 142 may include an acquisition module 401, a feature map determination module 403, a region proposal determination module 405, a pooling region proposal determination module 407, a classification module 409, and a boundary determination module 411.
  • the modules may be hardware circuits of all or part of the AI processing device 142.
  • the modules may also be implemented as an application or set of instructions read and executed by the AI processing device 142. Further, the modules may be any combination of the hardware circuits and the application/instructions.
  • the modules may be the part of the AI processing device 142 when the AI processing device 142 is executing the application/set of instructions.
  • the acquisition module 401 may be configured to obtain information and/or data related to the AI image processing system 100.
  • the acquisition module 401 may obtain an image including a target object.
  • the image may be a still image or a video captured by the sensor 110.
  • the target object may refer to an object that to be identified and/or detected in the image.
  • the target object may be an object tilted relative to the image (e.g., a safety belt, tilted characters) .
  • all objects in the image may need to be identified and/or detected, and each object in the image may be referred to as a target object.
  • the acquisition module 401 may obtain the image from one or more components of the AI image processing system 100, such as the sensor 110, the terminal 130, a storage (e.g., the storage device 150) , or from an external source (e.g., ImageNet) via the network 120.
  • a storage e.g., the storage device 150
  • an external source e.g., ImageNet
  • the feature map determination module 403 may be configured to generate a plurality of feature maps by inputting an image (e.g., the image obtained by the acquisition module 401) into a convolutional neural network (CNN) .
  • the CNN may be a trained CNN including one or more convolution layers and one or more pooling layers and without a full connection layer.
  • the convolution layer (s) may be configured to extract features (or feature maps) of an image.
  • the pooling layers may be configured to reduce the size of the feature maps of the image.
  • the feature maps may include feature information of the image.
  • the region proposal determination module 405 may be configured to determine a plurality of region proposals based on the plurality of feature maps. In some embodiments, the region proposal determination module 405 may determine the plurality of region proposals according to a region proposal network (RPN) . Specifically, the region proposal determination module 405 may slide a sliding window over the plurality of feature maps. With the sliding of the sliding window over the plurality of feature maps, a plurality of sliding-window locations may be determined. At each sliding-window location, a set of preliminary region proposals may be determined. Since there is a plurality of sliding-window locations, a plurality of preliminary region proposals may be determined at the plurality of sliding-window locations.
  • RPN region proposal network
  • multiple preliminary region proposals may highly overlap with each other, and the region proposal determination module 405 may select a portion of the plurality of preliminary region proposals as the plurality of region proposals.
  • the region proposal determination module 405 may determine the plurality of region proposals using a non-maximum suppression (NMS) . Details regarding the determination of the region proposals may be may be found elsewhere in the present disclosure (e.g., operation 530 of the process 500 and the descriptions thereof) .
  • the pooling region proposal determination module 407 may be configured to determine a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps. In some embodiments, the pooling region proposal determination module 407 may map the plurality of region proposals to the plurality of feature maps to determine a plurality of proposal feature maps (also referred to as regions of interest (ROIs) ) . The pooling region proposal determination module 407 may then determine the plurality of pooling region proposals by performing pooling on the plurality of proposal feature maps (or ROIs) .
  • ROIs regions of interest
  • the classification module 409 may be configured to classify the plurality of pooling region proposals into one or more object categories or a background category via the classifier. In some embodiments, the classification module 409 may classify negative samples in the pooling region proposals into the background category. If a pooling region proposal is determined as the background category, the pooling region proposal may be omitted and may not do further processing. In some embodiments, the classification module 409 may classify a pooling region proposal corresponding to a positive sample into one of the one or more object categories.
  • the one or more object categories may be default settings of the image processing system 100, and/or may be adjusted by a user.
  • the one or more object categories may include a category of the target object.
  • the classification module 409 may select one or more pooling region proposals corresponding to the target object from the plurality of pooling region proposals.
  • the boundary determination module 411 may be configured to determine a boundary of the target object in the image based on at least one of the one or more pooling region proposals.
  • the boundary may be a polygonal box, for example, a quadrilateral box.
  • the boundary determination module 411 may determine a plurality of crop strategies for each corner according to a position of the corresponding corner.
  • the boundary determination module 411 may trim the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies.
  • the boundary determination module 411 may identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners, and map the boundary to the image to determine a boundary of the target object. Details regarding the determination of a boundary may be found elsewhere in the present disclosure (e.g., operation 560 of the process 500, process 700, and the descriptions thereof) .
  • the boundary determination module 411 may determine one or more boundaries corresponding to the target object based on the one or more pooling region proposals.
  • the boundary determination module 411 may determine an IoU between each of the one or more boundaries and a ground truth.
  • the ground truth may indicate a labelled boundary box of the target object.
  • the IoU between a boundary and the ground truth may reflect a degree of overlapping of the boundary and the ground truth.
  • the boundary determination module 411 may compare one or more determined IoUs related to the one or more boundaries, and determine a boundary with the greatest IoU as a target boundary corresponding to the target object.
  • the modules in the AI processing device 142 may be connected to or communicate with each other via a wired connection or a wireless connection.
  • the wired connection may include a metal cable, an optical cable, a hybrid cable, or the like, or any combination thereof.
  • the wireless connection may include a Local Area Network (LAN) , a Wide Area Network (WAN) , a Bluetooth, a ZigBee, a Near Field Communication (NFC) , or the like, or any combination thereof.
  • the AI processing device 142 may further include one or more additional modules.
  • the AI processing device 142 may further include a storage module (not shown in FIG. 4) configured to store data generated by the modules of the AI processing device 142.
  • FIG. 5 is a flowchart illustrating an exemplary process 500 for determining a boundary of a target object according to some embodiments of the present disclosure.
  • the AI processing device 142 may be described as a subject to perform the process 500.
  • the process 500 may also be performed by other entities.
  • at least a portion of the process 500 may also be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3.
  • one or more operations of process 500 may be implemented in the AI image processing system 100 as illustrated in FIG. 1.
  • one or more operations in the process 500 may be stored in the storage device 150 and/or the storage (e.g., the ROM 230, the RAM 240, etc. ) as a form of instructions, and invoked and/or executed by the server 140 (e.g., the AI processing device 142 in the server 140, or the processor 220 of the AI processing device 142 in the server 140) .
  • the instructions may be transmitted in a form of electronic current or electrical signals.
  • the AI processing device 142 may obtain an image including a target object.
  • the image may be an image captured by the sensor 110 (e.g., a camera of a smartphone, a camera in an autonomous vehicle, an intelligent security camera, a traffic camera) .
  • the captured image may be a still image, a video, etc.
  • the image may include multiple objects, such as people, animals (e.g., dog, cat) , vehicles (e.g., bike, car, bus, truck) , plants (e.g., flower, tree) , buildings, scenery, or the like, or any combination thereof.
  • the image may include an object tilted relative to the image, such as a safety belt, tilted characters, etc.
  • the target object may refer to an object that to be identified and/or detected in the image.
  • the target object may be an object tilted relative to the image (e.g., the safety belt, the tilted characters) .
  • all objects in the image may need to be identified and/or detected, and each object in the image may be referred to as a target object.
  • the AI processing device 142 may obtain the image from one or more components of the AI image processing system 100, such as the sensor 110, the terminal 130, a storage device (e.g., the storage device 150) .
  • the AI processing device 142 may obtain the image from an external source via the network 120.
  • the AI processing device 142 may obtain the image from ImageNet, etc.
  • the AI processing device 142 may generate a plurality of feature maps by inputting the image into a convolutional neural network (CNN) .
  • the plurality of feature maps may include feature information of the image.
  • the CNN may be generated based on a Zeiler and Fergus model (ZF) , VGG-16, RestNet 50, etc.
  • the CNN may be a trained CNN including one or more convolution layers and one or more pooling layers and without a full connection layer.
  • the convolution layer (s) may be configured to extract features (or feature maps) of an image (e.g., the image obtained in 510) .
  • the pooling layers may be configured to reduce the size of the feature maps of the image.
  • the image may be inputted into the CNN, and a plurality of feature maps may be generated.
  • the CNN may be determined based on a ZF model.
  • An image with the size of 600*1000 may be inputted into the ZF model, and 256 feature maps may be outputted from the ZF model.
  • the size of each of the 256 feature maps may be 40*60.
  • the CNN may be generated according to transfer learning. Transfer learning may be capable of reducing the training time by using previously obtained knowledge.
  • a base network may be a pre-trained network trained previously based on a plurality of first training samples obtained from a dataset (e.g., ImageNet dataset, PASCAL VOC, COCO, etc. ) .
  • the base network may include one or more layers (e.g., convolution layer (s) , pooling layer (s) ) and a plurality of pre-trained weights. At least some of the one or more layers and its corresponding pre-trained weights may be transferred to a target network.
  • the base network may be VGG-16, including thirteen convolution layers, four pooling layers, and three full connection layers.
  • the thirteen convolution layers and the four pooling layers may be transferred to a target network (e.g., the CNN) .
  • the pre-trained weights of the convolution layers and/or the pooling layers may not need to be adjusted, or may be fine-tuned based on a plurality of second training samples obtained from a dataset (e.g., ImageNet dataset, PASCAL VOC, COCO, etc. ) .
  • the target network may further include one or more additional layers other than the transferred layers.
  • the weights in the additional layer (s) may be updated according to a plurality of third training samples obtained from a dataset (e.g., ImageNet dataset, PASCAL VOC, COCO, etc. ) .
  • the CNN may be directly generated by training a preliminary CNN using a plurality of fourth training samples obtained from a dataset (e.g., ImageNet dataset, PASCAL VOC, COCO, etc. ) .
  • a dataset e.g., ImageNet dataset, PASCAL VOC, COCO, etc.
  • the AI processing device 142 may determine a plurality of region proposals based on the plurality of feature maps. In some embodiments, the AI processing device 142 may determine the plurality of region proposals according to a region proposal network (RPN) . As shown in FIG. 6, the RPN may include at least one regression layer and at least one classification layer.
  • RPN region proposal network
  • the AI processing device 142 may slide a sliding window over the plurality of feature maps.
  • the sliding window may also be referred to as a convolution kernel that has a size of, for example, 3*3, 5*5, etc.
  • a plurality of sliding-window locations may be determined.
  • the size of the sliding window may be 3*3, and the size of the plurality of feature maps may be 40*60.
  • 40*60 (2400) sliding-window locations may be roughly determined.
  • the sliding window may coincide with a sub-region of the plurality of feature maps.
  • the AI processing device 142 may map the sub-region of the plurality of feature maps to a multi-dimensional feature vector. For example, if there are 256 feature maps, a 256-dimensional feature vector may be generated at the sub-region.
  • the AI processing device 142 may generate an anchor by mapping a center pixel of the sub-region to a pixel of the image obtained in 510.
  • the anchor may correspond to a set of anchor boxes (e.g., including k anchor boxes) in the image. Each of the set of anchor boxes may be a rectangular box.
  • the anchor may be a center point of the set of anchor boxes.
  • Each of the set of anchor boxes may be associated with a scale and an aspect ratio.
  • 3 scales e.g., 128, 256, 512, etc.
  • 3 aspect ratios e.g., 1: 1, 1: 2, 2: 1, etc.
  • the number of the set of anchor boxes may be 9.
  • the AI processing device 142 may feed the multi-dimensional feature vector and/or the set of anchor boxes into the at least one regression layer and the at least one classification layer, respectively.
  • the at least one regression layer may be configured to conduct bounding-box regression to determine a set of preliminary region proposals corresponding to the set of anchor boxes.
  • the output of the at least one regression layer may include four coordinate values of each of the set of preliminary region proposals.
  • the four coordinate values of a preliminary region proposal may include a location of the preliminary region proposal (e.g., coordinates (x, y) of the anchor of the corresponding anchor box) and a size of the preliminary region proposal (e.g., a width w and a height h of the preliminary region proposal) .
  • the at least one classification layer may be configured to determine a category for each of the set of preliminary region proposals.
  • the category may be a foreground or a background.
  • the output of the at least one classification layer may include a first score of being foreground and a second score of being background of each of the set of preliminary region proposals.
  • a set of (e.g., 9) preliminary region proposals may be determined. Since there is a plurality of sliding-window locations (e.g., roughly 40*60) , a plurality of (e.g., roughly 20000) preliminary region proposals may be determined at the plurality of sliding-window locations. In some embodiments, multiple preliminary region proposals may highly overlap with each other.
  • the AI processing device 142 may select a portion of the plurality of preliminary region proposals as a plurality of region proposals. In some embodiments, the AI processing device 142 may select the plurality of region proposals using a non-maximum suppression (NMS) .
  • NMS non-maximum suppression
  • the AI processing device 142 may determine the plurality of region proposals based on the first score of being foreground and the second score of being background of each of the plurality of preliminary region proposals and four coordinate values of each of the plurality of preliminary region proposals.
  • the AI processing device 142 may determine an intersection-over-union (IoU) between each of the plurality of preliminary region proposals and a ground truth.
  • the ground truth may be a labelled boundary box of the target object.
  • the AI processing device 142 may determine preliminary region proposals that have an IoU greater than 0.7 as positive samples, and determine preliminary region proposals that have an IoU less than 0.3 as negative samples.
  • the AI processing device 142 may remove preliminary region proposals other than the positive samples and the negative samples.
  • the AI processing device 142 may select the plurality of region proposals from the positive samples and the negative samples. In some embodiments, the AI processing device 142 may rank the positive samples based on the first score of being foreground of each of the positive samples, and select multiple positive samples based on the ranked positive samples. The AI processing device 142 may rank the negative samples based on the second score of being background of each of the negative samples, and select multiple negative samples based on the ranked negative samples. The selected positive samples and the selected negative samples may constitute the plurality of region proposals. In some embodiments, the AI processing device 142 may select 300 region proposals. The number of the selected positive samples may be the same as or different from that of the selected negative samples. In some embodiments, before selecting the region proposals using the non-maximum suppression (NMS) , the AI processing device 142 may first remove preliminary region proposals that cross boundaries of the image (also referred to as cross-boundary preliminary region proposals) .
  • NMS non-maximum suppression
  • the AI processing device 142 may determine a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps.
  • the AI processing device 142 may map the plurality of region proposals to the plurality of feature maps to determine a plurality of proposal feature maps (also referred to as regions of interest (ROIs) ) .
  • the plurality of proposal feature maps (or ROI) will be inputted into a classifier for further processing.
  • the classifier may only accept proposal feature map (s) with a canonical size (e.g., 7*7) .
  • the AI processing device 142 may resize the plurality of proposal feature maps to the canonical size.
  • the AI processing device 142 may determine the plurality of pooling region proposals by performing pooling on the plurality of proposal feature maps (or ROIs) .
  • the pooling may include a max pooling, a mean pooling, or the like.
  • the plurality of pooling region proposals may correspond to the canonical size (e.g., 7*7) and may be inputted into the classifier for further processing.
  • a pooling region proposal may be determined as a fixed-length vector, which will be sent into a full connection layer of the classifier.
  • the AI processing device 142 may classify the plurality of pooling region proposals into one or more object categories or a background category via the classifier.
  • the classifier may include a support vector machine (SVM) classifier, a Bayer classifier, a decision tree classifier, a softmax classifier, or the like, or any combination thereof.
  • SVM support vector machine
  • one or more pooling region proposals may be classified into the background category.
  • the region proposals may include multiple positive samples and multiple negative samples.
  • the pooling region proposals may correspond to multiple positive samples and multiple negative samples.
  • the multiple negative samples in the pooling region proposals may be classified into the background category. If a pooling region proposal is determined as the background category, the pooling region proposal may be omitted and may not do further processing.
  • a pooling region proposal corresponding to a positive sample may be classified into one of the one or more object categories.
  • the one or more object categories may be default settings of the AI image processing system 100, and/or may be adjusted by a user.
  • the one or more object categories may include a category of the target object.
  • the plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object.
  • the AI processing device 142 may select the one or more pooling region proposals corresponding to the target object from the plurality of pooling region proposals.
  • the AI processing device 142 may determine a target boundary of the target object in the image based on at least one of the one or more pooling region proposals.
  • the target boundary may be a polygonal box, for example, a quadrilateral box.
  • each of the one or more pooling region proposals may have a plurality of corners (e.g., 4 corners, 5 corners, 8 corners, etc. ) .
  • the AI processing device 142 may determine a plurality of crop strategies for each corner of the plurality of corners according to a position of the corresponding corner.
  • the AI processing device 142 may determine five crop strategies for each of the plurality of corners.
  • the AI processing device 142 may determine one of the plurality of (e.g., five) crop strategies as a desired crop strategy of the corner based on the pooling region proposal.
  • the AI processing device 142 may determine a cropping direction and a cropping length for each corner based on the pooling region proposal.
  • the cropping direction of a corner may be limited to one of the plurality of crop strategies of the corner.
  • the AI processing device 142 may trim the pooling region proposal by cropping each of the plurality of corners according to the desired crop strategy, for example, based on the cropping direction and the cropping length.
  • the plurality of crop strategies of each corner may include a crop strategy of false and/or a crop strategy of target position. The crop strategy of false of a corner may indicate that the corner may correspond to a point inside the target object.
  • the crop strategy of target position of a corner may indicate that the corner may correspond to a boundary point of the target object. If the cropping direction of a corner corresponds to a crop strategy of false, the AI processing device 142 may stop to crop the corner and abandon the pooling region proposal. If the cropping direction of a corner corresponds to a crop strategy of target position, the AI processing device 142 may stop to crop the corner. When the cropping direction of each corner corresponds to a crop strategy of target position, the AI processing device 142 may identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners, and map the boundary to the image to determine a boundary of the target object. Details regarding the determination of a boundary may be found elsewhere in the present disclosure (e.g., FIG. 7 and the descriptions thereof) .
  • the AI processing device 142 may determine one or more boundaries corresponding to the target object based on the one or more pooling region proposals. Each of the one or more boundaries may be determined according to process 700, and the descriptions thereof are not repeated herein.
  • the AI processing device 142 may determine an IoU between each of the one or more boundaries and a ground truth.
  • the ground truth may indicate a labelled boundary box of the target object.
  • the IoU between a boundary and the ground truth may reflect a degree of overlapping of the boundary and the ground truth.
  • the AI processing device 142 may compare one or more determined IoUs related to the one or more boundaries, and determine a boundary with the greatest IoU as a target boundary corresponding to the target object.
  • each corner of the pooling region proposal may be cropped according to one of its crop strategies, which has considered information related to the corresponding corner.
  • the cropping direction and/or the cropping length of each corner of the pooling region proposal may be determined based on the pooling region proposal, which has considered features in the pooling region proposal.
  • a boundary of the target object determined according to the process disclosed in the present disclosure may be more suitable for the target object, especially, for a tilted target object, which may improve the accuracy of detecting and/or locating the target object.
  • one or more boundaries may be determined.
  • a boundary with the greatest IoU among the one or more boundaries may be determined as the target boundary. That is, a boundary having the greatest degree of overlapping with the ground truth may be determined as the target boundary, which may further improve the accuracy of detecting and/or locating the target object.
  • the AI processing device 142 may not need to determine a plurality of boundaries corresponding to the target object.
  • the AI processing device 142 may determine the boundary as a target boundary, and operation 560 may be terminated.
  • the boundaries of one or more target objects (e.g., all objects) in the image may be determined simultaneously.
  • process 500 may be repeated to determine boundaries of target objects in a plurality of different images.
  • FIG. 6 is schematic diagram illustrating an exemplary region proposal network (RPN) according to some embodiments of the present disclosure.
  • the RPN introduces a sliding window.
  • the sliding window is configured to slide over a plurality of feature maps.
  • a sliding window coincides with a sub-region of the plurality of feature maps at a certain sliding-window location.
  • the sliding window has a size of 3*3.
  • the sub-region is mapped to a multi-dimensional feature vector, e.g., a 256-dimensional (256-d) feature vector shown in an intermediate layer.
  • a center pixel O of the sub-region is mapped to a pixel of an image to generate an anchor O’ .
  • a set of anchor boxes (e.g., k anchor boxes) are determined based on the anchor O’ .
  • Each of the set of anchor boxes is a rectangular box, and the anchor O’ is a center point of the set of anchor boxes.
  • the RPN includes a regression layer (denoted as reg layer) and a classification layer (denoted as cls layer) .
  • the regression layer may be configured to conduct bounding-box regression to determine a preliminary region proposal corresponding to an anchor box.
  • the classification layer may be configured to determine a category for the preliminary region proposal.
  • the multi-dimensional feature vector i.e., the 256-d feature vector
  • the set of anchor boxes i.e., k anchor boxes
  • the output of the regression layer includes four coordinate values (also referred to as four coordinates) of each of the set of preliminary region proposals.
  • the four coordinate values of a preliminary region proposal may include a location of the preliminary region proposal (e.g., coordinates (x, y) of the anchor of the corresponding anchor box) and a size of the preliminary region proposal (e.g., a width w and a height h of the preliminary region proposal) .
  • the output of the classification layer includes two scores of each of the set of preliminary region proposals, including a first score of being foreground and a second score of being background.
  • a set of preliminary region proposals are determined at the certain sliding-window location.
  • a plurality of preliminary region proposals may be determined at a plurality of sliding-window locations.
  • the RPN may select a portion of the plurality of preliminary region proposals as region proposals for further processing. More descriptions regarding the selection of the region proposals may be found elsewhere in the present disclosure (e.g., operation 530 of process 500 and the relevant descriptions thereof) .
  • FIG. 7 is a flowchart illustrating an exemplary process 700 for determining a boundary of a target object based on a pooling region proposal according to some embodiments of the present disclosure.
  • the AI processing device 142 may be described as a subject to perform the process 700.
  • the process 700 may also be performed by other entities.
  • at least a portion of the process 700 may also be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3.
  • one or more operations of process 700 may be implemented in the AI image processing system 100 as illustrated in FIG. 1.
  • one or more operations in the process 700 may be stored in the storage device 150 and/or the storage (e.g., the ROM 230, the RAM 240, etc. ) as a form of instructions, and invoked and/or executed by the server 140 (e.g., the AI processing device 142 in the server 140, or the processor 220 of the AI processing device 142 in the server 140) .
  • the instructions may be transmitted in a form of electronic current or electrical signals.
  • a portion of operation 560 of the process 500 may be performed according to process 700.
  • a pooling region proposal may have a plurality of corners.
  • the AI processing device 142 e.g., the boundary determination module 411 may determine a plurality of crop strategies for each of the plurality of corners of the pooling region proposal according to a position of each of the plurality of corners.
  • the position of a corner may refer to a position of the corner relative to positions of the other corners.
  • the pooling region proposal may be a rectangular box and include four corners.
  • the four corners may include a top left corner, a top right corner, a bottom left corner, and a bottom right corner.
  • the AI processing device 142 may determine a plurality of crop strategies for each of the four corners based on the position of each of the four corner. Specifically, the AI processing device 142 may determine a plurality of crop strategies for the top left corner.
  • the plurality of crop strategies of the top left corner may include cropping to right, cropping to bottom, cropping to bottom right, target position, false, or the like, or any combination thereof.
  • the AI processing device 142 may determine a plurality of crop strategies for the top right corner.
  • the plurality of crop strategies of the top right corner may include cropping to left, cropping to bottom, cropping to bottom left, target position, false, or the like, or any combination thereof.
  • the AI processing device 142 may determine a plurality of crop strategies for the bottom left corner.
  • the plurality of crop strategies of the bottom left corner may include cropping to right, cropping to top, cropping to top right, target position, or false.
  • the AI processing device 142 may determine a plurality of crop strategies for the bottom right corner.
  • the plurality of crop strategies of the bottom right corner may include cropping to left, cropping to top, cropping to top left, target position, false, or the like, or any combination thereof.
  • the plurality of crop strategies of each corner may include a crop strategy of false and/or a crop strategy of target position.
  • the crop strategy of false of a corner may indicate that the corner may correspond to a point inside the target object.
  • the crop strategy of target position of a corner may indicate that the corner may correspond to a boundary point of the target object. It should be noted that the crop strategies of each corner and the number of corners are merely provided for illustration purposes, and are not intended to limit the scope of the present disclosure.
  • the AI processing device 142 may determine a crop strategy for each of the plurality of corners from the plurality of crop strategies based on the pooling region proposal.
  • the AI processing device 142 may determine a cropping direction and a cropping length for each corner based on the pooling region proposal.
  • the AI processing device 142 may analyze features of pixels (e.g., pixels representing target object, pixels representing background) in the pooling region proposal, and determine the cropping direction and/or the cropping length based on the analysis result.
  • the cropping direction may be limited to one of the plurality of crop strategies.
  • the cropping length may be a length of several pixels, for example, a length including 0-10 pixels.
  • the AI processing device 142 may determine whether one of the plurality of corners corresponds to a crop strategy of false. In some embodiments, if the determined cropping direction of a corner corresponds to the crop strategy of false, the corner may correspond to a point inside the target object. That is, the pooling region proposal does not encompass the whole target object. If the determined cropping direction of a corner corresponds to the crop strategy of target position, the corner may correspond to a boundary point of the target object. Otherwise, if the determined cropping direction of a corner corresponds to other crop strategies other than the crop strategy of false and the crop strategy of target position, the corner may correspond to a point that has a distance from the object.
  • the AI processing device 142 may proceed to operation 740. In response to a determination that each of the plurality of corners does not correspond to the crop strategy of false, the AI processing device 142 may proceed to operation 750.
  • the AI processing device 142 may abandon the pooling region proposal. Since the pooling region proposal does not encompass the whole target object, a boundary of the target object cannot be determined based on the pooling region proposal. Accordingly, the AI processing device 142 may abandon the pooling region proposal.
  • the AI processing device 142 may determine whether each of plurality of corners corresponds to a crop strategy of target position. In response to a determination that at least one of the plurality of corners does not correspond to the crop strategy of target position, the AI processing device 142 may proceed to operation 760.
  • the AI processing device 142 may trim the pooling region proposal by cropping the at least one corner according to the determined crop strategy of the at least one corner. That is, if a corner does not correspond to the crop strategy of target position and the crop strategy of false, the AI processing device 142 may crop the corner based on the crop strategy of the corner determined in 720. When the at least one corner is cropped according to its crop strategy, a trimmed pooling region proposal may be determined.
  • the AI processing device 142 may crop the top right corner towards left to update the position of the top right corner.
  • the AI processing device 142 may crop the top left corner towards right to update the position of the top left corner.
  • the AI processing device 142 may crop the bottom left corner towards top right to update the position of the bottom left corner.
  • the AI processing device 142 may perform a bounding mapping based on the cropped plurality of corners to determine a rectangular box.
  • the pooling region proposal may be a rectangular box and include four corners. Due to different crop strategies applied for different corners, the trimmed pooling region proposal may be a quadrilateral box other than a rectangular box. In some embodiments, the crop strategies described above can be used only when the (trimmed) pooling region proposal is a rectangular box.
  • the AI processing device 142 may perform a boundary mapping on the trimmed pooling region proposal. Specifically, the AI processing device 142 may determine two diagonal lines based on the four corners, and determine the longer diagonal line as a target diagonal line. The AI processing device 142 may determine a rectangular box based on the target diagonal line.
  • the AI processing device 142 may resize the rectangular box into a canonical size to determine an updated pooling region proposal.
  • the AI processing device 142 may resize the rectangular box by performing pooling to determine an updated pooling region proposal.
  • the updated pooling region proposal may have a canonical size and be accepted by the classifier.
  • the AI processing device 142 may proceed to operations 720 through 780 and start a next iteration. Descriptions of the operations 720 through 770 may be found elsewhere in the present disclosure, and the descriptions thereof are not repeated.
  • the AI processing device 142 may repeat operations 720 through 780 until each of the plurality of corners corresponds to a crop strategy of target position.
  • the AI processing device 142 may proceed to operation 750.
  • the AI processing device 142 may determine whether each of the plurality of corners corresponds to the crop strategy of target position. In response to a determination that each of the plurality of corners corresponds to the crop strategy of target position, the AI processing device 142 may stop to crop the plurality of corners. The AI processing device 142 may proceed to operation 790.
  • the AI processing device 142 may identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners.
  • the (trimmed) pooling region proposal may include four corners.
  • the AI processing device 142 may connect the four corners to determine a boundary on the feature map (s) .
  • the AI processing device 142 may map the boundary to the image to determine a boundary of the target object.
  • the boundary of the target object may be a quadrilateral box.
  • a plurality of crop strategies may be determined based on a position of the corresponding corner. Furthermore, the cropping direction and/or the cropping length of each corner may be determined based on features of pixels in the pooling region proposal. That is, to crop a corner, the position of the corner and/or features of the pooling region proposal have been taken into account.
  • a boundary of the target object determined according to the process disclosed in the present disclosure may be more suitable for the target object, especially, for a tilted target object, which may improve the accuracy of detecting and/or locating the target object.
  • the present disclosure may provide a suitable boundary for the tilted target object, and further improve the accuracy of detecting and/or locating the tilted target object.
  • operations 730 and 750 may be performed simultaneously.
  • operation 750 may be performed before operation 730.
  • the AI processing device 142 may repeat the process 700 to determine one or more boundaries corresponding to the target object.
  • FIG. 8 is a schematic diagram illustrating an exemplary process for determining a boundary of a target object according to some embodiments of the present disclosure.
  • an image is inputted into a convolutional neural network (CNN) .
  • the image may include one or more target objects (e.g., objects to be detected) .
  • the CNN may be generated based on a ZF model, VGG-16, RestNet 50, etc.
  • the CNN may include one or more convolution layers, one or more pooling layers and be without a full connection layer.
  • a plurality of feature maps may be generated.
  • the plurality of feature maps may include feature information of the image. Details regarding the generation of the feature maps may be found elsewhere in the present disclosure (e.g., operations 510 and 520, and the descriptions thereof) .
  • the plurality of feature maps may be inputted into a region proposal network (RPN) .
  • RPN region proposal network
  • a sliding window may slide over the plurality of feature maps.
  • a plurality of sliding-window locations may be determined.
  • a multi-dimensional feature vector e.g., a 256-dimensional feature vector
  • the anchor may correspond to a set of anchor boxes, each of which may be associated with a scale and an aspect ratio.
  • the RPN includes at least one regression layer and at least one classification layer.
  • the multi-dimensional feature vector and/or the set of anchor boxes are fed into the at least one regression layer and the at least one classification layer.
  • the output of the at least one regression layer may include four coordinate values of each of the set of preliminary region proposals.
  • the output of the at least one classification layer may include a first score of being foreground and a second score of being background of each of the set of preliminary region proposals.
  • a plurality of preliminary region proposals may be determined.
  • a portion of the plurality of preliminary region proposals may be selected as a plurality of region proposals.
  • the plurality of region proposals may include positive samples (e.g., foreground) and negative samples (e.g., background) .
  • the plurality of region proposals may be further processed. Details regarding the determination of the region proposals may be found elsewhere in the present disclosure (e.g., operation 530 of the process 500) .
  • an ROI pooling operation is performed based on the plurality of feature maps and the plurality of region proposals.
  • the plurality of region proposals may be mapped to the plurality of feature maps to determine a plurality of proposal feature maps (also referred to as ROIs) .
  • the plurality of ROIs may be resized to a canonical size (e.g., 7*7) by performing pooling on the plurality of ROIs. Then a plurality of pooling region proposals may be determined.
  • the plurality of pooling region proposals may be into a classifier for further processing.
  • the plurality of pooling region proposals may be classified into one or more object categories (e.g., K categories) or a background category via the classifier. If a pooling region proposal is determined as the background category, the pooling region proposal may be omitted and/or removed.
  • the plurality of pooling region proposals may include one or more pooling region proposals corresponding to a target object. For a pooling region proposal, a boundary of the target object in the image may be determined based on the pooling region proposal. To determine the boundary of the target object, the pooling region proposal may be trimmed for one or more times. In some embodiments, the pooling region proposal may include a plurality of corners. As shown in FIG.
  • the pooling region proposal includes four corners, that is, a top left (TL) corner, a top right (TR) corner, a bottom left (BL) corner, and a bottom right (BR) corner.
  • Each of the four corners includes five crop strategies.
  • the five strategies of the top left corner include cropping to right ( ⁇ ) , cropping to bottom right ( ⁇ ) , cropping to bottom ( ⁇ ) , target position (T) and false (F) .
  • the five strategies of the top right corner include cropping to left ( ⁇ ) , cropping to bottom left ( ⁇ ) , cropping to bottom ( ⁇ ) , target position (T) and false (F) .
  • the five strategies of bottom left corner include cropping to right ( ⁇ ) , cropping to top right ( ⁇ ) , cropping to top ( ⁇ ) , target position (T) and false (F) .
  • the five strategies of bottom right corner include cropping to left ( ⁇ ) , cropping to top left ( ⁇ ) , cropping to top ( ⁇ ) , target position (T) and false (F) .
  • a crop strategy of each of the four corners may be determined based on features of pixels in the pooling region proposal. Whether one of the four corners correspond to a crop strategy of false may be determined. If at least one of the four corners correspond to the crop strategy of false, the pooling region proposal may be determined that does not encompass the whole target object, and the pooling region proposal may be abandoned and/or rejected.
  • each of the four corners may not correspond to the crop strategy of false. If each of the four corners corresponds to a crop strategy of target position may be determined. If a corner does not correspond to the crop strategy of target position, the corner may be cropped based on the determined crop strategy. When each corner is cropped according to its crop strategy, a trimmed pooling region proposal is determined. A bounding mapping may be performed based on the cropped four corners to determine a rectangular box. The rectangular box is resized into a canonical size to determine an updated pooling region proposal. The updated pooling region proposal may be further trimmed, and a next iteration may be performed. When each of the four corners corresponds to the crop strategy of target position, each of the four corners may not be cropped.
  • a boundary of the (trimmed) pooling region proposal may be identified.
  • the boundary may be mapped to the image to determine a boundary of the target object. More descriptions of the determination of the boundary of the target object may be found elsewhere in the present disclosure (e.g., FIG. 7 and the descriptions thereof) .
  • FIGs. 9A to 9C are schematic diagrams illustrating an image according to some embodiments of the present disclosure.
  • the image may include a target object, i.e., tilted characters “CHANEL” .
  • FIG. 9B shows a boundary 902 (also referred to as bounding box) of “CHANEL” , which is determined according to a Faster-RCNN algorithm.
  • FIG. 9C shows a boundary 904 of “CHANEL” , which is determined according to the process described in the present disclosure.
  • the boundary 902 includes more background than the target object, which cannot accurately determine the location of the target object.
  • the boundary 904 includes less background, which may implement an accurate locating of the target object.
  • the process described in the present disclosure can improve the accuracy of object detection.
  • aspects of the present disclosure may be illustrated and described herein in any of a number of patentable classes or context including any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof. Accordingly, aspects of the present disclosure may be implemented entirely hardware, entirely software (including firmware, resident software, micro-code, etc. ) or combining software and hardware implementation that may all generally be referred to herein as a “unit, ” “module, ” or “system. ” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.
  • a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including electro-magnetic, optical, or the like, or any suitable combination thereof.
  • a computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including wireless, wireline, optical fiber cable, RF, or the like, or any suitable combination of the foregoing.
  • Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object-oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB. NET, Python or the like, conventional procedural programming languages, such as the "C" programming language, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) or in a cloud computing environment or offered as a service such as a Software as a Service (SaaS) .
  • LAN local area network
  • WAN wide area network
  • SaaS Software as a Service

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The systems and methods for object detection. The systems and methods may obtain an image including a target object; generate feature map (s); determine region proposal (s) based on the feature map (s); determine pooling region proposal (s) based on the region proposal (s) and the feature map (s); and classify the pooling region proposal (s) into one or more object categories or a background category via a classifier. Each of the one or more pooling region proposals having corners. For each pooling region proposal corresponding to the target object, the systems and methods may determine crop strategies for each corner according to a position of each corner; trim the pooling region proposal by cropping each corner according to one of the crop strategies; identify a boundary to the trimmed pooling region proposal based on the cropped corners; and map the boundary to the image to determine a boundary of the target object.

Description

AI SYSTEMS AND METHODS FOR OBJECTION DETECTION
CROSS-REFERENCE TO RELATED APPLICATIONS
This application claims priority to Chinese Application No. 201811438175. 5, filed on November 27, 2018, the entire contents of which are hereby incorporated by reference.
TECHNICAL FIELD
The present disclosure generally relates to systems and methods for image processing, and in particular, systems and methods for detecting objects in an image.
BACKGROUND
With the emergence and popularity of artificial intelligent applications (e.g., face recognition, intelligent security cameras) , artificial intelligent object detection techniques, especially, deep learning-based object detection techniques, have been rapidly developed. The artificial intelligent object detection techniques can identify and/or classify an object in an image, and locate the object in the image by drawing a bounding box. However, the boundary box may generally be a rectangular box. For an object being irregular or tilted relative to the image (e.g., a safety belt) , the boundary box (e.g., a rectangular box) may include background. In some cases, the boundary box may include more background than the object, which cannot locate the object accurately. Thus, it is desirable to provide artificial intelligent systems and methods for determining a boundary for a tilted object (s) , which may implement the accurate locating of the tilted object (s) .
SUMMARY
The present disclosure relates to AI systems and methods for object detection. In one aspect of the present disclosure, an artificial intelligent image processing system for object detection is provided. The artificial intelligent image processing system may include at least one storage device and at least one processor in communication with the at least one storage device. The at least one storage device may include a set of  instructions for determining a boundary corresponding to an object in an image. When executing the set of instructions, the at least one processor may be directed to obtain an image including a target object, and generate a plurality of feature maps by inputting the image into a convolutional neural network (CNN) . The at least one processor may also be directed to determine a plurality of region proposals based on the plurality of feature maps, and determine a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps. The at least one processor may be further directed to classify the plurality of pooling region proposals into one or more object categories or a background category via a classifier. The one or more object categories may include a category of the target object, and the plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object. Each of the one or more pooling region proposals may have a plurality of corners. For each of the one or more pooling region proposals corresponding to the target object, the at least one processor may be directed to determine a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner; trim the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies; identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and map the boundary to the image to determine a boundary of the target object.
In some embodiments, the CNN may include one or more convolution layers and one or more pooling layers and is without a full connection layer.
In some embodiments, the plurality of region proposals may be determined according to a region proposal network (RPN) .
In some embodiments, the RPN may include at least one regression layer and at least one classification layer. To determine the plurality of region proposals, the at least one processor may be directed to slide a sliding window over the plurality of feature maps. At each sliding-window location, the sliding window may coincide with a sub-region of the plurality of feature maps. The at least one processor may be directed to  map the sub-region of the plurality of feature maps to a multi-dimensional feature vector, and generate an anchor by mapping a center pixel of the sub-region to a pixel of the image. The anchor may correspond to a set of anchor boxes in the image, and each of the set of anchor boxes may be associated with a scale and an aspect ratio. The at least one processor may also be directed to feed the multi-dimensional feature vector into the at least one regression layer and the at least one classification layer, respectively. The at least one regression layer may be configured to conduct bounding-box regression to determine a set of preliminary region proposals corresponding to the set of anchor boxes, and the output of the at least one regression layer may include four coordinate values of each of the set of preliminary region proposals. The at least one classification layer may be configured to determine a category for each of the set of preliminary region proposals. The category may be a foreground or a background, and the output of the at least one classification layer may include a first score of being foreground and a second score of being background of each of the set of preliminary region proposals. The at least one processor may be further directed to select a portion of the plurality of preliminary region proposals as the plurality of region proposals based on the first score of being foreground and the second score of being background of each of a plurality of preliminary region proposals and four coordinate values of each of the plurality of preliminary region proposals.
In some embodiments, to select a portion of the plurality of preliminary region proposals as the plurality of region proposals, the at least one processor may be directed to select the plurality of region proposals using a non-maximum suppression (NMS) .
In some embodiments, the plurality of pooling region proposals may correspond to a canonical size. To determine the plurality of pooling region proposals, the at least one processor may be further directed to map the plurality of region proposals to the plurality of feature maps to determine a plurality of proposal feature maps, and determine the plurality of pooling region proposals by performing pooling on the plurality of proposal feature maps.
In some embodiments, the plurality of corners may include a top left corner, a top right corner, a bottom left corner, and a bottom right corner. The plurality of crop strategies of the top left corner may include at least one of cropping to right, cropping to bottom, cropping to bottom right, target position, or false. The plurality of crop strategies of the top right corner may include at least one of cropping to left, cropping to bottom, cropping to bottom left, target position, or false. The plurality of crop strategies of the bottom left corner may include at least one of cropping to right, cropping to top, cropping to top right, target position, or false. The plurality of crop strategies of the bottom right corner may include at least one of cropping to left, cropping to top, cropping to top left, target position, or false.
In some embodiments, the at least one processor may be further directed to stop to crop one of the plurality of corners when the corner corresponds to a crop strategy of target position.
In some embodiments, cropping each of the plurality of corners, the at least one processor may be directed to determine a cropping direction and a cropping length for each of the plurality of corners based on the pooling region proposal. The cropping direction of each of the plurality of corners may be limited to one of the plurality of crop strategies of the corresponding corner. The at least one processor may also be directed to crop each of the plurality of corners based on the cropping direction and the cropping length.
In some embodiments, to trim the pooling region proposal by cropping each of the plurality of corners, the at least one processor may be directed to perform one or more iterations. In each of the one or more iterations, the at least one processor may be directed to determine a crop strategy for each of the plurality of corners based on the pooling region proposal from the plurality of crop strategies; determine whether one of the plurality of corners corresponds to a crop strategy of false; determine whether each of the plurality of corners corresponds to a crop strategy of target position in response to a determination that each of the plurality of corners does not correspond to the crop strategy  of false; in response to a determination that at least one of the plurality of corners does not correspond to the crop strategy of target position, crop the at least one of the plurality of corners according to the determined crop strategy of the at least one of the plurality of corners; perform, based on the cropped plurality of corners, a bounding mapping to determine a rectangular box; and resize the rectangular box into a canonical size. The at least one processor may also be directed to stop to crop the plurality of corners in response to a determination that each of the plurality of corners corresponds to the crop strategy of target position.
In some embodiments, the at least one processor may be further directed to abandon the pooling region proposal in response to a determination that at least one of the plurality of corners corresponds to the crop strategy of false.
In some embodiments, the at least one processor may be further directed to determine one or more boundaries corresponding to the target object; determine an intersection-over-union (IoU) between each of the one or more boundaries and a ground truth; and determine one of the one or more boundaries with the greatest IoU as a target boundary corresponding to the target object.
In some embodiments, the boundary of the target object may be a quadrilateral box.
In another aspect of the present disclosure, an artificial intelligent image processing method is provided. The artificial intelligent image processing method may be implemented on a computing device. The computing device may have at least one processor, at least one computer-readable storage medium, and a communication platform connected to a network. The method may include obtaining an image including a target object, and generating a plurality of feature maps by inputting the image into a convolutional neural network (CNN) . The method may also include determining a plurality of region proposals based on the plurality of feature maps, and determining a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps. The method may further include classifying the plurality of  pooling region proposals into one or more object categories or a background category via a classifier. The one or more object categories may include a category of the target object, and the plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object. Each of the one or more pooling region proposals may have a plurality of corners. For each of the one or more pooling region proposals corresponding to the target object, the method may include determining a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner; trimming the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies; identifying a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and mapping the boundary to the image to determine a boundary of the target object.
In another aspect of the present disclosure, a non-transitory computer-readable storage medium is provided. The non-transitory computer-readable storage medium may include at least one set of instructions for artificial intelligent object detection. When executed by at least one processor of a computing device, the at least one set of instructions may direct the at least one processor to perform acts of obtaining an image including a target object; generating a plurality of feature maps by inputting the image into a convolutional neural network (CNN) ; determining a plurality of region proposals based on the plurality of feature maps; determining a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps; classifying the plurality of pooling region proposals into one or more object categories or a background category via a classifier. The one or more object categories may include a category of the target object, and the plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object. Each of the one or more pooling region proposals may have a plurality of corners. For each of the one or more pooling region proposals corresponding to the target object, the at least one set of instructions may also direct the at least one processor to perform acts of determining a plurality of crop  strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner; trimming the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies; identifying a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and mapping the boundary to the image to determine a boundary of the target object.
In another aspect of the present disclosure, an artificial intelligent image processing system for object detection is provided. The artificial intelligent image processing system may include an acquisition module configured to obtain an image including a target object; a feature map determination module configured to generate a plurality of feature maps by inputting the image into a convolutional neural network (CNN) ; a region proposal determination module configured to determine a plurality of region proposals based on the plurality of feature maps; a pooling region proposal determination module configured to determine a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps; a classification module configured to classify the plurality of pooling region proposals into one or more object categories or a background category via a classifier. The one or more object categories may include a category of the target object, and the plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object. Each of the one or more pooling region proposals may have a plurality of corners. For each of the one or more pooling region proposals corresponding to the target object, the artificial intelligent image processing system may also include a boundary determination module configured to determine a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner; the boundary determination module configured to trim the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies; the boundary determination module configured to identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and the  boundary determination module configured to map the boundary to the image to determine a boundary of the target object.
Additional features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The features of the present disclosure may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities and combinations set forth in the detailed examples discussed below.
BRIEF DESCRIPTION OF THE DRAWINGS
The present disclosure is further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. The drawings are not to scale. These embodiments are non-limiting schematic embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:
FIG. 1 is a schematic diagram illustrating an exemplary AI image processing system according to some embodiments of the present disclosure;
FIG. 2 is a schematic diagram illustrating exemplary hardware and/or software components of a computing device according to some embodiments of the present disclosure;
FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of a mobile device according to some embodiments of the present disclosure;
FIG. 4 is a block diagram illustrating an exemplary AI processing device according to some embodiments of the present disclosure;
FIG. 5 is a flowchart illustrating an exemplary process for determining a boundary of a target object according to some embodiments of the present disclosure;
FIG. 6 is a schematic diagram illustrating an exemplary region proposal network according to some embodiments of the present disclosure;
FIG. 7 is a flowchart illustrating an exemplary process for determining a boundary of a target object according to some embodiments of the present disclosure;
FIG. 8 is a schematic diagram illustrating an exemplary process for determining a boundary of a target object according to some embodiments of the present disclosure; and
FIGs. 9A to 9C are schematic diagrams illustrating an image according to some embodiments of the present disclosure.
DETAILED DESCRIPTION
The following description is presented to enable any person skilled in the art to make and use the present disclosure, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present disclosure is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the claims.
The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting. As used herein, the singular forms “a,” “an, ” and “the” may be intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprise, ” “comprises, ” and/or “comprising, ” “include, ” “includes, ” and/or “including, ” when used in this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
These and other features, and characteristics of the present disclosure, as well as the methods of operations and functions of the related elements of structure and the combination of parts and economies of manufacture, may become more apparent upon consideration of the following description with reference to the accompanying drawings, all  of which form a part of this disclosure. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended to limit the scope of the present disclosure. It is understood that the drawings are not to scale.
The flowcharts used in the present disclosure illustrate operations that systems implement according to some embodiments of the present disclosure. It is to be expressly understood, the operations of the flowcharts may be implemented not in order. Conversely, the operations may be implemented in inverted order, or simultaneously. Moreover, one or more other operations may be added to the flowcharts. One or more operations may be removed from the flowcharts.
Moreover, while the system and method in the present disclosure is described primarily regarding an on-demand transportation service, it should also be understood that this is only one exemplary embodiment. The system or method of the present disclosure may be applied to any other kind of on demand service. For example, the system or method of the present disclosure may be applied to transportation systems of different environments including land, ocean, aerospace, or the like, or any combination thereof. The vehicle of the transportation systems may include a taxi, a private car, a hitch, a bus, a train, a bullet train, a high-speed rail, a subway, a vessel, an aircraft, a spaceship, a hot-air balloon, a driverless vehicle, or the like, or any combination thereof. The transportation system may also include any transportation system for management and/or distribution, for example, a system for sending and/or receiving an express. The application of the system or method of the present disclosure may include a web page, a plug-in of a browser, a client terminal, a custom system, an internal analysis system, an artificial intelligence robot, or the like, or any combination thereof.
The terms “passenger, ” “requester, ” “service requester, ” and “customer” in the present disclosure are used interchangeably to refer to an individual, an entity or a tool that may request or order a service. Also, the term “driver, ” “provider, ” “service provider, ” and “supplier” in the present disclosure are used interchangeably to refer to an individual,  an entity or a tool that may provide a service or facilitate the providing of the service. The term “user” in the present disclosure may refer to an individual, an entity, or a tool that may request a service, order a service, provide a service, or facilitate the providing of the service. For example, the user may be a passenger, a driver, an operator, or the like, or any combination thereof. In the present disclosure, “passenger” and “passenger terminal” may be used interchangeably, and “driver” and “driver terminal” may be used interchangeably.
The terms “service request” and “order” in the present disclosure are used interchangeably to refer to a request that may be initiated by a passenger, a requester, a service requester, a customer, a driver, a provider, a service provider, a supplier, or the like, or any combination thereof. The service request may be accepted by any one of a passenger, a requester, a service requester, a customer, a driver, a provider, a service provider, or a supplier. The service request may be chargeable or free.
The positioning technology used in the present disclosure may be based on a global positioning system (GPS) , a global navigation satellite system (GLONASS) , a compass navigation system (COMPASS) , a Galileo positioning system, a quasi-zenith satellite system (QZSS) , a wireless fidelity (WiFi) positioning technology, or the like, or any combination thereof. One or more of the above positioning systems may be used interchangeably in the present disclosure.
The present disclosure relates to artificial intelligent (AI) systems and methods for object detection in an image. Specifically, the AI systems and methods may determine a boudnary for a target object in the image. The determined boudnary of the target object may be a quadrilateral box. To determine the boundary of the target object, the AI systems and methods may input the image into a convolutional neural network (CNN) to generate a plurlaity of feature maps, and generate a plurality of region proposals based on the plurality of feature maps. The AI systems and methods may determine a plurality of pooling region proposals based on the plurality of region proposals and the plurlaity of feature maps by performing an ROI pooling operation. The AI systems and  methods may further classify the plurality of pooling region proposals into one or more object categories or a background category via a classifier. The plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object. Each of the one or more pooling region proposals may have a plurality of corners. For a pooling region proposal corresponding to the target object, the AI systems and methods may determine a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner. The AI systems and methods may also trim the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies. In some embodiments, the AI systems and methods may crop a corner based on a cropping direction and a cropping length, which may be determined based on the pooling region proposal. The AI systems and methods may identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners, and map the boundary to the image to determine the boundary of the target object. In the present disclosure, information (e.g., positions of the corners) related to the corners of the pooling region proposal and features in the pooling region proposal may be considered. Thus, the boundary of the target object determined according to the present disclosure may be more suitable for the target object, especially, for a tilted target object (e.g., a safety belt, tilted characters) , which may improve the accuracy of locating the target object.
FIG. 1 is a schematic diagram illustrating an exemplary AI image processing system 100 according to some embodiments of the present disclosure. The AI image processing system 100 may be configured for objection detection. For example, the AI image processing system 100 may determine a boundary corresponding to an object in an image. In some embodiments, the AI image processing system 100 may be an online platform providing an Online to Offline (O2O) service. The AI image processing system 100 may include a sensor 110, a network 120, a terminal 130, a server 140, and a storage device 150.
The sensor 110 may be configured to capture one or more images. As used in  this application, an image may be a still image, a video, a stream video, or a video frame obtained from a video. The image may be a three-dimensional (3D) image or a two-dimensional (2D) image. The sensor 110 may be or include one or more cameras. In some embodiments, the sensor 110 may be a digital camera, a video camera, a security camera, a web camera, a smartphone, a tablet, a laptop, a video gaming console equipped with a web camera, a camera with multiple lenses, a camcorder, etc. In some embodiments, the sensor 110 (e.g., a camera) may capture an image including one or more objects.
The network 120 may facilitate the exchange of information and/or data. In some embodiments, one or more components of the AI image processing system 100 (e.g., the sensor 110, the terminal 130, the server 140, the storage device 150) may send information and/or data to other component (s) in the AI image processing system 100 via the network 120. For example, the server 140 may process an image obtained from the sensor 110 via the network 120. As another example, the server 140 may obtain user instructions from the terminal 130 via the network 120. In some embodiments, the network 120 may be any type of wired or wireless network, or combination thereof. Merely by way of example, the network 120 may include a cable network, a wireline network, an optical fiber network, a telecommunications network, an intranet, an Internet, a local area network (LAN) , a wide area network (WAN) , a wireless local area network (WLAN) , a metropolitan area network (MAN) , a wide area network (WAN) , a public telephone switched network (PSTN) , a Bluetooth TM network, a ZigBee TM network, a near field communication (NFC) network, or the like, or any combination thereof. In some embodiments, the network 120 may include one or more network access points. For example, the network 120 may include wired or wireless network access points such as base stations and/or internet exchange points 120-1, 120-2, …, through which one or more components of the AI image processing system 100 may be connected to the network 120 to exchange data and/or information.
The terminal 130 include a mobile device 130-1, a tablet computer 130-2, a  laptop computer 130-3, or the like, or any combination thereof. In some embodiments, the mobile device 130-1 may include a smart home device, a wearable device, a smart mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof. In some embodiments, the smart home device may include a smart lighting device, a control device of an intelligent electrical apparatus, a smart monitoring device, a smart television, a smart video camera, an interphone, or the like, or any combination thereof. In some embodiments, the wearable device may include a bracelet, footgear, eyeglasses, a helmet, a watch, clothing, a backpack, an accessory, or the like, or any combination thereof. In some embodiments, the smart mobile device may include a smartphone, a personal digital assistant (PDA) , a gaming device, a navigation device, a point of sale (POS) device, or the like, or any combination thereof. In some embodiments, the virtual reality device and/or the augmented reality device may include a virtual reality helmet, a virtual reality glass, a virtual reality patch, an augmented reality helmet, an augmented reality glass, an augmented reality patch, or the like, or any combination thereof. For example, the virtual reality device and/or the augmented reality device may include a Google Glass, an Oculus Rift, a HoloLens, a Gear VR, etc. In some embodiments, the terminal 130 may remotely operate the sensor 110. In some embodiments, the terminal 130 may operate the sensor 110 via a wireless connection. In some embodiments, the terminal 130 may receive information and/or instructions inputted by a user, and send the received information and/or instructions to the sensor 110 or to the server 140 via the network 120. In some embodiments, the terminal 130 may receive data and/or information from the server 140. In some embodiments, the terminal 130 may be part of the server 140. In some embodiments, the terminal 130 may be omitted.
In some embodiments, the server 140 may be a single server or a server group. The server group may be centralized, or distributed (e.g., the server 140 may be a distributed system) . In some embodiments, the server 140 may be local or remote. For example, the server 140 may access information and/or data stored in the sensor 110, terminal 130, and/or the storage device 150 via the network 120. As another example,  the server 140 may be directly connected to the sensor 110, the terminal 130, and/or the storage device 150 to access stored information and/or data. In some embodiments, the server 140 may be implemented on a cloud platform. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof. In some embodiments, the server 140 may be implemented on a computing device 200 having one or more components illustrated in FIG. 2 in the present disclosure.
In some embodiments, the server 140 may include an AI processing device 142. The AI processing device 142 may process information and/or data to perform one or more functions described in the present disclosure. For example, the AI processing device 142 may process an image including a target object to determine a boundary of the target object in the image. In some embodiments, the AI processing device 142 may include one or more processing devices (e.g., single-core processing device (s) or multi-core processor (s) ) . Merely by way of example, the AI processing device 142 may include a central processing unit (CPU) , an application-specific integrated circuit (ASIC) , an application-specific instruction-set processor (ASIP) , a graphics processing unit (GPU) , a physics processing unit (PPU) , a digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a controller, a microcontroller unit, a reduced instruction-set computer (RISC) , a microprocessor, or the like, or any combination thereof.
The storage device 150 may store data and/or instructions. In some embodiments, the storage device 150 may store data obtained from the terminal 130 and/or the server 140. In some embodiments, the storage device 150 may store data and/or instructions that the server 140 may execute or use to perform exemplary methods described in the present disclosure. In some embodiments, the storage device 150 may include a mass storage, removable storage, a volatile read-and-write memory, a read-only memory (ROM) , or the like, or any combination thereof. Exemplary mass storage may include a magnetic disk, an optical disk, a solid-state drive, etc. Exemplary removable  storage may include a flash drive, a floppy disk, an optical disk, a memory card, a zip disk, a magnetic tape, etc. Exemplary volatile read-and-write memory may include a random-access memory (RAM) . Exemplary RAM may include a dynamic RAM (DRAM) , a double date rate synchronous dynamic RAM (DDR SDRAM) , a static RAM (SRAM) , a thyristor RAM (T-RAM) , and a zero-capacitor RAM (Z-RAM) , etc. Exemplary ROM may include a mask ROM (MROM) , a programmable ROM (PROM) , an erasable programmable ROM (PEROM) , an electrically erasable programmable ROM (EEPROM) , a compact disk ROM (CD-ROM) , and a digital versatile disk ROM, etc. In some embodiments, the storage device 150 may be implemented on a cloud platform. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof.
In some embodiments, the storage device 150 may be connected to the network 120 to communicate with one or more components of the AI image processing system 100 (e.g., the sensor 110, the terminal 130, the server 140) . One or more components in the AI image processing system 100 may access the data or instructions stored in the storage device 150 via the network 120. In some embodiments, the storage device 150 may be directly connected to or communicate with one or more components in the AI image processing system 100 (e.g., the sensor 110, the terminal 130, the server 140) . In some embodiments, the storage device 150 may be part of the sensor 110.
One of ordinary skill in the art would understand that when an element (or component) of the AI image processing system 100 performs, the element may perform through electrical signals and/or electromagnetic signals. For example, when a terminal 130 transmits out a request to the server 140, a processor of the terminal 130 may generate an electrical signal encoding the request. The processor of the terminal 130 may then transmit the electrical signal to an output port. If the terminal 130 communicates with the server 140 via a wired network, the output port may be physically connected to a cable, which further may transmit the electrical signal to an input port of the  server 140. If the terminal 130 communicates with the server 140 via a wireless network, the output port of the terminal 130 may be one or more antennas, which convert the electrical signal to electromagnetic signal. Within an electronic device, such as the terminal 130, and/or the server 140, when a processor thereof processes an instruction, transmits out an instruction, and/or performs an action, the instruction and/or action is conducted via electrical signals. For example, when the processor retrieves or saves data from a storage medium, it may transmit out electrical signals to a read/write device of the storage medium, which may read or write structured data in the storage medium. The structured data may be transmitted to the processor in the form of electrical signals via a bus of the electronic device. Here, an electrical signal may refer to one electrical signal, a series of electrical signals, and/or a plurality of discrete electrical signals.
FIG. 2 is a schematic diagram illustrating exemplary hardware and software components of a computing device 200 according to some embodiments of the present disclosure. In some embodiments, the terminal 130, and/or the server 140 may be implemented on the computing device 200. For example, the AI processing device 142 of the server 140 may be implemented on the computing device 200 and configured to perform functions of the AI processing device 142 disclosed in this disclosure.
The computing device 200 may be a special purpose computer, and may be used to implement an AI image processing system 100 for the present disclosure. The computing device 200 may be used to implement any component of the AI image processing system 100 as described herein. For example, the AI processing device 142 may be implemented on the computing device, via its hardware, software program, firmware, or a combination thereof. Although only one such computer is shown, for convenience, the computer functions relating to the image processing as described herein may be implemented in a distributed fashion on several similar platforms, to distribute the processing load.
The computing device 200, for example, may include a COM port 250 connected with a network that may implement the data communications. The computing  device 200 may also include a processor 220, in the form of one or more processors (or CPUs) , for executing program instructions. The exemplary computing device may include an internal communication bus 210, different types of program storage units and data storage units (e.g., a disk 270, a read only memory (ROM) 230, a random-access memory (RAM) 240) , various data files applicable to computer processing and/or communication. The exemplary computing device may also include program instructions stored in the ROM 230, RAM 240, and/or other type of non-transitory storage medium to be executed by the processor 220. The method and/or process of the present disclosure may be implemented as the program instructions. The computer device 200 also includes an I/O device 260 that may support the input and/or output of data flows between the computing device 200 and other components. The computing device 200 may also receive programs and data via the communication network.
Merely for illustration, only one CPU and/or processor is described in the computing device 200. However, it should be noted that the computing device 200 in the present disclosure may also include multiple CPUs and/or processors, thus operations and/or method steps that are performed by one CPU and/or processor as described in the present disclosure may also be jointly or separately performed by the multiple CPUs and/or processors. For example, if in the present disclosure the CPU and/or processor of the computing device 200 executes both step A and step B, it should be understood that step A and step B may also be performed by two different CPUs and/or processors jointly or separately in the computing device 200 (e.g., the first processor executes step A and the second processor executes step B, or the first and second processors jointly execute steps A and B) .
FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of a mobile device 300 according to some embodiments of the present disclosure. In some embodiments, a terminal (e.g., the terminal 130) may be implemented on the mobile device 300. As illustrated in FIG. 3, the mobile device 300 may include a communication platform 310, a display 320, a graphic processing unit  (GPU) 330, a central processing unit (CPU) 340, an I/O 350, a memory 360, an operating system (OS) 370, a storage 390. In some embodiments, any other suitable component, including but not limited to a system bus or a controller (not shown) , may also be included in the mobile device 300.
In some embodiments, a mobile operating system 370 (e.g., iOS TM, Android MM, Windows Phone TW, etc. ) and one or more applications 380 may be loaded into the memory 360 from the storage 390 in order to be executed by the CPU 340. The applications 380 may include a browser or any other suitable mobile apps for receiving and rendering information relating to image processing or other information from the AI image processing system 100. User interactions with the information stream may be achieved via the I/O 350 and provided to the storage device 150, the server 140 /or other components of the AI image processing system 100.
To implement various modules, units, and their functionalities described in the present disclosure, computer hardware platforms may be used as the hardware platform (s) for one or more of the elements described herein. A computer with user interface elements may be used to implement a personal computer (PC) or any other type of work station or terminal device. A computer may also act as a system if appropriately programmed.
FIG. 4 is a block diagram illustrating an exemplary AI processing device 142 according to some embodiments of the present disclosure. The AI processing device 142 may include an acquisition module 401, a feature map determination module 403, a region proposal determination module 405, a pooling region proposal determination module 407, a classification module 409, and a boundary determination module 411. The modules may be hardware circuits of all or part of the AI processing device 142. The modules may also be implemented as an application or set of instructions read and executed by the AI processing device 142. Further, the modules may be any combination of the hardware circuits and the application/instructions. For example, the modules may be the part of the AI processing device 142 when the AI processing device 142 is executing the  application/set of instructions.
The acquisition module 401 may be configured to obtain information and/or data related to the AI image processing system 100. In some embodiments, the acquisition module 401 may obtain an image including a target object. In some embodiments, the image may be a still image or a video captured by the sensor 110. In some embodiments, the target object may refer to an object that to be identified and/or detected in the image. For example, the target object may be an object tilted relative to the image (e.g., a safety belt, tilted characters) . Alternatively, all objects in the image may need to be identified and/or detected, and each object in the image may be referred to as a target object. In some embodiments, the acquisition module 401 may obtain the image from one or more components of the AI image processing system 100, such as the sensor 110, the terminal 130, a storage (e.g., the storage device 150) , or from an external source (e.g., ImageNet) via the network 120.
The feature map determination module 403 may be configured to generate a plurality of feature maps by inputting an image (e.g., the image obtained by the acquisition module 401) into a convolutional neural network (CNN) . The CNN may be a trained CNN including one or more convolution layers and one or more pooling layers and without a full connection layer. The convolution layer (s) may be configured to extract features (or feature maps) of an image. The pooling layers may be configured to reduce the size of the feature maps of the image. The feature maps may include feature information of the image.
The region proposal determination module 405 may be configured to determine a plurality of region proposals based on the plurality of feature maps. In some embodiments, the region proposal determination module 405 may determine the plurality of region proposals according to a region proposal network (RPN) . Specifically, the region proposal determination module 405 may slide a sliding window over the plurality of feature maps. With the sliding of the sliding window over the plurality of feature maps, a plurality of sliding-window locations may be determined. At each sliding-window location,  a set of preliminary region proposals may be determined. Since there is a plurality of sliding-window locations, a plurality of preliminary region proposals may be determined at the plurality of sliding-window locations. In some embodiments, multiple preliminary region proposals may highly overlap with each other, and the region proposal determination module 405 may select a portion of the plurality of preliminary region proposals as the plurality of region proposals. Merely by way of example, the region proposal determination module 405 may determine the plurality of region proposals using a non-maximum suppression (NMS) . Details regarding the determination of the region proposals may be may be found elsewhere in the present disclosure (e.g., operation 530 of the process 500 and the descriptions thereof) .
The pooling region proposal determination module 407 may be configured to determine a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps. In some embodiments, the pooling region proposal determination module 407 may map the plurality of region proposals to the plurality of feature maps to determine a plurality of proposal feature maps (also referred to as regions of interest (ROIs) ) . The pooling region proposal determination module 407 may then determine the plurality of pooling region proposals by performing pooling on the plurality of proposal feature maps (or ROIs) .
The classification module 409 may be configured to classify the plurality of pooling region proposals into one or more object categories or a background category via the classifier. In some embodiments, the classification module 409 may classify negative samples in the pooling region proposals into the background category. If a pooling region proposal is determined as the background category, the pooling region proposal may be omitted and may not do further processing. In some embodiments, the classification module 409 may classify a pooling region proposal corresponding to a positive sample into one of the one or more object categories. The one or more object categories may be default settings of the image processing system 100, and/or may be adjusted by a user.
The one or more object categories may include a category of the target object. The  classification module 409 may select one or more pooling region proposals corresponding to the target object from the plurality of pooling region proposals.
The boundary determination module 411 may be configured to determine a boundary of the target object in the image based on at least one of the one or more pooling region proposals. In some embodiments, the boundary may be a polygonal box, for example, a quadrilateral box. Merely by way of example, for a pooling region proposal having a plurality of corners (e.g., 4 corners, 5 corners, 8 corners, etc. ) , the boundary determination module 411 may determine a plurality of crop strategies for each corner according to a position of the corresponding corner. The boundary determination module 411 may trim the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies. The boundary determination module 411 may identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners, and map the boundary to the image to determine a boundary of the target object. Details regarding the determination of a boundary may be found elsewhere in the present disclosure (e.g., operation 560 of the process 500, process 700, and the descriptions thereof) .
In some embodiments, the boundary determination module 411 may determine one or more boundaries corresponding to the target object based on the one or more pooling region proposals. The boundary determination module 411 may determine an IoU between each of the one or more boundaries and a ground truth. In some embodiments, the ground truth may indicate a labelled boundary box of the target object. The IoU between a boundary and the ground truth may reflect a degree of overlapping of the boundary and the ground truth. The boundary determination module 411 may compare one or more determined IoUs related to the one or more boundaries, and determine a boundary with the greatest IoU as a target boundary corresponding to the target object.
The modules in the AI processing device 142 may be connected to or communicate with each other via a wired connection or a wireless connection. The wired  connection may include a metal cable, an optical cable, a hybrid cable, or the like, or any combination thereof. The wireless connection may include a Local Area Network (LAN) , a Wide Area Network (WAN) , a Bluetooth, a ZigBee, a Near Field Communication (NFC) , or the like, or any combination thereof.
It should be noted that the above description is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, the AI processing device 142 may further include one or more additional modules. For example, the AI processing device 142 may further include a storage module (not shown in FIG. 4) configured to store data generated by the modules of the AI processing device 142.
FIG. 5 is a flowchart illustrating an exemplary process 500 for determining a boundary of a target object according to some embodiments of the present disclosure. For illustration purpose only, the AI processing device 142 may be described as a subject to perform the process 500. However, one of ordinary skill in the art would understand that the process 500 may also be performed by other entities. For example, one of ordinary skill in the art would understand that at least a portion of the process 500 may also be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3. In some embodiments, one or more operations of process 500 may be implemented in the AI image processing system 100 as illustrated in FIG. 1. In some embodiments, one or more operations in the process 500 may be stored in the storage device 150 and/or the storage (e.g., the ROM 230, the RAM 240, etc. ) as a form of instructions, and invoked and/or executed by the server 140 (e.g., the AI processing device 142 in the server 140, or the processor 220 of the AI processing device 142 in the server 140) . In some embodiments, the instructions may be transmitted in a form of electronic current or electrical signals.
In 510, the AI processing device 142 (e.g., the acquisition module 401) may  obtain an image including a target object. In some embodiments, the image may be an image captured by the sensor 110 (e.g., a camera of a smartphone, a camera in an autonomous vehicle, an intelligent security camera, a traffic camera) . The captured image may be a still image, a video, etc. In some embodiments, the image may include multiple objects, such as people, animals (e.g., dog, cat) , vehicles (e.g., bike, car, bus, truck) , plants (e.g., flower, tree) , buildings, scenery, or the like, or any combination thereof. In some embodiments, the image may include an object tilted relative to the image, such as a safety belt, tilted characters, etc. In some embodiments, the target object may refer to an object that to be identified and/or detected in the image. For example, the target object may be an object tilted relative to the image (e.g., the safety belt, the tilted characters) . Alternatively, all objects in the image may need to be identified and/or detected, and each object in the image may be referred to as a target object.
In some embodiments, the AI processing device 142 may obtain the image from one or more components of the AI image processing system 100, such as the sensor 110, the terminal 130, a storage device (e.g., the storage device 150) . Alternatively or additionally, the AI processing device 142 may obtain the image from an external source via the network 120. For example, the AI processing device 142 may obtain the image from ImageNet, etc.
In 520, the AI processing device 142 (e.g., the feature map determination module 403) may generate a plurality of feature maps by inputting the image into a convolutional neural network (CNN) . The plurality of feature maps may include feature information of the image. In some embodiments, the CNN may be generated based on a Zeiler and Fergus model (ZF) , VGG-16, RestNet 50, etc. In some embodiments, the CNN may be a trained CNN including one or more convolution layers and one or more pooling layers and without a full connection layer. The convolution layer (s) may be configured to extract features (or feature maps) of an image (e.g., the image obtained in 510) . The pooling layers may be configured to reduce the size of the feature maps of the image. In some embodiments, the image may be inputted into the CNN, and a plurality of  feature maps may be generated. Merely by way of example, the CNN may be determined based on a ZF model. An image with the size of 600*1000 may be inputted into the ZF model, and 256 feature maps may be outputted from the ZF model. The size of each of the 256 feature maps may be 40*60.
In some embodiments, the CNN may be generated according to transfer learning. Transfer learning may be capable of reducing the training time by using previously obtained knowledge. Specifically, a base network may be a pre-trained network trained previously based on a plurality of first training samples obtained from a dataset (e.g., ImageNet dataset, PASCAL VOC, COCO, etc. ) . The base network may include one or more layers (e.g., convolution layer (s) , pooling layer (s) ) and a plurality of pre-trained weights. At least some of the one or more layers and its corresponding pre-trained weights may be transferred to a target network. For example, the base network may be VGG-16, including thirteen convolution layers, four pooling layers, and three full connection layers. The thirteen convolution layers and the four pooling layers may be transferred to a target network (e.g., the CNN) . In some embodiments, the pre-trained weights of the convolution layers and/or the pooling layers may not need to be adjusted, or may be fine-tuned based on a plurality of second training samples obtained from a dataset (e.g., ImageNet dataset, PASCAL VOC, COCO, etc. ) . In some embodiments, the target network may further include one or more additional layers other than the transferred layers. The weights in the additional layer (s) may be updated according to a plurality of third training samples obtained from a dataset (e.g., ImageNet dataset, PASCAL VOC, COCO, etc. ) . It should be noted that, in some embodiments, different from the transfer learning, the CNN may be directly generated by training a preliminary CNN using a plurality of fourth training samples obtained from a dataset (e.g., ImageNet dataset, PASCAL VOC, COCO, etc. ) .
In 530, the AI processing device 142 (e.g., the region proposal determination module 405) may determine a plurality of region proposals based on the plurality of feature maps. In some embodiments, the AI processing device 142 may determine the plurality  of region proposals according to a region proposal network (RPN) . As shown in FIG. 6, the RPN may include at least one regression layer and at least one classification layer.
In some embodiments, the AI processing device 142 may slide a sliding window over the plurality of feature maps. The sliding window may also be referred to as a convolution kernel that has a size of, for example, 3*3, 5*5, etc. With the sliding of the sliding window over the plurality of feature maps, a plurality of sliding-window locations may be determined. Merely by way of example, the size of the sliding window may be 3*3, and the size of the plurality of feature maps may be 40*60. A padding operation (e.g., padding=1) may be performed on the plurality of feature maps. When sliding the sliding window over the plurality of feature maps, 40*60 (2400) sliding-window locations may be roughly determined.
At each sliding-window location, the sliding window may coincide with a sub-region of the plurality of feature maps. In some embodiments, the AI processing device 142 may map the sub-region of the plurality of feature maps to a multi-dimensional feature vector. For example, if there are 256 feature maps, a 256-dimensional feature vector may be generated at the sub-region. The AI processing device 142 may generate an anchor by mapping a center pixel of the sub-region to a pixel of the image obtained in 510. In some embodiments, the anchor may correspond to a set of anchor boxes (e.g., including k anchor boxes) in the image. Each of the set of anchor boxes may be a rectangular box. The anchor may be a center point of the set of anchor boxes. Each of the set of anchor boxes may be associated with a scale and an aspect ratio. Merely by way of example, if 3 scales (e.g., 128, 256, 512, etc. ) and 3 aspect ratios (e.g., 1: 1, 1: 2, 2: 1, etc. ) are applied, the number of the set of anchor boxes may be 9. In some embodiments, the AI processing device 142 may feed the multi-dimensional feature vector and/or the set of anchor boxes into the at least one regression layer and the at least one classification layer, respectively. In some embodiments, the at least one regression layer may be configured to conduct bounding-box regression to determine a set of preliminary region proposals corresponding to the set of anchor boxes. The output of the at least one  regression layer may include four coordinate values of each of the set of preliminary region proposals. In some embodiments, the four coordinate values of a preliminary region proposal may include a location of the preliminary region proposal (e.g., coordinates (x, y) of the anchor of the corresponding anchor box) and a size of the preliminary region proposal (e.g., a width w and a height h of the preliminary region proposal) . The at least one classification layer may be configured to determine a category for each of the set of preliminary region proposals. The category may be a foreground or a background. The output of the at least one classification layer may include a first score of being foreground and a second score of being background of each of the set of preliminary region proposals.
As described above, at each sliding-window location, a set of (e.g., 9) preliminary region proposals may be determined. Since there is a plurality of sliding-window locations (e.g., roughly 40*60) , a plurality of (e.g., roughly 20000) preliminary region proposals may be determined at the plurality of sliding-window locations. In some embodiments, multiple preliminary region proposals may highly overlap with each other. The AI processing device 142 may select a portion of the plurality of preliminary region proposals as a plurality of region proposals. In some embodiments, the AI processing device 142 may select the plurality of region proposals using a non-maximum suppression (NMS) . Specifically, the AI processing device 142 may determine the plurality of region proposals based on the first score of being foreground and the second score of being background of each of the plurality of preliminary region proposals and four coordinate values of each of the plurality of preliminary region proposals. In some embodiments, the AI processing device 142 may determine an intersection-over-union (IoU) between each of the plurality of preliminary region proposals and a ground truth. The ground truth may be a labelled boundary box of the target object. The AI processing device 142 may determine preliminary region proposals that have an IoU greater than 0.7 as positive samples, and determine preliminary region proposals that have an IoU less than 0.3 as negative samples. The AI processing device 142 may remove preliminary region  proposals other than the positive samples and the negative samples. In some embodiments, the AI processing device 142 may select the plurality of region proposals from the positive samples and the negative samples. In some embodiments, the AI processing device 142 may rank the positive samples based on the first score of being foreground of each of the positive samples, and select multiple positive samples based on the ranked positive samples. The AI processing device 142 may rank the negative samples based on the second score of being background of each of the negative samples, and select multiple negative samples based on the ranked negative samples. The selected positive samples and the selected negative samples may constitute the plurality of region proposals. In some embodiments, the AI processing device 142 may select 300 region proposals. The number of the selected positive samples may be the same as or different from that of the selected negative samples. In some embodiments, before selecting the region proposals using the non-maximum suppression (NMS) , the AI processing device 142 may first remove preliminary region proposals that cross boundaries of the image (also referred to as cross-boundary preliminary region proposals) .
In 540, the AI processing device 142 (e.g., the pooling region proposal determination module 407) may determine a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps. In some embodiments, the AI processing device 142 may map the plurality of region proposals to the plurality of feature maps to determine a plurality of proposal feature maps (also referred to as regions of interest (ROIs) ) . In some embodiments, the plurality of proposal feature maps (or ROI) will be inputted into a classifier for further processing. The classifier may only accept proposal feature map (s) with a canonical size (e.g., 7*7) . Thus, the AI processing device 142 may resize the plurality of proposal feature maps to the canonical size. The AI processing device 142 may determine the plurality of pooling region proposals by performing pooling on the plurality of proposal feature maps (or ROIs) . In some embodiments, the pooling may include a max pooling, a mean pooling, or the like. In some embodiments, the plurality of pooling region proposals may correspond to the  canonical size (e.g., 7*7) and may be inputted into the classifier for further processing. For example, a pooling region proposal may be determined as a fixed-length vector, which will be sent into a full connection layer of the classifier.
In 550, the AI processing device 142 (e.g., the classification module 409) may classify the plurality of pooling region proposals into one or more object categories or a background category via the classifier. In some embodiments, the classifier may include a support vector machine (SVM) classifier, a Bayer classifier, a decision tree classifier, a softmax classifier, or the like, or any combination thereof.
In some embodiments, one or more pooling region proposals may be classified into the background category. For example, as described in connection with operation 530, the region proposals may include multiple positive samples and multiple negative samples. Similarly, the pooling region proposals may correspond to multiple positive samples and multiple negative samples. In some embodiments, the multiple negative samples in the pooling region proposals may be classified into the background category. If a pooling region proposal is determined as the background category, the pooling region proposal may be omitted and may not do further processing.
In some embodiments, a pooling region proposal corresponding to a positive sample may be classified into one of the one or more object categories. The one or more object categories may be default settings of the AI image processing system 100, and/or may be adjusted by a user. The one or more object categories may include a category of the target object. The plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object. The AI processing device 142 may select the one or more pooling region proposals corresponding to the target object from the plurality of pooling region proposals.
In 560, the AI processing device 142 (e.g., the boundary determination module 411) may determine a target boundary of the target object in the image based on at least one of the one or more pooling region proposals. In some embodiments, the target boundary may be a polygonal box, for example, a quadrilateral box.
In some embodiments, each of the one or more pooling region proposals may have a plurality of corners (e.g., 4 corners, 5 corners, 8 corners, etc. ) . For a pooling region proposal, the AI processing device 142 may determine a plurality of crop strategies for each corner of the plurality of corners according to a position of the corresponding corner. Merely by way of example, the AI processing device 142 may determine five crop strategies for each of the plurality of corners. In some embodiments, for each corner, the AI processing device 142 may determine one of the plurality of (e.g., five) crop strategies as a desired crop strategy of the corner based on the pooling region proposal. Merely by way of example, the AI processing device 142 may determine a cropping direction and a cropping length for each corner based on the pooling region proposal. The cropping direction of a corner may be limited to one of the plurality of crop strategies of the corner. In some embodiments, the AI processing device 142 may trim the pooling region proposal by cropping each of the plurality of corners according to the desired crop strategy, for example, based on the cropping direction and the cropping length. In some embodiments, the plurality of crop strategies of each corner may include a crop strategy of false and/or a crop strategy of target position. The crop strategy of false of a corner may indicate that the corner may correspond to a point inside the target object. The crop strategy of target position of a corner may indicate that the corner may correspond to a boundary point of the target object. If the cropping direction of a corner corresponds to a crop strategy of false, the AI processing device 142 may stop to crop the corner and abandon the pooling region proposal. If the cropping direction of a corner corresponds to a crop strategy of target position, the AI processing device 142 may stop to crop the corner. When the cropping direction of each corner corresponds to a crop strategy of target position, the AI processing device 142 may identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners, and map the boundary to the image to determine a boundary of the target object. Details regarding the determination of a boundary may be found elsewhere in the present disclosure (e.g., FIG. 7 and the descriptions thereof) .
In some embodiments, the AI processing device 142 may determine one or more boundaries corresponding to the target object based on the one or more pooling region proposals. Each of the one or more boundaries may be determined according to process 700, and the descriptions thereof are not repeated herein. The AI processing device 142 may determine an IoU between each of the one or more boundaries and a ground truth. In some embodiments, the ground truth may indicate a labelled boundary box of the target object. The IoU between a boundary and the ground truth may reflect a degree of overlapping of the boundary and the ground truth. The AI processing device 142 may compare one or more determined IoUs related to the one or more boundaries, and determine a boundary with the greatest IoU as a target boundary corresponding to the target object.
In the present disclosure, for a pooling region proposal, each corner of the pooling region proposal may be cropped according to one of its crop strategies, which has considered information related to the corresponding corner. Besides, the cropping direction and/or the cropping length of each corner of the pooling region proposal may be determined based on the pooling region proposal, which has considered features in the pooling region proposal. Thus, a boundary of the target object determined according to the process disclosed in the present disclosure may be more suitable for the target object, especially, for a tilted target object, which may improve the accuracy of detecting and/or locating the target object. As disclosed in the present disclosure, for the target object, one or more boundaries may be determined. A boundary with the greatest IoU among the one or more boundaries may be determined as the target boundary. That is, a boundary having the greatest degree of overlapping with the ground truth may be determined as the target boundary, which may further improve the accuracy of detecting and/or locating the target object.
It should be noted that the above description regarding the process 500 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and  modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, in 560, the AI processing device 142 may not need to determine a plurality of boundaries corresponding to the target object. When the AI processing device 142 determines a boundary of the target object, the AI processing device 142 may determine the boundary as a target boundary, and operation 560 may be terminated. In some embodiments, according to process 500, the boundaries of one or more target objects (e.g., all objects) in the image may be determined simultaneously. In some embodiments, process 500 may be repeated to determine boundaries of target objects in a plurality of different images.
FIG. 6 is schematic diagram illustrating an exemplary region proposal network (RPN) according to some embodiments of the present disclosure. As shown in FIG. 6, the RPN introduces a sliding window. The sliding window is configured to slide over a plurality of feature maps. As shown in FIG. 6, a sliding window coincides with a sub-region of the plurality of feature maps at a certain sliding-window location. The sliding window has a size of 3*3. The sub-region is mapped to a multi-dimensional feature vector, e.g., a 256-dimensional (256-d) feature vector shown in an intermediate layer. Besides, a center pixel O of the sub-region is mapped to a pixel of an image to generate an anchor O’ . A set of anchor boxes (e.g., k anchor boxes) are determined based on the anchor O’ . Each of the set of anchor boxes is a rectangular box, and the anchor O’ is a center point of the set of anchor boxes. In some embodiments, there may be 3 scales and 3 aspect ratios, and 9 anchor boxes may be determined on the image.
As shown in FIG. 6, the RPN includes a regression layer (denoted as reg layer) and a classification layer (denoted as cls layer) . The regression layer may be configured to conduct bounding-box regression to determine a preliminary region proposal corresponding to an anchor box. The classification layer may be configured to determine a category for the preliminary region proposal. As illustrated, the multi-dimensional feature vector (i.e., the 256-d feature vector) and/or the set of anchor boxes (i.e., k anchor  boxes) are fed into the regression layer and the classification layer, respectively. The output of the regression layer includes four coordinate values (also referred to as four coordinates) of each of the set of preliminary region proposals. The four coordinate values of a preliminary region proposal may include a location of the preliminary region proposal (e.g., coordinates (x, y) of the anchor of the corresponding anchor box) and a size of the preliminary region proposal (e.g., a width w and a height h of the preliminary region proposal) . The output of the classification layer includes two scores of each of the set of preliminary region proposals, including a first score of being foreground and a second score of being background.
As describe above, a set of preliminary region proposals are determined at the certain sliding-window location. With the sliding of the sliding window over the plurality of feature maps, a plurality of preliminary region proposals may be determined at a plurality of sliding-window locations. In some embodiments, the RPN may select a portion of the plurality of preliminary region proposals as region proposals for further processing. More descriptions regarding the selection of the region proposals may be found elsewhere in the present disclosure (e.g., operation 530 of process 500 and the relevant descriptions thereof) .
FIG. 7 is a flowchart illustrating an exemplary process 700 for determining a boundary of a target object based on a pooling region proposal according to some embodiments of the present disclosure. For illustration purpose only, the AI processing device 142 may be described as a subject to perform the process 700. However, one of ordinary skill in the art would understand that the process 700 may also be performed by other entities. For example, one of ordinary skill in the art would understand that at least a portion of the process 700 may also be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3. In some embodiments, one or more operations of process 700 may be implemented in the AI image processing system 100 as illustrated in FIG. 1. In some embodiments, one or more operations in the process 700 may be stored in the storage device 150 and/or the  storage (e.g., the ROM 230, the RAM 240, etc. ) as a form of instructions, and invoked and/or executed by the server 140 (e.g., the AI processing device 142 in the server 140, or the processor 220 of the AI processing device 142 in the server 140) . In some embodiments, the instructions may be transmitted in a form of electronic current or electrical signals. In some embodiments, a portion of operation 560 of the process 500 may be performed according to process 700.
In some embodiments, a pooling region proposal may have a plurality of corners. In 710, the AI processing device 142 (e.g., the boundary determination module 411) may determine a plurality of crop strategies for each of the plurality of corners of the pooling region proposal according to a position of each of the plurality of corners.
In some embodiments, the position of a corner may refer to a position of the corner relative to positions of the other corners. In certain embodiments, the pooling region proposal may be a rectangular box and include four corners. The four corners may include a top left corner, a top right corner, a bottom left corner, and a bottom right corner. The AI processing device 142 may determine a plurality of crop strategies for each of the four corners based on the position of each of the four corner. Specifically, the AI processing device 142 may determine a plurality of crop strategies for the top left corner. The plurality of crop strategies of the top left corner may include cropping to right, cropping to bottom, cropping to bottom right, target position, false, or the like, or any combination thereof. The AI processing device 142 may determine a plurality of crop strategies for the top right corner. The plurality of crop strategies of the top right corner may include cropping to left, cropping to bottom, cropping to bottom left, target position, false, or the like, or any combination thereof. The AI processing device 142 may determine a plurality of crop strategies for the bottom left corner. The plurality of crop strategies of the bottom left corner may include cropping to right, cropping to top, cropping to top right, target position, or false. The AI processing device 142 may determine a plurality of crop strategies for the bottom right corner. The plurality of crop strategies of the bottom right corner may include cropping to left, cropping to top, cropping to top left,  target position, false, or the like, or any combination thereof. As can be seen from above, the plurality of crop strategies of each corner may include a crop strategy of false and/or a crop strategy of target position. The crop strategy of false of a corner may indicate that the corner may correspond to a point inside the target object. The crop strategy of target position of a corner may indicate that the corner may correspond to a boundary point of the target object. It should be noted that the crop strategies of each corner and the number of corners are merely provided for illustration purposes, and are not intended to limit the scope of the present disclosure.
In 720, the AI processing device 142 (e.g., the boundary determination module 411) may determine a crop strategy for each of the plurality of corners from the plurality of crop strategies based on the pooling region proposal. In some embodiments, the AI processing device 142 may determine a cropping direction and a cropping length for each corner based on the pooling region proposal. For example, the AI processing device 142 may analyze features of pixels (e.g., pixels representing target object, pixels representing background) in the pooling region proposal, and determine the cropping direction and/or the cropping length based on the analysis result. The cropping direction may be limited to one of the plurality of crop strategies. The cropping length may be a length of several pixels, for example, a length including 0-10 pixels.
In 730, the AI processing device 142 (e.g., the boundary determination module 411) may determine whether one of the plurality of corners corresponds to a crop strategy of false. In some embodiments, if the determined cropping direction of a corner corresponds to the crop strategy of false, the corner may correspond to a point inside the target object. That is, the pooling region proposal does not encompass the whole target object. If the determined cropping direction of a corner corresponds to the crop strategy of target position, the corner may correspond to a boundary point of the target object. Otherwise, if the determined cropping direction of a corner corresponds to other crop strategies other than the crop strategy of false and the crop strategy of target position, the corner may correspond to a point that has a distance from the object. In response to a  determination that at least one of the plurality of corners corresponds to the crop strategy of false, the AI processing device 142 may proceed to operation 740. In response to a determination that each of the plurality of corners does not correspond to the crop strategy of false, the AI processing device 142 may proceed to operation 750.
In 740, the AI processing device 142 (e.g., the boundary determination module 411) may abandon the pooling region proposal. Since the pooling region proposal does not encompass the whole target object, a boundary of the target object cannot be determined based on the pooling region proposal. Accordingly, the AI processing device 142 may abandon the pooling region proposal.
In 750, the AI processing device 142 (e.g., the boundary determination module 411) may determine whether each of plurality of corners corresponds to a crop strategy of target position. In response to a determination that at least one of the plurality of corners does not correspond to the crop strategy of target position, the AI processing device 142 may proceed to operation 760.
In 760, the AI processing device 142 (e.g., the boundary determination module 411) may trim the pooling region proposal by cropping the at least one corner according to the determined crop strategy of the at least one corner. That is, if a corner does not correspond to the crop strategy of target position and the crop strategy of false, the AI processing device 142 may crop the corner based on the crop strategy of the corner determined in 720. When the at least one corner is cropped according to its crop strategy, a trimmed pooling region proposal may be determined.
Merely by way of example, for a top right corner of the pooling region proposal, if the determined cropping direction corresponds to a crop strategy of cropping to left, the AI processing device 142 may crop the top right corner towards left to update the position of the top right corner. As another example, for a top left corner of the pooling region proposal, if the determined cropping direction corresponds to a crop strategy of cropping to right, the AI processing device 142 may crop the top left corner towards right to update the position of the top left corner. As a further example, for a bottom left corner of the pooling  region proposal, if the determined cropping direction corresponds to a crop strategy of cropping to top right, the AI processing device 142 may crop the bottom left corner towards top right to update the position of the bottom left corner.
In 770, the AI processing device 142 (e.g., the boundary determination module 411) may perform a bounding mapping based on the cropped plurality of corners to determine a rectangular box. In certain embodiments, as described above, the pooling region proposal may be a rectangular box and include four corners. Due to different crop strategies applied for different corners, the trimmed pooling region proposal may be a quadrilateral box other than a rectangular box. In some embodiments, the crop strategies described above can be used only when the (trimmed) pooling region proposal is a rectangular box. Thus, the AI processing device 142 may perform a boundary mapping on the trimmed pooling region proposal. Specifically, the AI processing device 142 may determine two diagonal lines based on the four corners, and determine the longer diagonal line as a target diagonal line. The AI processing device 142 may determine a rectangular box based on the target diagonal line.
In 780, the AI processing device 142 (e.g., the boundary determination module 411) may resize the rectangular box into a canonical size to determine an updated pooling region proposal. In some embodiments, the AI processing device 142 may resize the rectangular box by performing pooling to determine an updated pooling region proposal. The updated pooling region proposal may have a canonical size and be accepted by the classifier. After the updated pooling region proposal is determined, the AI processing device 142 may proceed to operations 720 through 780 and start a next iteration. Descriptions of the operations 720 through 770 may be found elsewhere in the present disclosure, and the descriptions thereof are not repeated. The AI processing device 142 may repeat operations 720 through 780 until each of the plurality of corners corresponds to a crop strategy of target position.
In 730, in response to a determination that each of the plurality of corners does not correspond to the crop strategy of false, the AI processing device 142 may proceed to  operation 750. In 750, the AI processing device 142 may determine whether each of the plurality of corners corresponds to the crop strategy of target position. In response to a determination that each of the plurality of corners corresponds to the crop strategy of target position, the AI processing device 142 may stop to crop the plurality of corners. The AI processing device 142 may proceed to operation 790.
In 790, the AI processing device 142 (e.g., the boundary determination module 411) may identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners. In certain embodiments, the (trimmed) pooling region proposal may include four corners. The AI processing device 142 may connect the four corners to determine a boundary on the feature map (s) .
In 795, the AI processing device 142 (e.g., the boundary determination module 411) may map the boundary to the image to determine a boundary of the target object. In some embodiments, the boundary of the target object may be a quadrilateral box.
In the present disclosure, for each corner of the pooling region proposal, a plurality of crop strategies may be determined based on a position of the corresponding corner. Furthermore, the cropping direction and/or the cropping length of each corner may be determined based on features of pixels in the pooling region proposal. That is, to crop a corner, the position of the corner and/or features of the pooling region proposal have been taken into account. Thus, a boundary of the target object determined according to the process disclosed in the present disclosure may be more suitable for the target object, especially, for a tilted target object, which may improve the accuracy of detecting and/or locating the target object. For example, as shown in FIG. 9C, for a tilted target object (e.g., tilted characters) , the present disclosure may provide a suitable boundary for the tilted target object, and further improve the accuracy of detecting and/or locating the tilted target object.
It should be noted that the above description regarding the process 700 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and  modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. For example,  operations  730 and 750 may be performed simultaneously. As another example, operation 750 may be performed before operation 730. In some embodiments, the AI processing device 142 may repeat the process 700 to determine one or more boundaries corresponding to the target object.
FIG. 8 is a schematic diagram illustrating an exemplary process for determining a boundary of a target object according to some embodiments of the present disclosure. As shown in FIG. 8, an image is inputted into a convolutional neural network (CNN) . In some embodiments, the image may include one or more target objects (e.g., objects to be detected) . In some embodiments, the CNN may be generated based on a ZF model, VGG-16, RestNet 50, etc. In some embodiments, the CNN may include one or more convolution layers, one or more pooling layers and be without a full connection layer. By inputting the image into the CNN, a plurality of feature maps may be generated. The plurality of feature maps may include feature information of the image. Details regarding the generation of the feature maps may be found elsewhere in the present disclosure (e.g.,  operations  510 and 520, and the descriptions thereof) .
As shown in FIG. 8, the plurality of feature maps may be inputted into a region proposal network (RPN) . In the RPN, a sliding window may slide over the plurality of feature maps. With the sliding of the sliding window over the plurality of feature maps, a plurality of sliding-window locations may be determined. At each sliding-window location, a multi-dimensional feature vector (e.g., a 256-dimensional feature vector) may be generated and/or an anchor in the image may be determined. The anchor may correspond to a set of anchor boxes, each of which may be associated with a scale and an aspect ratio. As shown in FIG. 8, the RPN includes at least one regression layer and at least one classification layer. The multi-dimensional feature vector and/or the set of anchor boxes are fed into the at least one regression layer and the at least one classification layer. The output of the at least one regression layer may include four  coordinate values of each of the set of preliminary region proposals. The output of the at least one classification layer may include a first score of being foreground and a second score of being background of each of the set of preliminary region proposals. Similarly, at the plurality of sliding windows, a plurality of preliminary region proposals may be determined. In some embodiments, a portion of the plurality of preliminary region proposals may be selected as a plurality of region proposals. The plurality of region proposals may include positive samples (e.g., foreground) and negative samples (e.g., background) . The plurality of region proposals may be further processed. Details regarding the determination of the region proposals may be found elsewhere in the present disclosure (e.g., operation 530 of the process 500) .
As shown in FIG. 8, an ROI pooling operation is performed based on the plurality of feature maps and the plurality of region proposals. Specifically, the plurality of region proposals may be mapped to the plurality of feature maps to determine a plurality of proposal feature maps (also referred to as ROIs) . The plurality of ROIs may be resized to a canonical size (e.g., 7*7) by performing pooling on the plurality of ROIs. Then a plurality of pooling region proposals may be determined. The plurality of pooling region proposals may be into a classifier for further processing.
The plurality of pooling region proposals may be classified into one or more object categories (e.g., K categories) or a background category via the classifier. If a pooling region proposal is determined as the background category, the pooling region proposal may be omitted and/or removed. The plurality of pooling region proposals may include one or more pooling region proposals corresponding to a target object. For a pooling region proposal, a boundary of the target object in the image may be determined based on the pooling region proposal. To determine the boundary of the target object, the pooling region proposal may be trimmed for one or more times. In some embodiments, the pooling region proposal may include a plurality of corners. As shown in FIG. 8, the pooling region proposal includes four corners, that is, a top left (TL) corner, a top right (TR) corner, a bottom left (BL) corner, and a bottom right (BR) corner. Each of the four corners  includes five crop strategies. Specifically, the five strategies of the top left corner include cropping to right (→) , cropping to bottom right (↘) , cropping to bottom (↓) , target position (T) and false (F) . The five strategies of the top right corner include cropping to left (←) , cropping to bottom left (↙) , cropping to bottom (↓) , target position (T) and false (F) . The five strategies of bottom left corner include cropping to right (→) , cropping to top right (↗) , cropping to top (↑) , target position (T) and false (F) . The five strategies of bottom right corner include cropping to left (←) , cropping to top left (↖) , cropping to top (↑) , target position (T) and false (F) . A crop strategy of each of the four corners may be determined based on features of pixels in the pooling region proposal. Whether one of the four corners correspond to a crop strategy of false may be determined. If at least one of the four corners correspond to the crop strategy of false, the pooling region proposal may be determined that does not encompass the whole target object, and the pooling region proposal may be abandoned and/or rejected. If each of the four corners does not correspond to the crop strategy of false, whether each of the four corners corresponds to a crop strategy of target position may be determined. If a corner does not correspond to the crop strategy of target position, the corner may be cropped based on the determined crop strategy. When each corner is cropped according to its crop strategy, a trimmed pooling region proposal is determined. A bounding mapping may be performed based on the cropped four corners to determine a rectangular box. The rectangular box is resized into a canonical size to determine an updated pooling region proposal. The updated pooling region proposal may be further trimmed, and a next iteration may be performed. When each of the four corners corresponds to the crop strategy of target position, each of the four corners may not be cropped. A boundary of the (trimmed) pooling region proposal may be identified. The boundary may be mapped to the image to determine a boundary of the target object. More descriptions of the determination of the boundary of the target object may be found elsewhere in the present disclosure (e.g., FIG. 7 and the descriptions thereof) .
FIGs. 9A to 9C are schematic diagrams illustrating an image according to some embodiments of the present disclosure. As shown in FIGs. 9A to 9C, the image may include a target object, i.e., tilted characters “CHANEL” . FIG. 9B shows a boundary 902 (also referred to as bounding box) of “CHANEL” , which is determined according to a Faster-RCNN algorithm. FIG. 9C shows a boundary 904 of “CHANEL” , which is determined according to the process described in the present disclosure. The boundary 902 includes more background than the target object, which cannot accurately determine the location of the target object. The boundary 904 includes less background, which may implement an accurate locating of the target object. Thus, the process described in the present disclosure can improve the accuracy of object detection.
Having thus described the basic concepts, it may be rather apparent to those skilled in the art after reading this detailed disclosure that the foregoing detailed disclosure is intended to be presented by way of example only and is not limiting. Various alterations, improvements, and modifications may occur and are intended to those skilled in the art, though not expressly stated herein. These alterations, improvements, and modifications are intended to be suggested by this disclosure, and are within the spirit and scope of the exemplary embodiments of this disclosure.
Moreover, certain terminology has been used to describe embodiments of the present disclosure. For example, the terms “one embodiment, ” “an embodiment, ” and/or “some embodiments” mean that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Therefore, it is emphasized and should be appreciated that two or more references to “an embodiment” or “one embodiment” or “an alternative embodiment” in various portions of this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the present disclosure.
Further, it will be appreciated by one skilled in the art, aspects of the present disclosure may be illustrated and described herein in any of a number of patentable  classes or context including any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof. Accordingly, aspects of the present disclosure may be implemented entirely hardware, entirely software (including firmware, resident software, micro-code, etc. ) or combining software and hardware implementation that may all generally be referred to herein as a “unit, ” “module, ” or “system. ” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including electro-magnetic, optical, or the like, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including wireless, wireline, optical fiber cable, RF, or the like, or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object-oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB. NET, Python or the like, conventional procedural programming languages, such as the "C" programming language, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be  connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) or in a cloud computing environment or offered as a service such as a Software as a Service (SaaS) .
Furthermore, the recited order of processing elements or sequences, or the use of numbers, letters, or other designations therefore, is not intended to limit the claimed processes and methods to any order except as may be specified in the claims. Although the above disclosure discusses through various examples what is currently considered to be a variety of useful embodiments of the disclosure, it is to be understood that such detail is solely for that purpose, and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover modifications and equivalent arrangements that are within the spirit and scope of the disclosed embodiments. For example, although the implementation of various components described above may be embodied in a hardware device, it may also be implemented as a software only solution, e.g., an installation on an existing server or mobile device.
Similarly, it should be appreciated that in the foregoing description of embodiments of the present disclosure, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the various embodiments. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed subject matter requires more features than are expressly recited in each claim. Rather, claimed subject matter may lie in less than all features of a single foregoing disclosed embodiment.

Claims (28)

  1. An artificial intelligent image processing system for object detection, comprising:
    at least one storage device including a set of instructions for determining a boundary corresponding to an object in an image;
    at least one processor in communication with the at least one storage device, wherein when executing the set of instructions, the at least one processor is directed to:
    obtain an image including a target object;
    generate a plurality of feature maps by inputting the image into a convolutional neural network (CNN) ;
    determine a plurality of region proposals based on the plurality of feature maps;
    determine a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps;
    classify the plurality of pooling region proposals into one or more object categories or a background category via a classifier, the one or more object categories including a category of the target object, the plurality of pooling region proposals including one or more pooling region proposals corresponding to the target object, each of the one or more pooling region proposals having a plurality of corners; and
    for each of the one or more pooling region proposals corresponding to the target object,
    determine a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner;
    trim the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies;
    identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and
    map the boundary to the image to determine a boundary of the target object.
  2. The artificial intelligent image processing system of claim 1, wherein the CNN includes one or more convolution layers and one or more pooling layers and is without a full connection layer.
  3. The artificial intelligent image processing system of claim 1, wherein the plurality of region proposals is determined according to a region proposal network (RPN) .
  4. The artificial intelligent image processing system of claim 3, wherein the RPN includes at least one regression layer and at least one classification layer, and to determine the plurality of region proposals, the at least one processor is directed to:
    slide a sliding window over the plurality of feature maps;
    at each sliding-window location, the sliding window coinciding with a sub-region of the plurality of feature maps,
    map the sub-region of the plurality of feature maps to a multi-dimensional feature vector;
    generate an anchor by mapping a center pixel of the sub-region to a pixel of the image, the anchor corresponding to a set of anchor boxes in the image, each of the set of anchor boxes being associated with a scale and an aspect ratio;
    feed the multi-dimensional feature vector into the at least one regression layer and the at least one classification layer, respectively, wherein
    the at least one regression layer is configured to conduct bounding-box regression to determine a set of preliminary region proposals corresponding to the set of anchor boxes, the output of the at least one regression layer including four coordinate values of each of the set of preliminary region proposals, and
    the at least one classification layer is configured to determine a category for each of the set of preliminary region proposals, the category being a foreground or a background, the output of the at least one classification layer including a first score of being foreground and a second score of being background of each of the
    set of preliminary region proposals; and
    select, based on the first score of being foreground and the second score of being background of each of a plurality of preliminary region proposals and four coordinate values of each of the plurality of preliminary region proposals, a portion of the plurality of preliminary region proposals as the plurality of region proposals.
  5. The artificial intelligent image processing system of claim 4, wherein to select a portion of the plurality of preliminary region proposals as the plurality of region proposals, the at least one processor is directed to:
    select the plurality of region proposals using a non-maximum suppression (NMS) .
  6. The artificial intelligent image processing system of any one of claims 1 to 5, wherein the plurality of pooling region proposals corresponds to a canonical size, and to determine the plurality of pooling region proposals, the at least one processor is further directed to:
    map the plurality of region proposals to the plurality of feature maps to determine a plurality of proposal feature maps; and
    determine the plurality of pooling region proposals by performing pooling on the plurality of proposal feature maps.
  7. The artificial intelligent image processing system of claim 1, wherein the plurality of corners includes a top left corner, a top right corner, a bottom left corner, and a bottom right corner, wherein
    the plurality of crop strategies of the top left corner includes at least one of cropping to right, cropping to bottom, cropping to bottom right, target position, or false;
    the plurality of crop strategies of the top right corner includes at least one of cropping to left, cropping to bottom, cropping to bottom left, target position, or false;
    the plurality of crop strategies of the bottom left corner includes at least one of cropping to right, cropping to top, cropping to top right, target position, or false; and
    the plurality of crop strategies of the bottom right corner includes at least one of cropping to left, cropping to top, cropping to top left, target position, or false.
  8. The artificial intelligent image processing system of claim 7, wherein the at least one processor is further directed to:
    stop to crop one of the plurality of corners when the corner corresponds to a crop strategy of target position.
  9. The artificial intelligent image processing system of claim 7, wherein cropping each of the plurality of corners, the at least one processor is directed to:
    determine a cropping direction and a cropping length for each of the plurality of corners based on the pooling region proposal, wherein the cropping direction of each of the plurality of corners is limited to one of the plurality of crop strategies of the corresponding corner; and
    crop each of the plurality of corners based on the cropping direction and the cropping length.
  10. The artificial intelligent image processing system of claim 7, wherein to trim the pooling region proposal by cropping each of the plurality of corners, the at least one processor is directed to:
    perform one or more iterations;
    in each of the one or more iterations,
    determine, from the plurality of crop strategies, a crop strategy for each of the plurality of corners based on the pooling region proposal;
    determine whether one of the plurality of corners corresponds to a crop strategy of false;
    determine whether each of the plurality of corners corresponds to a crop strategy of target position in response to a determination that each of the plurality of corners  does not correspond to the crop strategy of false;
    in response to a determination that at least one of the plurality of corners does not correspond to the crop strategy of target position, crop the at least one of the plurality of corners according to the determined crop strategy of the at least one of the plurality of corners;
    perform, based on the cropped plurality of corners, a bounding mapping to determine a rectangular box; and
    resize the rectangular box into a canonical size; and
    stop to crop the plurality of corners in response to a determination that each of the plurality of corners corresponds to the crop strategy of target position.
  11. The artificial intelligent image processing system of claim 10, wherein the at least one processor is further directed to:
    abandon the pooling region proposal in response to a determination that at least one of the plurality of corners corresponds to the crop strategy of false.
  12. The artificial intelligent image processing system of any one of claims 1 to 11, wherein the at least one processor is further directed to:
    determine one or more boundaries corresponding to the target object;
    determine an intersection-over-union (IoU) between each of the one or more boundaries and a ground truth; and
    determine one of the one or more boundaries with the greatest IoU as a target boundary corresponding to the target object.
  13. The artificial intelligent image processing system of any one of claims 1 to 12, wherein the boundary of the target object is a quadrilateral box.
  14. An artificial intelligent image processing method implemented on a computing device  having at least one processor, at least one computer-readable storage medium, and a communication platform connected to a network, comprising:
    obtaining an image including a target object;
    generating a plurality of feature maps by inputting the image into a convolutional neural network (CNN) ;
    determining a plurality of region proposals based on the plurality of feature maps;
    determining a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps;
    classifying the plurality of pooling region proposals into one or more object categories or a background category via a classifier, the one or more object categories including a category of the target object, the plurality of pooling region proposals including one or more pooling region proposals corresponding to the target object, each of the one or more pooling region proposals having a plurality of corners; and
    for each of the one or more pooling region proposals corresponding to the target object,
    determining a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner;
    trimming the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies;
    identifying a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and
    mapping the boundary to the image to determine a boundary of the target object.
  15. The artificial intelligent image processing method of claim 14, wherein the CNN includes one or more convolution layers and one or more pooling layers and is without a full connection layer.
  16. The artificial intelligent image processing method of claim 14, wherein the plurality of  region proposals is determined according to a region proposal network (RPN) .
  17. The artificial intelligent image processing method of claim 16, wherein the RPN includes at least one regression layer and at least one classification layer, and determining the plurality of region proposals further comprises:
    sliding a sliding window over the plurality of feature maps;
    at each sliding-window location, the sliding window coinciding with a sub-region of the plurality of feature maps,
    mapping the sub-region of the plurality of feature maps to a multi-dimensional feature vector;
    generating an anchor by mapping a center pixel of the sub-region to a pixel of the image, the anchor corresponding to a set of anchor boxes in the image, each of the set of anchor boxes being associated with a scale and an aspect ratio;
    feeding the multi-dimensional feature vector into the at least one regression layer and the at least one classification layer, respectively, wherein
    the at least one regression layer is configured to conduct bounding-box regression to determine a set of preliminary region proposals corresponding to the set of anchor boxes, the output of the at least one regression layer including four coordinate values of each of the set of preliminary region proposals, and
    the at least one classification layer is configured to determine a category for each of the set of preliminary region proposals, the category being a foreground or a background, the output of the at least one classification layer including a first score of being foreground and a second score of being background of each of the set of preliminary region proposals; and
    selecting, based on the first score of being foreground and the second score of being background of each of a plurality of preliminary region proposals and four coordinate values of each of the plurality of preliminary region proposals, a portion of the plurality of preliminary region proposals as the plurality of region proposals.
  18. The artificial intelligent image processing method of claim 17, wherein selecting a portion of the plurality of preliminary region proposals as the plurality of region proposals comprises:
    selecting the plurality of region proposals using a non-maximum suppression (NMS) .
  19. The artificial intelligent image processing method of any one of claims 14 to 18, wherein the plurality of pooling region proposals corresponds to a canonical size, and determining the plurality of pooling region proposals further comprises:
    mapping the plurality of region proposals to the plurality of feature maps to determine a plurality of proposal feature maps; and
    determining the plurality of pooling region proposals by performing pooling on the plurality of proposal feature maps.
  20. The artificial intelligent image processing method of claim 14, wherein the plurality of corners includes a top left corner, a top right corner, a bottom left corner, and a bottom right corner, wherein
    the plurality of crop strategies of the top left corner includes at least one of cropping to right, cropping to bottom, cropping to bottom right, target position, or false;
    the plurality of crop strategies of the top right corner includes at least one of cropping to left, cropping to bottom, cropping to bottom left, target position, or false;
    the plurality of crop strategies of the bottom left corner includes at least one of cropping to right, cropping to top, cropping to top right, target position, or false; and
    the plurality of crop strategies of the bottom right corner includes at least one of cropping to left, cropping to top, cropping to top left, target position, or false.
  21. The artificial intelligent image processing method of claim 20, further comprising:
    stopping to crop one of the plurality of corners when the corner corresponds to a crop  strategy of target position.
  22. The artificial intelligent image processing method of claim 20, wherein cropping each of the plurality of corners comprises:
    determining a cropping direction and a cropping length for each of the plurality of corners based on the pooling region proposal, wherein the cropping direction of each of the plurality of corners is limited to one of the plurality of crop strategies of the corresponding corner; and
    cropping each of the plurality of corners based on the cropping direction and the cropping length.
  23. The artificial intelligent image processing method of claim 20, wherein trimming the pooling region proposal by cropping each of the plurality of corners comprises:
    performing one or more iterations;
    in each of the one or more iterations,
    determining, from the plurality of crop strategies, a crop strategy for each of the plurality of corners based on the pooling region proposal;
    determining whether one of the plurality of corners corresponds to a crop strategy of false;
    determining whether each of the plurality of corners corresponds to a crop strategy of target position in response to a determination that each of the plurality of corners does not correspond to the crop strategy of false;
    in response to a determination that at least one of the plurality of corners does not correspond to the crop strategy of target position, cropping the at least one of the plurality of corners according to the determined crop strategy of the at least one of the plurality of corners;
    performing, based on the cropped plurality of corners, a bounding mapping to determine a rectangular box; and
    resizing the rectangular box into a canonical size; and
    stopping to crop the plurality of corners in response to a determination that each of the plurality of corners corresponds to the crop strategy of target position.
  24. The artificial intelligent image processing method of claim 23, further comprising:
    abandoning the pooling region proposal in response to a determination that at least one of the plurality of corners corresponds to the crop strategy of false.
  25. The artificial intelligent image processing method of any one of claims 14 to 24, further comprising:
    determining one or more boundaries corresponding to the target object;
    determining an intersection-over-union (IoU) between each of the one or more boundaries and a ground truth; and
    determining one of the one or more boundaries with the greatest IoU as a target boundary corresponding to the target object.
  26. The artificial intelligent image processing method of any one of claims 14 to 25, wherein the boundary of the target object is a quadrilateral box.
  27. A non-transitory computer-readable storage medium, comprising at least one set of instructions for artificial intelligent object detection, wherein when executed by at least one processor of a computing device, the at least one set of instructions directs the at least one processor to perform acts of:
    obtaining an image including a target object;
    generating a plurality of feature maps by inputting the image into a convolutional neural network (CNN) ;
    determining a plurality of region proposals based on the plurality of feature maps;
    determining a plurality of pooling region proposals based on the plurality of region  proposals and the plurality of feature maps;
    classifying the plurality of pooling region proposals into one or more object categories or a background category via a classifier, the one or more object categories including a category of the target object, the plurality of pooling region proposals including one or more pooling region proposals corresponding to the target object, each of the one or more pooling region proposals having a plurality of corners; and
    for each of the one or more pooling region proposals corresponding to the target object,
    determining a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner;
    trimming the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies;
    identifying a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and
    mapping the boundary to the image to determine a boundary of the target object.
  28. An artificial intelligent image processing system for object detection, comprising:
    an acquisition module configured to obtain an image including a target object;
    a feature map determination module configured to generate a plurality of feature maps by inputting the image into a convolutional neural network (CNN) ;
    a region proposal determination module configured to determine a plurality of region proposals based on the plurality of feature maps;
    a pooling region proposal determination module configured to determine a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps;
    a classification module configured to classify the plurality of pooling region proposals into one or more object categories or a background category via a classifier, the one or more object categories including a category of the target object, the plurality of pooling  region proposals including one or more pooling region proposals corresponding to the target object, each of the one or more pooling region proposals having a plurality of corners; and
    for each of the one or more pooling region proposals corresponding to the target object,
    a boundary determination module configured to determine a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner;
    the boundary determination module configured to trim the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies;
    the boundary determination module configured to identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and
    the boundary determination module configured to map the boundary to the image to determine a boundary of the target object.
PCT/CN2018/119410 2018-11-27 2018-12-05 Ai systems and methods for objection detection WO2020107510A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2020557230A JP7009652B2 (en) 2018-11-27 2018-12-05 AI system and method for object detection

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811438175.5A CN111222387B (en) 2018-11-27 2018-11-27 System and method for object detection
CN201811438175.5 2018-11-27

Publications (1)

Publication Number Publication Date
WO2020107510A1 true WO2020107510A1 (en) 2020-06-04

Family

ID=70825714

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/119410 WO2020107510A1 (en) 2018-11-27 2018-12-05 Ai systems and methods for objection detection

Country Status (3)

Country Link
JP (1) JP7009652B2 (en)
CN (1) CN111222387B (en)
WO (1) WO2020107510A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738133A (en) * 2020-06-17 2020-10-02 北京奇艺世纪科技有限公司 Model training method, target detection method, device, electronic equipment and readable storage medium
CN111951337A (en) * 2020-08-19 2020-11-17 武汉中海庭数据技术有限公司 Image detection target space positioning method and system
CN112819223A (en) * 2021-01-29 2021-05-18 云南省测绘资料档案馆(云南省基础地理信息中心) High-performance intersection method for cutting polygon
JP2021192224A (en) * 2020-06-10 2021-12-16 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Method and device, electronic device, computer-readable storage medium, and computer program for detecting pedestrian

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673444B (en) * 2021-08-19 2022-03-11 清华大学 Intersection multi-view target detection method and system based on angular point pooling
KR20230057646A (en) * 2021-10-22 2023-05-02 연세대학교 산학협력단 Multi-level transition region-based domain adaptive object detection apparatus and method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124415A1 (en) * 2015-11-04 2017-05-04 Nec Laboratories America, Inc. Subcategory-aware convolutional neural networks for object detection
CN106780612A (en) * 2016-12-29 2017-05-31 浙江大华技术股份有限公司 Object detecting method and device in a kind of image
US20170169315A1 (en) * 2015-12-15 2017-06-15 Sighthound, Inc. Deeply learned convolutional neural networks (cnns) for object localization and classification
CN107301383A (en) * 2017-06-07 2017-10-27 华南理工大学 A kind of pavement marking recognition methods based on Fast R CNN
CN107368845A (en) * 2017-06-15 2017-11-21 华南理工大学 A kind of Faster R CNN object detection methods based on optimization candidate region
CN107610113A (en) * 2017-09-13 2018-01-19 北京邮电大学 The detection method and device of Small object based on deep learning in a kind of image

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106462940A (en) * 2014-10-09 2017-02-22 微软技术许可有限责任公司 Generic object detection in images
JP6829575B2 (en) * 2016-10-03 2021-02-10 グローリー株式会社 Image processing equipment, image processing system and image processing method
CN107886120A (en) * 2017-11-03 2018-04-06 北京清瑞维航技术发展有限公司 Method and apparatus for target detection tracking
CN108734705A (en) * 2018-05-17 2018-11-02 杭州电子科技大学 Digital galactophore fault image calcification clusters automatic testing method based on deep learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124415A1 (en) * 2015-11-04 2017-05-04 Nec Laboratories America, Inc. Subcategory-aware convolutional neural networks for object detection
US20170169315A1 (en) * 2015-12-15 2017-06-15 Sighthound, Inc. Deeply learned convolutional neural networks (cnns) for object localization and classification
CN106780612A (en) * 2016-12-29 2017-05-31 浙江大华技术股份有限公司 Object detecting method and device in a kind of image
CN107301383A (en) * 2017-06-07 2017-10-27 华南理工大学 A kind of pavement marking recognition methods based on Fast R CNN
CN107368845A (en) * 2017-06-15 2017-11-21 华南理工大学 A kind of Faster R CNN object detection methods based on optimization candidate region
CN107610113A (en) * 2017-09-13 2018-01-19 北京邮电大学 The detection method and device of Small object based on deep learning in a kind of image

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021192224A (en) * 2020-06-10 2021-12-16 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Method and device, electronic device, computer-readable storage medium, and computer program for detecting pedestrian
JP7269979B2 (en) 2020-06-10 2023-05-09 阿波▲羅▼智▲聯▼(北京)科技有限公司 Method and apparatus, electronic device, computer readable storage medium and computer program for detecting pedestrians
CN111738133A (en) * 2020-06-17 2020-10-02 北京奇艺世纪科技有限公司 Model training method, target detection method, device, electronic equipment and readable storage medium
CN111951337A (en) * 2020-08-19 2020-11-17 武汉中海庭数据技术有限公司 Image detection target space positioning method and system
CN111951337B (en) * 2020-08-19 2022-05-31 武汉中海庭数据技术有限公司 Image detection target space positioning method and system
CN112819223A (en) * 2021-01-29 2021-05-18 云南省测绘资料档案馆(云南省基础地理信息中心) High-performance intersection method for cutting polygon

Also Published As

Publication number Publication date
JP2021519984A (en) 2021-08-12
CN111222387A (en) 2020-06-02
CN111222387B (en) 2023-03-03
JP7009652B2 (en) 2022-01-25

Similar Documents

Publication Publication Date Title
WO2020107510A1 (en) Ai systems and methods for objection detection
US20220245792A1 (en) Systems and methods for image quality detection
JP6364049B2 (en) Vehicle contour detection method, device, storage medium and computer program based on point cloud data
WO2019223582A1 (en) Target detection method and system
CN113657224B (en) Method, device and equipment for determining object state in vehicle-road coordination
WO2018049998A1 (en) Traffic sign information acquisition method and device
CN112613378B (en) 3D target detection method, system, medium and terminal
WO2022105517A1 (en) Systems and methods for detecting traffic accidents
US20220171060A1 (en) Systems and methods for calibrating a camera and a multi-line lidar
US11647176B2 (en) Methods and systems for camera calibration
US20230121534A1 (en) Method and electronic device for 3d object detection using neural networks
US11748860B2 (en) Systems and methods for new road determination
US11657592B2 (en) Systems and methods for object recognition
US11842440B2 (en) Landmark location reconstruction in autonomous machine applications
EP3703008A1 (en) Object detection and 3d box fitting
WO2021077313A1 (en) Systems and methods for autonomous driving
US20240193788A1 (en) Method, device, computer system for detecting pedestrian based on 3d point clouds
CN116704125B (en) Mapping method, device, chip and module equipment based on three-dimensional point cloud
WO2019205008A1 (en) Systems and methods for determining a reflective area in an image
WO2021077315A1 (en) Systems and methods for autonomous driving
US20220178701A1 (en) Systems and methods for positioning a target subject
WO2019201141A1 (en) Methods and systems for image processing
WO2022007367A1 (en) Systems and methods for pose determination
CN114648639A (en) Target vehicle detection method, system and device
WO2021012243A1 (en) Positioning systems and methods

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18941191

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020557230

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18941191

Country of ref document: EP

Kind code of ref document: A1