WO2020107510A1

WO2020107510A1 - Ai systems and methods for objection detection

Info

Publication number: WO2020107510A1
Application number: PCT/CN2018/119410
Authority: WO
Inventors: Yuan Zhao; Ying Xin
Original assignee: Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date: 2018-11-27
Filing date: 2018-12-05
Publication date: 2020-06-04
Also published as: JP2021519984A; CN111222387A; CN111222387B; JP7009652B2

Abstract

The systems and methods for object detection. The systems and methods may obtain an image including a target object; generate feature map (s); determine region proposal (s) based on the feature map (s); determine pooling region proposal (s) based on the region proposal (s) and the feature map (s); and classify the pooling region proposal (s) into one or more object categories or a background category via a classifier. Each of the one or more pooling region proposals having corners. For each pooling region proposal corresponding to the target object, the systems and methods may determine crop strategies for each corner according to a position of each corner; trim the pooling region proposal by cropping each corner according to one of the crop strategies; identify a boundary to the trimmed pooling region proposal based on the cropped corners; and map the boundary to the image to determine a boundary of the target object.

Description

AI SYSTEMS AND METHODS FOR OBJECTION DETECTION

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to Chinese Application No. 201811438175. 5, filed on November 27, 2018, the entire contents of which are hereby incorporated by reference.

TECHNICAL FIELD

The present disclosure generally relates to systems and methods for image processing, and in particular, systems and methods for detecting objects in an image.

BACKGROUND

With the emergence and popularity of artificial intelligent applications (e.g., face recognition, intelligent security cameras) , artificial intelligent object detection techniques, especially, deep learning-based object detection techniques, have been rapidly developed. The artificial intelligent object detection techniques can identify and/or classify an object in an image, and locate the object in the image by drawing a bounding box. However, the boundary box may generally be a rectangular box. For an object being irregular or tilted relative to the image (e.g., a safety belt) , the boundary box (e.g., a rectangular box) may include background. In some cases, the boundary box may include more background than the object, which cannot locate the object accurately. Thus, it is desirable to provide artificial intelligent systems and methods for determining a boundary for a tilted object (s) , which may implement the accurate locating of the tilted object (s) .

SUMMARY

The present disclosure relates to AI systems and methods for object detection. In one aspect of the present disclosure, an artificial intelligent image processing system for object detection is provided. The artificial intelligent image processing system may include at least one storage device and at least one processor in communication with the at least one storage device. The at least one storage device may include a set of instructions for determining a boundary corresponding to an object in an image. When executing the set of instructions, the at least one processor may be directed to obtain an image including a target object, and generate a plurality of feature maps by inputting the image into a convolutional neural network (CNN) . The at least one processor may also be directed to determine a plurality of region proposals based on the plurality of feature maps, and determine a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps. The at least one processor may be further directed to classify the plurality of pooling region proposals into one or more object categories or a background category via a classifier. The one or more object categories may include a category of the target object, and the plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object. Each of the one or more pooling region proposals may have a plurality of corners. For each of the one or more pooling region proposals corresponding to the target object, the at least one processor may be directed to determine a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner; trim the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies; identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and map the boundary to the image to determine a boundary of the target object.

In some embodiments, the CNN may include one or more convolution layers and one or more pooling layers and is without a full connection layer.

In some embodiments, the plurality of region proposals may be determined according to a region proposal network (RPN) .

In some embodiments, the RPN may include at least one regression layer and at least one classification layer. To determine the plurality of region proposals, the at least one processor may be directed to slide a sliding window over the plurality of feature maps. At each sliding-window location, the sliding window may coincide with a sub-region of the plurality of feature maps. The at least one processor may be directed to map the sub-region of the plurality of feature maps to a multi-dimensional feature vector, and generate an anchor by mapping a center pixel of the sub-region to a pixel of the image. The anchor may correspond to a set of anchor boxes in the image, and each of the set of anchor boxes may be associated with a scale and an aspect ratio. The at least one processor may also be directed to feed the multi-dimensional feature vector into the at least one regression layer and the at least one classification layer, respectively. The at least one regression layer may be configured to conduct bounding-box regression to determine a set of preliminary region proposals corresponding to the set of anchor boxes, and the output of the at least one regression layer may include four coordinate values of each of the set of preliminary region proposals. The at least one classification layer may be configured to determine a category for each of the set of preliminary region proposals. The category may be a foreground or a background, and the output of the at least one classification layer may include a first score of being foreground and a second score of being background of each of the set of preliminary region proposals. The at least one processor may be further directed to select a portion of the plurality of preliminary region proposals as the plurality of region proposals based on the first score of being foreground and the second score of being background of each of a plurality of preliminary region proposals and four coordinate values of each of the plurality of preliminary region proposals.

In some embodiments, to select a portion of the plurality of preliminary region proposals as the plurality of region proposals, the at least one processor may be directed to select the plurality of region proposals using a non-maximum suppression (NMS) .

In some embodiments, the plurality of pooling region proposals may correspond to a canonical size. To determine the plurality of pooling region proposals, the at least one processor may be further directed to map the plurality of region proposals to the plurality of feature maps to determine a plurality of proposal feature maps, and determine the plurality of pooling region proposals by performing pooling on the plurality of proposal feature maps.

In some embodiments, the plurality of corners may include a top left corner, a top right corner, a bottom left corner, and a bottom right corner. The plurality of crop strategies of the top left corner may include at least one of cropping to right, cropping to bottom, cropping to bottom right, target position, or false. The plurality of crop strategies of the top right corner may include at least one of cropping to left, cropping to bottom, cropping to bottom left, target position, or false. The plurality of crop strategies of the bottom left corner may include at least one of cropping to right, cropping to top, cropping to top right, target position, or false. The plurality of crop strategies of the bottom right corner may include at least one of cropping to left, cropping to top, cropping to top left, target position, or false.

In some embodiments, the at least one processor may be further directed to stop to crop one of the plurality of corners when the corner corresponds to a crop strategy of target position.

In some embodiments, cropping each of the plurality of corners, the at least one processor may be directed to determine a cropping direction and a cropping length for each of the plurality of corners based on the pooling region proposal. The cropping direction of each of the plurality of corners may be limited to one of the plurality of crop strategies of the corresponding corner. The at least one processor may also be directed to crop each of the plurality of corners based on the cropping direction and the cropping length.

In some embodiments, to trim the pooling region proposal by cropping each of the plurality of corners, the at least one processor may be directed to perform one or more iterations. In each of the one or more iterations, the at least one processor may be directed to determine a crop strategy for each of the plurality of corners based on the pooling region proposal from the plurality of crop strategies; determine whether one of the plurality of corners corresponds to a crop strategy of false; determine whether each of the plurality of corners corresponds to a crop strategy of target position in response to a determination that each of the plurality of corners does not correspond to the crop strategy of false; in response to a determination that at least one of the plurality of corners does not correspond to the crop strategy of target position, crop the at least one of the plurality of corners according to the determined crop strategy of the at least one of the plurality of corners; perform, based on the cropped plurality of corners, a bounding mapping to determine a rectangular box; and resize the rectangular box into a canonical size. The at least one processor may also be directed to stop to crop the plurality of corners in response to a determination that each of the plurality of corners corresponds to the crop strategy of target position.

In some embodiments, the at least one processor may be further directed to abandon the pooling region proposal in response to a determination that at least one of the plurality of corners corresponds to the crop strategy of false.

In some embodiments, the at least one processor may be further directed to determine one or more boundaries corresponding to the target object; determine an intersection-over-union (IoU) between each of the one or more boundaries and a ground truth; and determine one of the one or more boundaries with the greatest IoU as a target boundary corresponding to the target object.

In some embodiments, the boundary of the target object may be a quadrilateral box.

In another aspect of the present disclosure, an artificial intelligent image processing method is provided. The artificial intelligent image processing method may be implemented on a computing device. The computing device may have at least one processor, at least one computer-readable storage medium, and a communication platform connected to a network. The method may include obtaining an image including a target object, and generating a plurality of feature maps by inputting the image into a convolutional neural network (CNN) . The method may also include determining a plurality of region proposals based on the plurality of feature maps, and determining a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps. The method may further include classifying the plurality of pooling region proposals into one or more object categories or a background category via a classifier. The one or more object categories may include a category of the target object, and the plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object. Each of the one or more pooling region proposals may have a plurality of corners. For each of the one or more pooling region proposals corresponding to the target object, the method may include determining a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner; trimming the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies; identifying a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and mapping the boundary to the image to determine a boundary of the target object.

In another aspect of the present disclosure, a non-transitory computer-readable storage medium is provided. The non-transitory computer-readable storage medium may include at least one set of instructions for artificial intelligent object detection. When executed by at least one processor of a computing device, the at least one set of instructions may direct the at least one processor to perform acts of obtaining an image including a target object; generating a plurality of feature maps by inputting the image into a convolutional neural network (CNN) ; determining a plurality of region proposals based on the plurality of feature maps; determining a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps; classifying the plurality of pooling region proposals into one or more object categories or a background category via a classifier. The one or more object categories may include a category of the target object, and the plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object. Each of the one or more pooling region proposals may have a plurality of corners. For each of the one or more pooling region proposals corresponding to the target object, the at least one set of instructions may also direct the at least one processor to perform acts of determining a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner; trimming the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies; identifying a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and mapping the boundary to the image to determine a boundary of the target object.

In another aspect of the present disclosure, an artificial intelligent image processing system for object detection is provided. The artificial intelligent image processing system may include an acquisition module configured to obtain an image including a target object; a feature map determination module configured to generate a plurality of feature maps by inputting the image into a convolutional neural network (CNN) ; a region proposal determination module configured to determine a plurality of region proposals based on the plurality of feature maps; a pooling region proposal determination module configured to determine a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps; a classification module configured to classify the plurality of pooling region proposals into one or more object categories or a background category via a classifier. The one or more object categories may include a category of the target object, and the plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object. Each of the one or more pooling region proposals may have a plurality of corners. For each of the one or more pooling region proposals corresponding to the target object, the artificial intelligent image processing system may also include a boundary determination module configured to determine a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner; the boundary determination module configured to trim the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies; the boundary determination module configured to identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and the boundary determination module configured to map the boundary to the image to determine a boundary of the target object.

Additional features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The features of the present disclosure may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities and combinations set forth in the detailed examples discussed below.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. The drawings are not to scale. These embodiments are non-limiting schematic embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:

FIG. 1 is a schematic diagram illustrating an exemplary AI image processing system according to some embodiments of the present disclosure;

FIG. 2 is a schematic diagram illustrating exemplary hardware and/or software components of a computing device according to some embodiments of the present disclosure;

FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of a mobile device according to some embodiments of the present disclosure;

FIG. 4 is a block diagram illustrating an exemplary AI processing device according to some embodiments of the present disclosure;

FIG. 5 is a flowchart illustrating an exemplary process for determining a boundary of a target object according to some embodiments of the present disclosure;

FIG. 6 is a schematic diagram illustrating an exemplary region proposal network according to some embodiments of the present disclosure;

FIG. 7 is a flowchart illustrating an exemplary process for determining a boundary of a target object according to some embodiments of the present disclosure;

FIG. 8 is a schematic diagram illustrating an exemplary process for determining a boundary of a target object according to some embodiments of the present disclosure; and

FIGs. 9A to 9C are schematic diagrams illustrating an image according to some embodiments of the present disclosure.

DETAILED DESCRIPTION

The following description is presented to enable any person skilled in the art to make and use the present disclosure, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present disclosure is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the claims.

The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting. As used herein, the singular forms “a,” “an, ” and “the” may be intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprise, ” “comprises, ” and/or “comprising, ” “include, ” “includes, ” and/or “including, ” when used in this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

These and other features, and characteristics of the present disclosure, as well as the methods of operations and functions of the related elements of structure and the combination of parts and economies of manufacture, may become more apparent upon consideration of the following description with reference to the accompanying drawings, all of which form a part of this disclosure. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended to limit the scope of the present disclosure. It is understood that the drawings are not to scale.

The flowcharts used in the present disclosure illustrate operations that systems implement according to some embodiments of the present disclosure. It is to be expressly understood, the operations of the flowcharts may be implemented not in order. Conversely, the operations may be implemented in inverted order, or simultaneously. Moreover, one or more other operations may be added to the flowcharts. One or more operations may be removed from the flowcharts.

Moreover, while the system and method in the present disclosure is described primarily regarding an on-demand transportation service, it should also be understood that this is only one exemplary embodiment. The system or method of the present disclosure may be applied to any other kind of on demand service. For example, the system or method of the present disclosure may be applied to transportation systems of different environments including land, ocean, aerospace, or the like, or any combination thereof. The vehicle of the transportation systems may include a taxi, a private car, a hitch, a bus, a train, a bullet train, a high-speed rail, a subway, a vessel, an aircraft, a spaceship, a hot-air balloon, a driverless vehicle, or the like, or any combination thereof. The transportation system may also include any transportation system for management and/or distribution, for example, a system for sending and/or receiving an express. The application of the system or method of the present disclosure may include a web page, a plug-in of a browser, a client terminal, a custom system, an internal analysis system, an artificial intelligence robot, or the like, or any combination thereof.

The terms “passenger, ” “requester, ” “service requester, ” and “customer” in the present disclosure are used interchangeably to refer to an individual, an entity or a tool that may request or order a service. Also, the term “driver, ” “provider, ” “service provider, ” and “supplier” in the present disclosure are used interchangeably to refer to an individual, an entity or a tool that may provide a service or facilitate the providing of the service. The term “user” in the present disclosure may refer to an individual, an entity, or a tool that may request a service, order a service, provide a service, or facilitate the providing of the service. For example, the user may be a passenger, a driver, an operator, or the like, or any combination thereof. In the present disclosure, “passenger” and “passenger terminal” may be used interchangeably, and “driver” and “driver terminal” may be used interchangeably.

The terms “service request” and “order” in the present disclosure are used interchangeably to refer to a request that may be initiated by a passenger, a requester, a service requester, a customer, a driver, a provider, a service provider, a supplier, or the like, or any combination thereof. The service request may be accepted by any one of a passenger, a requester, a service requester, a customer, a driver, a provider, a service provider, or a supplier. The service request may be chargeable or free.

The positioning technology used in the present disclosure may be based on a global positioning system (GPS) , a global navigation satellite system (GLONASS) , a compass navigation system (COMPASS) , a Galileo positioning system, a quasi-zenith satellite system (QZSS) , a wireless fidelity (WiFi) positioning technology, or the like, or any combination thereof. One or more of the above positioning systems may be used interchangeably in the present disclosure.

The present disclosure relates to artificial intelligent (AI) systems and methods for object detection in an image. Specifically, the AI systems and methods may determine a boudnary for a target object in the image. The determined boudnary of the target object may be a quadrilateral box. To determine the boundary of the target object, the AI systems and methods may input the image into a convolutional neural network (CNN) to generate a plurlaity of feature maps, and generate a plurality of region proposals based on the plurality of feature maps. The AI systems and methods may determine a plurality of pooling region proposals based on the plurality of region proposals and the plurlaity of feature maps by performing an ROI pooling operation. The AI systems and methods may further classify the plurality of pooling region proposals into one or more object categories or a background category via a classifier. The plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object. Each of the one or more pooling region proposals may have a plurality of corners. For a pooling region proposal corresponding to the target object, the AI systems and methods may determine a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner. The AI systems and methods may also trim the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies. In some embodiments, the AI systems and methods may crop a corner based on a cropping direction and a cropping length, which may be determined based on the pooling region proposal. The AI systems and methods may identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners, and map the boundary to the image to determine the boundary of the target object. In the present disclosure, information (e.g., positions of the corners) related to the corners of the pooling region proposal and features in the pooling region proposal may be considered. Thus, the boundary of the target object determined according to the present disclosure may be more suitable for the target object, especially, for a tilted target object (e.g., a safety belt, tilted characters) , which may improve the accuracy of locating the target object.

FIG. 1 is a schematic diagram illustrating an exemplary AI image processing system 100 according to some embodiments of the present disclosure. The AI image processing system 100 may be configured for objection detection. For example, the AI image processing system 100 may determine a boundary corresponding to an object in an image. In some embodiments, the AI image processing system 100 may be an online platform providing an Online to Offline (O2O) service. The AI image processing system 100 may include a sensor 110, a network 120, a terminal 130, a server 140, and a storage device 150.

The sensor 110 may be configured to capture one or more images. As used in this application, an image may be a still image, a video, a stream video, or a video frame obtained from a video. The image may be a three-dimensional (3D) image or a two-dimensional (2D) image. The sensor 110 may be or include one or more cameras. In some embodiments, the sensor 110 may be a digital camera, a video camera, a security camera, a web camera, a smartphone, a tablet, a laptop, a video gaming console equipped with a web camera, a camera with multiple lenses, a camcorder, etc. In some embodiments, the sensor 110 (e.g., a camera) may capture an image including one or more objects.

The network 120 may facilitate the exchange of information and/or data. In some embodiments, one or more components of the AI image processing system 100 (e.g., the sensor 110, the terminal 130, the server 140, the storage device 150) may send information and/or data to other component (s) in the AI image processing system 100 via the network 120. For example, the server 140 may process an image obtained from the sensor 110 via the network 120. As another example, the server 140 may obtain user instructions from the terminal 130 via the network 120. In some embodiments, the network 120 may be any type of wired or wireless network, or combination thereof. Merely by way of example, the network 120 may include a cable network, a wireline network, an optical fiber network, a telecommunications network, an intranet, an Internet, a local area network (LAN) , a wide area network (WAN) , a wireless local area network (WLAN) , a metropolitan area network (MAN) , a wide area network (WAN) , a public telephone switched network (PSTN) , a Bluetooth ^TM network, a ZigBee ^TM network, a near field communication (NFC) network, or the like, or any combination thereof. In some embodiments, the network 120 may include one or more network access points. For example, the network 120 may include wired or wireless network access points such as base stations and/or internet exchange points 120-1, 120-2, …, through which one or more components of the AI image processing system 100 may be connected to the network 120 to exchange data and/or information.

The terminal 130 include a mobile device 130-1, a tablet computer 130-2, a laptop computer 130-3, or the like, or any combination thereof. In some embodiments, the mobile device 130-1 may include a smart home device, a wearable device, a smart mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof. In some embodiments, the smart home device may include a smart lighting device, a control device of an intelligent electrical apparatus, a smart monitoring device, a smart television, a smart video camera, an interphone, or the like, or any combination thereof. In some embodiments, the wearable device may include a bracelet, footgear, eyeglasses, a helmet, a watch, clothing, a backpack, an accessory, or the like, or any combination thereof. In some embodiments, the smart mobile device may include a smartphone, a personal digital assistant (PDA) , a gaming device, a navigation device, a point of sale (POS) device, or the like, or any combination thereof. In some embodiments, the virtual reality device and/or the augmented reality device may include a virtual reality helmet, a virtual reality glass, a virtual reality patch, an augmented reality helmet, an augmented reality glass, an augmented reality patch, or the like, or any combination thereof. For example, the virtual reality device and/or the augmented reality device may include a Google Glass, an Oculus Rift, a HoloLens, a Gear VR, etc. In some embodiments, the terminal 130 may remotely operate the sensor 110. In some embodiments, the terminal 130 may operate the sensor 110 via a wireless connection. In some embodiments, the terminal 130 may receive information and/or instructions inputted by a user, and send the received information and/or instructions to the sensor 110 or to the server 140 via the network 120. In some embodiments, the terminal 130 may receive data and/or information from the server 140. In some embodiments, the terminal 130 may be part of the server 140. In some embodiments, the terminal 130 may be omitted.

In some embodiments, the server 140 may be a single server or a server group. The server group may be centralized, or distributed (e.g., the server 140 may be a distributed system) . In some embodiments, the server 140 may be local or remote. For example, the server 140 may access information and/or data stored in the sensor 110, terminal 130, and/or the storage device 150 via the network 120. As another example, the server 140 may be directly connected to the sensor 110, the terminal 130, and/or the storage device 150 to access stored information and/or data. In some embodiments, the server 140 may be implemented on a cloud platform. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof. In some embodiments, the server 140 may be implemented on a computing device 200 having one or more components illustrated in FIG. 2 in the present disclosure.

In some embodiments, the server 140 may include an AI processing device 142. The AI processing device 142 may process information and/or data to perform one or more functions described in the present disclosure. For example, the AI processing device 142 may process an image including a target object to determine a boundary of the target object in the image. In some embodiments, the AI processing device 142 may include one or more processing devices (e.g., single-core processing device (s) or multi-core processor (s) ) . Merely by way of example, the AI processing device 142 may include a central processing unit (CPU) , an application-specific integrated circuit (ASIC) , an application-specific instruction-set processor (ASIP) , a graphics processing unit (GPU) , a physics processing unit (PPU) , a digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a controller, a microcontroller unit, a reduced instruction-set computer (RISC) , a microprocessor, or the like, or any combination thereof.

The storage device 150 may store data and/or instructions. In some embodiments, the storage device 150 may store data obtained from the terminal 130 and/or the server 140. In some embodiments, the storage device 150 may store data and/or instructions that the server 140 may execute or use to perform exemplary methods described in the present disclosure. In some embodiments, the storage device 150 may include a mass storage, removable storage, a volatile read-and-write memory, a read-only memory (ROM) , or the like, or any combination thereof. Exemplary mass storage may include a magnetic disk, an optical disk, a solid-state drive, etc. Exemplary removable storage may include a flash drive, a floppy disk, an optical disk, a memory card, a zip disk, a magnetic tape, etc. Exemplary volatile read-and-write memory may include a random-access memory (RAM) . Exemplary RAM may include a dynamic RAM (DRAM) , a double date rate synchronous dynamic RAM (DDR SDRAM) , a static RAM (SRAM) , a thyristor RAM (T-RAM) , and a zero-capacitor RAM (Z-RAM) , etc. Exemplary ROM may include a mask ROM (MROM) , a programmable ROM (PROM) , an erasable programmable ROM (PEROM) , an electrically erasable programmable ROM (EEPROM) , a compact disk ROM (CD-ROM) , and a digital versatile disk ROM, etc. In some embodiments, the storage device 150 may be implemented on a cloud platform. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof.

In some embodiments, the storage device 150 may be connected to the network 120 to communicate with one or more components of the AI image processing system 100 (e.g., the sensor 110, the terminal 130, the server 140) . One or more components in the AI image processing system 100 may access the data or instructions stored in the storage device 150 via the network 120. In some embodiments, the storage device 150 may be directly connected to or communicate with one or more components in the AI image processing system 100 (e.g., the sensor 110, the terminal 130, the server 140) . In some embodiments, the storage device 150 may be part of the sensor 110.

One of ordinary skill in the art would understand that when an element (or component) of the AI image processing system 100 performs, the element may perform through electrical signals and/or electromagnetic signals. For example, when a terminal 130 transmits out a request to the server 140, a processor of the terminal 130 may generate an electrical signal encoding the request. The processor of the terminal 130 may then transmit the electrical signal to an output port. If the terminal 130 communicates with the server 140 via a wired network, the output port may be physically connected to a cable, which further may transmit the electrical signal to an input port of the server 140. If the terminal 130 communicates with the server 140 via a wireless network, the output port of the terminal 130 may be one or more antennas, which convert the electrical signal to electromagnetic signal. Within an electronic device, such as the terminal 130, and/or the server 140, when a processor thereof processes an instruction, transmits out an instruction, and/or performs an action, the instruction and/or action is conducted via electrical signals. For example, when the processor retrieves or saves data from a storage medium, it may transmit out electrical signals to a read/write device of the storage medium, which may read or write structured data in the storage medium. The structured data may be transmitted to the processor in the form of electrical signals via a bus of the electronic device. Here, an electrical signal may refer to one electrical signal, a series of electrical signals, and/or a plurality of discrete electrical signals.

FIG. 2 is a schematic diagram illustrating exemplary hardware and software components of a computing device 200 according to some embodiments of the present disclosure. In some embodiments, the terminal 130, and/or the server 140 may be implemented on the computing device 200. For example, the AI processing device 142 of the server 140 may be implemented on the computing device 200 and configured to perform functions of the AI processing device 142 disclosed in this disclosure.

The computing device 200 may be a special purpose computer, and may be used to implement an AI image processing system 100 for the present disclosure. The computing device 200 may be used to implement any component of the AI image processing system 100 as described herein. For example, the AI processing device 142 may be implemented on the computing device, via its hardware, software program, firmware, or a combination thereof. Although only one such computer is shown, for convenience, the computer functions relating to the image processing as described herein may be implemented in a distributed fashion on several similar platforms, to distribute the processing load.

The computing device 200, for example, may include a COM port 250 connected with a network that may implement the data communications. The computing device 200 may also include a processor 220, in the form of one or more processors (or CPUs) , for executing program instructions. The exemplary computing device may include an internal communication bus 210, different types of program storage units and data storage units (e.g., a disk 270, a read only memory (ROM) 230, a random-access memory (RAM) 240) , various data files applicable to computer processing and/or communication. The exemplary computing device may also include program instructions stored in the ROM 230, RAM 240, and/or other type of non-transitory storage medium to be executed by the processor 220. The method and/or process of the present disclosure may be implemented as the program instructions. The computer device 200 also includes an I/O device 260 that may support the input and/or output of data flows between the computing device 200 and other components. The computing device 200 may also receive programs and data via the communication network.

Merely for illustration, only one CPU and/or processor is described in the computing device 200. However, it should be noted that the computing device 200 in the present disclosure may also include multiple CPUs and/or processors, thus operations and/or method steps that are performed by one CPU and/or processor as described in the present disclosure may also be jointly or separately performed by the multiple CPUs and/or processors. For example, if in the present disclosure the CPU and/or processor of the computing device 200 executes both step A and step B, it should be understood that step A and step B may also be performed by two different CPUs and/or processors jointly or separately in the computing device 200 (e.g., the first processor executes step A and the second processor executes step B, or the first and second processors jointly execute steps A and B) .

FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of a mobile device 300 according to some embodiments of the present disclosure. In some embodiments, a terminal (e.g., the terminal 130) may be implemented on the mobile device 300. As illustrated in FIG. 3, the mobile device 300 may include a communication platform 310, a display 320, a graphic processing unit (GPU) 330, a central processing unit (CPU) 340, an I/O 350, a memory 360, an operating system (OS) 370, a storage 390. In some embodiments, any other suitable component, including but not limited to a system bus or a controller (not shown) , may also be included in the mobile device 300.

In some embodiments, a mobile operating system 370 (e.g., iOS ^TM, Android ^MM, Windows Phone ^TW, etc. ) and one or more applications 380 may be loaded into the memory 360 from the storage 390 in order to be executed by the CPU 340. The applications 380 may include a browser or any other suitable mobile apps for receiving and rendering information relating to image processing or other information from the AI image processing system 100. User interactions with the information stream may be achieved via the I/O 350 and provided to the storage device 150, the server 140 /or other components of the AI image processing system 100.

To implement various modules, units, and their functionalities described in the present disclosure, computer hardware platforms may be used as the hardware platform (s) for one or more of the elements described herein. A computer with user interface elements may be used to implement a personal computer (PC) or any other type of work station or terminal device. A computer may also act as a system if appropriately programmed.

FIG. 4 is a block diagram illustrating an exemplary AI processing device 142 according to some embodiments of the present disclosure. The AI processing device 142 may include an acquisition module 401, a feature map determination module 403, a region proposal determination module 405, a pooling region proposal determination module 407, a classification module 409, and a boundary determination module 411. The modules may be hardware circuits of all or part of the AI processing device 142. The modules may also be implemented as an application or set of instructions read and executed by the AI processing device 142. Further, the modules may be any combination of the hardware circuits and the application/instructions. For example, the modules may be the part of the AI processing device 142 when the AI processing device 142 is executing the application/set of instructions.

The acquisition module 401 may be configured to obtain information and/or data related to the AI image processing system 100. In some embodiments, the acquisition module 401 may obtain an image including a target object. In some embodiments, the image may be a still image or a video captured by the sensor 110. In some embodiments, the target object may refer to an object that to be identified and/or detected in the image. For example, the target object may be an object tilted relative to the image (e.g., a safety belt, tilted characters) . Alternatively, all objects in the image may need to be identified and/or detected, and each object in the image may be referred to as a target object. In some embodiments, the acquisition module 401 may obtain the image from one or more components of the AI image processing system 100, such as the sensor 110, the terminal 130, a storage (e.g., the storage device 150) , or from an external source (e.g., ImageNet) via the network 120.

The feature map determination module 403 may be configured to generate a plurality of feature maps by inputting an image (e.g., the image obtained by the acquisition module 401) into a convolutional neural network (CNN) . The CNN may be a trained CNN including one or more convolution layers and one or more pooling layers and without a full connection layer. The convolution layer (s) may be configured to extract features (or feature maps) of an image. The pooling layers may be configured to reduce the size of the feature maps of the image. The feature maps may include feature information of the image.

The region proposal determination module 405 may be configured to determine a plurality of region proposals based on the plurality of feature maps. In some embodiments, the region proposal determination module 405 may determine the plurality of region proposals according to a region proposal network (RPN) . Specifically, the region proposal determination module 405 may slide a sliding window over the plurality of feature maps. With the sliding of the sliding window over the plurality of feature maps, a plurality of sliding-window locations may be determined. At each sliding-window location, a set of preliminary region proposals may be determined. Since there is a plurality of sliding-window locations, a plurality of preliminary region proposals may be determined at the plurality of sliding-window locations. In some embodiments, multiple preliminary region proposals may highly overlap with each other, and the region proposal determination module 405 may select a portion of the plurality of preliminary region proposals as the plurality of region proposals. Merely by way of example, the region proposal determination module 405 may determine the plurality of region proposals using a non-maximum suppression (NMS) . Details regarding the determination of the region proposals may be may be found elsewhere in the present disclosure (e.g., operation 530 of the process 500 and the descriptions thereof) .

The pooling region proposal determination module 407 may be configured to determine a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps. In some embodiments, the pooling region proposal determination module 407 may map the plurality of region proposals to the plurality of feature maps to determine a plurality of proposal feature maps (also referred to as regions of interest (ROIs) ) . The pooling region proposal determination module 407 may then determine the plurality of pooling region proposals by performing pooling on the plurality of proposal feature maps (or ROIs) .

The classification module 409 may be configured to classify the plurality of pooling region proposals into one or more object categories or a background category via the classifier. In some embodiments, the classification module 409 may classify negative samples in the pooling region proposals into the background category. If a pooling region proposal is determined as the background category, the pooling region proposal may be omitted and may not do further processing. In some embodiments, the classification module 409 may classify a pooling region proposal corresponding to a positive sample into one of the one or more object categories. The one or more object categories may be default settings of the image processing system 100, and/or may be adjusted by a user.

The one or more object categories may include a category of the target object. The classification module 409 may select one or more pooling region proposals corresponding to the target object from the plurality of pooling region proposals.

The boundary determination module 411 may be configured to determine a boundary of the target object in the image based on at least one of the one or more pooling region proposals. In some embodiments, the boundary may be a polygonal box, for example, a quadrilateral box. Merely by way of example, for a pooling region proposal having a plurality of corners (e.g., 4 corners, 5 corners, 8 corners, etc. ) , the boundary determination module 411 may determine a plurality of crop strategies for each corner according to a position of the corresponding corner. The boundary determination module 411 may trim the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies. The boundary determination module 411 may identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners, and map the boundary to the image to determine a boundary of the target object. Details regarding the determination of a boundary may be found elsewhere in the present disclosure (e.g., operation 560 of the process 500, process 700, and the descriptions thereof) .

In some embodiments, the boundary determination module 411 may determine one or more boundaries corresponding to the target object based on the one or more pooling region proposals. The boundary determination module 411 may determine an IoU between each of the one or more boundaries and a ground truth. In some embodiments, the ground truth may indicate a labelled boundary box of the target object. The IoU between a boundary and the ground truth may reflect a degree of overlapping of the boundary and the ground truth. The boundary determination module 411 may compare one or more determined IoUs related to the one or more boundaries, and determine a boundary with the greatest IoU as a target boundary corresponding to the target object.

The modules in the AI processing device 142 may be connected to or communicate with each other via a wired connection or a wireless connection. The wired connection may include a metal cable, an optical cable, a hybrid cable, or the like, or any combination thereof. The wireless connection may include a Local Area Network (LAN) , a Wide Area Network (WAN) , a Bluetooth, a ZigBee, a Near Field Communication (NFC) , or the like, or any combination thereof.

It should be noted that the above description is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, the AI processing device 142 may further include one or more additional modules. For example, the AI processing device 142 may further include a storage module (not shown in FIG. 4) configured to store data generated by the modules of the AI processing device 142.

FIG. 5 is a flowchart illustrating an exemplary process 500 for determining a boundary of a target object according to some embodiments of the present disclosure. For illustration purpose only, the AI processing device 142 may be described as a subject to perform the process 500. However, one of ordinary skill in the art would understand that the process 500 may also be performed by other entities. For example, one of ordinary skill in the art would understand that at least a portion of the process 500 may also be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3. In some embodiments, one or more operations of process 500 may be implemented in the AI image processing system 100 as illustrated in FIG. 1. In some embodiments, one or more operations in the process 500 may be stored in the storage device 150 and/or the storage (e.g., the ROM 230, the RAM 240, etc. ) as a form of instructions, and invoked and/or executed by the server 140 (e.g., the AI processing device 142 in the server 140, or the processor 220 of the AI processing device 142 in the server 140) . In some embodiments, the instructions may be transmitted in a form of electronic current or electrical signals.

In 510, the AI processing device 142 (e.g., the acquisition module 401) may obtain an image including a target object. In some embodiments, the image may be an image captured by the sensor 110 (e.g., a camera of a smartphone, a camera in an autonomous vehicle, an intelligent security camera, a traffic camera) . The captured image may be a still image, a video, etc. In some embodiments, the image may include multiple objects, such as people, animals (e.g., dog, cat) , vehicles (e.g., bike, car, bus, truck) , plants (e.g., flower, tree) , buildings, scenery, or the like, or any combination thereof. In some embodiments, the image may include an object tilted relative to the image, such as a safety belt, tilted characters, etc. In some embodiments, the target object may refer to an object that to be identified and/or detected in the image. For example, the target object may be an object tilted relative to the image (e.g., the safety belt, the tilted characters) . Alternatively, all objects in the image may need to be identified and/or detected, and each object in the image may be referred to as a target object.

In some embodiments, the AI processing device 142 may obtain the image from one or more components of the AI image processing system 100, such as the sensor 110, the terminal 130, a storage device (e.g., the storage device 150) . Alternatively or additionally, the AI processing device 142 may obtain the image from an external source via the network 120. For example, the AI processing device 142 may obtain the image from ImageNet, etc.

In 520, the AI processing device 142 (e.g., the feature map determination module 403) may generate a plurality of feature maps by inputting the image into a convolutional neural network (CNN) . The plurality of feature maps may include feature information of the image. In some embodiments, the CNN may be generated based on a Zeiler and Fergus model (ZF) , VGG-16, RestNet 50, etc. In some embodiments, the CNN may be a trained CNN including one or more convolution layers and one or more pooling layers and without a full connection layer. The convolution layer (s) may be configured to extract features (or feature maps) of an image (e.g., the image obtained in 510) . The pooling layers may be configured to reduce the size of the feature maps of the image. In some embodiments, the image may be inputted into the CNN, and a plurality of feature maps may be generated. Merely by way of example, the CNN may be determined based on a ZF model. An image with the size of 600*1000 may be inputted into the ZF model, and 256 feature maps may be outputted from the ZF model. The size of each of the 256 feature maps may be 40*60.

In some embodiments, the CNN may be generated according to transfer learning. Transfer learning may be capable of reducing the training time by using previously obtained knowledge. Specifically, a base network may be a pre-trained network trained previously based on a plurality of first training samples obtained from a dataset (e.g., ImageNet dataset, PASCAL VOC, COCO, etc. ) . The base network may include one or more layers (e.g., convolution layer (s) , pooling layer (s) ) and a plurality of pre-trained weights. At least some of the one or more layers and its corresponding pre-trained weights may be transferred to a target network. For example, the base network may be VGG-16, including thirteen convolution layers, four pooling layers, and three full connection layers. The thirteen convolution layers and the four pooling layers may be transferred to a target network (e.g., the CNN) . In some embodiments, the pre-trained weights of the convolution layers and/or the pooling layers may not need to be adjusted, or may be fine-tuned based on a plurality of second training samples obtained from a dataset (e.g., ImageNet dataset, PASCAL VOC, COCO, etc. ) . In some embodiments, the target network may further include one or more additional layers other than the transferred layers. The weights in the additional layer (s) may be updated according to a plurality of third training samples obtained from a dataset (e.g., ImageNet dataset, PASCAL VOC, COCO, etc. ) . It should be noted that, in some embodiments, different from the transfer learning, the CNN may be directly generated by training a preliminary CNN using a plurality of fourth training samples obtained from a dataset (e.g., ImageNet dataset, PASCAL VOC, COCO, etc. ) .

In 530, the AI processing device 142 (e.g., the region proposal determination module 405) may determine a plurality of region proposals based on the plurality of feature maps. In some embodiments, the AI processing device 142 may determine the plurality of region proposals according to a region proposal network (RPN) . As shown in FIG. 6, the RPN may include at least one regression layer and at least one classification layer.

In some embodiments, the AI processing device 142 may slide a sliding window over the plurality of feature maps. The sliding window may also be referred to as a convolution kernel that has a size of, for example, 3*3, 5*5, etc. With the sliding of the sliding window over the plurality of feature maps, a plurality of sliding-window locations may be determined. Merely by way of example, the size of the sliding window may be 3*3, and the size of the plurality of feature maps may be 40*60. A padding operation (e.g., padding=1) may be performed on the plurality of feature maps. When sliding the sliding window over the plurality of feature maps, 40*60 (2400) sliding-window locations may be roughly determined.

At each sliding-window location, the sliding window may coincide with a sub-region of the plurality of feature maps. In some embodiments, the AI processing device 142 may map the sub-region of the plurality of feature maps to a multi-dimensional feature vector. For example, if there are 256 feature maps, a 256-dimensional feature vector may be generated at the sub-region. The AI processing device 142 may generate an anchor by mapping a center pixel of the sub-region to a pixel of the image obtained in 510. In some embodiments, the anchor may correspond to a set of anchor boxes (e.g., including k anchor boxes) in the image. Each of the set of anchor boxes may be a rectangular box. The anchor may be a center point of the set of anchor boxes. Each of the set of anchor boxes may be associated with a scale and an aspect ratio. Merely by way of example, if 3 scales (e.g., 128, 256, 512, etc. ) and 3 aspect ratios (e.g., 1: 1, 1: 2, 2: 1, etc. ) are applied, the number of the set of anchor boxes may be 9. In some embodiments, the AI processing device 142 may feed the multi-dimensional feature vector and/or the set of anchor boxes into the at least one regression layer and the at least one classification layer, respectively. In some embodiments, the at least one regression layer may be configured to conduct bounding-box regression to determine a set of preliminary region proposals corresponding to the set of anchor boxes. The output of the at least one regression layer may include four coordinate values of each of the set of preliminary region proposals. In some embodiments, the four coordinate values of a preliminary region proposal may include a location of the preliminary region proposal (e.g., coordinates (x, y) of the anchor of the corresponding anchor box) and a size of the preliminary region proposal (e.g., a width w and a height h of the preliminary region proposal) . The at least one classification layer may be configured to determine a category for each of the set of preliminary region proposals. The category may be a foreground or a background. The output of the at least one classification layer may include a first score of being foreground and a second score of being background of each of the set of preliminary region proposals.

As described above, at each sliding-window location, a set of (e.g., 9) preliminary region proposals may be determined. Since there is a plurality of sliding-window locations (e.g., roughly 40*60) , a plurality of (e.g., roughly 20000) preliminary region proposals may be determined at the plurality of sliding-window locations. In some embodiments, multiple preliminary region proposals may highly overlap with each other. The AI processing device 142 may select a portion of the plurality of preliminary region proposals as a plurality of region proposals. In some embodiments, the AI processing device 142 may select the plurality of region proposals using a non-maximum suppression (NMS) . Specifically, the AI processing device 142 may determine the plurality of region proposals based on the first score of being foreground and the second score of being background of each of the plurality of preliminary region proposals and four coordinate values of each of the plurality of preliminary region proposals. In some embodiments, the AI processing device 142 may determine an intersection-over-union (IoU) between each of the plurality of preliminary region proposals and a ground truth. The ground truth may be a labelled boundary box of the target object. The AI processing device 142 may determine preliminary region proposals that have an IoU greater than 0.7 as positive samples, and determine preliminary region proposals that have an IoU less than 0.3 as negative samples. The AI processing device 142 may remove preliminary region proposals other than the positive samples and the negative samples. In some embodiments, the AI processing device 142 may select the plurality of region proposals from the positive samples and the negative samples. In some embodiments, the AI processing device 142 may rank the positive samples based on the first score of being foreground of each of the positive samples, and select multiple positive samples based on the ranked positive samples. The AI processing device 142 may rank the negative samples based on the second score of being background of each of the negative samples, and select multiple negative samples based on the ranked negative samples. The selected positive samples and the selected negative samples may constitute the plurality of region proposals. In some embodiments, the AI processing device 142 may select 300 region proposals. The number of the selected positive samples may be the same as or different from that of the selected negative samples. In some embodiments, before selecting the region proposals using the non-maximum suppression (NMS) , the AI processing device 142 may first remove preliminary region proposals that cross boundaries of the image (also referred to as cross-boundary preliminary region proposals) .

In 540, the AI processing device 142 (e.g., the pooling region proposal determination module 407) may determine a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps. In some embodiments, the AI processing device 142 may map the plurality of region proposals to the plurality of feature maps to determine a plurality of proposal feature maps (also referred to as regions of interest (ROIs) ) . In some embodiments, the plurality of proposal feature maps (or ROI) will be inputted into a classifier for further processing. The classifier may only accept proposal feature map (s) with a canonical size (e.g., 7*7) . Thus, the AI processing device 142 may resize the plurality of proposal feature maps to the canonical size. The AI processing device 142 may determine the plurality of pooling region proposals by performing pooling on the plurality of proposal feature maps (or ROIs) . In some embodiments, the pooling may include a max pooling, a mean pooling, or the like. In some embodiments, the plurality of pooling region proposals may correspond to the canonical size (e.g., 7*7) and may be inputted into the classifier for further processing. For example, a pooling region proposal may be determined as a fixed-length vector, which will be sent into a full connection layer of the classifier.

In 550, the AI processing device 142 (e.g., the classification module 409) may classify the plurality of pooling region proposals into one or more object categories or a background category via the classifier. In some embodiments, the classifier may include a support vector machine (SVM) classifier, a Bayer classifier, a decision tree classifier, a softmax classifier, or the like, or any combination thereof.

In some embodiments, one or more pooling region proposals may be classified into the background category. For example, as described in connection with operation 530, the region proposals may include multiple positive samples and multiple negative samples. Similarly, the pooling region proposals may correspond to multiple positive samples and multiple negative samples. In some embodiments, the multiple negative samples in the pooling region proposals may be classified into the background category. If a pooling region proposal is determined as the background category, the pooling region proposal may be omitted and may not do further processing.

In some embodiments, a pooling region proposal corresponding to a positive sample may be classified into one of the one or more object categories. The one or more object categories may be default settings of the AI image processing system 100, and/or may be adjusted by a user. The one or more object categories may include a category of the target object. The plurality of pooling region proposals may include one or more pooling region proposals corresponding to the target object. The AI processing device 142 may select the one or more pooling region proposals corresponding to the target object from the plurality of pooling region proposals.

In 560, the AI processing device 142 (e.g., the boundary determination module 411) may determine a target boundary of the target object in the image based on at least one of the one or more pooling region proposals. In some embodiments, the target boundary may be a polygonal box, for example, a quadrilateral box.

In some embodiments, each of the one or more pooling region proposals may have a plurality of corners (e.g., 4 corners, 5 corners, 8 corners, etc. ) . For a pooling region proposal, the AI processing device 142 may determine a plurality of crop strategies for each corner of the plurality of corners according to a position of the corresponding corner. Merely by way of example, the AI processing device 142 may determine five crop strategies for each of the plurality of corners. In some embodiments, for each corner, the AI processing device 142 may determine one of the plurality of (e.g., five) crop strategies as a desired crop strategy of the corner based on the pooling region proposal. Merely by way of example, the AI processing device 142 may determine a cropping direction and a cropping length for each corner based on the pooling region proposal. The cropping direction of a corner may be limited to one of the plurality of crop strategies of the corner. In some embodiments, the AI processing device 142 may trim the pooling region proposal by cropping each of the plurality of corners according to the desired crop strategy, for example, based on the cropping direction and the cropping length. In some embodiments, the plurality of crop strategies of each corner may include a crop strategy of false and/or a crop strategy of target position. The crop strategy of false of a corner may indicate that the corner may correspond to a point inside the target object. The crop strategy of target position of a corner may indicate that the corner may correspond to a boundary point of the target object. If the cropping direction of a corner corresponds to a crop strategy of false, the AI processing device 142 may stop to crop the corner and abandon the pooling region proposal. If the cropping direction of a corner corresponds to a crop strategy of target position, the AI processing device 142 may stop to crop the corner. When the cropping direction of each corner corresponds to a crop strategy of target position, the AI processing device 142 may identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners, and map the boundary to the image to determine a boundary of the target object. Details regarding the determination of a boundary may be found elsewhere in the present disclosure (e.g., FIG. 7 and the descriptions thereof) .

In some embodiments, the AI processing device 142 may determine one or more boundaries corresponding to the target object based on the one or more pooling region proposals. Each of the one or more boundaries may be determined according to process 700, and the descriptions thereof are not repeated herein. The AI processing device 142 may determine an IoU between each of the one or more boundaries and a ground truth. In some embodiments, the ground truth may indicate a labelled boundary box of the target object. The IoU between a boundary and the ground truth may reflect a degree of overlapping of the boundary and the ground truth. The AI processing device 142 may compare one or more determined IoUs related to the one or more boundaries, and determine a boundary with the greatest IoU as a target boundary corresponding to the target object.

In the present disclosure, for a pooling region proposal, each corner of the pooling region proposal may be cropped according to one of its crop strategies, which has considered information related to the corresponding corner. Besides, the cropping direction and/or the cropping length of each corner of the pooling region proposal may be determined based on the pooling region proposal, which has considered features in the pooling region proposal. Thus, a boundary of the target object determined according to the process disclosed in the present disclosure may be more suitable for the target object, especially, for a tilted target object, which may improve the accuracy of detecting and/or locating the target object. As disclosed in the present disclosure, for the target object, one or more boundaries may be determined. A boundary with the greatest IoU among the one or more boundaries may be determined as the target boundary. That is, a boundary having the greatest degree of overlapping with the ground truth may be determined as the target boundary, which may further improve the accuracy of detecting and/or locating the target object.

It should be noted that the above description regarding the process 500 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, in 560, the AI processing device 142 may not need to determine a plurality of boundaries corresponding to the target object. When the AI processing device 142 determines a boundary of the target object, the AI processing device 142 may determine the boundary as a target boundary, and operation 560 may be terminated. In some embodiments, according to process 500, the boundaries of one or more target objects (e.g., all objects) in the image may be determined simultaneously. In some embodiments, process 500 may be repeated to determine boundaries of target objects in a plurality of different images.

FIG. 6 is schematic diagram illustrating an exemplary region proposal network (RPN) according to some embodiments of the present disclosure. As shown in FIG. 6, the RPN introduces a sliding window. The sliding window is configured to slide over a plurality of feature maps. As shown in FIG. 6, a sliding window coincides with a sub-region of the plurality of feature maps at a certain sliding-window location. The sliding window has a size of 3*3. The sub-region is mapped to a multi-dimensional feature vector, e.g., a 256-dimensional (256-d) feature vector shown in an intermediate layer. Besides, a center pixel O of the sub-region is mapped to a pixel of an image to generate an anchor O’ . A set of anchor boxes (e.g., k anchor boxes) are determined based on the anchor O’ . Each of the set of anchor boxes is a rectangular box, and the anchor O’ is a center point of the set of anchor boxes. In some embodiments, there may be 3 scales and 3 aspect ratios, and 9 anchor boxes may be determined on the image.

As shown in FIG. 6, the RPN includes a regression layer (denoted as reg layer) and a classification layer (denoted as cls layer) . The regression layer may be configured to conduct bounding-box regression to determine a preliminary region proposal corresponding to an anchor box. The classification layer may be configured to determine a category for the preliminary region proposal. As illustrated, the multi-dimensional feature vector (i.e., the 256-d feature vector) and/or the set of anchor boxes (i.e., k anchor boxes) are fed into the regression layer and the classification layer, respectively. The output of the regression layer includes four coordinate values (also referred to as four coordinates) of each of the set of preliminary region proposals. The four coordinate values of a preliminary region proposal may include a location of the preliminary region proposal (e.g., coordinates (x, y) of the anchor of the corresponding anchor box) and a size of the preliminary region proposal (e.g., a width w and a height h of the preliminary region proposal) . The output of the classification layer includes two scores of each of the set of preliminary region proposals, including a first score of being foreground and a second score of being background.

As describe above, a set of preliminary region proposals are determined at the certain sliding-window location. With the sliding of the sliding window over the plurality of feature maps, a plurality of preliminary region proposals may be determined at a plurality of sliding-window locations. In some embodiments, the RPN may select a portion of the plurality of preliminary region proposals as region proposals for further processing. More descriptions regarding the selection of the region proposals may be found elsewhere in the present disclosure (e.g., operation 530 of process 500 and the relevant descriptions thereof) .

FIG. 7 is a flowchart illustrating an exemplary process 700 for determining a boundary of a target object based on a pooling region proposal according to some embodiments of the present disclosure. For illustration purpose only, the AI processing device 142 may be described as a subject to perform the process 700. However, one of ordinary skill in the art would understand that the process 700 may also be performed by other entities. For example, one of ordinary skill in the art would understand that at least a portion of the process 700 may also be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3. In some embodiments, one or more operations of process 700 may be implemented in the AI image processing system 100 as illustrated in FIG. 1. In some embodiments, one or more operations in the process 700 may be stored in the storage device 150 and/or the storage (e.g., the ROM 230, the RAM 240, etc. ) as a form of instructions, and invoked and/or executed by the server 140 (e.g., the AI processing device 142 in the server 140, or the processor 220 of the AI processing device 142 in the server 140) . In some embodiments, the instructions may be transmitted in a form of electronic current or electrical signals. In some embodiments, a portion of operation 560 of the process 500 may be performed according to process 700.

In some embodiments, a pooling region proposal may have a plurality of corners. In 710, the AI processing device 142 (e.g., the boundary determination module 411) may determine a plurality of crop strategies for each of the plurality of corners of the pooling region proposal according to a position of each of the plurality of corners.

In some embodiments, the position of a corner may refer to a position of the corner relative to positions of the other corners. In certain embodiments, the pooling region proposal may be a rectangular box and include four corners. The four corners may include a top left corner, a top right corner, a bottom left corner, and a bottom right corner. The AI processing device 142 may determine a plurality of crop strategies for each of the four corners based on the position of each of the four corner. Specifically, the AI processing device 142 may determine a plurality of crop strategies for the top left corner. The plurality of crop strategies of the top left corner may include cropping to right, cropping to bottom, cropping to bottom right, target position, false, or the like, or any combination thereof. The AI processing device 142 may determine a plurality of crop strategies for the top right corner. The plurality of crop strategies of the top right corner may include cropping to left, cropping to bottom, cropping to bottom left, target position, false, or the like, or any combination thereof. The AI processing device 142 may determine a plurality of crop strategies for the bottom left corner. The plurality of crop strategies of the bottom left corner may include cropping to right, cropping to top, cropping to top right, target position, or false. The AI processing device 142 may determine a plurality of crop strategies for the bottom right corner. The plurality of crop strategies of the bottom right corner may include cropping to left, cropping to top, cropping to top left, target position, false, or the like, or any combination thereof. As can be seen from above, the plurality of crop strategies of each corner may include a crop strategy of false and/or a crop strategy of target position. The crop strategy of false of a corner may indicate that the corner may correspond to a point inside the target object. The crop strategy of target position of a corner may indicate that the corner may correspond to a boundary point of the target object. It should be noted that the crop strategies of each corner and the number of corners are merely provided for illustration purposes, and are not intended to limit the scope of the present disclosure.

In 720, the AI processing device 142 (e.g., the boundary determination module 411) may determine a crop strategy for each of the plurality of corners from the plurality of crop strategies based on the pooling region proposal. In some embodiments, the AI processing device 142 may determine a cropping direction and a cropping length for each corner based on the pooling region proposal. For example, the AI processing device 142 may analyze features of pixels (e.g., pixels representing target object, pixels representing background) in the pooling region proposal, and determine the cropping direction and/or the cropping length based on the analysis result. The cropping direction may be limited to one of the plurality of crop strategies. The cropping length may be a length of several pixels, for example, a length including 0-10 pixels.

In 730, the AI processing device 142 (e.g., the boundary determination module 411) may determine whether one of the plurality of corners corresponds to a crop strategy of false. In some embodiments, if the determined cropping direction of a corner corresponds to the crop strategy of false, the corner may correspond to a point inside the target object. That is, the pooling region proposal does not encompass the whole target object. If the determined cropping direction of a corner corresponds to the crop strategy of target position, the corner may correspond to a boundary point of the target object. Otherwise, if the determined cropping direction of a corner corresponds to other crop strategies other than the crop strategy of false and the crop strategy of target position, the corner may correspond to a point that has a distance from the object. In response to a determination that at least one of the plurality of corners corresponds to the crop strategy of false, the AI processing device 142 may proceed to operation 740. In response to a determination that each of the plurality of corners does not correspond to the crop strategy of false, the AI processing device 142 may proceed to operation 750.

In 740, the AI processing device 142 (e.g., the boundary determination module 411) may abandon the pooling region proposal. Since the pooling region proposal does not encompass the whole target object, a boundary of the target object cannot be determined based on the pooling region proposal. Accordingly, the AI processing device 142 may abandon the pooling region proposal.

In 750, the AI processing device 142 (e.g., the boundary determination module 411) may determine whether each of plurality of corners corresponds to a crop strategy of target position. In response to a determination that at least one of the plurality of corners does not correspond to the crop strategy of target position, the AI processing device 142 may proceed to operation 760.

In 760, the AI processing device 142 (e.g., the boundary determination module 411) may trim the pooling region proposal by cropping the at least one corner according to the determined crop strategy of the at least one corner. That is, if a corner does not correspond to the crop strategy of target position and the crop strategy of false, the AI processing device 142 may crop the corner based on the crop strategy of the corner determined in 720. When the at least one corner is cropped according to its crop strategy, a trimmed pooling region proposal may be determined.

Merely by way of example, for a top right corner of the pooling region proposal, if the determined cropping direction corresponds to a crop strategy of cropping to left, the AI processing device 142 may crop the top right corner towards left to update the position of the top right corner. As another example, for a top left corner of the pooling region proposal, if the determined cropping direction corresponds to a crop strategy of cropping to right, the AI processing device 142 may crop the top left corner towards right to update the position of the top left corner. As a further example, for a bottom left corner of the pooling region proposal, if the determined cropping direction corresponds to a crop strategy of cropping to top right, the AI processing device 142 may crop the bottom left corner towards top right to update the position of the bottom left corner.

In 770, the AI processing device 142 (e.g., the boundary determination module 411) may perform a bounding mapping based on the cropped plurality of corners to determine a rectangular box. In certain embodiments, as described above, the pooling region proposal may be a rectangular box and include four corners. Due to different crop strategies applied for different corners, the trimmed pooling region proposal may be a quadrilateral box other than a rectangular box. In some embodiments, the crop strategies described above can be used only when the (trimmed) pooling region proposal is a rectangular box. Thus, the AI processing device 142 may perform a boundary mapping on the trimmed pooling region proposal. Specifically, the AI processing device 142 may determine two diagonal lines based on the four corners, and determine the longer diagonal line as a target diagonal line. The AI processing device 142 may determine a rectangular box based on the target diagonal line.

In 780, the AI processing device 142 (e.g., the boundary determination module 411) may resize the rectangular box into a canonical size to determine an updated pooling region proposal. In some embodiments, the AI processing device 142 may resize the rectangular box by performing pooling to determine an updated pooling region proposal. The updated pooling region proposal may have a canonical size and be accepted by the classifier. After the updated pooling region proposal is determined, the AI processing device 142 may proceed to operations 720 through 780 and start a next iteration. Descriptions of the operations 720 through 770 may be found elsewhere in the present disclosure, and the descriptions thereof are not repeated. The AI processing device 142 may repeat operations 720 through 780 until each of the plurality of corners corresponds to a crop strategy of target position.

In 730, in response to a determination that each of the plurality of corners does not correspond to the crop strategy of false, the AI processing device 142 may proceed to operation 750. In 750, the AI processing device 142 may determine whether each of the plurality of corners corresponds to the crop strategy of target position. In response to a determination that each of the plurality of corners corresponds to the crop strategy of target position, the AI processing device 142 may stop to crop the plurality of corners. The AI processing device 142 may proceed to operation 790.

In 790, the AI processing device 142 (e.g., the boundary determination module 411) may identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners. In certain embodiments, the (trimmed) pooling region proposal may include four corners. The AI processing device 142 may connect the four corners to determine a boundary on the feature map (s) .

In 795, the AI processing device 142 (e.g., the boundary determination module 411) may map the boundary to the image to determine a boundary of the target object. In some embodiments, the boundary of the target object may be a quadrilateral box.

In the present disclosure, for each corner of the pooling region proposal, a plurality of crop strategies may be determined based on a position of the corresponding corner. Furthermore, the cropping direction and/or the cropping length of each corner may be determined based on features of pixels in the pooling region proposal. That is, to crop a corner, the position of the corner and/or features of the pooling region proposal have been taken into account. Thus, a boundary of the target object determined according to the process disclosed in the present disclosure may be more suitable for the target object, especially, for a tilted target object, which may improve the accuracy of detecting and/or locating the target object. For example, as shown in FIG. 9C, for a tilted target object (e.g., tilted characters) , the present disclosure may provide a suitable boundary for the tilted target object, and further improve the accuracy of detecting and/or locating the tilted target object.

It should be noted that the above description regarding the process 700 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. For example,

operations

730 and 750 may be performed simultaneously. As another example, operation 750 may be performed before operation 730. In some embodiments, the AI processing device 142 may repeat the process 700 to determine one or more boundaries corresponding to the target object.

FIG. 8 is a schematic diagram illustrating an exemplary process for determining a boundary of a target object according to some embodiments of the present disclosure. As shown in FIG. 8, an image is inputted into a convolutional neural network (CNN) . In some embodiments, the image may include one or more target objects (e.g., objects to be detected) . In some embodiments, the CNN may be generated based on a ZF model, VGG-16, RestNet 50, etc. In some embodiments, the CNN may include one or more convolution layers, one or more pooling layers and be without a full connection layer. By inputting the image into the CNN, a plurality of feature maps may be generated. The plurality of feature maps may include feature information of the image. Details regarding the generation of the feature maps may be found elsewhere in the present disclosure (e.g.,

operations

510 and 520, and the descriptions thereof) .

As shown in FIG. 8, the plurality of feature maps may be inputted into a region proposal network (RPN) . In the RPN, a sliding window may slide over the plurality of feature maps. With the sliding of the sliding window over the plurality of feature maps, a plurality of sliding-window locations may be determined. At each sliding-window location, a multi-dimensional feature vector (e.g., a 256-dimensional feature vector) may be generated and/or an anchor in the image may be determined. The anchor may correspond to a set of anchor boxes, each of which may be associated with a scale and an aspect ratio. As shown in FIG. 8, the RPN includes at least one regression layer and at least one classification layer. The multi-dimensional feature vector and/or the set of anchor boxes are fed into the at least one regression layer and the at least one classification layer. The output of the at least one regression layer may include four coordinate values of each of the set of preliminary region proposals. The output of the at least one classification layer may include a first score of being foreground and a second score of being background of each of the set of preliminary region proposals. Similarly, at the plurality of sliding windows, a plurality of preliminary region proposals may be determined. In some embodiments, a portion of the plurality of preliminary region proposals may be selected as a plurality of region proposals. The plurality of region proposals may include positive samples (e.g., foreground) and negative samples (e.g., background) . The plurality of region proposals may be further processed. Details regarding the determination of the region proposals may be found elsewhere in the present disclosure (e.g., operation 530 of the process 500) .

As shown in FIG. 8, an ROI pooling operation is performed based on the plurality of feature maps and the plurality of region proposals. Specifically, the plurality of region proposals may be mapped to the plurality of feature maps to determine a plurality of proposal feature maps (also referred to as ROIs) . The plurality of ROIs may be resized to a canonical size (e.g., 7*7) by performing pooling on the plurality of ROIs. Then a plurality of pooling region proposals may be determined. The plurality of pooling region proposals may be into a classifier for further processing.

The plurality of pooling region proposals may be classified into one or more object categories (e.g., K categories) or a background category via the classifier. If a pooling region proposal is determined as the background category, the pooling region proposal may be omitted and/or removed. The plurality of pooling region proposals may include one or more pooling region proposals corresponding to a target object. For a pooling region proposal, a boundary of the target object in the image may be determined based on the pooling region proposal. To determine the boundary of the target object, the pooling region proposal may be trimmed for one or more times. In some embodiments, the pooling region proposal may include a plurality of corners. As shown in FIG. 8, the pooling region proposal includes four corners, that is, a top left (TL) corner, a top right (TR) corner, a bottom left (BL) corner, and a bottom right (BR) corner. Each of the four corners includes five crop strategies. Specifically, the five strategies of the top left corner include cropping to right (→) , cropping to bottom right (↘) , cropping to bottom (↓) , target position (T) and false (F) . The five strategies of the top right corner include cropping to left (←) , cropping to bottom left (↙) , cropping to bottom (↓) , target position (T) and false (F) . The five strategies of bottom left corner include cropping to right (→) , cropping to top right (↗) , cropping to top (↑) , target position (T) and false (F) . The five strategies of bottom right corner include cropping to left (←) , cropping to top left (↖) , cropping to top (↑) , target position (T) and false (F) . A crop strategy of each of the four corners may be determined based on features of pixels in the pooling region proposal. Whether one of the four corners correspond to a crop strategy of false may be determined. If at least one of the four corners correspond to the crop strategy of false, the pooling region proposal may be determined that does not encompass the whole target object, and the pooling region proposal may be abandoned and/or rejected. If each of the four corners does not correspond to the crop strategy of false, whether each of the four corners corresponds to a crop strategy of target position may be determined. If a corner does not correspond to the crop strategy of target position, the corner may be cropped based on the determined crop strategy. When each corner is cropped according to its crop strategy, a trimmed pooling region proposal is determined. A bounding mapping may be performed based on the cropped four corners to determine a rectangular box. The rectangular box is resized into a canonical size to determine an updated pooling region proposal. The updated pooling region proposal may be further trimmed, and a next iteration may be performed. When each of the four corners corresponds to the crop strategy of target position, each of the four corners may not be cropped. A boundary of the (trimmed) pooling region proposal may be identified. The boundary may be mapped to the image to determine a boundary of the target object. More descriptions of the determination of the boundary of the target object may be found elsewhere in the present disclosure (e.g., FIG. 7 and the descriptions thereof) .

FIGs. 9A to 9C are schematic diagrams illustrating an image according to some embodiments of the present disclosure. As shown in FIGs. 9A to 9C, the image may include a target object, i.e., tilted characters “CHANEL” . FIG. 9B shows a boundary 902 (also referred to as bounding box) of “CHANEL” , which is determined according to a Faster-RCNN algorithm. FIG. 9C shows a boundary 904 of “CHANEL” , which is determined according to the process described in the present disclosure. The boundary 902 includes more background than the target object, which cannot accurately determine the location of the target object. The boundary 904 includes less background, which may implement an accurate locating of the target object. Thus, the process described in the present disclosure can improve the accuracy of object detection.

Having thus described the basic concepts, it may be rather apparent to those skilled in the art after reading this detailed disclosure that the foregoing detailed disclosure is intended to be presented by way of example only and is not limiting. Various alterations, improvements, and modifications may occur and are intended to those skilled in the art, though not expressly stated herein. These alterations, improvements, and modifications are intended to be suggested by this disclosure, and are within the spirit and scope of the exemplary embodiments of this disclosure.

Moreover, certain terminology has been used to describe embodiments of the present disclosure. For example, the terms “one embodiment, ” “an embodiment, ” and/or “some embodiments” mean that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Therefore, it is emphasized and should be appreciated that two or more references to “an embodiment” or “one embodiment” or “an alternative embodiment” in various portions of this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the present disclosure.

Further, it will be appreciated by one skilled in the art, aspects of the present disclosure may be illustrated and described herein in any of a number of patentable classes or context including any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof. Accordingly, aspects of the present disclosure may be implemented entirely hardware, entirely software (including firmware, resident software, micro-code, etc. ) or combining software and hardware implementation that may all generally be referred to herein as a “unit, ” “module, ” or “system. ” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including electro-magnetic, optical, or the like, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including wireless, wireline, optical fiber cable, RF, or the like, or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object-oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB. NET, Python or the like, conventional procedural programming languages, such as the "C" programming language, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) or in a cloud computing environment or offered as a service such as a Software as a Service (SaaS) .

Furthermore, the recited order of processing elements or sequences, or the use of numbers, letters, or other designations therefore, is not intended to limit the claimed processes and methods to any order except as may be specified in the claims. Although the above disclosure discusses through various examples what is currently considered to be a variety of useful embodiments of the disclosure, it is to be understood that such detail is solely for that purpose, and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover modifications and equivalent arrangements that are within the spirit and scope of the disclosed embodiments. For example, although the implementation of various components described above may be embodied in a hardware device, it may also be implemented as a software only solution, e.g., an installation on an existing server or mobile device.

Similarly, it should be appreciated that in the foregoing description of embodiments of the present disclosure, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the various embodiments. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed subject matter requires more features than are expressly recited in each claim. Rather, claimed subject matter may lie in less than all features of a single foregoing disclosed embodiment.

Claims

An artificial intelligent image processing system for object detection, comprising:

at least one storage device including a set of instructions for determining a boundary corresponding to an object in an image;

at least one processor in communication with the at least one storage device, wherein when executing the set of instructions, the at least one processor is directed to:

obtain an image including a target object;

generate a plurality of feature maps by inputting the image into a convolutional neural network (CNN) ;

determine a plurality of region proposals based on the plurality of feature maps;

determine a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps;

classify the plurality of pooling region proposals into one or more object categories or a background category via a classifier, the one or more object categories including a category of the target object, the plurality of pooling region proposals including one or more pooling region proposals corresponding to the target object, each of the one or more pooling region proposals having a plurality of corners; and

for each of the one or more pooling region proposals corresponding to the target object,

determine a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner;

trim the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies;

identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and

map the boundary to the image to determine a boundary of the target object.
The artificial intelligent image processing system of claim 1, wherein the CNN includes one or more convolution layers and one or more pooling layers and is without a full connection layer.
The artificial intelligent image processing system of claim 1, wherein the plurality of region proposals is determined according to a region proposal network (RPN) .
The artificial intelligent image processing system of claim 3, wherein the RPN includes at least one regression layer and at least one classification layer, and to determine the plurality of region proposals, the at least one processor is directed to:

slide a sliding window over the plurality of feature maps;

at each sliding-window location, the sliding window coinciding with a sub-region of the plurality of feature maps,

map the sub-region of the plurality of feature maps to a multi-dimensional feature vector;

generate an anchor by mapping a center pixel of the sub-region to a pixel of the image, the anchor corresponding to a set of anchor boxes in the image, each of the set of anchor boxes being associated with a scale and an aspect ratio;

feed the multi-dimensional feature vector into the at least one regression layer and the at least one classification layer, respectively, wherein

the at least one regression layer is configured to conduct bounding-box regression to determine a set of preliminary region proposals corresponding to the set of anchor boxes, the output of the at least one regression layer including four coordinate values of each of the set of preliminary region proposals, and

the at least one classification layer is configured to determine a category for each of the set of preliminary region proposals, the category being a foreground or a background, the output of the at least one classification layer including a first score of being foreground and a second score of being background of each of the

set of preliminary region proposals; and

select, based on the first score of being foreground and the second score of being background of each of a plurality of preliminary region proposals and four coordinate values of each of the plurality of preliminary region proposals, a portion of the plurality of preliminary region proposals as the plurality of region proposals.
The artificial intelligent image processing system of claim 4, wherein to select a portion of the plurality of preliminary region proposals as the plurality of region proposals, the at least one processor is directed to:

select the plurality of region proposals using a non-maximum suppression (NMS) .
The artificial intelligent image processing system of any one of claims 1 to 5, wherein the plurality of pooling region proposals corresponds to a canonical size, and to determine the plurality of pooling region proposals, the at least one processor is further directed to:

map the plurality of region proposals to the plurality of feature maps to determine a plurality of proposal feature maps; and

determine the plurality of pooling region proposals by performing pooling on the plurality of proposal feature maps.
The artificial intelligent image processing system of claim 1, wherein the plurality of corners includes a top left corner, a top right corner, a bottom left corner, and a bottom right corner, wherein

the plurality of crop strategies of the top left corner includes at least one of cropping to right, cropping to bottom, cropping to bottom right, target position, or false;

the plurality of crop strategies of the top right corner includes at least one of cropping to left, cropping to bottom, cropping to bottom left, target position, or false;

the plurality of crop strategies of the bottom left corner includes at least one of cropping to right, cropping to top, cropping to top right, target position, or false; and

the plurality of crop strategies of the bottom right corner includes at least one of cropping to left, cropping to top, cropping to top left, target position, or false.
The artificial intelligent image processing system of claim 7, wherein the at least one processor is further directed to:

stop to crop one of the plurality of corners when the corner corresponds to a crop strategy of target position.
The artificial intelligent image processing system of claim 7, wherein cropping each of the plurality of corners, the at least one processor is directed to:

determine a cropping direction and a cropping length for each of the plurality of corners based on the pooling region proposal, wherein the cropping direction of each of the plurality of corners is limited to one of the plurality of crop strategies of the corresponding corner; and

crop each of the plurality of corners based on the cropping direction and the cropping length.
The artificial intelligent image processing system of claim 7, wherein to trim the pooling region proposal by cropping each of the plurality of corners, the at least one processor is directed to:

perform one or more iterations;

in each of the one or more iterations,

determine, from the plurality of crop strategies, a crop strategy for each of the plurality of corners based on the pooling region proposal;

determine whether one of the plurality of corners corresponds to a crop strategy of false;

determine whether each of the plurality of corners corresponds to a crop strategy of target position in response to a determination that each of the plurality of corners does not correspond to the crop strategy of false;

in response to a determination that at least one of the plurality of corners does not correspond to the crop strategy of target position, crop the at least one of the plurality of corners according to the determined crop strategy of the at least one of the plurality of corners;

perform, based on the cropped plurality of corners, a bounding mapping to determine a rectangular box; and

resize the rectangular box into a canonical size; and

stop to crop the plurality of corners in response to a determination that each of the plurality of corners corresponds to the crop strategy of target position.
The artificial intelligent image processing system of claim 10, wherein the at least one processor is further directed to:

abandon the pooling region proposal in response to a determination that at least one of the plurality of corners corresponds to the crop strategy of false.
The artificial intelligent image processing system of any one of claims 1 to 11, wherein the at least one processor is further directed to:

determine one or more boundaries corresponding to the target object;

determine an intersection-over-union (IoU) between each of the one or more boundaries and a ground truth; and

determine one of the one or more boundaries with the greatest IoU as a target boundary corresponding to the target object.
The artificial intelligent image processing system of any one of claims 1 to 12, wherein the boundary of the target object is a quadrilateral box.
An artificial intelligent image processing method implemented on a computing device having at least one processor, at least one computer-readable storage medium, and a communication platform connected to a network, comprising:

obtaining an image including a target object;

generating a plurality of feature maps by inputting the image into a convolutional neural network (CNN) ;

determining a plurality of region proposals based on the plurality of feature maps;

determining a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps;

classifying the plurality of pooling region proposals into one or more object categories or a background category via a classifier, the one or more object categories including a category of the target object, the plurality of pooling region proposals including one or more pooling region proposals corresponding to the target object, each of the one or more pooling region proposals having a plurality of corners; and

for each of the one or more pooling region proposals corresponding to the target object,

determining a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner;

trimming the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies;

identifying a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and

mapping the boundary to the image to determine a boundary of the target object.
The artificial intelligent image processing method of claim 14, wherein the CNN includes one or more convolution layers and one or more pooling layers and is without a full connection layer.
The artificial intelligent image processing method of claim 14, wherein the plurality of region proposals is determined according to a region proposal network (RPN) .
The artificial intelligent image processing method of claim 16, wherein the RPN includes at least one regression layer and at least one classification layer, and determining the plurality of region proposals further comprises:

sliding a sliding window over the plurality of feature maps;

at each sliding-window location, the sliding window coinciding with a sub-region of the plurality of feature maps,

mapping the sub-region of the plurality of feature maps to a multi-dimensional feature vector;

generating an anchor by mapping a center pixel of the sub-region to a pixel of the image, the anchor corresponding to a set of anchor boxes in the image, each of the set of anchor boxes being associated with a scale and an aspect ratio;

feeding the multi-dimensional feature vector into the at least one regression layer and the at least one classification layer, respectively, wherein

the at least one regression layer is configured to conduct bounding-box regression to determine a set of preliminary region proposals corresponding to the set of anchor boxes, the output of the at least one regression layer including four coordinate values of each of the set of preliminary region proposals, and

the at least one classification layer is configured to determine a category for each of the set of preliminary region proposals, the category being a foreground or a background, the output of the at least one classification layer including a first score of being foreground and a second score of being background of each of the set of preliminary region proposals; and

selecting, based on the first score of being foreground and the second score of being background of each of a plurality of preliminary region proposals and four coordinate values of each of the plurality of preliminary region proposals, a portion of the plurality of preliminary region proposals as the plurality of region proposals.
The artificial intelligent image processing method of claim 17, wherein selecting a portion of the plurality of preliminary region proposals as the plurality of region proposals comprises:

selecting the plurality of region proposals using a non-maximum suppression (NMS) .
The artificial intelligent image processing method of any one of claims 14 to 18, wherein the plurality of pooling region proposals corresponds to a canonical size, and determining the plurality of pooling region proposals further comprises:

mapping the plurality of region proposals to the plurality of feature maps to determine a plurality of proposal feature maps; and

determining the plurality of pooling region proposals by performing pooling on the plurality of proposal feature maps.
The artificial intelligent image processing method of claim 14, wherein the plurality of corners includes a top left corner, a top right corner, a bottom left corner, and a bottom right corner, wherein

the plurality of crop strategies of the top left corner includes at least one of cropping to right, cropping to bottom, cropping to bottom right, target position, or false;

the plurality of crop strategies of the top right corner includes at least one of cropping to left, cropping to bottom, cropping to bottom left, target position, or false;

the plurality of crop strategies of the bottom left corner includes at least one of cropping to right, cropping to top, cropping to top right, target position, or false; and

the plurality of crop strategies of the bottom right corner includes at least one of cropping to left, cropping to top, cropping to top left, target position, or false.
The artificial intelligent image processing method of claim 20, further comprising:

stopping to crop one of the plurality of corners when the corner corresponds to a crop strategy of target position.
The artificial intelligent image processing method of claim 20, wherein cropping each of the plurality of corners comprises:

determining a cropping direction and a cropping length for each of the plurality of corners based on the pooling region proposal, wherein the cropping direction of each of the plurality of corners is limited to one of the plurality of crop strategies of the corresponding corner; and

cropping each of the plurality of corners based on the cropping direction and the cropping length.
The artificial intelligent image processing method of claim 20, wherein trimming the pooling region proposal by cropping each of the plurality of corners comprises:

performing one or more iterations;

in each of the one or more iterations,

determining, from the plurality of crop strategies, a crop strategy for each of the plurality of corners based on the pooling region proposal;

determining whether one of the plurality of corners corresponds to a crop strategy of false;

determining whether each of the plurality of corners corresponds to a crop strategy of target position in response to a determination that each of the plurality of corners does not correspond to the crop strategy of false;

in response to a determination that at least one of the plurality of corners does not correspond to the crop strategy of target position, cropping the at least one of the plurality of corners according to the determined crop strategy of the at least one of the plurality of corners;

performing, based on the cropped plurality of corners, a bounding mapping to determine a rectangular box; and

resizing the rectangular box into a canonical size; and

stopping to crop the plurality of corners in response to a determination that each of the plurality of corners corresponds to the crop strategy of target position.
The artificial intelligent image processing method of claim 23, further comprising:

abandoning the pooling region proposal in response to a determination that at least one of the plurality of corners corresponds to the crop strategy of false.
The artificial intelligent image processing method of any one of claims 14 to 24, further comprising:

determining one or more boundaries corresponding to the target object;

determining an intersection-over-union (IoU) between each of the one or more boundaries and a ground truth; and

determining one of the one or more boundaries with the greatest IoU as a target boundary corresponding to the target object.
The artificial intelligent image processing method of any one of claims 14 to 25, wherein the boundary of the target object is a quadrilateral box.
A non-transitory computer-readable storage medium, comprising at least one set of instructions for artificial intelligent object detection, wherein when executed by at least one processor of a computing device, the at least one set of instructions directs the at least one processor to perform acts of:

obtaining an image including a target object;

generating a plurality of feature maps by inputting the image into a convolutional neural network (CNN) ;

determining a plurality of region proposals based on the plurality of feature maps;

determining a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps;

classifying the plurality of pooling region proposals into one or more object categories or a background category via a classifier, the one or more object categories including a category of the target object, the plurality of pooling region proposals including one or more pooling region proposals corresponding to the target object, each of the one or more pooling region proposals having a plurality of corners; and

for each of the one or more pooling region proposals corresponding to the target object,

determining a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner;

trimming the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies;

identifying a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and

mapping the boundary to the image to determine a boundary of the target object.
An artificial intelligent image processing system for object detection, comprising:

an acquisition module configured to obtain an image including a target object;

a feature map determination module configured to generate a plurality of feature maps by inputting the image into a convolutional neural network (CNN) ;

a region proposal determination module configured to determine a plurality of region proposals based on the plurality of feature maps;

a pooling region proposal determination module configured to determine a plurality of pooling region proposals based on the plurality of region proposals and the plurality of feature maps;

a classification module configured to classify the plurality of pooling region proposals into one or more object categories or a background category via a classifier, the one or more object categories including a category of the target object, the plurality of pooling region proposals including one or more pooling region proposals corresponding to the target object, each of the one or more pooling region proposals having a plurality of corners; and

for each of the one or more pooling region proposals corresponding to the target object,

a boundary determination module configured to determine a plurality of crop strategies for each corner of the plurality of corners of the pooling region proposal according to a position of the corresponding corner;

the boundary determination module configured to trim the pooling region proposal by cropping each of the plurality of corners according to one of the plurality of crop strategies;

the boundary determination module configured to identify a boundary to the trimmed pooling region proposal based on the cropped plurality of corners; and

the boundary determination module configured to map the boundary to the image to determine a boundary of the target object.