CN112602091A

CN112602091A - Object detection using multiple neural networks trained for different image fields

Info

Publication number: CN112602091A
Application number: CN201980055920.4A
Authority: CN
Inventors: S·D·安丘; 王北楠; J·格洛斯纳
Original assignee: Optimum Semiconductor Technologies Inc
Current assignee: Optimum Semiconductor Technologies Inc
Priority date: 2018-07-30
Filing date: 2019-07-24
Publication date: 2021-04-02
Also published as: EP3830751A1; KR20210035269A; US20220114807A1; WO2020028116A1; EP3830751A4

Abstract

A system and method relating to object detection may include: the method includes receiving an image frame comprising an array of pixels captured by an image sensor associated with a processing device, identifying a near-field image segment and a far-field image segment in the image frame, applying a first neural network trained for the near-field image segment to detect objects present in the near-field image segment, and applying a second neural network trained for the far-field image segment to detect objects present in the near-field image segment.

Description

Object detection using multiple neural networks trained for different image fields

Cross Reference to Related Applications

This application claims priority to U.S. provisional application 62/711,695 filed on 30.7.2018, the contents of which are incorporated herein by reference in their entirety.

Technical Field

The present disclosure relates to detecting objects in images, and more particularly, to systems and methods for object detection using multiple neural networks trained for different image fields.

Background

Computer systems programmed to detect objects in an environment have a wide range of industrial applications. For example, autonomous vehicles may be equipped with sensors (e.g., lidar sensors and cameras) to capture sensor data around the vehicle. Further, the autonomous vehicle may be equipped with a computer system that includes a processing device that executes executable code to detect objects around the vehicle based on the sensor data.

Neural networks are used for object detection. The neural network in the present disclosure is an artificial neural network that may be implemented using circuitry to make decisions based on input data. A neural network may include one or more layers of nodes, where each node may be implemented in hardware as a computational circuit element for performing computations. Nodes in the input layer may receive input data into the neural network. A node in an inner layer may receive output data generated by a node in a previous layer. Further, nodes in the layer may perform certain calculations and generate output data for nodes of subsequent layers. The nodes of the output layer may generate output data for the neural network. Thus, a neural network may contain multiple layers of nodes to perform the computation of the forward propagation from the input layer to the output layer.

Drawings

The present disclosure will be understood more fully from the detailed description given below and from the accompanying drawings of various embodiments of the disclosure. The drawings, however, should not be taken to limit the disclosure to the specific embodiments, but are for explanation and understanding only.

FIG. 1 illustrates a system for detecting an object using multiple compact neural networks matched to different image fields, according to an embodiment of the present disclosure.

Fig. 2 illustrates an decomposition of an image frame according to an embodiment of the present disclosure.

Fig. 3 illustrates a decomposition of an image frame into near field image segments and far field image segments according to an embodiment of the present disclosure.

FIG. 4 depicts a flow diagram of a method of using a multi-field object detector according to an embodiment of the present disclosure.

Fig. 5 depicts a block diagram of a computer system operating in accordance with one or more aspects of the present disclosure.

Detailed Description

The neural network may include a plurality of layers of nodes. The layers may include an input layer, an output layer, and a hidden layer therebetween. The computation of the neural network propagates from the input layer through the hidden layer to the output layer. Each layer may include a node associated with a node value calculated from a previous layer through an edge connecting nodes between the current layer and the previous layer. An edge may connect a node in one layer to a node in an adjacent layer. Each edge may be associated with a weight value. Accordingly, the node value associated with the node of the current layer may be a weighted sum of the node values of the previous layers.

One type of neural network is a Convolutional Neural Network (CNN), where the computation performed at a hidden layer may be a convolution of node values associated with previous layers and weight values associated with edges. For example, the processing means may apply a convolution operation to the input layer and generate a node value of a first hidden layer connected to the input layer by an edge, and apply a convolution operation to the first hidden layer to generate a node value of a second hidden layer, and so on until the calculation reaches the output layer. The processing device may apply a soft combining operation to the output data and generate a detection result. The detection result may include the identity of the detected object and its location.

The topology and the weight values associated with the edges are determined in a neural network training phase. During the training phase, training input data may be fed into the CNN in forward propagation (from the input layer to the output layer). The output result of CNN may be compared with the target output data to calculate error data. Based on the error data, the processing device may perform back propagation in which weight values associated with edges are adjusted according to discriminant analysis. The process of forward and backward propagation may be repeated until the error data meets certain performance requirements during the verification process. CNN can then be used for object detection. CNNs may be trained for a particular class of objects (e.g., human) or multiple classes of objects (e.g., cars, pedestrians, and trees).

Autonomous vehicles are often equipped with a computer system for object detection. Instead of relying on a human operator to detect objects in the surrounding environment, an in-vehicle computer system may be programmed to use sensors to capture information of the environment and detect objects based on sensor data. Sensors used by autonomous vehicles may include cameras, lidar, radar, and the like.

In some embodiments, one or more cameras are used to capture images of the surrounding environment. The camera may include an optical lens, an array of photosensitive elements, a digital image processing unit, and a storage device. The optical lens may receive the light beam and focus the light beam on an image plane. Each optical lens may be associated with a focal length, which is the distance between the lens and the image plane. In practice, the video camera may have a fixed focal length, where the focal length may determine the field of view (FOV). The field of view of an optical device (e.g., a video camera) refers to the viewable area through the optical device. A shorter focal length may be associated with a wider field of view; a longer focal length may be associated with a narrower field of view.

The array of photosensitive elements may be fabricated in a silicon plane located at a position along the optical axis of the lens to capture the light beam passing through the lens. The image sensing elements may be Charge Coupled Device (CCD) elements, Complementary Metal Oxide Semiconductor (CMOS) elements or any suitable type of light sensitive devices. Each light sensitive element may capture a different color component (red, green, blue) of the light impinging on the light sensitive element. The array of photosensitive elements may comprise a rectangular array of a predetermined number of elements (e.g., M by N, where M and N are integers). The total number of elements in the array may determine the resolution of the camera.

The digital image processing unit is a hardware processor that can be coupled to an array of photosensitive elements to capture the response of these photosensitive elements to light. The digital image processing unit may include an analog-to-digital converter (ADC) to convert the analog signal from the photosensitive element into a digital signal. The digital image processing unit may also perform a filtering operation on the digital signal and encode the digital signal according to a video compression standard.

In one embodiment, the digital image processing unit may be coupled to a timing generator and record images captured by the photosensitive elements at predetermined time intervals (e.g., 30 or 60 frames per second). Each recorded image is referred to as an image frame comprising a rectangular array of pixels. Thus, image frames captured by a fixed focus video camera at a fixed spatial resolution may be stored in a storage device for further processing, such as object detection, where the resolution is defined by the number of pixels in a unit area in the image frame.

One technical challenge of autonomous vehicles is detecting human bodies based on images captured by one or more video cameras. The neural network may be trained to recognize human bodies in the image. The trained neural network may be deployed in actual operation to detect a human body. If the focal length is much shorter than the distance between the human body and the lens of the video camera, the optical magnification of the video camera can be expressed as G ═ f/p ═ i/o, where p is the distance from the object to the center of the lens, f is the focal length, i (measured in number of pixels) is the length of the object projected on the image frame, and o is the height of the object. As the distance p increases, the number of pixels associated with the object decreases. As a result, fewer pixels are used to capture the height of the human body at a distance. Since fewer pixels may provide less information about the human body, a trained neural network may have difficulty detecting a distant human body. For example, assume that the focal length f is 0.1m (meters); the height o of the object is 2 m; the pixel density k is 100 pixels/mm; minimum pixel for object detectionThe number Nmin is 80 pixels. The maximum distance for reliable object detection is p ═ f o/(N/k) ═ 0.1 × 2/80 × 10^-3250 m/l 00. Therefore, a depth of field exceeding 250m is defined as a far field. If i is 40 pixels, p is 500 m. If the far field is in the range of 250-500m, the resolution for representing the object needs to be doubled from 40 pixels to 80 pixels.

To overcome the above-described and other drawbacks of object detection using neural networks, embodiments of the present disclosure provide systems and methods that may divide a two-dimensional region of an image frame into image segments. Each image segment may be associated with a particular image field including at least one of a far field or a near field. The image segments associated with the far field may have a higher resolution than the image segments associated with the near field. Thus, an image segment associated with the far field may include more pixels than an image segment associated with the near field. Embodiments of the present disclosure may further provide for each image segment a neural network trained specifically for that image segment, wherein the number of neural networks is the same as the number of image segments. Because each image segment is much smaller than the entire image frame, the neural network associated with the image segment is more compact and can provide more accurate detection results.

Embodiments of the present disclosure may also track detected human bodies through different segments associated with different fields (e.g., from far-field to near-field) to further reduce false alarm rates. When a human body moves within range of the lidar sensor, the lidar sensor and the video camera may be paired together to detect the human body.

FIG. 1 illustrates a system 100 for detecting an object using multiple compact neural networks matching different image fields according to an embodiment of the present disclosure. As shown in fig. 1, the system 100 may include a processing device 102, an accelerator circuit 104, and a memory device 106. System 100 may optionally include sensors, such as lidar sensor 122 and video camera 120. The system 100 may be a computing system (e.g., on an autonomous vehicle) or a system on a chip (SoC). The processing device 102 may be a hardware processor, such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or a general purpose processing unit. In one embodiment, the processing device 102 may be programmed to perform certain tasks, including delegating compute-intensive tasks to the accelerator circuit 104.

The accelerator circuit 104 may be communicatively coupled to the processing device 102 to perform computationally intensive tasks using dedicated circuitry therein. The special purpose circuits may be Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), Digital Signal Processors (DSPs), network processors, and the like. In one embodiment, the accelerator circuit 104 may include a plurality of Compute Circuit Elements (CCEs), which are circuit elements that may be programmed to perform some type of computation. For example, to implement a neural network, CCEs may be programmed under the instructions of processing device 102 to perform operations such as weighted summation and convolution. Thus, each CCE may be programmed to perform computations associated with nodes of the neural network; a set of CCEs of the accelerator circuit 104 may be programmed as a layer (visible layer or hidden layer) of nodes in the neural network; the sets of CCEs of the accelerator circuit 104 may be programmed into layers that serve as nodes of a neural network. In one embodiment, in addition to performing the calculations, the CCE may also include local storage (e.g., registers) (not shown) to store parameters (e.g., synaptic weights) used in the calculations. Thus, for simplicity and simplicity of description, each CCE in the present disclosure corresponds to a circuit element that enables calculation of parameters associated with a node of the neural network. The processing device 102 may be programmed with instructions to construct an architecture for the neural network and train the neural network for a particular task.

The memory device 106 may include a storage device communicatively coupled to the processing device 102 and the accelerator circuit 104. In one implementation, the memory device 106 may store input data 116 to the multi-field object detector 108 for execution by the processing device 102 and output data 118 generated by the multi-field object detector 108. The input data 116 may be sensor data captured by sensors such as a lidar sensor 120 and a video camera 122. The output data may be the result of object detection by the multi-field object detector 108. The object detection result may be recognition of a human body.

In one embodiment, the processing device 102 may be programmed to execute a multi-field object detector 108, which when executed, the multi-field object detector 108 may detect a human body based on the input data 116. Instead of utilizing a neural network for detecting objects based on full resolution image frames captured by the video camera 122, implementations of the multi-field object detector 108 may employ a combination of several reduced complexity neural networks to achieve object detection. In one embodiment, the multi-field object detector 108 may decompose a video image captured by the video camera 122 into near-field image segments and far-field image segments, where the far-field image segments may have a higher resolution than the near-field image segments. The size of either the far field image segment or the near field image segment is smaller than the size of the full resolution image. The multi-field object detector 108 may apply a Convolutional Neural Network (CNN)110 trained specifically for near-field image segments to near-field image segments and CNN 112 trained specifically for far-field image segments to far-field image segments. Multi-field object detector 108 may also track over time the arrival of a human body detected in the far field to the near field until the human body reaches the range of lidar sensor 120. Multi-field object detector 108 may then apply CNN 114 trained specifically for lidar data to the lidar data. Because the

CNNs

110, 112 are trained for the near and far field image segments, respectively, the

CNNs

110, 112 may be compact CNNs that are smaller than CNNs trained for full resolution images.

The multi-field object detector 108 can decompose the full resolution image into a near-field image representation (referred to as "near-field image segments") that captures objects closer to the optical lens and a far-field image representation (referred to as "far-field image segments") that captures objects further from the optical lens. Fig. 2 illustrates an decomposition of an image frame according to an embodiment of the present disclosure. As shown in fig. 2, the optical system of the video camera 200 may include a lens 202 and an image plane (e.g., an array of photosensitive elements) 204 at a distance from the lens 202, wherein the image plane is within the depth of field of the video camera. Depth of field is the distance between the image plane and the focal plane at which an object captured at the image plane appears in acceptable sharpness in the image. Objects that are far from the lens 202 can be projected onto a small area on the image plane and therefore require a higher resolution (or sharper focus, more pixels) to be identified. In contrast, objects near the lens 202 can be projected onto a large area on the image plane and therefore require a lower resolution (fewer pixels) to be identified. As shown in fig. 2, the near field image segments cover a larger area in the image plane than the far field image segments. In some cases, the near field image segment may overlap a portion of the far field image in the image plane.

Fig. 3 illustrates the decomposition of an image frame 300 into near field image segments 302 and far field image segments 304 according to an embodiment of the present disclosure. Although the above embodiments are discussed by way of example for near field image segments and far field image segments, embodiments of the present disclosure may also include multi-field image segments, where each of the image segments is associated with a specially trained neural network. For example, the image segments may include a near field image segment, a mid field image segment, and a far field image segment. The processing means may apply different neural networks to the near field image segment, the mid field image segment and the far field image segment for human detection.

The video camera may record a stream of image frames that includes a pixel array corresponding to the photosensitive elements on the image plane 204. Each image frame may include a plurality of rows of pixels. Thus, as shown in FIG. 2, the area of image frame 300 is proportional to the area of image plane 204. As shown in fig. 3, the near field image segment 302 may cover a larger portion of the image frame than the far field image segment 304 because objects near the optical lens are projected on the image plane larger. In one embodiment, near field image segments 304 and far field image segments 306 may be extracted from an image frame, where the near field image segments 302 are associated with a lower resolution (e.g., a sparse sampling pattern 306) and the far field image segments 304 are associated with a higher resolution (e.g., a dense sampling pattern 308).

In one embodiment, the processing device 102 may execute an image pre-processor to extract the near field image segments 306 and the far field image segments 308. The processing device 102 may first identify the top band 310 and the bottom band 312 of the image frame 300 and discard the top band 310 and the bottom band 312. Processing device 102 may identify top band 310 as a first predetermined number of pixel rows and bottom band 312 as a second predetermined number of pixel rows. Processing device 102 may discard top band 310 and bottom band 312 because both bands cover the sky and the road directly in front of the camera, and both bands typically do not contain a human body.

The processing device 102 may further identify a first range of pixel rows for the near field image segment 302 and a second range of pixel rows for the far field image segment 304, where the first range may be greater than the second range. The first range of pixel rows may include a third predetermined number of pixel rows in the middle of the image frame; the second range of rows of pixels may include a fourth predetermined number of rows of pixels vertically above the centerline of the image frame. Processing device 102 may also decimate pixels within a first range of rows of pixels using sparse sub-sampling pattern 306 and decimate pixels within a second range of rows of pixels using dense sub-sampling pattern 308. In one embodiment, the near field image segments 302 are decimated using a large decimation factor (e.g., 8) and the far field image segments 304 are decimated using a small decimation factor (e.g., 2), such that the resolution of the extracted far field image segments 304 is higher than the extracted near field image segments 306. In one embodiment, the resolution of the far field image segment 304 may be twice the resolution of the near field image segment 306. In another embodiment, the resolution of the far field image segments 304 may be greater than twice the resolution of the near field image segments 306.

The video camera may capture a stream of image frames at a certain frame rate (e.g., 30 or 60 frames per second). The processing device 102 may execute an image pre-processor to extract corresponding near field image segments 302 and far field image segments 304 for each image frame in the stream. In one embodiment, a first neural network is trained for human detection based on near-field image segment data and a second neural network is trained for human detection based on far-field image segment data. The number of nodes in the first and second neural networks is small compared to a neural network trained for a full resolution of the image frame.

FIG. 4 depicts a flow diagram of a method 400 of using a multi-field object detector in accordance with an embodiment of the present disclosure. The method 400 may be performed by a processing device that may comprise hardware (e.g., circuitry, dedicated logic), computer readable instructions (e.g., running on a general purpose computer system or a dedicated machine), or a combination of both. The method 400 and its individual functions, routines, subroutines, or operations may each be performed by one or more processors of a computer device executing the method. In some embodiments, method 400 may be performed by a single processing thread. Alternatively, the method 400 may be performed by two or more processing threads, each thread performing one or more separate functions, routines, subroutines, or operations of the method.

For simplicity of explanation, the methodologies of the present disclosure are depicted and described as a series of acts. However, acts may occur in various orders and/or concurrently, and with other acts not presented and described herein, in accordance with the disclosure. Moreover, not all illustrated acts may be required to implement a methodology in accordance with the disclosed subject matter. In addition, those skilled in the art will understand and appreciate that the methodologies could alternatively be represented as a series of interrelated states via a state diagram or events. Additionally, it should be appreciated that the methodologies disclosed in this specification are capable of being stored on an article of manufacture to facilitate transporting and transferring such methodologies to computing devices. The term "article of manufacture" as used herein is intended to encompass a computer program accessible from any computer-readable device or storage media. In one embodiment, the method 400 may be performed by the processing device 102 executing the multi-field object detector 108 and the accelerator circuit 104 supporting CNN as shown in fig. 1.

Compact neural networks for human detection may require training before being deployed on an autonomous vehicle. During the training process, the weight parameters associated with the edges of the neural network may be adjusted and selected based on certain criteria. The training of the neural network may be done offline using a publicly available database. These publicly available databases may include images of outdoor scenes that include human bodies that have been manually marked. In one embodiment, the images of the training data may be further processed to identify human bodies in the far field and the near field. For example, the far field image may be a 50x80 pixel window cropped from the image. Thus, the training data may include far-field training data and near-field training data. Training can be performed offline by a more powerful computer (referred to as a "training computer system").

The processing device of the training computer system may train a first neural network based on the near-field training data and may train a second neural network based on the far-field training data. The type of neural network may be a Convolutional Neural Network (CNN), and the training may be based on back propagation. The trained first and second neural networks are smaller than the image frame based full resolution trained neural network. After training, the first and second neural networks may be used by the autonomous vehicle to detect objects (e.g., human bodies) on the road.

Referring to fig. 4, at 402, the processing device 102 (or a different processing device on the autonomous vehicle) may identify a stream of image frames captured by a video camera during operation of the autonomous vehicle. The processing means will detect a human in the stream.

At 404, the processing device 102 may extract near-field image segments and far-field image segments from the image frames of the stream using the method described above in connection with fig. 3. The resolution of the near field image segments may be lower than the resolution of the far field image segments.

At 406, the processing device 102 may apply a first neural network trained based on near-field training data to the near-field image segment to identify a human body in the near-field image segment.

At 408, the processing device 102 may apply a second neural network trained based on far-field training data to the far-field image segment to identify the human body in the far-field image segment.

At 410, in response to detecting a human body in the far-field image segment, the processing device 102 may record the detected human body in a record and track the human body through image frames from the far-field to the near-field. The processing device 102 may use a polynomial fit and/or a kalman predictor to predict a location of the human body detected in a subsequent image frame and apply a second neural network to far field image segments extracted from the subsequent image frame to determine whether the human body is located at the predicted location. If the processing means determines that no human body is present at the predicted position, the detected human body is considered to be a false positive and the entry corresponding to the human body is deleted from the record.

At 412, the processing device 102 may further determine whether the approaching human is within range of a lidar sensor paired with a video camera on the autonomous vehicle for human detection. The lidar may detect objects in a range shorter than the far field but within the near field. In response to determining that the human body is within range of the lidar sensor (e.g., by detecting an object in a corresponding location with a far-field image segment), the processing device may apply a third neural network trained on the lidar sensor data to the lidar sensor data and apply a second neural network for the far-field image segment (or apply the first neural network for the near-field image segment). In this way, lidar sensor data may be used in conjunction with image data to further improve human detection.

The processing device 102 may also operate the autonomous vehicle based on the detection of the human body. For example, the processing device 102 may operate the vehicle to stop or avoid a collision with a human body.

Fig. 5 depicts a block diagram of a computer system operating in accordance with one or more aspects of the present disclosure. In various illustrative examples, computer system 500 may correspond to system 100 of fig. 1.

In some embodiments, computer system 500 may be connected (e.g., via a network such as a Local Area Network (LAN), intranet, extranet, or the internet) to other computer systems. The computer system 500 may operate in a client-server environment as a server or client computer, or in a peer-to-peer or distributed network environment as a peer computer. Computer system 500 may be provided by a Personal Computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a Web appliance, a server, a network router, switch or bridge, or any device capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that device. Furthermore, the term "computer" shall include any collection of computers that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.

In another aspect, computer system 500 may include a processing device 502, a volatile memory 504 (e.g., Random Access Memory (RAM)), a non-volatile memory 506 (e.g., read-only memory (ROM) or Electrically Erasable Programmable ROM (EEPROM)), and a data storage device 516, which may communicate with each other via a bus 508.

The processing device 502 may be provided by one or more processors, such as a general-purpose processor (e.g., a Complex Instruction Set Computing (CISC) microprocessor, Reduced Instruction Set Computing (RISC) microprocessor, Very Long Instruction Word (VLIW) microprocessor, microprocessor implementing other types of instruction sets, or microprocessors implementing combinations of various types of instruction sets) or a special-purpose processor (e.g., an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), or a network processor).

The computer system 500 may also include a network interface device 522. The computer system 500 may also include a video display unit 510 (e.g., an LCD), an alphanumeric input device 512 (e.g., a keyboard), a cursor control device 514 (e.g., a mouse), and a signal generation device 520.

Data storage 516 may include a non-transitory computer-readable storage medium 524 on which may be stored instructions 526 encoding any one or more of the methods or functions described herein, including instructions of multi-field object detector 108 of fig. 1, for implementing method 400.

The instructions 526 may also reside, completely or partially, within the volatile memory 504 and/or within the processing device 502 during execution thereof by the computer system 500, such that the volatile memory 504 and the processing device 502 may also constitute machine-readable storage media.

While the computer-readable storage medium 524 is shown in an illustrative example to be a single medium, the term "computer-readable storage medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of executable instructions. The term "computer-readable storage medium" shall also be taken to include a tangible medium that is capable of storing or encoding a set of instructions for execution by the computer to cause the computer to perform any one or more of the methodologies described herein. The term "computer readable storage medium" shall include, but not be limited to, solid-state memories, optical media, and magnetic media.

The methods, components and features described herein may be implemented by discrete hardware components or may be integrated in the functionality of other hardware components such as ASICS, FPGAs, DSPs or similar devices. Additionally, the methods, components and features may be implemented by firmware modules or functional circuits within a hardware device. Furthermore, the methods, components and features may be implemented in any combination of hardware devices and computer program components or in a computer program.

Unless specifically stated otherwise, terms such as "receiving," "associating," "determining," "updating," or the like, refer to the action and processes performed or effected by a computer system that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices. Also, as used herein, the terms "first," "second," "third," "fourth," etc. refer to labels used to distinguish between different elements, and may not have an ordinal meaning according to their numerical designation.

Examples described herein also relate to an apparatus for performing the methods described herein. The apparatus may be specially constructed for carrying out the methods described herein, or it may comprise a general purpose computer system selectively programmed by a computer program stored in the computer system. Such a computer program may be stored in a tangible storage medium readable by a computer.

The methods and illustrative examples described herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used with the teachings described herein, or it may prove convenient to construct a more specialized apparatus to perform the method 300 and/or each of its various functions, routines, subroutines, or operations. In the above description, structural examples of various of these systems are set forth.

The above description is intended to be illustrative, and not restrictive. While the present disclosure has been described with reference to specific illustrative examples and embodiments, it will be recognized that the present disclosure is not limited to the described examples and embodiments. The scope of the disclosure should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims

1. A method for detecting an object using a plurality of sensor devices, comprising:

receiving, by a processing device, an image frame comprising an array of pixels captured by an image sensor associated with the processing device;

identifying, by the processing device, near field image segments and far field image segments in the image frame;

applying, by the processing device, a first neural network trained for a near-field image segment to the near-field image segment to detect an object present in the near-field image segment; and

applying, by the processing device, a second neural network trained for a far-field image segment to the far-field image segment to detect objects present in the far-field image segment.

2. The method of claim 1, wherein each of the near field image segment or the far field image segment includes fewer pixels than the image frame.

3. The method according to claim 1 or 2, wherein the near field image segment comprises a first number of pixel lines and the far field image comprises a second number of pixel lines, and wherein the first number of pixel lines is smaller than the second number of pixel lines.

4. A method according to claim 1 or 2, wherein the number of pixels of the near field image segment is less than the number of pixels of the far field image segment.

5. A method according to claim 1 or 2, wherein the resolution of the near field image segments is lower than the resolution of the far field image segments.

6. The method of claim 1 or 2, wherein the near field image segments capture a scene at a first distance from an image plane of the image sensor and the far field image segments capture a scene at a second distance from the image plane, and wherein the first distance is less than the second distance.

7. The method of claim 1, further comprising:

operating an autonomous vehicle based on detection of a first object or a second object in the far field image segment in response to at least one of identifying the first object in the near field image or identifying the second object in the far field image segment.

8. The method of claim 1, further comprising:

in response to detecting a second object in the far field image segment, tracking the second object over time through a plurality of image frames from a range associated with the far field image segment to a range associated with one of the near field image segment or the far field image segment;

determining a range of the second object in a second image frame to reach a lidar sensor based on tracking the second object over time;

receiving lidar sensor data captured by the lidar sensor; and

applying a trained third neural network to the lidar sensor data to detect an object.

9. The method of claim 8, further comprising:

applying the first neural network to the near-field image segments of the second image frame or applying the second neural network to the far-field image segments of the second image frame; and

validating an object detected by at least one of applying the first neural network or applying the second neural network using an object detected by applying the third neural network.

10. A system for detecting an object using a plurality of sensor devices, comprising:

an image sensor;

a storage device for storing instructions; and

processing means, communicatively coupled to the image sensor and the storage means, for executing instructions to:

receiving an image frame comprising an array of pixels captured by an image sensor associated with the processing device;

identifying near field image segments and far field image segments in the image frame;

applying a first neural network trained on a near-field image segment to the near-field image segment to detect objects present in the near-field image segment; and

applying a second neural network trained on a far-field image segment to the far-field image segment to detect objects present in the far-field image segment.

11. The system of claim 10, wherein each of the near field image segment or the far field image segment includes fewer pixels than the image frame.

12. The system of claim 10 or 11, wherein the near field image segment comprises a first number of pixel rows and the far field image comprises a second number of pixel rows, and wherein the first number of pixel rows is less than the second number of pixel rows.

13. The system according to claim 10 or 11, wherein the number of pixels of the near field image segment is less than the number of pixels of the far field image segment.

14. The system according to claim 10 or 11, wherein the resolution of the near field image segments is lower than the resolution of the far field image segments.

15. The system of claim 10 or 11, wherein the near field image segments capture a scene at a first distance from an image plane of the image sensor and the far field image segments capture a scene at a second distance from the image plane, and wherein the first distance is less than the second distance.

16. The system of claim 10, wherein the processing device is to:

17. The system of claim 10, further comprising a lidar sensor, wherein the processing device is to:

receiving lidar sensor data captured by the lidar sensor; and

18. The system of claim 17, wherein the processing device is to:

19. A non-transitory machine-readable storage medium storing instructions that, when executed, cause a processing device to perform operations for detecting an object using a plurality of sensor devices, the operations comprising:

receiving, by the processing device, an image frame comprising an array of pixels captured by an image sensor associated with the processing device;

applying, by the processing device, a second neural network trained for a far-field image segment to the far-field image segment to detect objects present in the near-field image segment.

20. The non-transitory machine-readable storage medium of claim 19, wherein the near field image segment includes a first number of rows of pixels and the far field image includes a second number of rows of pixels, and wherein the first number of rows of pixels is less than the second number of rows of pixels.