CN110506277B

CN110506277B - Filter reuse mechanism for constructing robust deep convolutional neural networks

Info

Publication number: CN110506277B
Application number: CN201780089497.0A
Authority: CN
Inventors: 姜晓恒
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2017-02-13
Filing date: 2017-02-13
Publication date: 2023-08-08
Anticipated expiration: 2037-02-13
Also published as: CN110506277A; WO2018145308A1

Abstract

An apparatus and method, the method comprising: generating a feature map (406) for a first convolutional layer of the convolutional neural network based on the region of the image to be evaluated and a learned filter from the first convolutional layer; generating a feature map (408) for one or more subsequent convolutional layers of the convolutional neural network based on the feature map of the previous convolutional layer, the learned filter for the previous convolutional layer, and the learned filter for the subsequent convolutional layer; and detecting the presence of an object of interest in the region of the image based on the generated feature maps of the first convolution layer and one or more subsequent convolution layers (410).

Description

Filter reuse mechanism for constructing robust deep convolutional neural networks

Technical Field

The present disclosure relates to neural networks, and more particularly, to filtering mechanisms for convolutional neural networks.

Background

Object recognition is an important component in the field of computer vision. Deep Convolutional Neural Networks (CNNs) have been used to facilitate target recognition over the past few years. The strength of deep convolutional neural networks lies in the fact that they are able to learn the hierarchy of features. At G.Huang, Z.Liu, Q.Weinberge: an example of a CNN architecture is described in Densely Connected Convolutional Networks, coRR, abs/1608.06993 (2016) (hereinafter "Huang"). In Huang, a CNN architecture is proposed that introduces direct connections within all layers of the block in the neural network. That is, in one block, each layer is directly connected to each other layer in a feed-forward manner. A block typically includes several layers without downsampling operations. For each layer, all feature maps at the previous layer are treated as separate inputs, while their own feature maps are continuously passed as inputs to all subsequent layers. The core idea is to reuse the feature map generated in the previous layer. However, these feature maps do not lend themselves to new information to the neural network.

Disclosure of Invention

Accordingly, the present disclosure provides an apparatus and method to generate a feature map for a first convolutional layer of a convolutional neural network based on an area of an image to be evaluated and a learned filter from the first convolutional layer, generate a feature map for one or more subsequent convolutional layers of the convolutional neural network after the first convolutional layer, and detect a presence of an object of interest in the area of the image based on the generated feature maps of the first and one or more subsequent convolutional layers. Each subsequent convolution layer is generated based on the feature map of the previous convolution layer, the learned filter for the previous convolution layer, and the learned filter for the subsequent convolution layer.

The apparatus and method may be further configured to receive an image captured from the image sensing device and/or initiate an alert if the object is detected. Further, a convolutional neural network may be applied to each region of the image to detect whether an object exists in any of the regions of the image.

The apparatus and method may also be configured to learn a filter for each convolutional layer of the convolutional neural network using one or more training images during a training phase (or period). To learn a filter, the apparatus and method may be configured to initialize the filter for convolutional layers of a convolutional neural network, generate a feature map for each convolutional layer using forward propagation, calculate a loss based on the generated feature map and a score for each class and corresponding label using a loss function, and update the filter for the convolutional layer using backward propagation if the calculated loss has decreased. Each subsequent convolution layer after the first convolution layer is generated based on the feature map of the previous convolution layer, the learned filter for the previous convolution layer, and the filter for the subsequent convolution layer. The apparatus and method may be configured to repeat the operations of computing the feature map, computing the loss, and updating the filter until the convolutional neural network converges when the computed loss is no longer decreasing.

In an apparatus and method, two graph features may be generated for each of one or more subsequent convolutional layers. Further, the operations of generating a feature map for a first convolution layer, generating a feature map for one or more subsequent convolution layers, and detecting the presence of an object are performed in a test phase.

To detect the presence of an object in a region of an image, the apparatus and method may be configured to obtain a score for the region from an application of a convolutional neural network, and compare the score for the region to a value of a threshold. If the score for the region is greater than the value of the threshold, then the object is detected in the region.

Drawings

A description of various example embodiments is illustrated in conjunction with the accompanying drawings, in which:

FIG. 1 illustrates a block diagram of an example system for detecting the presence or absence of an object using a Convolutional Neural Network (CNN) with filter reuse (or sharing), in accordance with an embodiment of the present disclosure;

FIG. 2 illustrates a block diagram of an example system for detecting the presence or absence of an object using a Convolutional Neural Network (CNN) with filter reuse (or sharing) in accordance with another embodiment of the present disclosure;

FIG. 3 is an example architecture of a convolutional neural network reusing filters from a previous convolutional layer in a subsequent convolutional layer in accordance with an embodiment of the present disclosure;

FIG. 4 is a flow chart illustrating an example process by which a system such as, for example, the system of FIG. 1 or FIG. 2 is configured to implement training and/or testing phases using a convolutional neural network;

FIG. 5 is a flow chart illustrating an example process, such as, for example, the system in FIG. 1 or FIG. 2, configured to implement a training phase for training a convolutional neural network through the example process, in accordance with an embodiment of the present disclosure;

FIG. 6 is a flow chart illustrating an example process, such as, for example, the system in FIG. 1 or FIG. 2, configured to implement a test phase for evaluating an image or region thereof using a trained convolutional neural network through the example process, in accordance with an embodiment of the present disclosure; and

fig. 7 is a flowchart illustrating an example detection process, such as, for example, the system in fig. 1 or 2, configured to detect the presence (or absence) of a feature, such as an object, using a convolutional neural network through the example detection process, according to an example embodiment of the present disclosure.

Detailed Description

According to various example embodiments, an apparatus and method are provided that employ a deep Convolutional Neural Network (CNN) with a filter reuse mechanism to analyze an image or region thereof and detect the presence (or absence) of an object(s) of interest. CNNs are configured to reuse filters from previous (e.g., previous or earlier) convolutional layers to compute graph features in subsequent convolutional layers. In this way, the filters may be fully used or shared such that the ability of the feature representation is significantly enhanced, thereby significantly improving the recognition accuracy of the resulting depth CNN. The present CNN method with filter reuse may also utilize information (e.g., filters) obtained from previous convolutional layers and generate new information (e.g., feature maps) in the current convolutional layer, as compared to other ways of simply reusing the previous feature maps. Furthermore, the architecture of such CNNs may reduce the number of parameters since each current convolutional layer reuses the filter of the previous convolutional layer. Thus, this configuration can solve the over-fitting problem caused by using too many parameters.

The apparatus and methods of the present disclosure may be employed in an object recognition system, such as, for example, a video surveillance system employing a camera or other sensor. For example, the camera may capture several multi-view images of the same scene, such as 360 degree images. The task of video surveillance is to detect one or more objects of interest (e.g., pedestrians, animals, or other objects) from a multi-view image and then provide an alert or notification (e.g., an alarm or warning) to the user. Because the camera system may be provided to capture 360 degree images, the video surveillance system may potentially detect all objects of interest present in a scene or environment. In such a monitoring system, each camera (or camera subsystem) may be configured to perform object detection. For example, operation of a video surveillance system using CNNs with filter reuse may involve the following. Each camera of the system captures an image. For each region of the captured image, a CNN with filter reuse may be employed, for example, to classify the region as an object of interest if the response of the CNN is greater than a predefined threshold, and to classify the region as a background (e.g., a non-object) if the response of the CNN is equal to or less than the threshold.

As described herein, the object detection process may involve a training phase and a testing phase. The goal of the training phase is to design or configure a structure of the CNN with filter reuse and learn the parameters of the CNN (i.e., the filter). During the training phase, the CNN is trained to detect the presence (or absence) of a particular object(s) using the training image as input. For example, backward propagation may be used to learn or configure parameters of the CNN (such as filters) to detect the presence (or absence) of an object. The training image may include an example image of the object(s) of interest, an example image of the background(s), and other aspects that may be present in the image. In the test phase, the trained CNN with filter reuse may be applied to an image to be tested (e.g., an input image or a test image) to detect the presence (or absence) of a particular object(s). With the structure and parameters of the trained depth CNN, the goal of the test phase is to classify each region of the image by taking the region as input to the trained CNN. The region is classified as an object or background of interest. For example, if the classification decision is an object of interest, the system generates, for example, an alarm or notification (e.g., an alarm signal in the form of voice or a message), which may be immediately sent to the user via a network connection (e.g., the internet) or other medium. These operations, which are implemented during object detection, may be performed in each camera or camera subsystem of the monitoring system. An alert may be generated after one of the cameras in the system detects an object of interest. The object detection process may be implemented in or with each camera or each camera subsystem. Examples of CNN and object detection systems with filter reuse are described in further detail below with reference to the accompanying drawings.

Fig. 1 illustrates a block diagram of example components of a system 100 for detecting the presence (or absence) of an object of interest using a Convolutional Neural Network (CNN) of reuse or shared filters. As shown in fig. 1, system 100 includes one or more processors 110, a plurality of sensors 120, a user interface(s) 130, a memory 140, a communication interface(s) 150, a power supply 160, and an output device(s) 170. The power supply 160 may comprise a battery powered unit, which may be rechargeable, or may be a unit providing a connection to an external power source.

The sensor 120 is configured to sense or monitor activity (e.g., object (s)) in a geographic area or environment, such as around a vehicle, around or within a building, etc. The sensor 120 may include one or more image sensing devices or sensors. For example, the sensor 120 may be a camera with one or more lenses (e.g., a camera, a web camera, a camera system that captures panoramic or 360 degree images, a camera with a wide angle lens or multiple lenses, etc.). The image sensing device is configured to capture an image or image data that may be analyzed using the CNN to detect the presence (or absence) of an object of interest. The captured image or image data may include image frames, video, pictures, and the like. The sensors 120 may also include millimeter wave radar, infrared cameras, lidar (light detection and ranging) sensors, and/or other types of sensors.

The user interface(s) 130 may include a plurality of user input devices through which a user may input information or commands to the system 100. The user interface(s) 130 may include a keypad, touch screen display, microphone, or other user input device through which a user may enter information or commands.

Output device 170 may include a display, speakers, or other device capable of communicating information to a user. The communication interface(s) 150 may include communication circuitry, e.g., a Transmitter (TX), a Receiver (RX), a transceiver (such as a radio frequency transceiver), etc., for performing line-based communication with an external device such as a USB or ethernet cable interface, or for performing wireless communication with an external device such as, for example, through a wireless personal area network, a wireless local area network, a cellular network, or a wireless wide area network. For example, the communication interface(s) 150 may be used to receive CNNs and their parameters or updates thereof (e.g., learned filters for objects of interest) from the external computing device 180 (e.g., a server, a data center, etc.), transmit alerts or other notifications to the external computing device 180 (e.g., a device such as a user of a computer, etc.), and/or interact with the external computing device 180 to implement various operations described herein in a distributed manner, such as training phases, testing phases, alert notifications, and/or other operations described herein.

The memory 140 is a data storage device that may store computer-executable code or programs that, when executed by the processor 110, control the operation of the system 100. The memory 140 may also store configuration information for the CNN 142 and its parameters 144 (e.g., learned filters), images 146 (e.g., training images, captured images, etc.), and detection algorithms 148 for implementing various operations described herein, such as training phases, testing phases, alert notifications, and other operations described herein.

The processor 110 is in communication with a memory 140. The processor 110 is a processing system that may include one or more processors, such as CPUs, GPUs, controllers, dedicated circuits, or other processing units that control the operation of the system 100 including the detection operations described herein (e.g., training phase, testing phase, alert notification, etc.) in this disclosure. For example, the processor 110 is configured to train the CNN 142 to detect the presence or absence of an object of interest, e.g., to detect object(s) of interest, background(s), etc., by using configuration or learning parameters (e.g., learning filters) such as training images, category/tag information, etc. The processor 110 is also configured to test the captured image(s) or region thereof using the trained CNN 142 with the learned parameters in order to detect the presence (or absence) of an object in the image or region thereof. The object of interest may comprise a person, such as a pedestrian, an animal, a vehicle, a traffic sign, a road hazard, etc., or other object of interest depending on the intended application. The processor 110 is also configured to initiate an alert or other notification when the presence of an object is detected, such as by outputting the notification using the output device 170 or notifying the user by transmitting the notification to an external computing device 180 (e.g., the user's device, data center, server, etc.) via the communication interface 150. External computing device 180 may include components similar to those in system 100, such as those shown and described above with reference to fig. 1.

Fig. 2 depicts an example system 200 including a processor(s) 210 and a sensor(s) 220, according to some example embodiments. The system 200 may also include a radio frequency transceiver 250. Further, the system 200 may be installed in a vehicle 20 such as an automobile or truck, but the system may also be used without the vehicle 20. The system 200 may include the same or similar components and functions as provided in the system 100 of fig. 1.

For example, the sensor(s) 220 may include one or more image sensors configured to provide image data such as image frames, video, pictures, and the like. For example, in the case of an advanced driver assistance system/autonomous vehicle, the sensor 220 may include a camera, millimeter wave radar, infrared camera, lidar (light detection and ranging) sensor, and/or other types of sensors.

Processor 210 may include CNN circuitry, which may represent dedicated CNN circuitry configured to implement convolutional neural networks and other operations described herein. Alternatively or additionally, the CNN circuit may be implemented in other ways, such as using at least one memory including program code executed by at least one processing device (e.g., CPU, GPU, controller, etc.).

In some example embodiments, the system 200 may have a training phase. The training phase may configure the CNN circuit configuration to learn to detect and/or classify one or more objects of interest. The processor 210 may be trained with images including objects such as people, other vehicles, road hazards, and the like. After being trained, when the image includes object(s), the trained CNN, which is implementable via the processor 210, can detect the object(s) and provide an indication of the detection/classification of the object(s). During the training phase, the CNN may learn its configuration (e.g., parameters, weights, etc.). After being trained, the configured CNN may be used during a testing or operational phase to detect and/or classify areas (e.g., tiles or portions) of an unknown input image and thereby determine whether the input image includes an object of interest or includes only a background (i.e., does not have an object of interest).

In some example embodiments, the system 200 may be trained to detect objects, such as people, animals, other vehicles, traffic signs, road hazards, and the like. In Advanced Driver Assistance Systems (ADASs), when an object such as a vehicle/person is detected, an output such as a warning sound, haptic feedback, an indication of the identified object, or other indication may be generated, for example, to alert or notify the driver. In the case of an autonomous vehicle including system 200, the detected object may signal the control circuitry to take additional actions in the vehicle (e.g., initiate a pause, accelerate/decelerate, turn, and/or other actions). Further, the indication may be transmitted to other vehicles, ioT devices or clouds, a Mobile Edge Computing (MEC) platform, etc., via the radio transceiver 250.

Fig. 3 is an example of a Convolutional Neural Network (CNN) architecture 300 that includes multiple convolutional layers (e.g., layer 1 … layer L or L) and a decision layer. The CNN architecture 200 is configured to reuse or share filters from previous convolutional layers in subsequent convolutional layers. For example, in layer 1, N ₁ Figure C of characteristics ₁ Through a filter W ₁ Is obtained. C (C) ₁ The spatial width and height of (2) are w respectively ₁ And h ₁ . In layer 2, feature map C ₂ Not only by a new filter W ₂ But also through the previous layer 1 filter W ₁ And is obtained. By means of a filter W ₂ ，N ₂₁ A signature is obtained. By means ofExisting filter W ₁ ，N ₂₂ A signature is obtained. N (N) ₂₁ Feature map and N ₂₂ The feature maps are cascaded to form feature map C in layer 2 ₂ . Thus, as shown in fig. 2, the previous layer 1 filter W ₁ Reused in layer 2. Similarly, a new filter W ₃ N used to generate layer 3 ₃₁ A filter W obtained in the previous layer 2 ₂ Is used to generate layer 3N ₃₂ And (3) a characteristic diagram. N (N) ₃₁ Feature map and N ₃₂ The feature maps are cascaded to form feature map C of layer 3 ₃ . In the same way, the rest of the feature map C ₄ 、C ₅ …C _L Is calculated. The CNN architecture 300 may be employed in a detection process to detect the presence (or absence) of an object(s) of interest in a region of an image, or to classify a region of interest of an image. As described herein, the detection process may include a training phase to learn parameters for the CNN using the training image, and a testing phase to apply the trained CNN to classify areas of the image and detect the presence (or absence) of an object of interest. Examples of the training phase and the testing phase are described below with reference to the accompanying drawings.

Fig. 4 is a flow chart illustrating an example process 400 by which a system, such as, for example, the system of fig. 1 or 2, is configured to implement a training and/or testing phase, such as, for example, the convolutional neural network shown in fig. 3. For purposes of illustration, process 400 is discussed below with reference to processor 110 and other components of system 100 in FIG. 1, and process 400 describes high-level operations performed with respect to a training phase and a testing phase.

At reference numeral 402, the processor 110 is configured to provide a convolutional neural network during a training phase.

At reference numeral 404, the processor 110 is configured to learn parameters, such as filters, for each convolutional layer of the convolutional neural network during a training phase.

At reference numeral 406, the processor 110 is configured to generate, for a first convolutional layer of the convolutional neural network, a feature map based on the region of the image to be evaluated and a learned filter from the first convolutional layer during a test phase.

At reference numeral 408, the processor 110 is configured to generate, during a testing phase, a feature map based on the feature map of a previous convolutional layer, a learned filter for the previous convolutional layer, and a learned filter for the subsequent convolutional layer for one or more subsequent convolutional layers of the convolutional neural network.

At reference numeral 410, the processor 110 is configured to detect the presence (or absence) of an object of interest in a region of an image based on the generated feature maps of the first convolution layer and one or more subsequent convolution layers during a test phase. In the event that an object is detected, the processor may be configured to initiate an alert or other notification to the user or other entity.

Fig. 5 is a flow chart illustrating an example process 500, such as, for example, the system in fig. 1 or 2, configured to implement a training phase for training a CNN with filter reuse (see, e.g., fig. 3) through the example process. For purposes of illustration, process 400 is discussed below with reference to processor 110 and other components of system 100 in FIG. 1, and process 400 describes operations performed during a training phase.

At reference numeral 502, a set of training images and their corresponding labels are prepared. For example, if the training image contains an object of interest, the tag is set to a number (e.g., 1). If the training image does not contain an object of interest, the label of the image is set to another number (e.g., -1). The set of training images and their corresponding labels are used in the design and configuration of the CNN during the training phase for detecting objects of interest.

At reference numeral 504, the processor 110 implements an initialization operation for parameters of the CNN, such as filters. For example, the processor 110 initializes a filter (e.g., W) for a convolutional layer (e.g., layer 1 … L or L) such as the CNN in fig. 3 ₁ …W _L ). The filter may be initialized by using a gaussian distribution with zero mean and a small variance (e.g., 0.01).

Attached atAt graph mark 506, the processor 110 generates (e.g., computes or estimates) a feature map on a layer-by-layer convolution basis using forward propagation, such as, for example, using training images or regions thereof from a training image set as input. For example, this operation may involve using two filters to compute a feature map, such as shown in CNN architecture 300 of fig. 3. One filter from the previous convolutional layer and the other filter from the current convolutional layer. For example, W of a given layer l _i W of sum layer l+1 _i+1 The feature map generated in layer l is denoted as N _l . In calculating the feature map of layer l+1, the convolution operation is performed twice. First, a feature map is calculatedWhere "°" represents a convolution operation. Next, the feature map is calculated as follows: />Subsequently, feature map->And->Is cascaded to generate the final output N of layer l+1 _l+1 . It should be noted that W _l Is used in layer 1 to calculate the feature map +.>Therefore, the filter W used in layer 1 _l Reused in layer l +1 to generate a new feature map.

At reference numeral 508, the processor 110 implements a decision layer in which the penalty calculation is performed. For example, the processor 110 performs the penalty calculation, such as by calculating a penalty based on the final score for each category and the corresponding label. The loss calculation may be performed using a softmax loss function. An example of a softmax loss function is represented by equation (1) as shown below:

wherein:

y is a vector representing scores for all classes

y _c Is a class c score.

Instead of a softmax penalty function, other functions may also be employed in the decision layer, such as a Support Vector Machine (SVM) penalty function or other suitable penalty function for use with CNNs. For example, the softmax penalty function calculates the cross entropy penalty, while the SVM penalty function calculates the hinge penalty. For classification tasks, the two functions perform almost identically.

At reference numeral 510, the processor 110 determines whether the filter of the CNN should be updated based on the calculated loss (e.g., a change in the calculated loss). For example, the processor 110 determines whether the penalty has stopped decreasing or changing, or in other words, whether the CNN is converging. If the loss has stopped decreasing, the processor 110 outputs a filter (e.g., a learned filter) at reference numeral 514 for use in the CNN during the test phase. The output filter may be stored in memory for use with CNNs.

Otherwise, if the loss has not stopped decreasing, the processor 110 updates the filter of the CNN at reference numeral 512. For example, the processor 110 may implement backward propagation (e.g., standard backward propagation or other variants thereof) to update all filters of the CNN. For example, the filter may be updated during back propagation by the chain rule according to equation (2) as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,

epsilon represents the loss function

Is from the depthGradient of layer propagation.

Subsequently, the filter is updated as follows:

wherein the method comprises the steps of

η represents an update coefficient (e.g., a learning rate).

Process 500 then continues by repeating the operations in reference numbers 506, 508, and 510 until the calculated loss stops decreasing or changing (or in other words, the CNN converges).

Fig. 6 is a flow chart illustrating an example process 600, such as, for example, the system of fig. 1 or 2, configured to implement a test phase through the example process for evaluating an image or region thereof using a trained CNN (see, for example, fig. 3) with filter reuse. The test phase differs from the training phase in that it does not require updating the filter. Instead, the test phase may take the filter learned from the training phase to classify or detect the object. Furthermore, no loss for the decision layer needs to be calculated. The decision layer simply decides which category has the highest score. For purposes of illustration, process 600 is discussed below with reference to processor 110 and other components of system 100 in fig. 1, and process 600 describes operations performed during a testing phase.

At reference numeral 602, the processor 110 implements a region suggestion operation by determining a (image of) region that may contain an object of interest (e.g., a target object). For example, one simple method to identify a region of interest for evaluation is to employ a sliding window technique that scans the input image over its length. Other methods may also be employed.

At reference numeral 604, the processor 110 implements graph feature generation using CNNs with filter reuse. For example, the processor 110 applies the region of interest of the image to the CNN and generates a feature map on a layer-by-layer convolution basis using parameters such as learning from a training phase (e.g., a filter). The graph feature generation process in the test phase may be similar to the process performed in the training phase, such as described above with reference to fig. 5.

At reference numeral 606, the processor 110 implements a decision layer to perform classification of regions or object detection. For example, in the decision layer, the processor 110 may take the score vector y as input and determine which (e.g., y _c ) With the highest score. This operation outputs the label (e.g., pedestrian) corresponding to the highest score.

As previously discussed, the decision layer may use a softmax penalty function or other penalty function, such as an SVM penalty function. The softmax penalty function calculates the cross entropy penalty and the SVM penalty function calculates the hinge penalty. For classification tasks, the two functions perform almost identically.

Fig. 7 is a flow chart illustrating an example detection process 700, such as, for example, the system of fig. 1 or 2, configured to detect the presence (or absence) of an object of interest using a trained CNN (see, for example, fig. 3) with filter reuse through the example detection process. For purposes of illustration, process 700 is discussed below with reference to processor 110 and other components of system 100 in fig. 1.

At reference numeral 702, the sensor(s) 120 capture image(s). Depending on the application of the detection process 700, images may be captured for different scenes. For example, sensor(s) 120 may be positioned, mounted, or positioned to capture images for a fixed location (e.g., a different location within or around a building or other location) or for a movable location (e.g., a location around a moving vehicle, person, or other system). For example, a camera system such as a single lens or multi-lens camera or a camera system to capture panoramic or 360 degree images is mounted on a vehicle.

At reference numeral 704, the processor 110 scans each region of the image, such as from the captured image(s).

At reference numeral 706, the processor 110 applies the CNN to each region of the image, such as by implementing a test phase. An example of a test phase is described by process 600 described with reference to fig. 6. As described above, the application of CNN provides a score for a test area of an image.

At reference numeral 708, the processor 110 determines whether the score from the CNN is greater than a threshold (e.g., a value of the threshold).

If the score is not greater than the threshold, at reference numeral 710, the processor 110 does not initiate an alert or notification. Process 700 continues to capture and evaluate images. Otherwise, if the score is greater than the threshold, at reference numeral 712, the processor 110 initiates an alert or notification reflecting the detection of the object of interest or the classification of such object. As previously discussed, examples of objects of interest may include pedestrians, animals, vehicles, traffic signs, road hazards, or other related objects, depending on the intended application for the detection process. The alert or notification may be initiated locally at the system 100 via one of the output devices 170 or transmitted to the external computing device 180. The alert may be provided to the user in the form of a visual or audio notification or other suitable media (e.g., vibration, etc.).

Experimental example

Experimental results on the KITTI dataset show the effectiveness of the present method and system employing CNN with filter reuse. The KITTI data set is captured by a pair of cameras. The subset of the KITTI data set used for pedestrian detection includes 7481 training images and 7518 test images. In the present method and system, the depth CNN may, for example, include l=13 layers. Wave filter W ₁ 、W ₂ 、W ₃ 、W ₄ 、W ₅ 、W ₆ 、W ₇ 、W ₈ 、W ₉ 、W ₁₀ 、W ₁₁ 、W ₁₂ And W is ₁₃ The sizes of (3X 3), 3X 03X 132, 3X 23X 364, 3X 43X 564, 3X 63X 7128, 3X 83X 9128 3×3×0128, 3×13×2256, 3×33×4256, 3×53×256, 3×3×256, and 3×3×256. Conventional VGG neural networks are compared to examples of the present methods and systems employing filter reuse mechanisms that utilize CNNs. The average Accuracy (AP) of the current CNN with filter reuse is 60.43% and the average accuracy of the conventional VGG neural network is 56.74% (see, e.g., simonyan K, zisselman A.: very deep convolutional)networks for large-scale image recording. ArXiv preprintarXiv: 1409.1556 (2014)). It can be seen that the current CNN method with filter reuse is significantly superior to the conventional VGG method. That is, the introduction of filter reuse or sharing plays an important role in improving object detection performance. Thus, the present method and system employing a filter reuse mechanism in CNNs can provide significant improvements for the field of object detection and thus video surveillance.

It should be understood that the above-described systems and methods are provided as examples only. Although operations including training phases, testing phases, and alert notifications, among other aspects, may be implemented using the system 100 or 200 as described herein, these operations may be distributed and performed across multiple systems through a communication network(s). Furthermore, in addition to standard backward propagation, the training phase may instead employ other variants of backward propagation, which may be aimed at improving the performance of the backward propagation. The training and testing phase may also take other suitable loss functions or training strategies. As described herein, CNN approaches that utilize reuse or shared filters may be used for various applications, including but not limited to object detection/recognition in video surveillance systems, autonomous or semi-autonomous vehicles, or in ADAS implementations.

It should also be understood that the example embodiments disclosed and taught herein are susceptible to various and alternative forms. Thus, the use of singular terms such as, but not limited to, "a" and the like, is not intended to limit the number of items.

It should be appreciated that the development of a physical, actual commercial application incorporating the disclosed embodiments would require numerous implementation-specific decisions to achieve the final goals of the development guideline for the commercial implementation. Such implementation-specific decisions may include, and may not be limited to, compliance with system-related, business-related, government-related, and other constraints, which may vary by specific implementation, location, and over time. While a developer's work may be complex and time-consuming in an absolute sense, such work is nevertheless a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.

Using the description provided herein, example embodiments may be implemented as a machine, process, or article of manufacture by using standard programming and/or engineering techniques to produce programming software, firmware, hardware, or any combination thereof.

Any resulting program(s), having computer-readable program code, may be embodied on one or more computer-usable media such as resident memory devices, smart cards or other removable memory devices or transmission devices, thereby making a computer program product or article of manufacture according to the embodiments. Thus, the terms "article of manufacture" and "computer program product" as used herein are intended to encompass a computer program that exists permanently or temporarily on any computer-usable medium or in any transmitting medium which transmits such a program.

As described above, the memory/storage devices may include, but are not limited to, magnetic disks, solid state drives, optical disks, removable memory devices (such as smart cards, SIMs, WIMs), semiconductor memory (such as RAM, ROM, PROM). Transmission media includes, but is not limited to, transmission via: wireless communication networks, the internet, intranets, telephone/modem-based network communication, hard-wire/cable communication network, satellite communication, and other fixed or mobile network systems/communication links.

While particular embodiments and applications of the present disclosure have been illustrated and described, it is to be understood that the present disclosure is not limited to the precise construction and compositions disclosed herein, and that various modifications, changes, and variations may be apparent from the foregoing descriptions without departing from the invention as defined in the appended claims.

Claims

1. A computer-implemented method, comprising:

configured to generate a feature map for a first convolutional layer of a convolutional neural network based on an area of an image to be evaluated and a learned filter from the first convolutional layer;

configured to generate a feature map for one or more subsequent convolutional layers of the convolutional neural network, each subsequent convolutional layer being generated based on the feature map of a previous convolutional layer, a learned filter for the previous convolutional layer, and a learned filter for the subsequent convolutional layer; and

is configured to detect the presence of an object of interest in the region of the image based on the generated feature maps of the first and the one or more subsequent convolution layers.

2. The method of claim 1, further comprising:

is configured to receive the image captured from the image sensing device.

3. The method according to any one of claims 1 and 2, further comprising:

is configured to learn a filter for each convolutional layer of the convolutional neural network using one or more training images during a training phase.

4. The method of claim 3, wherein the configuring to learn a filter comprises:

configured to initialize a filter for the convolutional layer of the convolutional neural network;

configured to generate a feature map for each convolutional layer using forward propagation, each subsequent convolutional layer after the first convolutional layer being generated based on the feature map of a previous convolutional layer, a learned filter for the previous convolutional layer, and a filter for the subsequent convolutional layer;

configured to calculate a loss using a loss function based on the generated feature map and the score for each category and the corresponding label; and

configured to update the filter for the convolutional layer using backward propagation if the calculated loss has been reduced,

wherein the configuring to calculate a feature map, the configuring to calculate a loss, and the configuring to update the filter are repeated until the convolutional neural network converges when the calculated loss is no longer decreasing.

5. The method of any of claims 1, 2, and 4, wherein two graph features are generated for each of the one or more subsequent convolutional layers, one of the two graph features being generated with a learned filter of the previous convolutional layer, the other of the two graph features being generated with a learned filter of the subsequent convolutional layer.

6. The method of claim 5, further comprising:

the two graph features are concatenated for each of the one or more subsequent convolutional layers.

7. The method of any of claims 1, 2, 4, and 6, wherein the configuring to generate a feature map for a first convolutional layer, the configuring to generate a feature map for one or more subsequent convolutional layers, and the configuring to detect the presence of an object are performed in a test phase.

8. The method of any of claims 1, 2, 4, and 6, wherein the configuring to detect comprises:

configured to obtain a score for the region from an application of the convolutional neural network; and

configured to compare the score for the region to a value of a threshold,

wherein the object is detected in the region if the score for the region is greater than a value of the threshold.

9. The method of any one of claims 1, 2, 4, and 6, further comprising:

configured to initiate an alert if the object is detected.

10. The method of any of claims 1, 2, 4, and 6, wherein the convolutional neural network is applied to each region of the image to detect whether the object is present in any of the regions of the image.

11. A computing device comprising means for performing the method of any one of claims 1 to 10.

12. A computer readable storage medium comprising computer code instructions which, when executed by at least one processor, cause an apparatus to perform at least the method of any one of claims 1 to 10.

13. A computing device, comprising:

memory device

One or more processors configured to:

generating a feature map for a first convolutional layer of the convolutional neural network based on the region of the image to be evaluated and a learned filter from the first convolutional layer;

generating a feature map for one or more subsequent convolutional layers of the convolutional neural network, each subsequent convolutional layer being generated based on the feature map of a previous convolutional layer, a learned filter for the previous convolutional layer, and a learned filter for the subsequent convolutional layer; and

the presence of an object of interest is detected in the region of the image based on the generated feature maps of the first and one or more subsequent convolution layers.

14. The computing apparatus of claim 13, wherein the one or more processors are further configured to receive the image captured from an image sensing device.

15. The computing device of any of claims 13 and 14, wherein the one or more processors are further configured to learn a filter for each convolutional layer of the convolutional neural network using one or more training images during a training phase.

16. The computing device of claim 15, wherein to learn a filter, the one or more processors are configured to:

initializing a filter for the convolutional layer of the convolutional neural network;

generating a feature map for each convolutional layer using forward propagation, each subsequent convolutional layer after the first convolutional layer being generated based on the feature map of a previous convolutional layer, a learned filter for the previous convolutional layer, and a filter for the subsequent convolutional layer;

calculating a loss using a loss function based on the generated feature map and the score for each category and corresponding label; and

using backward propagation to update the filter for the convolutional layer in the event that the calculated loss has been reduced,

wherein the one or more processors are configured to repeat the operations of calculating a feature map, calculating a loss, and updating the filter until the convolutional neural network converges when the calculated loss is no longer reduced.

17. The computing device of any of claims 13, 14, and 16, wherein two graph features are generated for each of the one or more subsequent convolutional layers, one of the two graph features being generated with a learned filter of the previous convolutional layer, the other of the two graph features being generated with a learned filter of the subsequent convolutional layer.

18. The computing device of claim 17, wherein the one or more processors are configured to concatenate the two graph features for each of the one or more subsequent convolutional layers.

19. The computing device of any of claims 13, 14, and 16, wherein the one or more processors are configured to generate a feature map for a first convolutional layer, generate a feature map for one or more subsequent convolutional layers, and detect the presence of an object in a test phase.

20. The computing device of any of claims 13, 14, and 16, wherein to detect the presence of an object, the one or more processors are configured to:

obtaining a score for the region from an application of the convolutional neural network; and

comparing the score for the region to a threshold value,

21. The computing device of any of claims 13, 14, and 16, wherein the one or more processors are further configured to initiate an alert if the object is detected.

22. The computing device of any of claims 13, 14, and 16, wherein the convolutional neural network is applied to each region of the image to detect whether the object is present in any of the regions of the image.

23. A computer-implemented method, comprising:

configured to initialize a filter for a convolutional layer of a convolutional neural network;

configured to generate a feature map for each convolutional layer using forward propagation, each subsequent convolutional layer after a first convolutional layer being generated based on the feature map of a previous convolutional layer, a learned filter for the previous convolutional layer, and a filter for the subsequent convolutional layer;

24. A computing device comprising means for performing the method of claim 23.

25. A computer readable storage medium comprising computer code instructions which, when executed by at least one processor, cause an apparatus to perform at least the method of claim 23.

26. A computing device, comprising:

a memory; and

one or more processors configured to:

initializing a filter for a convolutional layer of a convolutional neural network;

generating a feature map for each convolutional layer using forward propagation, each subsequent convolutional layer after a first convolutional layer being generated based on the feature map of a previous convolutional layer, a learned filter for the previous convolutional layer, and a filter for the subsequent convolutional layer;

calculating a loss using a loss function based on the generated feature map and the score for each category and the corresponding tag; and