CN113762304B

CN113762304B - Image processing method, image processing device and electronic equipment

Info

Publication number: CN113762304B
Application number: CN202011351182.9A
Authority: CN
Inventors: 刘浩; 徐卓然; 董博
Original assignee: Beijing Jingdong Qianshi Technology Co Ltd
Current assignee: Beijing Jingdong Qianshi Technology Co Ltd
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2024-02-06
Anticipated expiration: 2040-11-26
Also published as: CN113762304A

Abstract

The present disclosure provides an image processing method, an image processing apparatus, and an electronic device. The image processing method comprises the following steps: acquiring an input image; and processing the input image with an incremental learning network to determine an image recognition result, wherein the incremental learning network comprises: a backbone network and at least two branch networks, each of the at least two branch networks corresponding to a different specified class, respectively, each of the backbone network and the at least two branch networks constituting a classification network for the one specified class; the output of the backbone network is used as the input of at least two branch networks respectively, and the branch networks are the minimum increment units of the increment learning network.

Description

Image processing method, image processing device and electronic equipment

Technical Field

The present disclosure relates to the field of computer technology, and more particularly, to an image processing method, an image processing apparatus, and an electronic device.

Background

Deep learning techniques have made tremendous progress in fields such as object classification, text processing, recommendation engines, image searching, face recognition, age recognition and speech recognition, human-machine conversation, and emotion computation.

In implementing the concepts of the present disclosure, the inventor found that at least the following problems exist in the prior art, and it is difficult to consider the plasticity, stability and performance of the network when the related art needs to learn a new class using the network.

Disclosure of Invention

In view of the above, the present disclosure provides an image processing method, an image processing apparatus, and an electronic device that can compromise the plasticity, stability, and performance of a network.

One aspect of the present disclosure provides an image processing method including: acquiring an input image; and processing the input image with an incremental learning network to determine an image recognition result, wherein the incremental learning network comprises: the system comprises a main network and at least two branch networks, wherein each of the at least two branch networks corresponds to a different specified category, each of the main network and the at least two branch networks forms a classification network aiming at the specified category, the output of the main network is used as the respective input of the at least two branch networks, and the branch networks are minimum increment units of an increment learning network.

According to an embodiment of the present disclosure, a backbone network comprises at least one sequentially connected backbone module comprising a convolutional layer and at least one of the following layers: a transform reconstruction layer, an activation function layer, and a pooling layer.

According to an embodiment of the present disclosure, a branch network includes a latent convolutional layer, a global average pooling layer, and a full convolutional layer connected in sequence.

According to embodiments of the present disclosure, the output of the branched network is one of two classifications.

According to an embodiment of the present disclosure, an incremental learning network is trained by: for a branch network of a specified class, training data of the specified class is taken as a positive sample, and training data outside the specified class is taken as a negative sample; and training the incremental learning network with positive and/or negative samples.

Training the incremental learning network with positive and/or negative samples, according to embodiments of the present disclosure, includes: if the category of the training data is different from the existing category of the historical training data, adding a branch network aiming at the category of the training data, taking at least part of the training data as a positive sample, taking at least part of the historical training data as a negative sample, and carrying out model training on the added branch network; and if the category of the training data belongs to the existing category of the historical training data, taking at least part of the training data and at least part of the historical training data with the same category together as positive samples, taking at least part of the historical training data with different categories with the training data as negative samples, and carrying out model training on the branch network corresponding to the category of the training data.

Training the incremental learning network with positive and/or negative samples, according to embodiments of the present disclosure, includes: if the category of the training data is different from the existing category of the historical training data, at least taking at least part of the training data as a negative sample to finely tune the existing branch network in the incremental learning network; and if the category of the training data belongs to the existing category of the historical training data, extracting positive samples and/or negative samples from the training data and the historical training data, and performing fine tuning on the existing branch network in the incremental learning network.

According to an embodiment of the present disclosure, the image processing method further includes: if the main network is not trained by the training data of the specified category, unlocking the network parameters of the main network, otherwise, locking the network parameters of the main network; and/or unlocking the network parameters of the backbone network if the number of branch networks of the backbone network is less than a preset number threshold, otherwise locking the network parameters of the backbone network.

According to an embodiment of the present disclosure, the image processing method further includes: determining representative training data based on at least one of training data, historical training data of the same class as the training data, and historical training data of a different class than the training data; constructing a sample library based on the representative training data; and training the incremental learning network with positive and/or negative samples includes: the incremental learning network is trained using positive and/or negative samples in the sample library.

According to embodiments of the present disclosure, the total amount of data of the sample library is related to the hardware performance of the electronic device used to train the incremental learning model.

According to an embodiment of the present disclosure, processing an input image with an incremental learning network to determine an image recognition result includes: acquiring the confidence coefficient of the processing results of at least two branch networks for the input image respectively; sequentially splicing the confidence level of the processing result according to the respective ordering of at least two branch networks; and taking the category of the branch network corresponding to the highest position of the confidence as the output of the incremental learning network.

According to an embodiment of the present disclosure, the input image is an image for an automatic driving task.

Another aspect of the present disclosure provides an image processing apparatus including: the image acquisition module is used for acquiring an input image; the image processing module is used for processing the input image by utilizing the incremental learning network so as to determine an image recognition result; wherein the incremental learning network comprises: the system comprises a main network and at least two branch networks, wherein each of the at least two branch networks corresponds to a different specified category, each of the main network and the at least two branch networks forms a classification network aiming at the specified category, the output of the main network is used as the respective input of the at least two branch networks, and the branch networks are minimum increment units of an increment learning network.

Another aspect of the present disclosure provides an electronic device comprising one or more processors and a storage device for storing executable instructions that, when executed by the processors, implement the method as described above.

Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions that, when executed, are configured to implement a method as described above.

Another aspect of the present disclosure provides a computer program comprising computer executable instructions which when executed are for implementing a method as described above.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments thereof with reference to the accompanying drawings in which:

FIG. 1 is a schematic diagram of a convolutional neural network and its operation in the related art;

FIG. 2 is a schematic diagram of a tree convolutional neural network in the related art;

fig. 3 is an exemplary system architecture to which an image processing method, an image processing apparatus, and an electronic device may be applied according to an embodiment of the present disclosure.

FIG. 4 is a schematic diagram of the architecture of an incremental learning network according to an embodiment of the present disclosure;

Fig. 5 is a schematic structural diagram of a backbone network according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a branched network according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a branched network according to another embodiment of the present disclosure;

FIG. 8 is a flow chart of a training method for an incremental learning network according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of model training a branch network according to an embodiment of the present disclosure;

FIG. 10 is a schematic diagram of model training a branch network according to another embodiment of the present disclosure;

FIG. 11 is a flowchart of an image processing method according to an embodiment of the present disclosure;

FIG. 12 is a schematic diagram of a network derivation process according to an embodiment of the present disclosure;

fig. 13 is a block diagram of an image processing apparatus according to an embodiment of the present disclosure; and

fig. 14 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a formulation similar to at least one of "A, B or C, etc." is used, in general such a formulation should be interpreted in accordance with the ordinary understanding of one skilled in the art (e.g. "a system with at least one of A, B or C" would include but not be limited to systems with a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

The related art may employ a network including convolutional layers for feature extraction, classification, etc., and in order to facilitate understanding of the embodiments of the present disclosure, a convolutional neural network and its operation will be first described by way of example.

Fig. 1 is a schematic diagram of a convolutional neural network and its operation in the related art.

As shown in the upper graph of fig. 1, after an input image is input into a convolutional neural network through an input layer, a category identifier is output after a plurality of processing procedures are sequentially performed. The main components of the convolutional neural network may include a plurality of convolutional layers, a plurality of downsampling layers, and a fully-connected layer. For example, a complete convolutional neural network may consist of a superposition of these three layers. The convolutional neural network as shown in the upper diagram of fig. 1 includes a first hierarchy, a second hierarchy, a third hierarchy, and so on. For example, each hierarchy may include one convolutional layer and one downsampling layer. Thus, the process of each hierarchy may include: the input image is convolved (convolved) and downsampled (sampled).

The convolutional layer is the core layer of the convolutional neural network. In the convolutional layer of a convolutional neural network, one neuron is connected with only a part of neurons of an adjacent layer. The convolution layer may apply several convolution kernels, also known as filters, to the input image to extract various types of features of the input image. Each convolution kernel may extract a type of feature. The convolution kernel is typically initialized in the form of a random decimal matrix, and will learn to obtain reasonable weights during the training process of the convolutional neural network. As shown in the lower diagram of fig. 1, the result obtained after applying one convolution kernel to an input image is called feature map (feature map), and the number of feature maps is equal to the number of convolution kernels. Each feature map is composed of a number of neurons arranged in a rectangular manner, and the neurons of the same feature map share weights, wherein the weights shared are convolution kernels. The feature map output by one level of convolution layers may be input to the adjacent next level of convolution layers and processed again to obtain a new feature map.

For example, the convolution layer may convolve the data of a certain local receptive field of the input image with different convolution checks, and the convolution result is input to the activation layer, which calculates according to a corresponding activation function to obtain the feature information of the input image.

Downsampling layers are provided between adjacent convolution layers, the downsampling layers being one form of downsampling. On one hand, the downsampling layer can be used for reducing the scale of an input image, simplifying the complexity of calculation and reducing the phenomenon of overfitting to a certain extent; on the other hand, the downsampling layer can also perform feature compression to extract main features of the input image. The downsampling layer can reduce the size of the feature map without changing the number of feature maps. For example, an input image of size 12×12, which is sampled by a convolution kernel of 6×6, can result in a 2×2 output image, which means that 36 pixels on the input image are combined into 1 pixel in the output image. The last downsampling layer or convolution layer may be connected to one or more fully connected layers that connect all of the extracted features. The output of the fully connected layer is a one-dimensional matrix, i.e. a vector.

Given an input data distribution D ₁ Deep learning models such as convolutional neural networks as shown in FIG. 1 are intended to learn the distribution D by adjusting the weights of neurons ₁ . There is a new data set (distributed as D ₂ ) Hope that the model can identify two data sets simultaneously, the most intuitive way is to train the set of two data sets directly, let the model learn D ₁ And D ₂ Is a joint distribution of (a). In practice when learning D ₂ At time D ₁ Is often already unavailable, for D ₁ The weight during learning is learning D ₂ Is modified to result in a D ₁ The problem may be referred to as the "forgetting" of the plastic and stability dilemma (stability and plasticity dilemma).

In order to implement incremental training under the deep learning framework, academia has also made some attempts, such as Incremental classifier and Representation Learning (iCaRL), etc., which usually keep a small number of historical samples, and implement the migration of the knowledge of the old model to the new model by way of knowledge distillation. However, the structure of the network of these methods is generally fixed, i.e., the network structure does not change with the increase of learning categories. Such characteristics may result in a model that may significantly decrease in learning accuracy with increasing class, as opposed to the learning ability of humans. The reason is that when the model is updated, even though training is not biased to new data by distillation, all neurons participate in updating, the weight which is already learned is inevitably changed by the new data, and the recognition capability of the model to the old data is continuously reduced along with the continuous iteration of the process.

For example, in tasks such as autopilot, VGG models (Visual Geometry Group Network), residual networks (Residual Neural Network, simply res net), lightweight networks (MobileNet), and the like may be used. However, the learning mechanism of these models is batch learning (batch-learning), if a new class needs to be learned, the whole model needs to be retrained, and this process consumes a great deal of computation power and time, resulting in long period of model iteration and high development cost. In contrast, incremental learning (incremental learning) is intended to give the model the ability to learn continuously, i.e., the model can learn new categories while maintaining the ability to identify existing categories.

Fig. 2 is a schematic diagram of a tree convolutional neural network in the related art.

Some lead scholars explored to use dynamic network structures to solve this problem, such as tree convolutional neural networks (TreeCNN). As shown in fig. 2, these methods typically use a tree structure to extend the network, which starts with a super-class (super-class) as the root node, and then deepens the network structure according to a specific extension rule. There are two non-negligible problems with this approach.

On the one hand, since the input order of the categories cannot be controlled, for example, only one category such as C in the current network ₁ At this time a new category (C ₂ ) Is input to the network, the network will be from C ₁ Branch despreading C of (2) ₂ And so on, assume we have a total of five categories { C ₁ ，C ₂ ，C ₃ ，C ₄ ，C ₅ Finally we are to distinguish C ₁ And C ₅ The network architecture may have been extended very deeply, and in practice the model does not need such deep architecture to distinguish between the two categories, which is obviously a problem caused by the algorithm design. As shown in fig. 2, the categories for which the newly added leaf nodes are respectively sheep and birds, and in order to distinguish cats from birds, the leaf nodes for which the categories are cat, dog, horse, sheep and bird need to be sequentially passed, resulting in waste of computing resources. In an autopilot scenarioThis approach is difficult to use practically due to the limited computational power of the hardware.

On the other hand, for a leaf node (leaf node), the absence of data volume may result in its handling of an unseen class being easily misidentified, since it only sees its own corresponding class and the classes of several other leaf nodes that are currently involved in training.

The incremental learning network provided by the embodiments of the present disclosure provides a new dynamic network structure for the drawbacks of the inherent network structure in the related art, so that the network branches corresponding to the existing categories are maintained (e.g., not updated or fine-tuned) while learning the new categories, thereby maintaining the recognition capability of the existing categories. On the other hand, the network learns new category data by adding new branches. Different from the depth increasing mode of the tree convolution neural network in the related technology, the incremental learning network growth uses a strategy of width expansion, so that each branch can be calculated in parallel when the network is deduced, and the performance loss caused by the depth increasing mode is avoided. In addition, the embodiment of the disclosure further provides a strategy for iterative updating of the existing category, so that the existing branch network can correctly process the new category, and the identification precision is improved.

Fig. 3 is an exemplary system architecture to which an image processing method, an image processing apparatus, and an electronic device may be applied according to an embodiment of the present disclosure. It should be noted that fig. 3 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 3, a system architecture 300 according to this embodiment may include terminal devices 301, 302, 303, a network 304, and a server 305. The network 304 is used as a medium to provide communication links between the terminal devices 301, 302, 303 and the server 305. The network 304 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 305 via the network 304 using the terminal devices 301, 302, 303 to receive or send messages or the like. Various communication client applications, such as a navigation class application, an image processing class application, a shopping class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 301, 302, 303, as examples only.

The terminal devices 301, 302, 303 may be a variety of electronic devices with image processing capabilities including, but not limited to, vehicles, smart phones, tablets, laptop and desktop computers, and the like.

The server 305 may be a server providing various services, such as a background management server (by way of example only) of the incremental learning network requested for the terminal devices 301, 302, 303. The background management server may analyze and process the received data such as the request, and feed back the processing result (for example, topology information, model parameter information, image recognition result, etc. of the incremental learning network) to the terminal device.

It should be noted that the incremental learning network provided by the embodiments of the present disclosure may be applied to a terminal device or a server, and the training method and the image processing method provided by the embodiments of the present disclosure may be executed by the terminal device 301, 302, 303 or the server 305. The training method and the image processing method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 305 and is capable of communicating with the terminal devices 301, 302, 303 and/or the server 305.

It should be understood that the number of terminal devices, networks and servers is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

An aspect of the present disclosure provides an image processing method.

The image processing method may include the following operations.

First, an input image is acquired. The input image is then processed using an incremental learning network to determine an image recognition result. Wherein the incremental learning network may comprise: a backbone network and at least two branch networks.

The incremental learning network and the training method thereof are respectively exemplarily described below with reference to fig. 4 to 10.

Fig. 4 is a schematic diagram of the structure of an incremental learning network according to an embodiment of the present disclosure.

As shown in fig. 4, the incremental learning network may include: a backbone network and at least two branch networks. Wherein each of the at least two branch networks corresponds to a different assigned class, respectively, and the backbone network and each of the at least two branch networks form a classification network for the one assigned class. The outputs of the backbone networks are respectively used as the respective inputs of at least two branch networks, and the branch networks are the minimum increment units of the increment learning network. The network topology can give consideration to plasticity, stability and performance.

As shown in fig. 4, the backbone network is connected to branch networks of the categories "cat", "dog" and "car", respectively. When the information of a new category needs to be identified, a new branch network can be set for the new category. Because the branch networks are in equal relation and have no operational sequence, the branch networks corresponding to the existing category can be kept when learning the new category, thereby keeping the identification capability of the existing category. On the other hand, the network learns new category data by adding new branches.

In one embodiment, the backbone network comprises at least one sequentially connected backbone module comprising a convolutional layer and at least one of the following layers: a transform reconstruction layer, an activation function layer, and a pooling layer.

Fig. 5 is a schematic structural diagram of a backbone network according to an embodiment of the present disclosure.

As shown in fig. 5, the backbone network may include a plurality of backbone modules, each of which may be the same or different in structure. For example, the backbone module may include a convolution layer, a transform reconstruction layer, an activation layer, and a pooling layer connected in sequence.

The conversion reconstruction layer (Batch Normalization, BN layer for short) belongs to one layer in the network like the convolution layer, the activation layer and the full connection layer. Such as BN layer may precede the activate function layer. Specifically, to implement the input of each layer, a preprocessing (such as normalizing the input) is added, and to avoid the influence of the normalization operation on the learned characteristics, transformation and reconstruction are performed on the normalization parameters. Thus, the original learned characteristics of a certain layer can be restored, and therefore, the network can restore the characteristic distribution to be learned of the original network by introducing the learnable parameters.

A full convolutional network (Fully convolutional Net, FCN for short). The FCN converts the full connection layer in the traditional Convolutional Neural Network (CNN) into individual convolutional layers, and outputs a labeled graph.

In this embodiment, the backbone network is mainly used to extract general features from input data, so as to output the general features to the branch network, so that the branch network performs classification prediction based on the general features, and so on.

It should be noted that the backbone network may directly employ one or more layers that are shallower in the trained model.

For example, backbone networks may use convolutional neural networks currently in mainstream, such as VGG, resNet, depth separable convolution (Xception), mobileNet, and the like. The middle and lower part networks of these models are intercepted as the backbone network (e.g., the part before conv4-3 of VGG, the part before the middle flow of Xreception). The backbone network is responsible for extracting the generic features of the image. In engineering problems (such as autopilot scenarios), a pre-trained model in ImageNet can be used directly, and then the corresponding layer is locked directly, just the branched network part is trained.

In one embodiment, the branch network includes a latent convolutional layer, a global average pooling layer, and a full convolutional layer connected in sequence. In this embodiment, the branched network is the minimum unit of network growth, which corresponds to a particular category in the dataset.

Fig. 6 is a schematic diagram of a branched network according to an embodiment of the present disclosure.

As shown in fig. 6, each of the branch networks may include a plurality of branch modules having different structures, e.g., a first branch module may sequentially include a convolution layer, a conversion reconstruction layer, and an activation layer, a second branch module may sequentially include a convolution layer and a conversion reconstruction layer, and a third branch module may include a full convolution layer. It should be noted that the structure of the branch network is required to correspond to the structure of the backbone network, so that the backbone network and the branch network can be combined into a complete classification network.

In one embodiment, the output of the branched network is one of two classifications.

In order to enable a smaller network scale to have a good classification capability, the branched network is realized in a class-two classification form: the current class is positive samples, and the rest of learned samples are negative samples. The branch network can be iteratively updated along with the expansion of the data category, and the variety of negative samples seen by the branch network is correspondingly increased along with the increase of the branch number, so that the identification capability of the positive samples is improved.

Fig. 7 is a schematic structural diagram of a branch network according to another embodiment of the present disclosure.

As shown in fig. 7, when VGG is selected as the backbone network, the adapted branch network may include a shallower convolutional layer and a full convolutional layer. A plurality of shallower convolutional layers, a global averaging pool (Global average pool) and a plurality of full convolutional layers are included in fig. 7.

Wherein the structure of each of the plurality of shallower convolutional layers may be the same. For example, the size of the convolution kernel it employs is 3*3 (filter 3*3), the channel 512 (channel 512), the step size is 1 (stride 1), and the pixel fill is the same (padding same), which can combine BN layer, active layer and pooling layer as a unit. The respective structures of the plurality of full convolution layers may differ, for example, a first full convolution layer comprising: filter 1*1,channel 256,stride 1,padding same, which may combine BN layer and activation layer as a unit. The second full convolution layer includes: filter 1*1,channel 64,stride 1,padding same, which may combine BN layer and activation layer as a unit. The third full convolution layer includes: filter 1*1,channel 2,stride 1,padding same, which may incorporate BN layers as a unit.

The incremental learning network provided by the embodiment of the disclosure comprises a plurality of branch networks with parallel connection relations, and new category data can be learned by adding new branch networks, so that network branches corresponding to existing categories are maintained when learning new categories, and the recognition capability of the existing categories is maintained. Different from the depth growth mode, the growth of the incremental learning network uses a strategy of width expansion, so that each branch can be calculated in parallel when the network is deduced, and the performance loss caused by the depth growth mode is reduced.

Another aspect of the present disclosure provides a training method for an incremental learning network as shown above.

Fig. 8 is a flowchart of a training method for an incremental learning network according to an embodiment of the present disclosure.

As shown in fig. 8, the training method may include operations S802 to S804.

In operation S802, training data of a specified category is taken as a positive sample and training data outside the specified category is taken as a negative sample for a branch network of the specified category.

In this embodiment, the positive samples and the negative samples determined by the existing training data can be quickly determined by the above manner, and each branch network can be trained by the training data of the respective category, which is helpful for improving the accuracy of the model output result.

In operation S804, the incremental learning network is trained with positive and/or negative samples.

In this embodiment, the incremental learning network may be trained with positive or negative samples to determine model parameters for each branch network based on, for example, a back propagation algorithm.

In one embodiment, training the incremental learning network with positive and/or negative samples may include the following operations.

In one aspect, if the class of training data is different from an existing class of historical training data, a branch network is added for the class of training data, and the added branch network is model trained with at least some of the training data as positive samples and at least some of the historical training data as negative samples. This allows new category data to be learned by adding new branch networks without requiring retraining of existing branch networks.

On the other hand, if the category of the training data belongs to the existing category of the historical training data, model training is performed on the branch network corresponding to the category of the training data with at least part of the training data and at least part of the historical training data having the same category taken together as positive samples and at least part of the historical training data having a different category from the training data taken as negative samples. Thus, each branch network is convenient to experience new training data, and accuracy of model prediction results is improved.

In one aspect, if the class of training data is different from an existing class of historical training data, fine-tuning an existing branch network in the incremental learning network is performed with at least a portion of the training data as a negative sample. Therefore, the existing branch network can be finely adjusted by the training data of the new category, the risk of misjudging the new category data by the existing branch network is reduced, meanwhile, the performance loss of the existing branch network for the corresponding category data is not caused due to the fact that the negative samples of the existing branch network are continuously increased, and the accuracy of network prediction is improved.

On the other hand, if the category of training data belongs to an existing category of historical training data, positive and/or negative samples are extracted from the training data and the historical training data, and fine tuning is performed on an existing branch network in the incremental learning network.

Fig. 9 is a schematic diagram of model training of a branch network according to an embodiment of the present disclosure.

As shown in fig. 9, for training data of a new class, a new branch network may be set for the new class. At least a portion of the new class of training data may be taken as a positive sample of the newly added branch network and at least a portion of the new class of training data may be taken as a negative sample of the existing branch network during training. Of course, the historical training data of the existing category can also be used as a negative sample of the newly added branch network.

Fig. 10 is a schematic diagram of model training a branch network according to another embodiment of the present disclosure.

As shown in fig. 10, the new training data is classified into an automobile, and the existing branch network includes a branch network classified into an automobile, so that the branch network does not need to be added, and in order to make all branch networks experience the new training data, the new training data may be used as a positive sample of the branch network of the automobile class, and at least part of the historical training data of the non-automobile class may be used as a negative sample of the branch network of the automobile class. The positive and negative samples of the other branched networks are similar.

In one embodiment, the training method may further include the following operations.

On the one hand, if the backbone network is not trained by the training data of the specified category, unlocking the network parameters of the backbone network, otherwise locking the network parameters of the backbone network.

On the other hand, if the number of branch networks of the backbone network is less than the preset number threshold, unlocking the network parameters of the backbone network, otherwise locking the network parameters of the backbone network.

For example, since the backbone network can be used to extract general features of each class, the network parameters of the backbone network are more suitable, and can be reduced by locking the network parameters of the backbone network.

Under the condition of few learned categories, the backbone network needs to participate in the back propagation of errors during training, and as the number of the learned categories increases, the backbone network can be locked during training and does not participate in training. In engineering problems (such as autopilot scenarios), a pre-trained model in ImageNet can be used directly, and then the corresponding layer is locked directly, just the branched network part is trained.

First, representative training data is determined based on at least one of training data, historical training data of the same category as the training data, and historical training data of a different category from the training data.

A sample library is then constructed based on the representative training data.

Accordingly, training the incremental learning network with positive and/or negative samples may include: the incremental learning network is trained using positive and/or negative samples in the sample library.

For example, a similar BiC approach may be taken, with a portion of the data being retained as representative samples (representative exemplars) for each class of data, thereby forming a sample library. Unlike BiC, this portion of data can be recalled when training the branch network to dynamically combine new classes of data into one data set for training. The upper limit of the total data amount of the sample library is N, N is a super parameter, and the value of the N is dependent on the memory size and the computing power of the computing platform hardware. Wherein the total amount of data of the sample library is related to hardware performance of the electronic device used to train the incremental learning model.

For example, the class corresponding to the non-current branch may be dynamically sampled as a negative sample, which ensures that existing network branches see also the current new data.

The following is directed to an actual scenario of automatic driving, and the backbone network takes VGG as an example to provide a training method of the incremental learning network.

First, the ImageNet pre-training model of VGG is loaded and all layers before conv4-3 are locked as the backbone network.

Then, a category is selected from the dataset and entered into the network.

If the input dataset is of a new class, a branch is added to the back of the backbone network (parallel to other branches if any). The branch network is trained by taking the current class data as a positive sample and taking representative samples of all other classes learned by the network as negative samples. To balance the effect of the difference in the number of positive and negative samples, a loss function (loss function) may be used binary focal loss, as shown in equation 1.

L＝-α _t (1-P _t ) ^γ log(P _t ) (1)

Where y is the true class (group-trunk), P is the probability of a positive sample, let α=0.5 in the early training phase, γ=1, and α=0.25 in the later training phase as samples in the sample library increase, γ=2.

If the input data set is the existing category, merging the data of the current category with the representative sample of the category learned before, taking the merged data as a positive sample, taking the data of all other categories learned by the network as negative samples, and training the branch network. The loss function is also binary focal loss, as shown in equation 1. And after training is finished, updating the representative sample library of the category.

Then, fine-tuning (fine-tuning) is performed on all branches of non-current categories. During training, a representative sample of the current category data is added into a total sample library, and categories corresponding to non-current branches are dynamically sampled to serve as negative samples, so that the old branches are ensured to see the current new data.

It should be noted that the activation function may employ a linear correction unit (Rectified linear unit, abbreviated as Relu). Relu will make the output of a portion of neurons 0, thus creating sparsity in the network and reducing the interdependence of parameters, alleviating the over-fitting problem. If Relu (x) =x (ifx > 0), relu (x) =0 (ifx.ltoreq.0) plays a role in single-side inhibition.

Another aspect of the present disclosure provides an image processing method.

Fig. 11 is a flowchart of an image processing method according to an embodiment of the present disclosure.

As shown in fig. 11, the image processing method may include operation S1102 and operation S1104.

In operation S1102, an input image is acquired.

For example, the input image is an image for an automatic driving task. Of course, the input image may also be an image for other tasks or fields, such as various scenes involving category recognition, etc.

In operation S1104, the input image is processed using the incremental learning network to determine an image recognition result.

Wherein the incremental learning network may be trained based on the training method as described above. The topology and network parameters of the incremental learning network, etc., may be as described above and will not be described in detail herein.

In one embodiment, processing the input image with the incremental learning network to determine the image recognition result may include the following operations.

First, a confidence level of each of the at least two branch networks for a processing result of the input image is obtained.

And then, the confidence of the processing results is spliced in sequence according to the respective ordering of the at least two branch networks.

Then, the category of the branch network corresponding to the highest confidence level is used as the output of the incremental learning network.

Fig. 12 is a schematic diagram of a network derivation process according to an embodiment of the present disclosure.

As shown in fig. 12, in the network derivation process, confidence score (confidence score) of all branch network outputs is spliced (con-categorised), so that all branches are packed into a model file, and an input image automatically performs parallel computation on each branch after passing through a backbone network, so as to reduce performance loss caused by increase of the number of branches. The order of stitching corresponds to the category represented by each branch. If the branch network of n categories is provided, n confidence degrees after splicing can be obtained, wherein n is a positive integer greater than 0. Finally, the category corresponding to the position with the highest confidence is the output of the network (category judgment). Confidence of the branch network as for the car class: and if the confidence coefficient 3 is the highest in score, the output of the incremental learning network is an automobile.

Another aspect of the present disclosure provides an image processing apparatus.

Fig. 13 is a block diagram of an image processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 13, the image processing apparatus 1300 may include an image acquisition module 1310 and an image processing module 1320.

Wherein the image acquisition module 1310 is configured to acquire an input image.

The image processing module 1320 is configured to process the input image using the incremental learning network to determine an image recognition result.

For example, the incremental learning network includes: the system comprises a main network and at least two branch networks, wherein each of the at least two branch networks corresponds to a different specified category, each of the main network and the at least two branch networks forms a classification network aiming at the specified category, the output of the main network is used as the respective input of the at least two branch networks, and the branch networks are minimum increment units of an increment learning network.

For example, the incremental learning network is trained based on the training method as shown above.

Any number of modules, sub-modules, units, sub-units, or at least some of the functionality of any number of the sub-units according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented as split into multiple modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an Application Specific Integrated Circuit (ASIC), or in any other reasonable manner of hardware or firmware that integrates or encapsulates the circuit, or in any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be at least partially implemented as computer program modules, which when executed, may perform the corresponding functions.

For example, any of the image acquisition module 1310 and the image processing module 1320 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of image acquisition module 1310 and image processing module 1320 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), programmable Logic Array (PLA), system-on-chip, system-on-substrate, system-on-package, application Specific Integrated Circuit (ASIC), or in hardware or firmware in any other reasonable manner of integrating or packaging circuitry, or in any one of or a suitable combination of any of three implementations of software, hardware, and firmware. Alternatively, at least one of the image acquisition module 1310 and the image processing module 1320 may be at least partially implemented as computer program modules that, when executed, perform the corresponding functions.

Fig. 14 is a block diagram of an electronic device according to an embodiment of the present disclosure. The electronic device shown in fig. 14 is merely an example, and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 14, an electronic device 1400 according to an embodiment of the present disclosure includes a processor 1401 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1402 or a program loaded from a storage section 1408 into a Random Access Memory (RAM) 1403. The processor 1401 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 1401 may also include on-board memory for caching purposes. The processor 1401 may include a single processing unit or a plurality of processing units for performing different actions of the method flows according to embodiments of the present disclosure.

In the RAM 1403, various programs and data necessary for the operation of the system 1400 are stored. The processor 1401, ROM 1402, and RAM 1403 are connected to each other through a bus 1404. The processor 1401 performs various operations of the method flow according to the embodiment of the present disclosure by executing programs in the ROM 1402 and/or the RAM 1403. Note that the program may be stored in one or more memories other than the ROM 1402 and the RAM 1403. The processor 1401 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, the system 1400 may also include an input/output (I/O) interface 1405, the input/output (I/O) interface 1405 also being connected to the bus 1404. The system 1400 may also include one or more of the following components connected to the I/O interface 1405: an input section 1406 including a keyboard, a mouse, and the like; an output portion 1407 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 1408 including a hard disk or the like; and a communication section 1409 including a network interface card such as a LAN card, a modem, and the like. The communication section 1409 performs communication processing via a network such as the internet. The drive 1410 is also connected to the I/O interface 1405 as needed. Removable media 1411, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memory, and the like, is installed as needed on drive 1410 so that a computer program read therefrom is installed as needed into storage portion 1408.

According to embodiments of the present disclosure, the method flow according to embodiments of the present disclosure may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1409 and/or installed from the removable medium 1411. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1401. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 1402 and/or RAM 1403 described above and/or one or more memories other than ROM 1402 and RAM 1403.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be combined in various combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. An image processing method, comprising:

acquiring an input image; and

processing the input image using an incremental learning network to determine an image recognition result,

wherein the dynamic growth of the incremental learning network uses a strategy of width expansion, the incremental learning network comprising: the system comprises a main network and at least two branch networks capable of parallel computation, wherein each of the at least two branch networks corresponds to a different designated category, each of the main network and the at least two branch networks forms a classification network for the designated category, the output of the main network is respectively used as the respective input of the at least two branch networks, the branch networks are minimum increment units of the increment learning network, a new category is learned by adding a new branch network, and the branch networks corresponding to the existing category are kept when the new category is learned.

2. The method of claim 1, wherein the backbone network comprises at least one sequentially connected backbone module comprising a convolutional layer and at least one of: a transform reconstruction layer, an activation function layer, and a pooling layer.

3. The method of claim 1, wherein the branching network comprises a latent convolutional layer, a global average pooling layer, and a full convolutional layer connected in sequence.

4. The method of claim 1, wherein the output of the branched network is one of two classifications.

5. The method of claim 1, wherein the incremental learning network is trained by:

for a branch network of a specified class, taking training data of the specified class as a positive sample and training data outside the specified class as a negative sample; and

training the incremental learning network using the positive and/or negative samples.

6. The method of claim 5, wherein the training the incremental learning network with the positive and/or negative samples comprises:

if the category of the training data is different from the existing category of the historical training data, adding a branch network aiming at the category of the training data, taking at least part of the training data as a positive sample, taking at least part of the historical training data as a negative sample, and carrying out model training on the added branch network; and

And if the category of the training data belongs to the existing category of the historical training data, taking at least part of the training data and at least part of the historical training data with the same category together as positive samples, taking at least part of the historical training data with different categories with the training data as negative samples, and carrying out model training on the branch network corresponding to the category of the training data.

7. The method of claim 5, wherein the training the incremental learning network with the positive and/or negative samples comprises:

if the category of the training data is different from the existing category of the historical training data, performing fine tuning on the existing branch network in the incremental learning network by taking at least part of the training data as a negative sample; and

and if the category of the training data belongs to the existing category of the historical training data, extracting positive samples and/or negative samples from the training data and the historical training data, and performing fine tuning on the existing branch network in the incremental learning network.

8. The method of claim 5, further comprising:

unlocking network parameters of the backbone network if the backbone network is not trained by the training data of the specified class, otherwise locking the network parameters of the backbone network;

And/or

And unlocking the network parameters of the backbone network if the number of the branch networks of the backbone network is less than a preset number threshold, otherwise, locking the network parameters of the backbone network.

9. The method of claim 5, further comprising:

determining representative training data based on at least one of the training data, historical training data of the same class as the training data, and historical training data of a different class than the training data;

constructing a sample library based on the representative training data; and

said training said incremental learning network using said positive and/or said negative samples comprises: training the incremental learning network using positive and/or negative samples in the sample library.

10. The method of claim 9, wherein the total amount of data of the sample library is related to hardware performance of an electronic device used to train the incremental learning model.

11. The method of claim 1, wherein the processing the input image with an incremental learning network to determine an image recognition result comprises:

acquiring confidence degrees of processing results of the at least two branch networks for the input image respectively;

Sequentially splicing the confidence degrees of the processing results according to the respective ordering of the at least two branch networks; and

and taking the category of the branch network corresponding to the highest position of the confidence as the output of the incremental learning network.

12. The method of claim 1, wherein the input image is an image for an autopilot mission.

13. An image processing apparatus comprising:

the image acquisition module is used for acquiring an input image;

the image processing module is used for processing the input image by utilizing an incremental learning network so as to determine an image recognition result; wherein the dynamic growth of the incremental learning network uses a strategy of width expansion, the incremental learning network comprising: the system comprises a main network and at least two branch networks, wherein each branch network capable of performing parallel computation corresponds to a different designated category, each branch network comprises a classification network aiming at the designated category, the output of the main network is used as the input of each branch network, the branch networks are minimum increment units of the increment learning network, a new category is learned by adding a new branch network, and the branch networks corresponding to the existing category are maintained when learning the new category.

14. An electronic device, comprising:

one or more processors;

storage means for storing executable instructions which when executed by the processor implement the image processing method according to any one of claims 1 to 12.

15. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, implement the image processing method according to any of claims 1 to 12.