CN117036658A

CN117036658A - Image processing method and related equipment

Info

Publication number: CN117036658A
Application number: CN202210468931.9A
Authority: CN
Inventors: 李傲雪; 李震国
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2023-11-10
Also published as: WO2023207531A1

Abstract

The embodiment of the application discloses an image processing method which can be applied to object detection/segmentation scenes. The method comprises the following steps: acquiring a training image with a label, wherein the training image comprises a foreground object and a background; training a first neural network based on the training image, the first loss function and the second loss function to obtain a second neural network, wherein the second neural network is used for realizing the detection/segmentation task of the image, the first loss function is used for generating the loss function, and the second loss function is used for detecting/segmentation the loss function. Because the first loss function is used for reconstructing the foreground image, the encoder can capture more texture and structural characteristics of the image, so that the positioning capability of small sample object detection is improved, and the detection/segmentation effect of a second neural network comprising the encoder is improved.

Description

Image processing method and related equipment

Technical Field

The embodiment of the application relates to the field of artificial intelligence, in particular to an image processing method and related equipment.

Background

Artificial intelligence (artificial intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In recent years, deep learning techniques have led to computers achieving excellent performance in image object detection tasks. Deep learning has achieved such great success, and one very important factor is the large data, particularly large scale tagged data. However, manual tag acquisition is costly and even some tasks cannot collect large-scale data, e.g., medical data is difficult to collect due to the need for specialized doctor labeling and patient privacy concerns. In the absence of annotation data, the deep learning model performance significantly declines.

Compared to deep networks, humans have the ability to learn quickly from a small number of samples. This is because humans accumulate various kinds of knowledge in life and the natural thinking ability of humans. The accumulation of knowledge means that the human being has various knowledge priors, and the natural thinking ability means that the human being has strong analogy ability, generalization ability and strong calculation ability. Taking the quick learning ability of human as a heuristic, people study the problem of small sample learning, and hope that computers can quickly learn new things like human with little labeling knowledge. In addition, image object detection is one of important basic tasks of computer vision, and has important application in the fields of automatic driving, industrial vision and the like. For object detection tasks, the cost of acquiring annotation data is also very high.

Currently, researchers have proposed small sample object detection tasks in order to alleviate the dependence of deep learning on annotation data. That is, given a large training set as the primary task, the small sample object detector learns knowledge from this data that can be migrated. For new classes that have never been seen (not coincident with the class of the original task), the detector can detect targets from the test images that have never been seen using a few labeled training examples for each class.

However, since there are fewer samples in the small sample object detection, the accuracy of positioning is not high.

Disclosure of Invention

The embodiment of the application provides an image processing method and related equipment, which are used for improving the positioning capability of a detection/segmentation model on an object.

The first aspect of the embodiment of the application provides an image processing method, which can be applied to object detection/segmentation scenes. The method may be performed by the training device/image processing apparatus or by a component of the training device/image processing apparatus (e.g., a processor, chip, or system-on-chip, etc.). The method comprises the following steps: acquiring a training image with a label, wherein the training image comprises a foreground object and a background; training a first neural network based on the training image, the first loss function and the second loss function to obtain a second neural network, wherein the second neural network is used for realizing the detection/segmentation task of the image, the first neural network comprises an encoder, a decoder and a generator, and the second neural network comprises the encoder and the decoder; the first loss function is used for representing the difference between a first foreground image and a second foreground image generated by an encoder and a generator in a first neural network in the training process, the first foreground image comprises a foreground object and does not comprise a background, the second foreground image is an image after the background is subtracted from the training image, and the second loss function is used for representing the difference between a detection/segmentation result obtained by the encoder and the decoder in a second neural network and a label in the training process.

In the embodiment of the application, the encoder is trained by the first loss function and the second loss function for detection/segmentation, and as the first loss function is used for reconstructing the foreground image, the encoder can capture more texture and structural characteristics of the image, so that the positioning capability of small sample object detection is improved, and the detection/segmentation effect of a second neural network comprising the encoder is improved.

Optionally, in a possible implementation manner of the first aspect, the first neural network and the second neural network further include a quantization module, where the quantization module is configured to update a feature map output by the encoder, and input the updated feature map to the decoder and the generator, respectively.

In this possible implementation, the quantization module may convert the continuous feature space into a discrete feature space represented by the prototype vector set. Discrete feature space is easier to model than high-dimensional continuous space.

Optionally, in a possible implementation manner of the first aspect, the first loss function includes a third loss function and a fourth loss function, where the third loss function is used to represent a difference between the first foreground image and the second foreground image, and the fourth loss function is used to represent a difference between the quantization module updating before and after the feature map is updated in the training process.

In this possible implementation, minimizing the difference between the pixels of the feature map before and after updating may introduce a penalty to train the quantization module, thereby enabling the quantization module to convert the continuous feature space into a discrete feature space represented by the prototype vector set. Discrete feature space is easier to model than high-dimensional continuous space.

Optionally, in a possible implementation manner of the first aspect, the first neural network and the second neural network further include a valuation module, where the valuation module is configured to update an index of the feature map, and the index is configured to update the feature map by the quantization module.

In the possible implementation manner, the assignment module can realize the alignment of the clustering centers of different pixels, and when the clustering center of each pixel is predicted, not only the current pixel but also the clustering centers of other similar pixels are considered, so that the subsequent reasoning effect is improved.

Optionally, in a possible implementation manner of the first aspect, the first loss function further includes a fifth loss function, where the fifth loss function is used to represent a difference between the index before and after updating the assignment module in the training process.

In this possible implementation, the difference between before and after the quantization module update is smaller and smaller using the fifth loss function training feature map. The difference between the training index before and after updating of the assignment module by using the fifth loss function is smaller, namely, the recalculated index value is as consistent as possible with the index value which is originally obtained by the nearest neighbor clustering method.

A second aspect of an embodiment of the present application provides an image processing method, which may be applied to object detection/segmentation scenes. The method may be performed by an image processing apparatus or by a component of an image processing apparatus (e.g., a processor, a chip, or a system-on-chip, etc.). The method comprises the following steps: acquiring a first image; extracting a first feature map of the first image based on the encoder; obtaining a detection/segmentation result of the first image based on the first feature map and the decoder; the encoder and the decoder are trained by a training image with a label, a first loss function and a second loss function, the training image comprises a foreground object and a background, the first loss function is used for representing the difference between the first foreground image and the second foreground image generated by the encoder and the generator in the training process, the first foreground image comprises the foreground object and does not comprise the background, the second foreground image is an image except the background subtracted from the training image, and the second loss function is used for representing the difference between the detection/segmentation result obtained by the encoder and the decoder in the training process and the label.

According to the embodiment of the application, the encoder is trained through the first loss function and the second loss function, so that the encoder can learn more textures and structural features of the image, the positioning capability of small sample object detection is improved, and the detection/segmentation effect of a second neural network comprising the encoder is improved.

Optionally, in a possible implementation manner of the second aspect, the steps are as follows: obtaining a detection/segmentation result of the first image based on the first feature map and the decoder, including: the first feature map is input to a decoder to obtain a detection/segmentation result.

In this possible implementation, the first feature map is directly input to the decoder, and since the decoder is trained by the first loss function and the second loss function, the obtained detection/segmentation result may have more texture and structural features,

optionally, in a possible implementation manner of the second aspect, the steps are as follows: obtaining a detection/segmentation result of the first image based on the first feature map and the decoder, including: updating the first feature map based on the quantization module to obtain a second feature map, wherein the quantization module is trained based on a fourth loss function, and the fourth loss function is used for representing the difference between the feature map of the training image output by the encoder and the feature map of the training image output by the encoder in the training process before and after the quantization module is updated; the second feature map is input to a decoder to obtain a detection/segmentation result.

In this possible implementation, the quantization module converts the continuous feature space into a discrete feature space represented by the prototype vector set. Discrete feature space is easier to model than high-dimensional continuous space.

A third aspect of the embodiments of the present application provides an image processing apparatus (which may also be a training device) that may be applied to object detection/segmentation of a scene. The image processing apparatus/training device includes: the acquisition unit is used for acquiring a training image with a label, wherein the training image comprises a foreground object and a background; the training unit is used for training a first neural network based on the training image, the first loss function and the second loss function to obtain a second neural network, wherein the second neural network is used for realizing the detection/segmentation task of the image, the first neural network comprises an encoder, a decoder and a generator, and the second neural network comprises the encoder and the decoder; the first loss function is used for representing the difference between a first foreground image and a second foreground image generated by an encoder and a generator in a first neural network in the training process, the first foreground image comprises a foreground object and does not comprise a background, the second foreground image is an image after the background is subtracted from the training image, and the second loss function is used for representing the difference between a detection/segmentation result obtained by the encoder and the decoder in a second neural network and a label in the training process.

Optionally, in a possible implementation manner of the third aspect, the first neural network and the second neural network further include a quantization module, where the quantization module is configured to update a feature map output by the encoder, and input the updated feature map to the decoder and the generator, respectively.

Optionally, in a possible implementation manner of the third aspect, the first loss function includes a third loss function and a fourth loss function, where the third loss function is used to represent a difference between the first foreground image and the second foreground image, and the fourth loss function is used to represent a difference between the quantization module updating before and after the feature map is updated in the training process.

Optionally, in a possible implementation manner of the third aspect, the first neural network and the second neural network further include a valuation module, where the valuation module is configured to update an index of the feature map, and the index is configured to update the feature map by the quantization module.

Optionally, in a possible implementation manner of the third aspect, the first loss function further includes a fifth loss function, where the fifth loss function is used to represent a difference between the index before and after updating the assignment module in the training process.

A fourth aspect of the embodiments of the present application provides an image processing apparatus that can be applied to an object detection/segmentation scene. The image processing apparatus includes: an acquisition unit configured to acquire a first image; an extraction unit for extracting a first feature map of the first image based on the encoder; the processing unit is used for obtaining a detection/segmentation result of the first image based on the first feature map and the decoder; the encoder and the decoder are trained by a training image with a label, a first loss function and a second loss function, the training image comprises a foreground object and a background, the first loss function is used for representing the difference between the first foreground image and the second foreground image generated by the encoder and the generator in the training process, the first foreground image comprises the foreground object and does not comprise the background, the second foreground image is an image except the background subtracted from the training image, and the second loss function is used for representing the difference between the detection/segmentation result obtained by the encoder and the decoder in the training process and the label.

Optionally, in a possible implementation manner of the fourth aspect, the processing unit is specifically configured to input the first feature map into the decoder to obtain the detection/segmentation result.

Optionally, in a possible implementation manner of the fourth aspect, the processing unit is specifically configured to update the first feature map based on a quantization module to obtain the second feature map, where the quantization module is trained based on a fourth loss function, and the fourth loss function is used to represent a difference between the feature map of the training image output by the encoder during the training before and after the quantization module is updated; the processing unit is specifically configured to input the second feature map to the decoder to obtain a detection/segmentation result.

A fifth aspect of the present application provides an image processing apparatus comprising: a processor coupled to the memory for storing a program or instructions that when executed by the processor cause the image processing apparatus to implement the method of the first aspect or any possible implementation of the first aspect or the method of the second aspect or any possible implementation of the second aspect.

A sixth aspect of the application provides a computer readable medium having stored thereon a computer program or instructions which, when run on a computer, cause the computer to perform the method of the first aspect or any possible implementation of the first aspect or the method of the second aspect or any possible implementation of the second aspect.

A seventh aspect of the application provides a computer program product which, when executed on a computer, causes the computer to perform the method of the first aspect or any of the possible implementations of the first aspect, causing the computer to perform the method of the second aspect or any of the possible implementations of the second aspect.

The technical effects of the third, fifth, sixth, seventh, or any one of the possible implementation manners of the first aspect may be referred to as technical effects of the first aspect or different possible implementation manners of the first aspect, which are not described herein.

The technical effects of the fourth, fifth, sixth, seventh or any one of the possible implementation manners may be referred to the technical effects of the second aspect or the different possible implementation manners of the second aspect, which are not described herein.

From the above technical solutions, the embodiment of the present application has the following advantages: the encoder is trained through the first loss function and the second loss function of detection/segmentation, and as the first loss function is used for reconstructing the foreground image, the encoder can capture more texture and structural characteristics of the image, so that the positioning capability of small sample object detection is improved, and the detection/segmentation effect of a second neural network comprising the encoder is improved.

Drawings

FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a chip hardware structure according to an embodiment of the present application;

FIG. 3A is a schematic diagram of an image processing system according to an embodiment of the present application;

FIG. 3B is a schematic diagram of another architecture of an image processing system according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of an image processing method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a training process of a second neural network according to an embodiment of the present application;

FIG. 6 is a schematic diagram of another training process of a second neural network according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a training process after adding a quantization module according to an embodiment of the present application;

FIG. 8 is a schematic diagram of another training process after adding a quantization module according to an embodiment of the present application;

FIG. 9 is a schematic diagram of another flow chart of an image processing method according to an embodiment of the present application;

fig. 10 to 12 are schematic views of several structures of an image processing apparatus according to an embodiment of the present application.

Detailed Description

In order to solve the above technical problems, embodiments of the present application provide an image processing method and related apparatus, in which an encoder is trained by a first loss function and a second loss function for detection/segmentation, and since the first loss function is used for reconstructing a foreground image, the encoder can capture texture and structural features of more images, thereby improving the positioning capability of small sample object detection and improving the detection/segmentation effect of a second neural network including the encoder. The image processing method and the related device according to the embodiments of the present application will be described in detail with reference to the accompanying drawings.

For ease of understanding, related terms and concepts primarily related to embodiments of the present application are described below.

1. Neural network

The neural network may be composed of neural units, which may be referred to as X _s And an arithmetic unit whose intercept 1 is an input, the output of the arithmetic unit may be:

wherein s=1, 2, … … n, n is a natural number greater than 1, W _s Is X _s B is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal.The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a Relu function. A neural network is a network formed by joining together a number of the above-described single neural units, i.e., the output of one neural unit may be the input of another. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.

The operation of each layer in a neural network can be described by the mathematical expression y=a (wx+b): the operation of each layer in a physical layer neural network can be understood as the transformation of input space into output space (i.e., row space to column space of the matrix) is accomplished by five operations on the input space (set of input vectors), including: 1. dimension increasing/decreasing; 2. zoom in/out; 3. rotating; 4. translating; 5. "bending". Wherein operations of 1, 2, 3 are completed by Wx, operation of 4 is completed by +b, and operation of 5 is completed by a (). The term "space" is used herein to describe two words because the object being classified is not a single thing, but rather a class of things, space referring to the collection of all individuals of such things. Where W is a weight vector, each value in the vector representing a weight value of a neuron in the layer neural network. The vector W determines the above spatial transformation of the input space into the output space, i.e. the weights W of each layer control how the space is transformed. The purpose of training the neural network is to finally obtain a weight matrix (a weight matrix formed by a plurality of layers of vectors W) of all layers of the trained neural network. Thus, the training process of the neural network is essentially a way to learn and control the spatial transformation, and more specifically to learn the weight matrix.

2. Convolutional neural network

The convolutional neural network (convolutional neuron network, CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of a convolutional layer and a sub-sampling layer. The feature extractor can be seen as a filter and the convolution process can be seen as convolving the same trainable filter with an input image or a convolution feature plane (feature map). The convolution layer refers to a neuron layer in the convolution neural network, which performs convolution processing on an input signal. In the convolutional layer of the convolutional neural network, one neuron may be connected with only a part of adjacent layer neurons. A convolutional layer typically contains a number of feature planes, each of which may be composed of a number of neural elements arranged in a rectangular pattern. Neural elements of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights can be understood as the way image information is extracted is independent of location. The underlying principle in this is: the statistics of a certain part of the image are the same as other parts. I.e. meaning that the image information learned in one part can also be used in another part. The same learned image information can be used for all locations on the image. In the same convolution layer, a plurality of convolution kernels may be used to extract different image information, and in general, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix with random size, and reasonable weight can be obtained through learning in the training process of the convolution neural network. In addition, the direct benefit of sharing weights is to reduce the connections between layers of the convolutional neural network, while reducing the risk of overfitting.

3. Deep learning

Deep learning (deep learning) is a machine learning technology based on deep neural network algorithm, and is mainly characterized in that multiple nonlinear transformation structures are used for processing and analyzing data. The method is mainly applied to scenes such as perception, decision and the like in the field of artificial intelligence, such as image and voice recognition, natural language translation, computer game and the like.

4. Loss function

In training the deep neural network, because the output of the deep neural network is expected to be as close to the value actually expected, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the actually expected target value (of course, there is usually an initialization process before the first update, that is, the parameters are preconfigured for each layer in the deep neural network), for example, if the predicted value of the network is higher, the weight vector is adjusted to make the predicted value lower, and the adjustment is continued until the neural network can predict the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and then the training of the deep neural network becomes a process of reducing the loss as much as possible.

The system architecture provided by the embodiment of the application is described below.

Referring to fig. 1, an embodiment of the present application provides a system architecture 100. As shown in the system architecture 100, the data acquisition device 160 is configured to acquire training data, where the training data in the embodiment of the present application includes: data of a plurality of different modalities. The mode may refer to text, image, video and audio. For example: the training data may include labeled training images, and the like. And stores the training data in database 130, training device 120 trains to obtain target model/rule 101 based on the training data maintained in database 130. How the training device 120 obtains the target model/rule 101 based on the training data, which target model/rule 101 can be used to implement the computer vision task to which the image processing method provided by the embodiment of the present application is applied, will be described in more detail below. The computer vision task may include: classification tasks, segmentation tasks, detection tasks, or image generation tasks, among others. In practical applications, the training data maintained in the database 130 is not necessarily acquired by the data acquisition device 160, but may be received from other devices. It should be noted that the training device 120 is not necessarily completely based on the training data maintained by the database 130 to perform training of the target model/rule 101, and it is also possible to obtain the training data from the cloud or other places to perform model training, which should not be taken as a limitation of the embodiments of the present application.

The target model/rule 101 obtained by training according to the training device 120 may be applied to different systems or devices, such as the execution device 110 shown in fig. 1, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (augmented reality, AR) device/Virtual Reality (VR) device, an in-vehicle terminal, and so on. Of course, the execution device 110 may also be a server or cloud, etc. In fig. 1, the execution device 110 is configured with an I/O interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include in an embodiment of the present application: a first image. In addition, the input data may be input by a user, or may be uploaded by the user through the photographing device, or may be from a database, which is not limited herein.

The preprocessing module 113 is configured to perform preprocessing according to the first image received by the I/O interface 112, and in an embodiment of the present application, the preprocessing module 113 may be configured to perform processing such as flipping, translation, clipping, color conversion, etc. on the first image.

In preprocessing input data by the execution device 110, or in performing processing related to computation or the like by the computation module 111 of the execution device 110, the execution device 110 may call data, codes or the like in the data storage system 150 for corresponding processing, or may store data, instructions or the like obtained by corresponding processing in the data storage system 150.

Finally, the I/O interface 112 returns the processing results, such as the obtained detection/segmentation results described above, to the client device 140 for provision to the user.

It should be noted that the training device 120 may generate, based on different training data, a corresponding target model/rule 101 for different targets or different tasks, where the corresponding target model/rule 101 may be used to achieve the targets or complete the tasks, thereby providing the user with the desired result.

In the case shown in FIG. 1, the user may manually give input data that may be manipulated through an interface provided by the I/O interface 112. In another case, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data requiring the user's authorization, the user may set the corresponding permissions in the client device 140. The user may view the results output by the execution device 110 at the client device 140, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client device 140 may also be used as a data collection terminal to collect input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data as shown in the figure, and store the new sample data in the database 130. Of course, instead of being collected by the client device 140, the I/O interface 112 may directly store the input data input to the I/O interface 112 and the output result output from the I/O interface 112 as new sample data into the database 130.

It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawing is not limited in any way, for example, in fig. 1, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may be disposed in the execution device 110.

As shown in fig. 1, the target model/rule 101 in the embodiment of the present application may be specifically a target neural network according to training by the training device 120 to obtain the target model/rule 101.

The following describes a chip hardware structure provided by the embodiment of the application.

Fig. 2 is a chip hardware structure according to an embodiment of the present application, where the chip includes a neural network processor 20. The chip may be provided in an execution device 110 as shown in fig. 1 for performing the calculation of the calculation module 111. The chip may also be provided in the training device 120 as shown in fig. 1 to complete the training work of the training device 120 and output the target model/rule 101.

The neural network processor 20 may be a neural Network Processor (NPU), tensor processor (tensor processing unit, TPU), or graphics processor (graphics processing unit, GPU) among all suitable processors for large-scale exclusive-or operation processing. Taking NPU as an example: the neural network processor 20 is mounted as a coprocessor on a main central processing unit (central processing unit, CPU) (host CPU) to which tasks are assigned. The NPU has a core part of an arithmetic circuit 203, and a controller 204 controls the arithmetic circuit 203 to extract data in a memory (a weight memory or an input memory) and perform an operation.

In some implementations, the arithmetic circuitry 203 includes a plurality of processing units (PEs) internally. In some implementations, the operational circuitry 203 is a two-dimensional systolic array. The arithmetic circuitry 203 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the operational circuitry 203 is a general purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit 203 takes the data corresponding to the matrix B from the weight memory 202 and buffers the data on each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 201 and performs matrix operation with the matrix B, and the obtained partial result or final result of the matrix is stored in the accumulator 208.

The vector calculation unit 207 may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, vector computation unit 207 may be used for network computation of non-convolutional/non-FC layers in a neural network, such as Pooling (Pooling), batch normalization (Batch Normalization), local response normalization (Local Response Normalization), and the like.

In some implementations, the vector computation unit 207 can store the vector of processed outputs to the unified buffer 206. For example, the vector calculation unit 207 may apply a nonlinear function to an output of the operation circuit 203, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 207 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as an activation input to the operational circuitry 203, for example for use in subsequent layers in a neural network.

The unified memory 206 is used for storing input data and output data.

The weight data is directly transferred to the input memory 201 and/or the unified memory 206 by the memory cell access controller 205 (direct memory access controller, DMAC), the weight data in the external memory is stored in the weight memory 202, and the data in the unified memory 206 is stored in the external memory.

A bus interface unit (bus interface unit, BIU) 210 for interfacing between the main CPU, DMAC and the instruction fetch memory 209 via a bus.

An instruction fetch memory (instruction fetch buffer) 209 coupled to the controller 204 is configured to store instructions for use by the controller 204.

And the controller 204 is used for calling the instruction cached in the instruction memory 209 to realize the control of the working process of the operation accelerator.

Typically, the unified memory 206, the input memory 201, the weight memory 202, and the finger memory 209 are On-Chip (On-Chip) memories, and the external memory is a memory external to the NPU, which may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, abbreviated as DDR SDRAM), a high bandwidth memory (high bandwidth memory, HBM), or other readable and writable memories.

Next, several application scenarios of the present application are described.

Fig. 3A is a schematic structural diagram of an image processing system according to an embodiment of the present application, where the image processing system includes a terminal device (only a mobile phone is taken as an example in fig. 3A) and an image processing device. It is understood that the terminal device may be a tablet (pad), a portable game machine, a palm (personal digital assistant, PDA), a notebook, an ultra mobile personal computer (ultra mobile personal computer, UMPC), a handheld computer, a netbook, a vehicle-mounted media playing device, a wearable electronic device, a Virtual Reality (VR) terminal device, an augmented reality (augmented reality, AR), a vehicle-mounted terminal, an airplane terminal, an intelligent robot, or the like, in addition to the mobile phone. The terminal device is an initiating terminal of image processing, and is used as an initiating party of an image processing request, and a user typically initiates the request through the terminal device.

The image processing apparatus may be an apparatus or a server having an image processing function such as a cloud server, a web server, an application server, or a management server. The image processing device receives an image processing request from the terminal device through the interactive interface, and then performs image processing in modes of machine learning, deep learning, searching, reasoning, decision and the like through a memory for storing data and a processor link of image processing. The memory in the image processing device may be a generic term comprising a database storing the history data locally, either on the image processing device or on another network server.

In the image processing system shown in fig. 3A, the terminal device may receive an instruction of a user, for example, the terminal device may acquire a plurality of data (for example, an image, text, audio, etc. acquired by the terminal device) input/selected by the user, and then initiate a request to the image processing device, so that the image processing device performs an image processing application (for example, a computer vision task of classification, segmentation, detection, image generation, etc.) on the plurality of data obtained by the terminal device, thereby obtaining a corresponding processing result for the plurality of data. For example, the terminal device may acquire an image input by the user, and then initiate an image detection request to the image processing device, so that the image processing device detects the image, thereby obtaining a detection result of the image, and displays the detection result of the image for viewing and use by the user.

In fig. 3A, the image processing apparatus may perform the image processing method of the embodiment of the present application.

Fig. 3B is another schematic structural diagram of an image processing system according to an embodiment of the present application, in fig. 3B, a terminal device (only a mobile phone is taken as an example of the terminal device in fig. 3B) is directly used as the image processing device, and the terminal device can directly obtain an image and directly process the image by hardware of the terminal device itself, and a specific process is similar to that of fig. 3A, and reference is made to the above description and will not be repeated here.

Alternatively, in the image processing system shown in fig. 3B, the terminal device may receive an instruction from the user, for example, the terminal device may acquire a plurality of images selected by the user in the terminal device, and then perform an image processing application (for example, a computer vision task of classification, segmentation, detection, image generation, etc.) on the images by the terminal device itself, thereby obtaining corresponding processing results for the images, and display the processing results for viewing and use by the user.

Alternatively, in the image processing system shown in fig. 3B, the terminal device may acquire an image in real time or periodically, and then perform an image processing application (e.g., a computer vision task such as classification, segmentation, detection, image generation, etc.) on the image by the terminal device itself, thereby obtaining a corresponding processing result for the image, and implement functions (classification function, segmentation function, detection function, image generation function, etc.) according to the processing result.

In fig. 3B, the terminal device itself may execute the image processing method of the embodiment of the present application.

The terminal device in fig. 3A and fig. 3B may be specifically the client device 140 or the executing device 110 in fig. 1, and the image processing device in fig. 3A may be specifically the executing device 110 in fig. 1, where the data storage system 150 may store data to be processed of the executing device 110, and the data storage system 150 may be integrated on the executing device 110, or may be disposed on a cloud or other network server.

The processor in fig. 3A and 3B may perform data training/machine learning/deep learning through a neural network model or other models (e.g., attention model, MLP, etc.), and perform image processing application on a plurality of data using the data final training or the learned model, thereby obtaining corresponding processing results.

The image processing method provided by the embodiment of the application can be applied to various scenes, and is described below.

First, the autopilot field.

Deep learning-based detection models are good at detecting common categories (e.g., automobiles, pedestrians), but it is difficult to accurately detect rare examples such as garbage bags on roadsides, dropped tires, triangular pyramids placed in roads, and the like. False detection and omission of these obstacles can lead to serious consequences. The image processing method (also called as a small sample object detection algorithm) provided by the embodiment of the application can improve the detection of the detection model on the category containing a small amount of marked samples and improve the precision and recall rate of the detection model.

And secondly, railway and power grid fault detection.

The wagon detection in the railway industry has about 10 hundred million yuan of manpower input each year, and large-particle fault detection scenes such as passenger cars, motor cars and lines are also provided; the power transmission, transformation and distribution inspection of the power grid is estimated to be 240+hundred million in the future 5 years. Because the probability of occurrence of faults is low and artificial labeling is needed, the labeling sample collection is difficult; and the imaging change is large due to the change of the external environment, and the difference in the fault class is obvious. The image processing method (also called as a small sample object detection algorithm) provided by the embodiment of the application can effectively process the fault detection task with a small number of marked samples, and the model can be deployed on the cloud to provide high-efficiency service for external clients.

It should be understood that the above two scenarios are merely examples, and in practical applications, the embodiments of the present application may also be applied to other scenarios such as small sample object detection/segmentation, which are not limited herein.

The training method and the image processing method of the neural network according to the embodiments of the present application are described in detail below with reference to the accompanying drawings.

The training method of the neural network according to the embodiment of the present application will be described in detail with reference to fig. 4. The method shown in fig. 4 may be performed by a training apparatus of a neural network, which may be a cloud service device, or may be a terminal device, for example, a computer, a server, or the like, which has a computing power sufficient to perform a training method of a recommended network, or may be a system composed of a cloud service device and a terminal device. The training method may be performed by the training device 120 of fig. 1, the neural network processor 20 of fig. 2, for example.

Alternatively, the training method may be processed by a CPU, or may be processed by both the CPU and the GPU, or may not use the GPU, and other suitable processors for neural network computation may be used, which is not limited by the present application.

The training method includes steps 401 and 402. Step 401 and step 402 are described in detail below.

Step 401, a training image with a label is acquired.

In the embodiment of the present application, the training device may acquire training images in various manners, which may be a manner of acquiring/shooting, a manner of receiving transmissions from other devices, a manner of selecting from a database, etc., and the present application is not limited in particular herein.

The training image in the embodiment of the application comprises a foreground image and a background, and the foreground image is a part which needs to be identified by the user-specified equipment in general.

Alternatively, the label of the training image may be obtained manually or by inputting a model, which is not limited herein. If applied to the detection scene, the label may be a circumscribed rectangle of the class of each object and/or the edge of the object in the image. If applied to a segmented scene, the label may be a classification label of pixels or be understood as a class corresponding to each pixel in the image.

Optionally, if applied to an autopilot scenario, the training device may be a vehicle, and the training image may be data collected by the vehicle in real time, or may be data collected periodically, which is not limited herein.

Step 402, training a first neural network based on the training image, the first loss function and the second loss function to obtain a second neural network, wherein the second neural network is used for realizing the detection/segmentation task of the image.

After the training device acquires the training image, the first neural network can be trained based on the training image, the first loss function and the second loss function, and the second neural network is obtained. Wherein the first neural network comprises an encoder, a decoder, and a generator, and the second neural network comprises the encoder and the decoder in the first neural network. It is also understood that the first neural network comprises a second neural network and a generator. The first loss function is used for representing differences between a first foreground image and a second foreground image generated by an encoder and a generator in the first neural network in the training process, wherein the first foreground image comprises a foreground object and does not comprise a background, and the second foreground image is an image after the background is subtracted from the training image. The second loss function is used to represent the difference between the detection/classification result obtained based on the encoder and decoder in the second neural network and the label in the training process.

The first loss function in the embodiment of the application can be understood as a generated loss function, the second loss function can be understood as a detection/segmentation loss function, and the encoder can learn more texture and structural characteristics of an image through training the two parts of loss functions, so that the positioning capability of small sample object detection is improved, and the detection/segmentation effect of a second neural network comprising the encoder is improved.

An example of the first loss function is illustrated by equation one:

equation one: l (L) _rec ＝||D(Q)-x⊙m||；

Wherein L is _rec The first loss function is represented, D represents a decoder, Q represents a feature map, which can be the feature map output by an encoder or the feature map updated by a subsequent quantization module, x represents a training image, and x represents a point-to-point multiplier, m represents a binarized mask, the size is the same as that of the training image, and the pixels of a foreground object are set to 1 and the pixels of a background are set to 0 according to labeling information.

It should be understood that the formula of the first loss function is merely an example, and other formulas may be provided in practical application, as shown in the formula two, and the specific structure of the first loss function is not limited.

Formula II: l (L) _rec ＝||D(Q ₀ ，Q ₁ ，Q ₂ )-x⊙m||；

Wherein Q is ₀ 、Q ₁ 、Q ₂ The characteristic diagrams of three scales obtained by the training image through the encoder can be represented, the characteristic diagrams of three scales after the subsequent quantization module is updated can also be represented, and description of the rest parameters can refer to a formula I and are not repeated here.

In the embodiment of the present application, the second loss function may be an absolute value loss function, a logarithmic loss function, an exponential loss function, a cross entropy loss function, and the like, and may be set according to actual needs, which is not limited herein.

This step can also be understood as optimizing the localization features of the detection/segmentation model using the generative model as a constraint.

For example, the specific process of the training device training the first neural network based on the training image, the first loss function, and the second loss function to obtain the second neural network may be as shown in fig. 5. The training image is input into an encoder to obtain a feature map of the training image, the feature map is input into a generator to generate a first foreground image, and the feature map is input into a decoder to obtain a detection/segmentation result. And removing the background in the training image to obtain a second foreground image. The encoder and the generator are trained using the first loss function such that the difference between the first foreground image and the second foreground image based on the output of the encoder and the generator is smaller and smaller. The encoder and decoder are trained using the second loss function such that the difference between the detection/classification result based on the encoder and decoder outputs and the tag is smaller and smaller.

Optionally, the first neural network and the second neural network may further include a quantization module for quantizing a feature map (F _i ) And updating, wherein i is used for indicating the number of layers and is an integer greater than or equal to 1, and the updated characteristic diagrams are respectively input to a generator and a decoder. The quantization module may use a prototype vector setUpdating the feature map, wherein n is an integer greater than 1. For example: will F _i Each pixel of +.>Is replaced by V _i Neutralization->Nearest prototype vector->j is used to indicate the position in the number of layers, is an integer greater than or equal to 1, and k is between 1 and n. The replacement process can be regarded as a clustering process in which a prototype vector is a cluster center, and each input pixel is designated as the cluster center nearest to the pixel. In order to ensure the reliability of the clustering, an effective clustering process is learned by introducing a fourth loss function as mentioned later.

Further, to minimize the difference between pixels of the feature map before and after updating, a penalty may be introduced to train the quantization module, thereby enabling the quantization module to convert the continuous feature space into a discrete feature space represented by the prototype vector set. Discrete feature space is easier to model than high-dimensional continuous space. In other words, the first loss function may include a third loss function for representing a difference between the aforementioned first foreground image and the second foreground image and a fourth loss function for representing a difference between the characteristic map of the encoder output before and after the quantization module update. I.e. generating the loss function comprises a third loss function and a fourth loss function. It is understood that the number of quantization modules may correspond to the number of encoder output feature maps one by one, and the number of encoder output feature maps may be understood as the number of encoder output feature maps that may be obtained by the encoder. In this case, the third loss function may be represented by the first or second formula, and the fourth loss function may be represented by the third formula.

And (3) a formula III:

wherein L is _qt The fourth loss function is represented by W, the width of the feature map, the height of the feature map and the remaining parameters are referred to in the foregoing description, and will not be described here again.

It will be appreciated that the above formula of the fourth loss function is merely exemplary, and in practical application, other formulas are also possible, and the specific structure of the fourth loss function is not limited.

For example, after the first neural network and the second neural network are introduced into the quantization module, as shown in fig. 6, the training process may input the training image into the encoder to obtain a feature map of the training image, and the quantization module updates the feature map, so that on one hand, the updated feature map is input into the generator to generate the first foreground image, and on the other hand, the updated feature map is input into the decoder to obtain the detection/segmentation result. And removing the background in the training image to obtain a second foreground image. And training the encoder, the quantization module and the generator by using the third loss function and the fourth loss function, so that the difference between the first foreground image and the second foreground image based on the encoder, the quantization module and the output is smaller and smaller. The encoder and decoder are trained using the second loss function such that the difference between the detection/classification result based on the encoder and decoder outputs and the tag is smaller and smaller. The difference between the quantization module before and after updating is smaller and smaller by using the fourth loss function training feature map. For example, three feature maps of different scales are output by an encoder (Q as mentioned above ₀ 、Q ₁ 、Q ₂ ) For example, the training process may be as shown in fig. 7, and in order to promote the receptive field of each feature map, a stitching operation as shown in fig. 7 may be introduced, and may also be understood as a residual structure.

In addition, the above-described clustering process considers only the relationship of each pixel and the clustering center, and does not consider the association between a plurality of pixels, which is detrimental to the clustering process, making the clustering result unreliable. Thus, when predicting the cluster center of each pixel, not only the current pixel but also the cluster centers of other similar pixels are considered, and thus, the first neural network and the second neural network may further include an assignment module for updating the index of the feature map, and the index is used for updating the feature map by the quantization module. In other words, the assignment module may enable alignment of cluster centers of different pixels. The training process in this case may be as shown in fig. 8, where the training image is input to the encoder to obtain a feature map of the training image, and the quantization module updates pixels of the feature map to obtain a quantization vector. And inputting the index of the quantized vector into a value assignment module for updating. Training the feature map using the fifth loss function is increasingly less different between before and after the quantization module update. The difference between the training index before and after updating of the assignment module by using the fifth loss function is smaller, namely, the recalculated index value is as consistent as possible with the index value which is originally obtained by the nearest neighbor clustering method. The fifth loss function is proposed to improve the quantization precision and the reconstruction capability of the generation model (including the encoder and the generator) on the foreground image.

Illustratively, the process of updating the index is shown in equation four.

Equation four:

wherein,representing the updated index of the assignment module, A representing the assignment module, and l representing pixel f _i Prototype index values calculated by nearest neighbor method;Representation->Index value calculated by nearest neighbor method, sim represents similarity calculation function for calculating +.>And->O represents a one-hot embedding function, and the index can be changed into a binary vector.

It should be understood that the above formula for updating the index is merely an example, and in practical application, other formulas are also possible, and the specific process for updating the index is not limited.

The fifth loss function in the embodiment of the present application may be as shown in the formula five.

Formula five:

wherein L is _align The fifth loss function is represented by W, the width of the feature map, and the height of the feature map, and the remaining parameters may refer to the description in the foregoing formula four, which is not repeated here.

It will be appreciated that the fifth loss function is merely exemplary, and that in practical applications, there may be other forms of the fifth loss function, and the specific formula of the fifth loss function is not limited herein.

The third loss function, the fourth loss function and the fifth loss function can be understood as generating the loss functions, so that the encoder can learn more result texture features in the process of updating the generator again, and further the accuracy of detecting/dividing tasks by the second neural network comprising the encoder is improved.

The training method of the neural network is described in detail above, and the image processing method provided by the embodiment of the application is described in detail below. The method may be performed by an image processing apparatus or by a component of an image processing apparatus (e.g., a processor, a chip, or a system-on-chip, etc.). The image processing device may be a cloud device (as shown in fig. 3A), or may be a terminal device (such as a mobile phone shown in fig. 3B). Of course, the method may also be performed by a system formed by the cloud device and the terminal device (as shown in fig. 3A). Alternatively, the method may be processed by a CPU in the image processing apparatus, or may be processed by both the CPU and the GPU, or may not use the GPU, and other suitable processors for neural network computation may be used, which is not limited by the present application.

The terminal device may be a mobile phone, a tablet (pad), a portable game machine, a palm computer (personal digital assistant, PDA), a notebook computer, an ultra mobile personal computer (ultra mobile personal computer, UMPC), a handheld computer, a netbook, a vehicle-mounted media playing device, a wearable electronic device, a Virtual Reality (VR) terminal device, an augmented reality (augmented reality, AR) terminal device, or other terminal products.

The application scene to which the method provided by the embodiment of the application is applicable can be a small sample object detection/segmentation scene in the automatic driving field, railway/power grid fault detection and the like. Referring to fig. 9, a flowchart of an image processing method according to an embodiment of the application may include steps 901 to 903. Steps 901 to 903 are described in detail below.

Step 901, a first image is acquired.

In the embodiment of the present application, the image processing device may acquire the first image in various manners, which may be a manner of acquiring/shooting, a manner of receiving a transmission from another device, a manner of selecting from a database, etc., and the embodiment is not limited herein.

Alternatively, if applied to an autopilot scenario, the image processing device may be a vehicle, and the first image may be data acquired by the vehicle in real time or periodically acquired data, which is not limited herein.

A first feature map of the first image is extracted based on the encoder, step 902.

The encoder and decoder in the embodiment of the present application may be trained by the training method provided in the embodiment shown in fig. 4.

The encoder and the decoder are trained by a training image with a label, a first loss function and a second loss function, wherein the training image comprises a foreground object and a background, the first loss function is used for representing the difference between the first foreground image and the second foreground image generated by the encoder and the generator in the training process, the first foreground image comprises the foreground object and does not comprise the background, the second foreground image is an image except the background subtracted from the training image, and the second loss function is used for representing the difference between the detection/segmentation result obtained by the encoder and the decoder in the training process and the label. For the description of the first loss function, the second loss function, etc., reference may be made to the description in the embodiment shown in fig. 4, and the description is omitted here.

In step 903, a detection/segmentation result of the first image is obtained based on the first feature map and the decoder.

After the image processing apparatus acquires the first feature map, a detection/segmentation result of the first image may be obtained based on the first feature map and the decoder.

In the embodiment of the present application, there are various ways to obtain the detection/segmentation result of the first image based on the first feature map and the decoder, and the following descriptions respectively apply to:

first, the first feature map is input to a decoder to obtain a detection/segmentation result.

The second method is to input the first feature map into the quantization module to obtain a second feature map, or to update the first feature map based on the quantization module to obtain the second feature map. The quantization module is trained based on a fourth loss function, the fourth loss function is used for representing the difference between the feature map of the training image output by the encoder in the training process before and after the quantization module is updated, and then the second feature map is input into the decoder to obtain a detection/segmentation result.

Further, in the second case, the assignment module may be further configured to update the index of the second feature map, so that the quantization module updates the first feature map with the updated index to obtain the second feature map. The quantization module is trained based on a fifth loss function, which is used to represent the difference between the index before and after updating the assignment module during the training process.

In this embodiment, descriptions of the loss functions (e.g., the first loss function, the second loss function, the third loss function, the fourth loss function, and the fifth loss function) may refer to the descriptions in the embodiment shown in fig. 4, and are not repeated here.

Alternatively, the overall process in this embodiment may be regarded as inputting the first image into the second neural network trained in the embodiment shown in fig. 4 to perform the detection/segmentation task, thereby obtaining the detection/segmentation result.

In this embodiment, the encoder is trained through the first loss function and the second loss function, so that the encoder can learn more textures and structural features of the image, thereby improving the positioning capability of small sample object detection and improving the detection/segmentation effect of the second neural network including the encoder.

In order to more intuitively see the beneficial effects of the method according to the embodiment of the present application, the beneficial effects of the model trained by using the method according to the embodiment of the present application will be described below with reference to the object detection dataset MS-COCO as an example.

First, an MS-co dataset is presented that includes a total of 80k training samples, 40k validation samples, and 20k test samples, covering 80 categories. Wherein 20 classes are set as new task classes and the remaining 60 classes are set as original task classes. Images belonging to 20 new task categories out of 5k images in 20k test samples are used for model performance evaluation, and images of 80k training samples are used for model training.

In small sample object detection, given an original task, the detection task is of Ns categories, each category having a large number of labeling samples. Meanwhile, a new task is provided, which is a detection task of Nt categories, and each category has only K labeling samples. Wherein the categories of the original task and the new task do not coincide. The goal of standard small sample object detection is to learn one detector for a new task.

The detector has two cases, respectively described below:

first, the detector is an Nt class detector. The training process is as follows:

1. pre-training of original tasks: the network shown in fig. 5/6 is first fully trained with Ns classes of training data to obtain Ns classes of detectors.

2. Fine tuning of new tasks: and then modifying the last layer of the network to output Nt neurons, and initializing other layers by using the parameters of the Ns class detectors except the random initialization of the last layer of the network. Network parameters are fine-tuned with a small amount of data for the new task.

Small sample object detection is performed using a trained detector (i.e., a second neural network), with test data from the class of new tasks. The measurement index includes average accuracy (average precision, AP) under different cross ratios, typically, 10 values are taken at intervals of 0.05 between 0.5 and 0.95, each value corresponds to one accuracy, and the 10 values are averaged to obtain the average accuracy. For example: AP (Access Point) ₅₀ Refers to the value of AP when the intersection ratio of the predicted frame and the target frame is larger than 0.5 and AP ₇₅ The intersection ratio of the predicted frame and the target frame is larger than 0.75 and is taken as the value of the AP at the time of detection, and the AP can be also understood to mean the average value of the AP under different intersection ratios. The larger these indices indicate the better the test model performance.

The model obtained by training the existing small sample object detection algorithm and the method provided by the embodiment of the application is respectively used: (also referred to as GLFR) experiments were performed on MS-COCO data sets and the results are shown in Table 1. The existing small sample object detection algorithm comprises the following steps: a few sample migration detector (low-shot transfer detector, LSTD), a Meta-area detector (Meta-region-based convolutional neural network, meta RCNN), a transform-invariant small sample detector (Transformation invariant Few-shot object detection, TIP), a context fusion-based dense correlation distillation (Dense relation distillation with context-aware aggregation, DCNet), a query-adaptive small sample object detector (Query adaptive few-shot object detector, QA-FewDet), a two-stage fine tuning algorithm (two-stage finetune algorithm, TFA), a Multi-scale positive refinement-based small sample object detection (Multi-scale positive sample refinement for few-shot object detection, MSPR), a base Small sample object detectors (Class margin equilibrium for few-shot object detection, CME), class refinement and jammer reprocessing based small sample object detection (Few-shot object detection via classification refinement and distractor retreatment, CRDR), semantic relationship reasoning based small sample object detectors (Semantic relation reasoning for few-shot object detection, SRR-FSOD), generic prototype enhanced small sample detector (Universal-prototype enhancing for few-shot object detection, FSOD) ^up ) Small sample object detection based on comparison candidate encoding (few-shot object detection via contrastive proposal encoding, FSCE), decoupled fast regional neural network (Decoupled faster R-CNN, deFRCN).

TABLE 1

Wherein table 1 gives experimental results on the MS-COCO dataset, giving 10 and 30 images (K-shot=10 or 30) as training sets, respectively, for each new class. As can be seen from table 1, under the setting of standard small sample object detection, the model (i.e., GLFR) obtained by training the method provided by the embodiment of the present application significantly exceeds that of other small sample object detection algorithms.

Second, the detector is a detector that detects classes simultaneously for one ns+nt classes. The training process is as follows:

2. Fine tuning of new tasks: the last layer of the network is then modified to output Nt+Ns neurons, and other layers are initialized with the parameters of Ns classes of detectors, except for random initialization of the last layer of the network. For the Ns categories of the original task, each category randomly samples K samples from the training data, and for the new task, all training data are utilized to combine the two parts of data to form an balanced fine tuning data set, and parameters of the whole network are fine tuned by utilizing the data set.

This situation is still evaluated in the object detection dataset MS-COCO. The setup was similar to the first case described above, except that the 5k images in the validation set were used for model performance evaluation. Experiments were performed on the MS-COCO dataset using the existing small sample object detection algorithm and GLFR, respectively, with the experimental results shown in Table 2:

TABLE 2

Wherein table 2 gives experimental results on the MS-COCO dataset, giving 10 and 30 images (K-shot=10 or 30) as training sets, respectively, for each new class. As can be seen from table 2, under the setting of standard small sample object detection, the model (i.e., GLFR) obtained by training the method provided by the embodiment of the present application significantly exceeds that of other small sample object detection algorithms.

The image processing method in the embodiment of the present application is described above, and the image processing apparatus in the embodiment of the present application is described below with reference to fig. 10, where an embodiment of the image processing apparatus in the embodiment of the present application includes:

an obtaining unit 1001, configured to obtain a training image with a label, where the training image includes a foreground object and a background;

the training unit 1002 is configured to train a first neural network based on a training image, a first loss function, and a second loss function, to obtain a second neural network, where the second neural network is used to implement a task of detecting/segmenting an image, the first neural network includes an encoder, a decoder, and a generator, and the second neural network includes an encoder and a decoder; the first loss function is used for representing the difference between a first foreground image and a second foreground image generated by an encoder and a generator in a first neural network in the training process, the first foreground image comprises a foreground object and does not comprise a background, the second foreground image is an image after the background is subtracted from the training image, and the second loss function is used for representing the difference between a detection/segmentation result obtained by the encoder and the decoder in a second neural network and a label in the training process.

In this embodiment, operations performed by each unit in the image processing apparatus are similar to those described in the embodiments shown in fig. 1 to 8, and are not described here again.

In this embodiment, the training unit 1002 trains the encoder through the first loss function and the second loss function of detection/segmentation, and since the first loss function is used for reconstructing the foreground image, the encoder can capture more texture and structural features of the image, thereby improving the positioning capability of small sample object detection and improving the detection/segmentation effect of the second neural network including the encoder.

Referring to fig. 11, an embodiment of an image processing apparatus according to an embodiment of the present application includes:

an acquisition unit 1101 for acquiring a first image;

an extracting unit 1102 for extracting a first feature map of the first image based on the encoder;

a processing unit 1103, configured to obtain a detection/segmentation result of the first image based on the first feature map and the decoder;

the encoder and the decoder are trained by a training image with a label, a first loss function and a second loss function, the training image comprises a foreground object and a background, the first loss function is used for representing the difference between the first foreground image and the second foreground image generated by the encoder and the generator in the training process, the first foreground image comprises the foreground object and does not comprise the background, the second foreground image is an image except the background subtracted from the training image, and the second loss function is used for representing the difference between the detection/segmentation result obtained by the encoder and the decoder in the training process and the label.

In this embodiment, the operations performed by the units in the image processing apparatus are similar to those described in the embodiment shown in fig. 9, and will not be described here again.

Referring to fig. 12, another image processing apparatus according to the present application is shown. The image processing device may include a processor 1201, a memory 1202, and a communication port 1203. The processor 1201, memory 1202 and communication port 1203 are interconnected by wires. Wherein program instructions and data are stored in memory 1202.

The memory 1202 stores therein program instructions and data corresponding to the steps executed by the image processing apparatus in the respective embodiments shown in fig. 1 to 9.

A processor 1201 for executing steps executed by the image processing apparatus as shown in any of the embodiments shown in the foregoing fig. 1 to 9.

The communication port 1203 may be used for receiving and transmitting data, and may be used for performing the steps related to acquiring, transmitting, and receiving in any of the embodiments shown in fig. 1-9.

In one implementation, the image processing device may include more or less components relative to FIG. 12, which is only exemplary and not limiting.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random access memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims

1. An image processing method, comprising:

acquiring a training image with a label, wherein the training image comprises a foreground object and a background;

training a first neural network based on the training image, a first loss function and a second loss function to obtain a second neural network, wherein the second neural network is used for realizing detection/segmentation tasks of the image, the first neural network comprises an encoder, a decoder and a generator, and the second neural network comprises the encoder and the decoder; the first loss function is used for representing a difference between a first foreground image and a second foreground image generated based on the encoder and the generator in the first neural network in the training process, the first foreground image comprises the foreground object and does not comprise the background, the second foreground image is an image obtained by subtracting the background from the training image, and the second loss function is used for representing a difference between a detection/segmentation result obtained based on the encoder and the decoder in the second neural network and the label in the training process.

2. The method of claim 1, wherein the first and second neural networks further comprise a quantization module for updating the feature map output by the encoder and inputting the updated feature map to the decoder and the generator, respectively.

3. The method of claim 2, wherein the first loss function comprises a third loss function representing a difference between the first foreground image and the second foreground image and a fourth loss function representing a difference between the feature map before and after the quantization module update during training.

4. The method of claim 3, wherein the first neural network and the second neural network further comprise a valuation module for updating an index of the feature map, the index being used by the quantization module to update the feature map.

5. The method of claim 4, wherein the first loss function further comprises a fifth loss function, the fifth loss function being used to represent a difference between the index before and after the assignment module update during training.

6. An image processing method, comprising:

acquiring a first image;

extracting a first feature map of the first image based on an encoder;

obtaining a detection/segmentation result of the first image based on the first feature map and a decoder;

The encoder and the decoder are trained by a training image with a label, a first loss function and a second loss function, the training image comprises a foreground object and a background, the first loss function is used for representing the difference between a first foreground image and a second foreground image generated by the encoder and the generator in the training process, the first foreground image comprises the foreground object and does not comprise the background, the second foreground image deducts images except the background for the training image, and the second loss function is used for representing the difference between a detection/segmentation result obtained by the encoder and the decoder in the training process and the label.

7. The method of claim 6, wherein the obtaining the detection/segmentation result of the first image based on the first feature map and a decoder comprises:

inputting the first characteristic diagram into the decoder to obtain the detection/segmentation result.

8. The method of claim 6, wherein the obtaining the detection/segmentation result of the first image based on the first feature map and a decoder comprises:

Updating the first feature map based on a quantization module to obtain a second feature map, wherein the quantization module is trained based on a fourth loss function, and the fourth loss function is used for representing the difference between the feature map of the training image output by the encoder and the feature map of the training image output by the encoder in the training process before and after updating the quantization module;

and inputting the second characteristic diagram into the decoder to obtain the detection/segmentation result.

9. An image processing apparatus, characterized in that the image processing apparatus comprises:

the acquisition unit is used for acquiring a training image with a label, wherein the training image comprises a foreground object and a background;

the training unit is used for training a first neural network based on the training image, a first loss function and a second loss function to obtain a second neural network, wherein the second neural network is used for realizing the detection/segmentation task of the image, the first neural network comprises an encoder, a decoder and a generator, and the second neural network comprises the encoder and the decoder; the first loss function is used for representing a difference between a first foreground image and a second foreground image generated based on the encoder and the generator in the first neural network in the training process, the first foreground image comprises the foreground object and does not comprise the background, the second foreground image is an image obtained by subtracting the background from the training image, and the second loss function is used for representing a difference between a detection/segmentation result obtained based on the encoder and the decoder in the second neural network and the label in the training process.

10. The image processing apparatus according to claim 9, wherein the first neural network and the second neural network further include a quantization module for updating a feature map output from an encoder and inputting the updated feature map to the decoder and the generator, respectively.

11. The image processing apparatus of claim 10, wherein the first loss function includes a third loss function for representing a difference between the first foreground image and the second foreground image and a fourth loss function for representing a difference between the feature map before and after the quantization module update during training.

12. The image processing device of claim 11, wherein the first neural network and the second neural network further comprise a valuation module for updating an index of the feature map, the index being used for the quantization module to update the feature map.

13. The image processing device of claim 12, wherein the first loss function further comprises a fifth loss function, the fifth loss function being used to represent a difference between the index before and after updating of the assignment module during training.

14. An image processing apparatus, characterized in that the image processing apparatus comprises:

an acquisition unit configured to acquire a first image;

an extraction unit configured to extract a first feature map of the first image based on an encoder;

a processing unit, configured to obtain a detection/segmentation result of the first image based on the first feature map and a decoder;

15. The image processing device according to claim 14, wherein the processing unit is specifically configured to input the first feature map to the decoder to obtain the detection/segmentation result.

16. The image processing device according to claim 14, wherein the processing unit is specifically configured to update the first feature map based on a quantization module to obtain a second feature map, the quantization module being trained based on a fourth loss function, the fourth loss function being used to represent a difference between a feature map of the training image output by the encoder during training, before and after the quantization module is updated;

the processing unit is specifically configured to input the second feature map to the decoder to obtain the detection/segmentation result.

17. An image processing apparatus, characterized by comprising: a processor coupled to a memory for storing a program or instructions that, when executed by the processor, cause the image processing apparatus to perform the method of any of claims 1 to 8.

18. A computer storage medium comprising computer instructions which, when run on a terminal device, cause the terminal device to perform the method of any of claims 1 to 8.

19. A computer program product, characterized in that the computer program product, when run on a computer, causes the computer to perform the method according to any of claims 1 to 8.