WO2023207531A1 - 一种图像处理方法及相关设备 - Google Patents

一种图像处理方法及相关设备 Download PDF

Info

Publication number
WO2023207531A1
WO2023207531A1 PCT/CN2023/086194 CN2023086194W WO2023207531A1 WO 2023207531 A1 WO2023207531 A1 WO 2023207531A1 CN 2023086194 W CN2023086194 W CN 2023086194W WO 2023207531 A1 WO2023207531 A1 WO 2023207531A1
Authority
WO
WIPO (PCT)
Prior art keywords
loss function
image
training
neural network
encoder
Prior art date
Application number
PCT/CN2023/086194
Other languages
English (en)
French (fr)
Inventor
李傲雪
李震国
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023207531A1 publication Critical patent/WO2023207531A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects

Definitions

  • Embodiments of the present application relate to the field of artificial intelligence, and in particular, to an image processing method and related equipment.
  • AI Artificial intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • deep learning technology has enabled computers to achieve excellent performance in image object detection tasks.
  • big data especially large-scale labeled data.
  • the cost of manually obtaining labels is very high, and even certain tasks cannot collect large-scale data.
  • medical data requires professional doctors to label it and involves patient privacy, so it is difficult to collect a large amount of labeled data.
  • the performance of deep learning models drops significantly.
  • the small-sample object detector learns transferable knowledge from these data. For new categories that have never been seen before (which do not coincide with the categories of the original task), using a small number of labeled training examples for each category, the detector can detect objects from unseen test images.
  • Embodiments of the present application provide an image processing method and related equipment to increase the detection/segmentation model's ability to position objects.
  • the first aspect of the embodiment of the present application provides an image processing method, which can be applied to object detection/segmentation scenarios.
  • the method may be executed by the training device/image processing device, or may be executed by components of the training device/image processing device (such as a processor, a chip, or a chip system, etc.).
  • the method includes: obtaining a training image with a label, the training image includes a foreground object and a background; training a first neural network based on the training image, a first loss function and a second loss function to obtain a second neural network, and the second neural network uses
  • the first neural network includes an encoder, a decoder and a generator
  • the second neural network includes an encoder and a decoder
  • the first loss function is used to represent the training process based on The difference between the first foreground image and the second foreground image generated by the encoder and generator in the first neural network.
  • the first foreground image includes foreground objects and does not include the background.
  • the second foreground image is the training image after subtracting the background.
  • the second loss function is used to represent the difference between the detection/segmentation results and labels obtained based on the encoder and decoder in the second neural network during the training process.
  • the encoder is trained through the first loss function and the second loss function for detection/segmentation. Since the first loss function is used to reconstruct the foreground image, the encoder can capture more texture and structural features of the image. , thereby improving the positioning capability of small sample object detection and improving the detection/segmentation effect of the second neural network containing the encoder.
  • the above-mentioned first neural network and second neural network also include a quantization module, the quantization module is used to update the feature map output by the encoder, and The updated feature maps are input to the decoder and generator respectively.
  • the quantization module can convert the continuous feature space into a discrete feature space represented by a set of prototype vectors. Discrete feature spaces are easier to model than high-dimensional continuous spaces.
  • the above-mentioned first loss function includes a third loss function and a fourth loss function
  • the third loss function is used to represent the first foreground image and the second foreground image.
  • the fourth loss function is used to represent the difference between the feature map before and after the quantization module update during the training process.
  • a loss can be introduced to train the quantization module, so that the quantization module converts the continuous feature space into discrete features represented by the prototype vector set. space. Discrete feature spaces are easier to model than high-dimensional continuous spaces.
  • the above-mentioned first neural network and second neural network also include an assignment module, the assignment module is used to update the index of the feature map, and the index is used to quantize the module pair The feature map is updated.
  • the assignment module can realize the alignment of the clustering centers of different pixels.
  • the clustering center of each pixel not only the current pixel but also the clustering centers of other similar pixels are considered to improve Subsequent inference effects.
  • the above-mentioned first loss function also includes a fifth loss function, and the fifth loss function is used to represent the index between before and after the assignment module is updated during the training process. difference.
  • the difference between the feature map trained using the fifth loss function before and after the quantization module update becomes smaller and smaller.
  • the difference between before and after the assignment module update becomes smaller and smaller, that is, the recalculated index value should be as consistent as possible with the original index value obtained through the nearest neighbor clustering method.
  • the second aspect of the embodiment of the present application provides an image processing method, which can be applied to object detection/segmentation scenarios.
  • the method may be executed by the image processing device, or may be executed by components of the image processing device (such as a processor, a chip, or a chip system, etc.).
  • the method includes: obtaining a first image; extracting a first feature map of the first image based on an encoder; obtaining a detection/segmentation result of the first image based on the first feature map and a decoder; the encoder and decoder are composed of labeled
  • the training image, the first loss function and the second loss function are obtained by training.
  • the training image includes foreground objects and background.
  • the first loss function is used to represent the first foreground image and the second foreground image generated based on the encoder and generator during the training process.
  • the difference between the two is that the first foreground image includes foreground objects and does not include the background, the second foreground image is the training image minus the background, and the second loss function is used to represent the detection based on the encoder and decoder during the training process. /Difference between segmentation results and labels.
  • the encoder is trained through the first loss function and the second loss function, so that the encoder can Learn more texture and structural features of the image, thereby improving the localization ability of small sample object detection, and improving the detection/segmentation effect of the second neural network containing the encoder.
  • the above step: obtaining the detection/segmentation result of the first image based on the first feature map and the decoder includes: inputting the first feature map into the decoder to obtain the detection /split result.
  • the first feature map is directly input into the decoder. Since the decoder is trained through the first loss function and the second loss function, the obtained detection/segmentation results can have more texture and structural features. ,
  • the above steps obtaining the detection/segmentation result of the first image based on the first feature map and the decoder, including: updating the first feature map based on the quantization module, Obtain the second feature map, which is obtained by training the quantization module based on the fourth loss function.
  • the fourth loss function is used to represent the difference between the feature map of the training image output by the encoder during the training process before and after the update of the quantization module;
  • the two feature maps are input to the decoder to obtain the detection/segmentation results.
  • the quantization module converts the continuous feature space into a discrete feature space represented by a set of prototype vectors. Discrete feature spaces are easier to model than high-dimensional continuous spaces.
  • the third aspect of the embodiment of the present application provides an image processing device (which can also be a training device), which can be applied to object detection/segmentation scenarios.
  • the image processing equipment/training device includes: an acquisition unit, used to acquire a training image with a label, the training image includes a foreground object and a background; a training unit, used to train the third loss function based on the training image, the first loss function and the second loss function.
  • a neural network is used to obtain a second neural network.
  • the second neural network is used to implement image detection/segmentation tasks.
  • the first neural network includes an encoder, a decoder and a generator
  • the second neural network includes an encoder and a decoder
  • a loss function is used to represent the difference between the first foreground image and the second foreground image generated based on the encoder and the generator in the first neural network during the training process.
  • the first foreground image includes foreground objects and does not include the background.
  • the second foreground image is the image after subtracting the background from the training image, and the second loss function is used to represent the difference between the detection/segmentation results and labels based on the encoder and decoder in the second neural network during the training process.
  • the above-mentioned first neural network and second neural network also include a quantization module, the quantization module is used to update the feature map output by the encoder, and The updated feature maps are input to the decoder and generator respectively.
  • the above-mentioned first loss function includes a third loss function and a fourth loss function
  • the third loss function is used to represent the first foreground image and the second foreground image.
  • the fourth loss function is used to represent the difference between the feature map before and after the quantization module update during the training process.
  • the above-mentioned first neural network and second neural network also include an assignment module, the assignment module is used to update the index of the feature map, and the index is used to quantize the module pair The feature map is updated.
  • the above-mentioned first loss function also includes a fifth loss function, and the fifth loss function is used to represent the index between before and after the assignment module is updated during the training process. difference.
  • the fourth aspect of the embodiment of the present application provides an image processing device, which can be applied to object detection/segmentation scenarios.
  • the image processing device includes: an acquisition unit, used to acquire a first image; an extraction unit, used to extract a first feature map of the first image based on an encoder; and a processing unit, used to obtain the first feature map based on the first feature map and a decoder.
  • Image detection/segmentation results; the encoder and decoder are trained by labeled training images, the first loss function, and the second loss function.
  • the training images include foreground objects and backgrounds, and the first loss function is used to represent the training process.
  • the first foreground image includes foreground objects and does not include the background
  • the second foreground image The image is the training image minus the background
  • the second loss function is used to represent the difference between the detection/segmentation results and labels based on the encoder and decoder during the training process.
  • the above-mentioned processing unit is specifically configured to input the first feature map into the decoder to obtain the detection/segmentation result.
  • the above-mentioned processing unit is specifically configured to update the first feature map based on the quantization module to obtain the second feature map, and the quantization module is trained based on the fourth loss function. Obtained, the fourth loss function is used to represent the difference between the feature map of the training image output by the encoder during the training process before and after the quantization module is updated; the processing unit is specifically used to input the second feature map into the decoder to obtain detection/ Segmentation results.
  • a fifth aspect of the present application provides an image processing device, including: a processor, the processor is coupled to a memory, and the memory is used to store programs or instructions.
  • the image processing device implements the above-mentioned first aspect.
  • the method in one aspect or any possible implementation of the first aspect, or the image processing device implements the method in the above-mentioned second aspect or any possible implementation of the second aspect.
  • the sixth aspect of the present application provides a computer-readable medium on which a computer program or instructions are stored.
  • the computer program or instructions When the computer program or instructions are run on a computer, the computer is caused to execute the foregoing first aspect or any possible implementation of the first aspect.
  • the method in the manner, or causing the computer to execute the aforementioned second aspect or the method in any possible implementation manner of the second aspect.
  • a seventh aspect of the present application provides a computer program product.
  • the computer program product When executed on a computer, the computer program product causes the computer to execute the method in the foregoing first aspect or any possible implementation of the first aspect, causing the computer to execute the foregoing second aspect.
  • the encoder is trained by using the first loss function and the second loss function of detection/segmentation. Since the first loss function is used to reconstruct the foreground image, the encoder can be It can capture more texture and structural features of the image, thereby improving the localization ability of small sample object detection and improving the detection/segmentation effect of the second neural network containing the encoder.
  • Figure 1 is a schematic structural diagram of the system architecture provided by the embodiment of the present application.
  • Figure 2 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
  • Figure 3A is a schematic structural diagram of an image processing system provided by an embodiment of the present application.
  • Figure 3B is another structural schematic diagram of the image processing system provided by an embodiment of the present application.
  • Figure 4 is a schematic flow chart of the image processing method provided by the embodiment of the present application.
  • Figure 5 is a schematic diagram of a training process of the second neural network provided by the embodiment of the present application.
  • Figure 6 is another schematic diagram of the training process of the second neural network provided by the embodiment of the present application.
  • Figure 7 is a schematic diagram of a training process after adding a quantization module provided by the embodiment of the present application.
  • Figure 8 is a schematic diagram of another training process after adding a quantization module provided by the embodiment of the present application.
  • Figure 9 is another schematic flow chart of the image processing method provided by the embodiment of the present application.
  • 10 to 12 are several structural schematic diagrams of image processing equipment provided by embodiments of the present application.
  • Embodiments of the present application provide an image processing method and related equipment to increase the detection/segmentation model's ability to position objects.
  • the small-sample object detector learns transferable knowledge from these data. For new categories that have never been seen before (which do not coincide with the categories of the original task), using a small number of labeled training examples for each category, the detector can detect objects from unseen test images.
  • embodiments of the present application provide an image processing method and related equipment.
  • the encoder is trained through a first loss function and a second loss function for detection/segmentation. Since the first loss function is used to reconstruct the foreground image, This allows the encoder to capture more texture and structural features of the image, thereby improving the positioning capability of small sample object detection and improving the detection/segmentation effect of the second neural network containing the encoder.
  • the image processing method and related equipment according to the embodiment of the present application will be introduced in detail below with reference to the accompanying drawings.
  • the neural network can be composed of neural units.
  • the neural unit can refer to an arithmetic unit that takes X s and intercept 1 as input.
  • the output of the arithmetic unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • W s is the weight of X s
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a Relu function.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • W is a weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer.
  • This vector W determines the above output
  • the spatial transformation from the input space to the output space, that is, the weight W of each layer controls how to transform the space.
  • the purpose of training a neural network is to finally obtain the weight matrix of all layers of the trained neural network (a weight matrix formed by the vector W of many layers). Therefore, the training process of neural network is essentially to learn how to control spatial transformation, and more specifically, to learn the weight matrix.
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • the convolutional neural network contains a feature extractor composed of convolutional layers and subsampling layers.
  • the feature extractor can be viewed as a filter, and the convolution process can be viewed as convolving the same trainable filter with an input image or feature map.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a neuron can be connected to only some of the neighboring layer neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units.
  • Neural units in the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as a way to extract image information independent of position. The underlying principle is that the statistical information of one part of the image is the same as that of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the same learned image information can be used for all positions on the image.
  • multiple convolution kernels can be used to extract different image information. Generally, the greater the number of convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a random-sized matrix. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • Deep learning is a type of machine learning technology based on deep neural network algorithms. Its main feature is the use of multiple nonlinear transformation structures to process and analyze data. It is mainly used in scenarios such as perception and decision-making in the field of artificial intelligence, such as image and speech recognition, natural language translation, computer games, etc.
  • loss function loss function
  • objective function object function
  • an embodiment of the present invention provides a system architecture 100.
  • the data collection device 160 is used to collect training data.
  • the training data includes: data in multiple different modalities. Among them, modality can refer to text, image, video and audio.
  • training data can include labeled training images, etc. and will train
  • the training data is stored in the database 130, and the training device 120 trains to obtain the target model/rules 101 based on the training data maintained in the database 130. How the training device 120 obtains the target model/rules 101 based on the training data will be described in more detail below.
  • the target model/rules 101 can be used to implement computer vision tasks applied by the image processing method provided by the embodiment of the present application.
  • the computer vision tasks may include: classification tasks, segmentation tasks, detection tasks or image generation tasks, etc.
  • the training data maintained in the database 130 may not all be collected by the data collection device 160, and may also be received from other devices.
  • the training device 120 may not necessarily train the target model/rules 101 based entirely on the training data maintained by the database 130. It may also obtain training data from the cloud or other places for model training. The above description should not be used as a guide for this application. Limitations of Examples.
  • the target model/rules 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in Figure 1 .
  • the execution device 110 can be a terminal, such as a mobile phone terminal, a tablet computer, or a laptop computer. , augmented reality (augmented reality, AR) equipment/virtual reality (VR) equipment, vehicle-mounted terminals, etc.
  • the execution device 110 can also be a server or a cloud, etc.
  • the execution device 110 is configured with an I/O interface 112 for data interaction with external devices.
  • the user can input data to the I/O interface 112 through the client device 140.
  • the input data can be used in the embodiment of the present application. Includes: first image.
  • the input data can be input by the user, or uploaded by the user through the shooting device, and of course it can also come from the database, which is not limited here.
  • the preprocessing module 113 is used to perform preprocessing according to the first image received by the I/O interface 112.
  • the preprocessing module 113 can be used to flip, translate, crop, color transform, etc. the first image. deal with.
  • the execution device 110 When the execution device 110 preprocesses input data, or when the calculation module 111 of the execution device 110 performs calculations and other related processes, the execution device 110 can call data, codes, etc. in the data storage system 150 for corresponding processing. , the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing results, such as the above-mentioned detection/segmentation results, to the client device 140, thereby providing them to the user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete the The above tasks, thereby providing the user with the desired results.
  • the user can manually set the input data, and the manual setting can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send input data to the I/O interface 112. If requiring the client device 140 to automatically send input data requires the user's authorization, the user can set corresponding permissions in the client device 140.
  • the user can view the results output by the execution device 110 on the client device 140, and the specific presentation form may be display, sound, action, etc.
  • the client device 140 can also be used as a data collection end to collect the input data of the input I/O interface 112 and the output results of the output I/O interface 112 as new sample data, and store them in the database 130 .
  • the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output result of the output I/O interface 112 as a new sample as shown in the figure.
  • the data is stored in database 130.
  • Figure 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention.
  • the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 can also be placed on the execution device 110. middle.
  • a target model/rule 101 is obtained through training based on the training device 120.
  • the target model/rule 101 in the embodiment of the present application may specifically be a target neural network.
  • Figure 2 is a chip hardware structure provided by an embodiment of the present invention.
  • the chip includes a neural network processor 20.
  • the chip can be disposed in the execution device 110 as shown in Figure 1 to complete the calculation work of the calculation module 111.
  • the chip can also be provided in the training device 120 as shown in Figure 1 to complete the training work of the training device 120 and output the target model/rules 101.
  • the neural network processor 20 may be a neural network processor (neural-network processing unit, NPU), a tensor processing unit (TPU), or a graphics processor (graphics processing unit, GPU), etc., which are suitable for large-scale applications.
  • NPU neural-network processing unit
  • TPU tensor processing unit
  • GPU graphics processor
  • the neural network processor 20 is mounted on the main central processing unit (central processing unit, CPU) (host CPU) as a co-processor.
  • the core part of the NPU is the arithmetic circuit 203.
  • the controller 204 controls the arithmetic circuit 203 to extract data in the memory (weight memory or input memory) and perform operations.
  • the computing circuit 203 internally includes multiple processing units (process engines, PEs).
  • arithmetic circuit 203 is a two-dimensional systolic array.
  • the arithmetic circuit 203 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 203 is a general-purpose matrix processor.
  • the operation circuit 203 obtains the corresponding data of matrix B from the weight memory 202 and caches it on each PE in the operation circuit.
  • the operation circuit takes the matrix A data from the input memory 201 and performs matrix operation on the matrix B, and the partial result or final result of the obtained matrix is stored in the accumulator 208 .
  • the vector calculation unit 207 can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc.
  • the vector calculation unit 207 can be used for network calculations of non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc. .
  • the vector computation unit can 207 store the processed output vectors to the unified buffer 206 .
  • the vector calculation unit 207 may apply a nonlinear function to the output of the operation circuit 203, such as a vector of accumulated values, to generate an activation value.
  • vector calculation unit 207 generates normalized values, merged values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 203, such as for use in a subsequent layer in a neural network.
  • the unified memory 206 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 201 and/or the unified memory 206 through the storage unit access controller 205 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 202. and storing the data in the unified memory 206 into the external memory.
  • DMAC direct memory access controller
  • a bus interface unit (BIU) 210 is used to implement interaction between the main CPU, the DMAC and the fetch memory 209 through the bus.
  • An instruction fetch buffer 209 connected to the controller 204 is used to store instructions used by the controller 204.
  • the controller 204 is used to call instructions cached in the memory 209 to control the working process of the computing accelerator.
  • the unified memory 206, the input memory 201, the weight memory 202 and the instruction memory 209 are all on-chip memories, and the external memory is a memory external to the NPU.
  • the external memory can be double data rate synchronous dynamic random access. Memory (double data rate synchronous dynamic random access memory, DDR SDRAM for short), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • FIG. 3A is a schematic structural diagram of an image processing system provided by an embodiment of the present application.
  • the image processing system includes a terminal device (only the terminal device is a mobile phone in FIG. 3A as an example) and an image processing device. It is understandable that in addition to a mobile phone, the terminal device can also be a tablet computer (pad), a portable game console, a handheld computer (personal digital assistant, PDA), a notebook computer, an ultra mobile personal computer (ultra mobile personal computer, UMPC), handheld computers, netbooks, vehicle-mounted media playback equipment, wearable electronic devices, virtual reality (VR) terminal equipment, augmented reality (AR), vehicles, vehicle-mounted terminals, aircraft terminals, intelligent robots and other terminals equipment.
  • the terminal device is the initiator of image processing. As the initiator of the image processing request, the user usually initiates the request through the terminal device.
  • the above-mentioned image processing device may be a cloud server, a network server, an application server, a management server, or other devices or servers with image processing functions.
  • the image processing device receives image processing requests from the terminal device through an interactive interface, and then performs image processing in the form of machine learning, deep learning, search, reasoning, decision-making, etc. through the memory that stores data and the image processing processor.
  • the memory in the image processing device can be a general term, including local storage and a database that stores historical data.
  • the database can be on the image processing device or on other network servers.
  • the terminal device can receive instructions from the user. For example, the terminal device can obtain multiple data input/selected by the user (for example: images, text, audio, etc. collected by the terminal device through the terminal device), Then a request is initiated to the image processing device, causing the image processing device to execute image processing applications (for example, computer vision tasks such as classification, segmentation, detection, image generation, etc.) on the plurality of data obtained by the terminal device, thereby obtaining a plurality of data. Corresponding processing results.
  • image processing applications for example, computer vision tasks such as classification, segmentation, detection, image generation, etc.
  • the terminal device can obtain the image input by the user, and then initiate an image detection request to the image processing device, so that the image processing device detects the image, thereby obtaining the detection result of the image, and displays the detection result of the image for the user. Watch and use.
  • the image processing device can execute the image processing method according to the embodiment of the present application.
  • Figure 3B is another schematic structural diagram of an image processing system provided by an embodiment of the present application.
  • a terminal device (only taking a mobile phone as an example of the terminal device in Figure 3B) directly serves as an image processing device, and the terminal device can directly obtain The image is processed directly by the hardware of the terminal device itself.
  • the specific process is similar to Figure 3A. Please refer to the above description and will not be repeated here.
  • the terminal device can receive instructions from the user.
  • the terminal device can obtain multiple images selected by the user in the terminal device, and then the terminal device itself executes the image processing on the images.
  • Image processing applications for example, computer vision tasks such as classification, segmentation, detection, image generation, etc.
  • image processing results for example, display the processing results for users to view and use.
  • the terminal device can collect images in real time or periodically, and then the terminal device itself executes image processing applications (for example, classification, segmentation, detection, image generation) on the images. and other computer vision tasks), thereby obtaining the corresponding processing results for the image, and implementing functions (classification function, segmentation function, detection function, image generation function, etc.) based on the processing results.
  • image processing applications for example, classification, segmentation, detection, image generation
  • functions for example, classification function, segmentation function, detection function, image generation function, etc.
  • the terminal device itself can execute the image processing method according to the embodiment of the present application.
  • the terminal device in FIG. 3A and FIG. 3B may specifically be the client device 140 or the execution device 110 in FIG. 1
  • the image processing device in FIG. 3A may specifically be the execution device 110 in FIG. 1
  • the data storage system 150 may To store the data to be processed by the execution device 110, the data storage system 150 can be integrated on the execution device 110, or can be set up on the cloud or other network servers.
  • the processors in Figure 3A and Figure 3B can perform data training/machine learning/deep learning through neural network models or other models (such as attention models, MLP, etc.), and use the data to eventually train or learn the model to target multiple data Execute image processing applications to obtain corresponding processing results.
  • neural network models or other models such as attention models, MLP, etc.
  • the image processing method provided by the embodiment of the present application can be applied to a variety of scenarios, which are described below.
  • the first is the field of autonomous driving.
  • Detection models based on deep learning are good at detecting common categories (such as cars and pedestrians), but are difficult to accurately detect rare examples, such as garbage bags on the roadside, fallen tires, triangular cones placed on the road, etc. But misdetection and missed detection of these obstacles can lead to serious consequences.
  • the detection model can be improved on categories containing a small number of labeled samples, and the accuracy and recall rate of the detection model can be improved.
  • the second type is railway and power grid fault detection.
  • the annual manpower investment in truck inspection in the railway industry is about 1 billion yuan.
  • there are large-scale fault detection scenarios such as passenger cars, trains, and lines.
  • the scale of power grid transmission, transformation, and distribution inspections is estimated to be 24 billion yuan in the next five years. Since faults are less likely to occur and require manual labeling, it is difficult to collect labeled samples; moreover, changes in the external environment lead to large changes in imaging, and there are obvious differences within fault categories.
  • the image processing method also called a small sample object detection algorithm
  • This model can be deployed on the cloud to provide efficient services to external customers.
  • the training method of the neural network will be introduced in detail with reference to Figure 4.
  • the method shown in Figure 4 can be executed by a neural network training device.
  • the neural network training device can be a cloud service device or a terminal device.
  • a computer, server, etc. has sufficient computing power to perform training of the recommendation network.
  • the device of the method may also be a system composed of cloud service equipment and terminal equipment.
  • the training method can be executed by the training device 120 in Figure 1 and the neural network processor 20 in Figure 2 .
  • the training method can be processed by the CPU, or it can be processed by the CPU and GPU together, or it can not use the GPU but use other processors suitable for neural network calculations, which is not limited by this application.
  • the training method includes step 401 and step 402. Step 401 and step 402 will be described in detail below.
  • Step 401 Obtain training images with labels.
  • the training device to obtain training images, which may be through collection/photography, through receiving transmissions from other devices, or through selection from a database, etc. Specifically, this There are no restrictions anywhere.
  • the training image in the embodiment of the present application includes a foreground image and a background.
  • the foreground image is the part that the user specifies that the device needs to recognize.
  • the labels of the training images can be obtained manually or by inputting the model, and the details are not limited here.
  • the label can be the category of each object in the image and/or a rectangular box surrounding the edge of the object.
  • the label can be the classification label of the pixel, or it can be understood as the category corresponding to each pixel in the image.
  • the training device can be a vehicle, and the training images can be data collected by the vehicle in real time or data collected periodically, and the details are not limited here.
  • Step 402 Train the first neural network based on the training image, the first loss function, and the second loss function to obtain a second neural network.
  • the second neural network is used to implement image detection/segmentation tasks.
  • the training device After the training device obtains the training image, it can train the first neural network based on the training image, the first loss function, and the second loss function to obtain a second neural network.
  • the first neural network includes an encoder, a decoder and a generator, and the second neural network includes the encoder and decoder in the first neural network. It can also be understood that the first neural network includes a second neural network and a generator.
  • the first loss function is used to represent the difference between the first foreground image and the second foreground image generated based on the encoder and generator in the first neural network during the training process.
  • the first foreground image includes foreground objects and does not include the background.
  • the second foreground image is the image after subtracting the background from the training image.
  • the second loss function is used to represent the difference between the detection/classification results and labels based on the encoder and decoder in the second neural network during the training process.
  • the first loss function in the embodiment of this application can be understood as a generation loss function
  • the second loss function can be understood as a detection/segmentation loss function. Training the encoder through these two loss functions can enable the encoder to learn images. More texture and structural features, thereby improving the localization ability of small sample object detection, and improving the detection/segmentation effect of the second neural network containing the encoder.
  • L rec represents the first loss function
  • D represents the decoder
  • Q represents the feature map, which can be the feature map output by the encoder or the feature map updated by the subsequent quantization module
  • x represents the training image
  • m represents a binary mask with the same size as the training image. According to the annotation information, the pixels of the foreground object are set to 1 and the pixels of the background are set to 0.
  • Q 0 , Q 1 , and Q 2 can represent the three-scale feature maps obtained by the encoder of the training image, or can also represent the three-scale feature maps updated by the subsequent quantization module.
  • the description of the remaining parameters can refer to Formula 1. , no more here Repeat.
  • the second loss function may be an absolute value loss function, a logarithmic loss function, an exponential loss function, a cross-entropy loss function, etc., and may be set according to actual needs, and is not limited here.
  • This step can also be understood as using the generative model as a constraint to optimize the positioning features of the detection/segmentation model.
  • the training device trains the first neural network based on the training image, the first loss function, and the second loss function.
  • the specific process of obtaining the second neural network can be shown in Figure 5.
  • Input the training image into the encoder to obtain the feature map of the training image.
  • the feature map is input into the generator to generate the first foreground image.
  • the feature map is input into the decoder to obtain the detection/segmentation result.
  • the background in the training image is removed to obtain the second foreground image.
  • the first loss function is then used to train the encoder and the generator, so that the difference between the first foreground image and the second foreground image based on the output of the encoder and the generator becomes smaller and smaller.
  • Then use the second loss function to train the encoder and decoder, so that the difference between the detection/classification results and labels based on the output of the encoder and decoder becomes smaller and smaller.
  • the first neural network and the second neural network may also include a quantization module, which is used to update the feature map ( Fi ) output by the encoder, where i is used to indicate the number of layers, which is greater than or An integer equal to 1, and the updated feature maps are input to the generator and decoder respectively.
  • This quantization module can use the prototype vector collection Update the feature map, n is an integer greater than 1. For example: convert each pixel in F i is replaced by V i to neutralize nearest prototype vector j is used to indicate the position in the layer number, which is an integer greater than or equal to 1, and k is between 1 and n.
  • the replacement process can be regarded as a clustering process, in which the prototype vector is the cluster center, and each input pixel is designated as the cluster center closest to the pixel.
  • an effective clustering process is learned by introducing the fourth loss function mentioned later.
  • the first loss function may include a third loss function and a fourth loss function.
  • the third loss function is used to represent the difference between the aforementioned first foreground image and the second foreground image.
  • the fourth loss function is used to represent The difference between the feature map output by the encoder before and after the quantization module is updated. That is, the generated loss function includes the third loss function and the fourth loss function.
  • the number of quantization modules can correspond to the number of feature maps output by the encoder.
  • the number of feature maps output by the encoder can be understood to mean that the encoder can obtain multi-scale feature maps.
  • the third loss function can be as shown in the aforementioned formula 1 or formula 2
  • the fourth loss function can be as shown in formula 3.
  • L qt represents the fourth loss function
  • W represents the width of the feature map
  • H represents the height of the feature map.
  • the training process can be as shown in Figure 6.
  • the training image is input into the encoder to obtain the feature map of the training image, and the quantization module performs operations on the feature map.
  • Update on the one hand
  • the updated feature map is input into the generator to generate the first foreground image
  • the updated feature map is input into the decoder to obtain the detection/segmentation result.
  • the background in the training image is removed to obtain the second foreground image.
  • use the third loss function and the fourth loss function to train the encoder, quantization module and generator, so that the difference between the first foreground image and the second foreground image based on the encoder, quantization module and output becomes smaller and smaller.
  • the first neural network and the second neural network can also include an assignment module.
  • the assignment module The index used to update the feature map. This index is used by the quantization module to update the feature map. In other words, the assignment module can achieve the alignment of cluster centers of different pixels.
  • the training process in this case can be shown in Figure 8. The training image is input into the encoder to obtain the feature map of the training image.
  • the quantization module updates the pixels of the feature map to obtain the quantized vector. Then enter the index of the quantized vector into the assignment module for update.
  • the differences between feature maps trained using the fifth loss function before and after the quantization module update are getting smaller and smaller.
  • the fifth loss function to train the index, the difference between before and after the assignment module update becomes smaller and smaller, that is, the recalculated index value should be as consistent as possible with the original index value obtained through the nearest neighbor clustering method.
  • the fifth loss function is proposed to improve quantification accuracy and improve the generation model.
  • A represents the assignment module
  • l represents the prototype index value of pixel fi calculated by the nearest neighbor method
  • sim represents the similarity calculation function, used to calculate
  • the similarity of O represents the one-hot embedding function, which can turn the index into a binary vector.
  • the fifth loss function in the embodiment of this application can be shown in Formula 5.
  • L align represents the fifth loss function
  • W represents the width of the feature map
  • H represents the height of the feature map.
  • fifth loss function is only an example. In practical applications, there can also be other forms of fifth loss functions.
  • the specific formula of the fifth loss function is not limited here.
  • the above-mentioned third loss function, fourth loss function and fifth loss function can be understood as generation loss functions, which are used in the process of updating the generator so that the encoder can learn more result texture features, thereby improving subsequent
  • the second neural network of the encoder performs the detection/segmentation task with accuracy.
  • the training method of the neural network is described in detail above, and the image processing method provided by the embodiment of the present application is introduced in detail below.
  • the method may be executed by the image processing device, or may be executed by components of the image processing device (such as a processor, a chip, or a chip system, etc.).
  • the image processing device can be a cloud device (as shown in the aforementioned Figure 3A) or a terminal device (such as the mobile phone shown in Figure 3B).
  • this method can also be executed by a system composed of cloud equipment and terminal equipment (as shown in the aforementioned Figure 3A).
  • this method can be processed by the CPU in the image processing device, or it can be processed by both the CPU and the GPU, or it can not use the GPU but use other processors suitable for neural network calculations, which is not limited by this application.
  • the above-mentioned terminal equipment can be a mobile phone, a tablet computer (pad), a portable game console, a handheld computer (personal digital assistant, PDA), a notebook computer, an ultra mobile personal computer (UMPC), a handheld computer, a netbook, or a vehicle-mounted computer.
  • VR virtual reality
  • AR augmented reality
  • the applicable application scenarios for the methods provided by the embodiments of this application may be small sample object detection/segmentation scenarios such as the field of autonomous driving and railway/power grid fault detection.
  • FIG. 9 is a schematic flowchart of an image processing method provided by an embodiment of the present application. The method may include steps 901 to 903 . Steps 901 to 903 will be described in detail below.
  • Step 901 Obtain the first image.
  • the image processing device may obtain the first image, which may be through collection/photography, or by receiving transmission from other devices, or by selecting from a database, etc. There are no specific limitations here.
  • the image processing device may be a vehicle, and the first image may be data collected by the vehicle in real time or data collected periodically, which is not limited here.
  • Step 902 Extract the first feature map of the first image based on the encoder.
  • the encoder and decoder in the embodiment of the present application may be trained by the training method provided by the embodiment shown in FIG. 4 .
  • the encoder and decoder are trained by a labeled training image, a first loss function and a second loss function.
  • the training image includes foreground objects and background.
  • the first loss function is used to represent the encoder and generation based on the training process.
  • the difference between the first and second foreground images generated by The difference between the detection/segmentation results and labels obtained by the encoder and decoder during the training process.
  • the first loss function, the second loss function, etc. reference may be made to the description in the embodiment shown in FIG. 4 and will not be described again here.
  • Step 903 Obtain the detection/segmentation result of the first image based on the first feature map and the decoder.
  • the image processing device After the image processing device obtains the first feature map, it can obtain the detection/segmentation result of the first image based on the first feature map and the decoder.
  • the first is to input the first feature map into the decoder to obtain the detection/segmentation result.
  • the first feature map is input into the quantization module to obtain the second feature map, which can also be understood as updating the first feature map based on the quantization module to obtain the second feature map.
  • the quantization module is trained based on the fourth loss function.
  • the fourth loss function is used to represent the difference between the feature map of the training image output by the encoder during the training process before and after the quantization module is updated, and then the second feature map is input.
  • the decoder gets the detection/segmentation results.
  • the assignment module can also be used to update the index of the second feature map, so that the quantization module uses the updated index to update the first feature map to obtain the second feature map.
  • the quantization module is trained based on the fifth loss function, which is used to represent the difference between the index before and after the assignment module is updated during the training process.
  • each loss function (such as the first loss function, the second loss function, the third loss function, the fourth loss function, the fifth loss function) can refer to the description in the embodiment shown in Figure 4, No further details will be given here.
  • the overall process in this embodiment can be viewed as inputting the first image into the second neural network trained in the embodiment shown in Figure 4 to perform the detection/segmentation task and obtain the detection/segmentation result.
  • training the encoder through the first loss function and the second loss function can enable the encoder to learn more texture and structural features of the image, thereby improving the positioning ability of small sample object detection and improving the performance of the encoder.
  • the detection/segmentation effect of the second neural network can be used to train more texture and structural features of the image, thereby improving the positioning ability of small sample object detection and improving the performance of the encoder.
  • the MS-COCO data set includes a total of 80k training samples, 40k verification samples and 20k test samples, covering 80 categories. Among them, 20 categories are set as new task categories, and the remaining 60 categories are set as original task categories. Images belonging to 20 new task categories among 5k images among 20k test samples are used for model performance evaluation, and images of 80k training samples are used for model training.
  • This detector has two situations, which are described below:
  • the detector is an Nt class detector.
  • the training process is as follows:
  • Original task pre-training First, fully train the network shown in Figure 5/ Figure 6 with training data of Ns categories to obtain a detector of Ns categories.
  • New task fine-tuning Then modify the last layer of the network so that its output is Nt neurons. Except for the random initialization of the last layer of the network, other layers are initialized using the parameters of Ns category detectors. Fine-tune network parameters using small amounts of data for new tasks.
  • the measurement indicators include the average accuracy (AP) under different intersection and union ratios.
  • AP average accuracy
  • the above-mentioned different intersection and union ratios are 10 values taken at intervals of 0.05 between 0.5-0.95, and each value corresponds to Accuracy, the average of 10 values is the average accuracy.
  • AP 50 means that the intersection ratio of the prediction box and the target box is greater than 0.5 as the AP value during detection.
  • AP 75 means that the intersection ratio of the prediction box and the target box is greater than 0.75 as the AP value during detection.
  • AP is also can be understood as Refers to the average value of AP under different intersection and union ratio thresholds. The larger these indicators are, the better the performance of the detection model is.
  • the model i.e., GLFR
  • the model significantly exceeds other small-sample object detection algorithms.
  • the detector is a detector that detects Ns+Nt categories simultaneously.
  • the training process is as follows:
  • Original task pre-training First, fully train the network shown in Figure 5/ Figure 6 with training data of Ns categories to obtain a detector of Ns categories.
  • New task fine-tuning Then modify the last layer of the network so that its output is Nt+Ns neurons. Except for the random initialization of the last layer of the network, other layers are initialized using the parameters of Ns category detectors. For the Ns category of the original task, each category randomly samples K samples from the training data. For the new task, all training data are used, and the two parts of the data are combined to form a balanced fine-tuning data set, and the entire network is fine-tuned using this data set. parameters.
  • the model i.e., GLFR
  • the model significantly exceeds other small-sample object detection algorithms.
  • An embodiment of the image processing device in the embodiment of the present application includes:
  • the acquisition unit 1001 is used to acquire training images with labels.
  • the training images include foreground objects and backgrounds;
  • Training unit 1002 used to train the first neural network based on the training image, the first loss function and the second loss function, Obtain a second neural network.
  • the second neural network is used to implement image detection/segmentation tasks.
  • the first neural network includes an encoder, a decoder and a generator.
  • the second neural network includes an encoder and a decoder.
  • the first loss function is represents the difference between the first foreground image and the second foreground image generated based on the encoder and generator in the first neural network during the training process.
  • the first foreground image includes foreground objects and does not include the background.
  • the second foreground image is For the image after subtracting the background from the training image, the second loss function is used to represent the difference between the detection/segmentation results and labels obtained based on the encoder and decoder in the second neural network during the training process.
  • each unit in the image processing device is similar to those described in the aforementioned embodiments shown in FIGS. 1 to 8 , and will not be described again here.
  • the training unit 1002 trains the encoder through the first loss function and the second loss function for detection/segmentation. Since the first loss function is used to reconstruct the foreground image, the encoder can capture more texture and texture of the image. Structural features, thereby improving the localization ability of small sample object detection, and improving the detection/segmentation effect of the second neural network containing the encoder.
  • An embodiment of the image processing device in the embodiment of the present application includes:
  • Acquisition unit 1101 used to acquire the first image
  • Extraction unit 1102 configured to extract the first feature map of the first image based on the encoder
  • the processing unit 1103 is used to obtain the detection/segmentation result of the first image based on the first feature map and the decoder;
  • the encoder and decoder are trained by training images with labels, the first loss function and the second loss function.
  • the training images include foreground objects and backgrounds.
  • the first loss function is used to represent the generation based on the encoder and generator during the training process.
  • the difference between the first foreground image and the second foreground image includes foreground objects and does not include the background.
  • the second foreground image is the training image minus the background image.
  • the second loss function is used to represent the training process. The difference between detection/segmentation results and labels based on the encoder and decoder.
  • each unit in the image processing device is similar to those described in the aforementioned embodiment shown in FIG. 9 and will not be described again here.
  • training the encoder through the first loss function and the second loss function can enable the encoder to learn more texture and structural features of the image, thereby improving the positioning ability of small sample object detection and improving the performance of the encoder.
  • the detection/segmentation effect of the second neural network can be used to train more texture and structural features of the image, thereby improving the positioning ability of small sample object detection and improving the performance of the encoder.
  • the image processing device may include a processor 1201, a memory 1202, and a communication port 1203.
  • the processor 1201, memory 1202 and communication port 1203 are interconnected through lines.
  • the memory 1202 stores program instructions and data.
  • the memory 1202 stores program instructions and data corresponding to the steps executed by the image processing device in the corresponding embodiments shown in FIGS. 1 to 9 .
  • the processor 1201 is configured to perform the steps performed by the image processing device shown in any of the embodiments shown in FIGS. 1 to 9 .
  • the communication port 1203 can be used to receive and send data, and to perform steps related to obtaining, sending, and receiving in any of the embodiments shown in FIGS. 1 to 9 .
  • the image processing device may include more or fewer components relative to Figure 12, for which this application only This is only an illustrative description and does not constitute a limitation.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or can be integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disk and other media that can store program code. .

Abstract

本申请实施例公开了一种图像处理方法,可以应用于物体检测/分割场景。方法包括:获取带有标签的训练图像(401),训练图像包括前景物体与背景;基于训练图像、第一损失函数以及第二损失函数训练第一神经网络,得到第二神经网络,第二神经网络用于实现图像的检测/分割任务(402),第一损失函数为生成损失函数,第二损失函数为检测/分割损失函数。由于第一损失函数用于重建前景图像,可以使得编码器可以捕捉图像更多的纹理和结构特征,进而提高小样本物体检测的定位能力,提升包含编码器的第二神经网络的检测/分割效果。

Description

一种图像处理方法及相关设备
本申请要求于2022年4月29日提交中国专利局、申请号为202210468931.9、发明名称为“一种图像处理方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及人工智能领域,尤其涉及一种图像处理方法及相关设备。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。近年来,深度学习技术使得计算机在图像物体检测任务取得卓越性能。深度学习之所以能够取得如此巨大的成功,一个非常重要的因素是大数据,特别是大规模带标签的数据。但是人工获取标签的代价很高,甚至某些任务无法收集到大规模数据,比如,医学数据由于需要专业医生标注且涉及患者隐私,难以收集大量标注数据。在缺少标注数据的情形下,深度学习模型性能明显下降。
相比于深度网络,人类具有从少量样本中快速学习的能力。这是因为人类在生活中积累了各种知识以及人类天生的思考能力。知识的积累意味着人类是存在各种知识先验的,天生的思考能力意味着人类拥有强大的类比能力、泛化能力和强大的算力。以人类快速学习能力为启发,人们研究小样本学习这一问题,希望计算机能够像人类一样利用少量的标注知识快速学习新事物。另外,图像物体检测是计算机视觉的重要基本任务之一,在自动驾驶、工业视觉等领域都有重要应用。对于物体检测任务,获取标注数据的成本也非常大。
目前,为了缓解深度学习对标注数据的依赖,研究者提出了小样本物体检测任务。即给定一个大规模的训练集合作为原任务,小样本物体检测器从这些数据中学习可以迁移的知识。对于从未见过的新类别(和原任务的类别不重合),利用每类少量标注的训练样例,检测器可以从未见过的测试图像中检测目标。
然而,由于小样本物体检测中的样本较少,导致定位的精度不高。
发明内容
本申请实施例提供了一种图像处理方法及相关设备,用于增加检测/分割模型对于物体的定位能力。
本申请实施例第一方面提供了一种图像处理方法,可以应用于物体检测/分割场景。该方法可以由训练装置/图像处理设备执行,也可以由训练装置/图像处理设备的部件(例如处理器、芯片、或芯片系统等)执行。该方法包括:获取带有标签的训练图像,训练图像包括前景物体与背景;基于训练图像、第一损失函数以及第二损失函数训练第一神经网络,得到第二神经网络,第二神经网络用于实现图像的检测/分割任务,第一神经网络包括编码器、解码器以及生成器,第二神经网络包括编码器与解码器;第一损失函数用于表示训练过程中基于 第一神经网络中编码器与生成器生成的第一前景图像与第二前景图像之间的差异,第一前景图像包括前景物体,且不包括背景,第二前景图像为训练图像中扣除背景之后的图像,第二损失函数用于表示训练过程中基于第二神经网络中编码器与解码器得到的检测/分割结果与标签之间的差异。
本申请实施例中,通过第一损失函数与检测/分割的第二损失函数一起训练编码器,由于第一损失函数用于重建前景图像,可以使得编码器可以捕捉图像更多的纹理和结构特征,进而提高小样本物体检测的定位能力,提升包含该编码器的第二神经网络的检测/分割效果。
可选地,在第一方面的一种可能的实现方式中,上述的第一神经网络与第二神经网络还包括量子化模块,量子化模块用于将编码器输出的特征图进行更新,并将更新后的特征图分别输入至解码器与生成器。
该种可能的实现方式中,量子化模块可以将连续的特征空间转化为原型向量集合表示的离散特征空间。离散特征空间相较于高维连续空间更容易建模。
可选地,在第一方面的一种可能的实现方式中,上述的第一损失函数包括第三损失函数与第四损失函数,第三损失函数用于表示第一前景图像与第二前景图像之间的差异,第四损失函数用于表示在训练过程中特征图在量子化模块更新前后之间的差异。
该种可能的实现方式中,最小化该特征图的像素在更新前后的差异,可以引入损失对量子化模块进行训练,进而使得量子化模块将连续的特征空间转化为原型向量集合表示的离散特征空间。离散特征空间相较于高维连续空间更容易建模。
可选地,在第一方面的一种可能的实现方式中,上述的第一神经网络与第二神经网络还包括赋值模块,赋值模块用于更新特征图的索引,索引用于量子化模块对特征图进行更新。
该种可能的实现方式中,赋值模块可以实现不同像素的聚类中心的对齐,在预测每个像素的聚类中心的时候,不仅考虑当前像素,还要考虑其他相似像素的聚类中心,提升后续推理效果。
可选地,在第一方面的一种可能的实现方式中,上述的第一损失函数还包括第五损失函数,第五损失函数用于表示在训练过程中索引在赋值模块更新前后之间的差异。
该种可能的实现方式中,使用第五损失函数训练特征图在量子化模块更新前后之间的差异越来越小。使用第五损失函数训练索引在赋值模块更新前后之间的差异越来越小,即重新计算的索引值要尽量和原来通过最近邻聚类方法得到的索引值尽可能一致。
本申请实施例第二方面提供了一种图像处理方法,可以应用于物体检测/分割场景。该方法可以由图像处理设备执行,也可以由图像处理设备的部件(例如处理器、芯片、或芯片系统等)执行。该方法包括:获取第一图像;基于编码器提取第一图像的第一特征图;基于第一特征图与解码器得到第一图像的检测/分割结果;编码器与解码器由带有标签的训练图像、第一损失函数以及第二损失函数训练得到,训练图像包括前景物体与背景,第一损失函数用于表示训练过程中基于编码器与生成器生成的第一前景图像与第二前景图像之间的差异,第一前景图像包括前景物体,且不包括背景,第二前景图像为训练图像扣除背景以外的图像,第二损失函数用于表示训练过程中基于编码器与解码器得到的检测/分割结果与标签之间的差异。
本申请实施例中,通过第一损失函数与第二损失函数训练编码器,可以使得编码器可以 学习到图像更多的纹理和结构特征,进而提高小样本物体检测的定位能力,提升包含该编码器的第二神经网络的检测/分割效果。
可选地,在第二方面的一种可能的实现方式中,上述步骤:基于第一特征图与解码器得到第一图像的检测/分割结果,包括:将第一特征图输入解码器得到检测/分割结果。
该种可能的实现方式中,直接将第一特征图输入解码器,由于解码器是通过第一损失函数与第二损失函数训练得到,得到的检测/分割结果可以具有更多的纹理和结构特征,
可选地,在第二方面的一种可能的实现方式中,上述步骤:基于第一特征图与解码器得到第一图像的检测/分割结果,包括:基于量子化模块更新第一特征图,得到第二特征图,量子化模块基于第四损失函数训练得到,第四损失函数用于表示在训练过程中编码器输出的训练图像的特征图在量子化模块更新前后之间的差异;将第二特征图输入解码器得到检测/分割结果。
该种可能的实现方式中,量子化模块将连续的特征空间转化为原型向量集合表示的离散特征空间。离散特征空间相较于高维连续空间更容易建模。
本申请实施例第三方面提供了一种图像处理设备(也可以是训练装置),可以应用于物体检测/分割场景。该图像处理设备/训练装置包括:获取单元,用于获取带有标签的训练图像,训练图像包括前景物体与背景;训练单元,用于基于训练图像、第一损失函数以及第二损失函数训练第一神经网络,得到第二神经网络,第二神经网络用于实现图像的检测/分割任务,第一神经网络包括编码器、解码器以及生成器,第二神经网络包括编码器与解码器;第一损失函数用于表示训练过程中基于第一神经网络中编码器与生成器生成的第一前景图像与第二前景图像之间的差异,第一前景图像包括前景物体,且不包括背景,第二前景图像为训练图像中扣除背景之后的图像,第二损失函数用于表示训练过程中基于第二神经网络中编码器与解码器得到的检测/分割结果与标签之间的差异。
可选地,在第三方面的一种可能的实现方式中,上述的第一神经网络与第二神经网络还包括量子化模块,量子化模块用于将编码器输出的特征图进行更新,并将更新后的特征图分别输入至解码器与生成器。
可选地,在第三方面的一种可能的实现方式中,上述的第一损失函数包括第三损失函数与第四损失函数,第三损失函数用于表示第一前景图像与第二前景图像之间的差异,第四损失函数用于表示在训练过程中特征图在量子化模块更新前后之间的差异。
可选地,在第三方面的一种可能的实现方式中,上述的第一神经网络与第二神经网络还包括赋值模块,赋值模块用于更新特征图的索引,索引用于量子化模块对特征图进行更新。
可选地,在第三方面的一种可能的实现方式中,上述的第一损失函数还包括第五损失函数,第五损失函数用于表示在训练过程中索引在赋值模块更新前后之间的差异。
本申请实施例第四方面提供了一种图像处理设备,可以应用于物体检测/分割场景。该图像处理设备包括:获取单元,用于获取第一图像;提取单元,用于基于编码器提取第一图像的第一特征图;处理单元,用于基于第一特征图与解码器得到第一图像的检测/分割结果;编码器与解码器由带有标签的训练图像、第一损失函数以及第二损失函数训练得到,训练图像包括前景物体与背景,第一损失函数用于表示训练过程中基于编码器与生成器生成的第一前景图像与第二前景图像之间的差异,第一前景图像包括前景物体,且不包括背景,第二前景 图像为训练图像扣除背景以外的图像,第二损失函数用于表示训练过程中基于编码器与解码器得到的检测/分割结果与标签之间的差异。
可选地,在第四方面的一种可能的实现方式中,上述的处理单元,具体用于将第一特征图输入解码器得到检测/分割结果。
可选地,在第四方面的一种可能的实现方式中,上述的处理单元,具体用于基于量子化模块更新第一特征图,得到第二特征图,量子化模块基于第四损失函数训练得到,第四损失函数用于表示在训练过程中编码器输出的训练图像的特征图在量子化模块更新前后之间的差异;处理单元,具体用于将第二特征图输入解码器得到检测/分割结果。
本申请第五方面提供了一种图像处理设备,包括:处理器,处理器与存储器耦合,存储器用于存储程序或指令,当程序或指令被处理器执行时,使得该图像处理设备实现上述第一方面或第一方面的任意可能的实现方式中的方法,或者使得该图像处理设备实现上述第二方面或第二方面的任意可能的实现方式中的方法。
本申请第六方面提供了一种计算机可读介质,其上存储有计算机程序或指令,当计算机程序或指令在计算机上运行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法,或者使得计算机执行前述第二方面或第二方面的任意可能的实现方式中的方法。
本申请第七方面提供了一种计算机程序产品,该计算机程序产品在计算机上执行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法,使得计算机执行前述第二方面或第二方面的任意可能的实现方式中的方法。
其中,第三、第五、第六、第七方面或者其中任一种可能实现方式所带来的技术效果可参见第一方面或第一方面不同可能实现方式所带来的技术效果,此处不再赘述。
其中,第四、第五、第六、第七方面或者其中任一种可能实现方式所带来的技术效果可参见第二方面或第二方面不同可能实现方式所带来的技术效果,此处不再赘述。
从以上技术方案可以看出,本申请实施例具有以下优点:通过第一损失函数与检测/分割的第二损失函数一起训练编码器,由于第一损失函数用于重建前景图像,可以使得编码器可以捕捉图像更多的纹理和结构特征,进而提高小样本物体检测的定位能力,提升包含该编码器的第二神经网络的检测/分割效果。
附图说明
图1为本申请实施例提供的系统架构的结构示意图;
图2为本申请实施例提供的一种芯片硬件结构示意图;
图3A为本申请实施例提供的图像处理系统的一个结构示意图;
图3B为本申请实施例提供的图像处理系统的另一结构示意图;
图4为本申请实施例提供的图像处理方法一个流程示意图;
图5为本申请实施例提供的第二神经网络的一个训练流程示意图;
图6为本申请实施例提供的第二神经网络的另一个训练流程示意图;
图7为本申请实施例提供的增加量子化模块后的一个训练流程示意图;
图8为本申请实施例提供的增加量子化模块后的另一个训练流程示意图;
图9为本申请实施例提供的图像处理方法另一个流程示意图;
图10至图12为本申请实施例提供的图像处理设备的几个结构示意图。
具体实施方式
本申请实施例提供了一种图像处理方法及相关设备,用于增加检测/分割模型对于物体的定位能力。
目前,为了缓解深度学习对标注数据的依赖,研究者提出了小样本物体检测任务。即给定一个大规模的训练集合作为原任务,小样本物体检测器从这些数据中学习可以迁移的知识。对于从未见过的新类别(和原任务的类别不重合),利用每类少量标注的训练样例,检测器可以从未见过的测试图像中检测目标。
然而,由于小样本物体检测中的样本较少,导致定位的精度不高。
为了解决上述技术问题,本申请实施例提供一种图像处理方法及相关设备,通过第一损失函数与检测/分割的第二损失函数一起训练编码器,由于第一损失函数用于重建前景图像,可以使得编码器可以捕捉更多图像的纹理和结构特征,进而提高小样本物体检测的定位能力,提升包含该编码器的第二神经网络的检测/分割效果。下面将结合附图对本申请实施例的图像处理方法及相关设备进行详细的介绍。
为了便于理解,下面先对本申请实施例主要涉及的相关术语和概念进行介绍。
1、神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以Xs和截距1为输入的运算单元,该运算单元的输出可以为:
其中,s=1、2、……n,n为大于1的自然数,Ws为Xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是Relu函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
神经网络中的每一层的工作可以用数学表达式y=a(Wx+b)来描述:从物理层面神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由Wx完成,4的操作由+b完成,5的操作则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文的输 入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
2、卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使同一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
3、深度学习
深度学习(deep learning)是一类基于深层次神经网络算法的机器学习技术,其主要特征是使用多重非线性变换构对数据进行处理和分析。主要应用于人工智能领域的感知、决策等场景,例如图像和语音识别、自然语言翻译、计算机博弈等。
4、损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
下面介绍本申请实施例提供的系统架构。
参见附图1,本发明实施例提供了一种系统架构100。如系统架构100所示,数据采集设备160用于采集训练数据,本申请实施例中训练数据包括:多个不同模态的数据。其中,模态可以是指文本、图像、视音频。例如:训练数据可以包括带标签的训练图像等等。并将训 练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。下面将更详细地描述训练设备120如何基于训练数据得到目标模型/规则101,该目标模型/规则101能够用于实现本申请实施例提供的图像处理方法所应用的计算机视觉任务。该计算机视觉任务可以包括:分类任务、分割任务、检测任务或图像生成任务等。需要说明的是,在实际的应用中,数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图1所示的执行设备110,执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)设备/虚拟现实(virtual reality,VR)设备,车载终端等。当然,执行设备110还可以是服务器或者云端等。在附图1中,执行设备110配置有I/O接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,输入数据在本申请实施例中可以包括:第一图像。另外该输入数据可以是用户输入的,也可以是用户通过拍摄设备上传的,当然还可以来自数据库,具体此处不做限定。
预处理模块113用于根据I/O接口112接收到的第一图像进行预处理,在本申请实施例中,预处理模块113可以用于对第一图像进行翻转、平移、裁剪、颜色变换等处理。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果,如得到的上述检测/分割结果返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在附图1中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,附图1仅是本发明实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图1中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110 中。
如图1所示,根据训练设备120训练得到目标模型/规则101,本申请实施例中的目标模型/规则101具体可以为目标神经网络。
下面介绍本申请实施例提供的一种芯片硬件结构。
图2为本发明实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器20。该芯片可以被设置在如图1所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图1所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。
神经网络处理器20可以是神经网络处理器(neural-network processing unit,NPU),张量处理器(tensor processing unit,TPU),或者图形处理器(graphics processing unit,GPU)等一切适合用于大规模异或运算处理的处理器。以NPU为例:神经网络处理器20作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,
由主CPU分配任务。NPU的核心部分为运算电路203,控制器204控制运算电路203提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路203内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路203是二维脉动阵列。运算电路203还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路203是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路203从权重存储器202中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器201中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器208中。
向量计算单元207可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元207可以用于神经网络中非卷积/非FC层的网络计算,如池化(Pooling),批归一化(Batch Normalization),局部响应归一化(Local Response Normalization)等。
在一些实现中,向量计算单元能207将经处理的输出的向量存储到统一缓存器206。例如,向量计算单元207可以将非线性函数应用到运算电路203的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元207生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路203的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器206用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器205(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器201和/或统一存储器206、将外部存储器中的权重数据存入权重存储器202,以及将统一存储器206中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)210,用于通过总线实现主CPU、DMAC和取指存储器209之间进行交互。
与控制器204连接的取指存储器(instruction fetch buffer)209,用于存储控制器204使用的指令。
控制器204,用于调用指存储器209中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器206,输入存储器201,权重存储器202以及取指存储器209均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,简称DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
接下来介绍几种本申请的应用场景。
图3A为本申请实施例提供的图像处理系统的一个结构示意图,该图像处理系统包括终端设备(图3A中仅以终端设备是手机为例)以及图像处理设备。可以理解的是,终端设备除了可以是手机之外,还可以是平板电脑(pad)、便携式游戏机、掌上电脑(personal digital assistant,PDA)、笔记本电脑、超级移动个人计算机(ultra mobile personal computer,UMPC)、手持计算机、上网本、车载媒体播放设备、可穿戴电子设备、虚拟现实(virtual reality,VR)终端设备、增强现实(augmented reality,AR)、车辆、车载终端、飞机终端、智能机器人等终端设备。终端设备为图像处理的发起端,作为图像处理请求的发起方,通常由用户通过终端设备发起请求。
上述图像处理设备可以是云服务器、网络服务器、应用服务器以及管理服务器等具有图像处理功能的设备或服务器。图像处理设备通过交互接口接收来自终端设备的图像处理请求,再通过存储数据的存储器以及图像处理的处理器环节进行机器学习,深度学习,搜索,推理,决策等方式的图像处理。图像处理设备中的存储器可以是一个统称,包括本地存储以及存储历史数据的数据库,数据库可以在图像处理设备上,也可以在其它网络服务器上。
在图3A所示的图像处理系统中,终端设备可以接收用户的指令,例如终端设备可以获取用户输入/选择的多个数据(例如:终端设备通过终端设备采集的图像、文本、音频等),然后向图像处理设备发起请求,使得图像处理设备针对终端设备得到的该多个数据执行图像处理应用(例如,分类、分割、检测、图像生成等的计算机视觉任务),从而得到针对多个数据的对应的处理结果。示例性的,终端设备可以获取用户输入的图像,然后向图像处理设备发起图像检测请求,使得图像处理设备对该图像进行检测,从而得到图像的检测结果,并显示图像的检测结果,以供用户观看和使用。
在图3A中,图像处理设备可以执行本申请实施例的图像处理方法。
图3B为本申请实施例提供的图像处理系统的另一结构示意图,在图3B中,终端设备(图3B中仅以终端设备是手机为例)直接作为图像处理设备,该终端设备能够直接获取图像,并直接由终端设备本身的硬件进行处理,具体过程与图3A相似,可参考上面的描述,在此不再赘述。
可选地,在图3B所示的图像处理系统中,终端设备可以接收用户的指令,例如终端设备可以获取用户在终端设备中所选择的多张图像,然后再由终端设备自身针对该图像执行图像处理应用(例如,分类、分割、检测、图像生成等的计算机视觉任务),从而得到针对该图像的对应的处理结果,并显示处理结果,以供用户观看和使用。
可选地,在图3B所示的图像处理系统中,终端设备可以实时或周期性的采集图像,然后再由终端设备自身针对该图像执行图像处理应用(例如,分类、分割、检测、图像生成等计算机视觉任务),从而得到针对该图像的对应的处理结果,并根据处理结果实现功能(分类功能、分割功能、检测功能、图像生成功能等等)。
在图3B中,终端设备自身就可以执行本申请实施例的图像处理方法。
上述图3A和图3B中的终端设备具体可以是图1中的客户设备140或执行设备110,图3A中的图像处理设备具体可以是图1中的执行设备110,其中,数据存储系统150可以存储执行设备110的待处理数据,数据存储系统150可以集成在执行设备110上,也可以设置在云上或其它网络服务器上。
图3A和图3B中的处理器可以通过神经网络模型或者其它模型(例如注意力模型、MLP等)进行数据训练/机器学习/深度学习,并利用数据最终训练或者学习得到的模型针对多个数据执行图像处理应用,从而得到相应的处理结果。
本申请实施例提供的图像处理方法可以应用于多种场景,下面分别进行描述。
第一种,自动驾驶领域。
基于深度学习的检测模型擅于检测常见类别(比如汽车、行人),但是难以准确检测罕见的样例,比如路边的垃圾袋、掉落的轮胎、摆放在路中的三角锥等。但是这些障碍物的误检和漏检可能导致严重后果。通过本申请实施例提出的图像处理方法(也可以称为小样本物体检测算法),可以改善检测模型在包含少量标注样本的类别上的检出,提高检测模型的精度和召回率。
第二种,铁路、电网故障检测。
铁路行业货车检测每年人力投入约10亿元,此外还有客车、动车、线路等大颗粒故障检测场景;电网输、变、配电巡检预估未来5年规模240+亿。由于故障发生几率少且需要人为的标注,收集标注样本困难;而且外部环境变化导致成像变化大,故障类内差别明显。本申请实施例提出的图像处理方法(也可以称为小样本物体检测算法)可以有效处理具有少量标注样本的故障检测任务,这一模型可以部署到云上,给外部的客户提供高效的服务。
可以理解的是,上述两种场景只是举例,在实际应用中,本申请实施例还可以应用于其他的小样本物体检测/分割等场景,具体此处不做限定。
下面结合附图对本申请实施例的神经网络的训练方法和图像处理方法进行详细的介绍。
先结合图4对本申请实施例的神经网络的训练方法进行详细介绍。图4所示的方法可以由神经网络的训练装置来执行,该神经网络的训练装置可以是云服务设备,也可以是终端设备,例如,电脑、服务器等运算能力足以用来执行推荐网络的训练方法的装置,也可以是由云服务设备和终端设备构成的系统。示例性地,该训练方法可以由图1中的训练设备120、图2中的神经网络处理器20执行。
可选地,该训练方法可以由CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而使用其他适合用于神经网络计算的处理器,本申请不做限制。
该训练方法包括步骤401与步骤402。下面对步骤401与步骤402进行详细说明。
步骤401,获取带有标签的训练图像。
本申请实施例中,训练装置获取训练图像的方式有多种,可以是通过采集/拍摄的方式,也可以是通过接收其他设备发送的方式,还可以是从数据库中选取的方式等,具体此处不做限定。
本申请实施例中的训练图像包括前景图像与背景,一般情况下,该前景图像是用户指定设备需要识别的部分。
可选地,训练图像的标签可以通过人工或输入模型等方式获取,具体此处不做限定。若应用于检测场景,该标签可以是图像中各对象的类别和/或对象边缘的外接长方形框。若应用于分割场景,该标签可以是像素的分类标签,或者理解为图像中每个像素对应的类别。
可选地,若应用于自动驾驶场景,训练装置可以是车辆,训练图像可以是车辆实时采集的数据,也可以是周期性采集的数据,具体此处不做限定。
步骤402,基于训练图像、第一损失函数以及第二损失函数训练第一神经网络,得到第二神经网络,第二神经网络用于实现图像的检测/分割任务。
训练装置获取训练图像之后,可以基于训练图像、第一损失函数、第二损失函数训练第一神经网络,得到第二神经网络。其中,第一神经网络包括编码器、解码器以及生成器,第二神经网络包括第一神经网络中的编码器与解码器。也可以理解为,第一神经网络包括第二神经网络与生成器。第一损失函数用于表示训练过程中基于第一神经网络中编码器与生成器生成的第一前景图像与第二前景图像之间的差异,该第一前景图像包括前景物体,不包括背景,第二前景图像为训练图像中扣除背景之后的图像。第二损失函数用于表示训练过程中基于第二神经网络中编码器与解码器得到的检测/分类结果与标签之间的差异。
本申请实施例中的第一损失函数可以理解为是生成损失函数,第二损失函数可以理解为是检测/分割损失函数,通过这两部分损失函数训练编码器,可以使得编码器可以学习到图像更多的纹理和结构特征,进而提高小样本物体检测的定位能力,提升包含该编码器的第二神经网络的检测/分割效果。
示例性的,第一损失函数的一种示例如公式一所示:
公式一:Lrec=||D(Q)-x m||;
其中,Lrec表示第一损失函数,D表示解码器,Q表示特征图,可以是编码器输出的特征图,也可以是后续量子化模块更新后的特征图,x表示训练图像,表示点对点的乘法算子,m表示一个二值化的掩膜,大小与训练图像相同,可以根据标注信息,前景物体的像素设为1,背景的像素设为0。
可以理解的是,上述第一损失函数的公式只是举例,在实际应用中,还可以有其它形式的公式,如公式二所示,具体对于第一损失函数的具体结构不做限定。
公式二:Lrec=||D(Q0,Q1,Q2)-x m||;
其中,Q0、Q1、Q2可以表示训练图像经过编码器得到的三个尺度的特征图,也可以表示后续量子化模块更新后的三个尺度特征图,其余参数的描述可参考公式一,此处不再 赘述。
本申请实施例中,第二损失函数可以是绝对值损失函数、对数损失函数、指数损失函数、交叉熵损失函数等等,可以根据实际需要设置,具体此处不做限定。
本步骤也可以理解为是,利用生成模型作为约束,优化检测/分割模型的定位特征。
示例性的,训练装置基于训练图像、第一损失函数以及第二损失函数训练第一神经网络,得到第二神经网络的具体过程可以如图5所示。将训练图像输入编码器得到该训练图像的特征图,一方面将该特征图输入生成器生成第一前景图像,另一方面将该特征图输入解码器得到检测/分割结果。将训练图像中的背景去除得到第二前景图像。再使用第一损失函数训练编码器与生成器,使得基于编码器与生成器输出的第一前景图像与第二前景图像之间的差异越来越小。再使用第二损失函数训练编码器与解码器,使得基于编码器与解码器输出的检测/分类结果与标签的差异越来越小。
可选地,第一神经网络与第二神经网络还可以包括量子化模块,该量子化模块用于将编码器输出的特征图(Fi)进行更新,i用于指示层数,为大于或等于1的整数,并将更新后的特征图分别输入至生成器与解码器。该量子化模块可以使用原型向量集合对特征图进行更新,n为大于1的整数。例如:将Fi中的每个像素被替换成Vi中和最邻近的原型向量j用于指示层数中的位置,为大于或等于1的整数,k在1至n之间。该替换过程可以看作是一种聚类的过程,其中,原型向量是聚类中心,每个输入的像素被指定为距离该像素最近的聚类中心。为了保证聚类的可靠性,通过引入后续所提的第四损失函数学习有效的聚类过程。
进一步的,为了最小化该特征图的像素在更新前后的差异,可以引入损失对量子化模块进行训练,进而使得量子化模块将连续的特征空间转化为原型向量集合表示的离散特征空间。离散特征空间相较于高维连续空间更容易建模。换句话说,第一损失函数可以包括第三损失函数与第四损失函数,该第三损失函数用于表示前述第一前景图像与第二前景图像之间的差异,第四损失函数用于表示编码器输出的特征图在量子化模块更新前后之间的差异。即生成损失函数包括第三损失函数与第四损失函数。可以理解的是,量子化模块的数量可以与编码器输出特征图的数量一一对应,编码器输出特征图的数量可以理解为编码器可以得到多尺度的特征图。该种情况下,第三损失函数可以如前述公式一或公式二所示,第四损失函数可以如公式三所示。
公式三:
其中,Lqt表示第四损失函数,W表示特征图的宽度,H表示特征图的高度,其余参数可参考之前的描述,此处不再赘述。
可以理解的是,上述第四损失函数的公式只是举例,在实际应用中,还可以有其它形式的公式,具体对于第四损失函数的具体结构不做限定。
示例性的,第一神经网络与第二神经网络引入量子化模块后,训练过程可以如图6所示,将训练图像输入编码器得到该训练图像的特征图,量子化模块对该特征图进行更新,一方面 将更新后的特征图输入生成器生成第一前景图像,另一方面将更新后的特征图输入解码器得到检测/分割结果。将训练图像中的背景去除得到第二前景图像。再使用第三损失函数与第四损失函数训练编码器、量子化模块以及生成器,使得基于编码器、量子化模块以及输出的第一前景图像与第二前景图像之间的差异越来越小。再使用第二损失函数训练编码器与解码器,使得基于编码器与解码器输出的检测/分类结果与标签的差异越来越小。再使用第四损失函数训练特征图在量子化模块更新前后之间的差异越来越小。例如,以编码器输出三个尺度不同的特征图(如前述所提的Q0、Q1、Q2)为例,训练过程可以如图7所示,为了提升各个特征图的感受野,可以引入如图7中所示的拼接操作,也可以理解为是残差结构。
另外,上述聚类过程仅考虑每个像素和聚类中心的关系,并未考虑多个像素之间的关联,这对聚类过程是有害的,使得聚类的结果不可靠。因此,在预测每个像素的聚类中心的时候,不仅考虑当前像素,还要考虑其他相似像素的聚类中心,因此,第一神经网络与第二神经网络还可以包括赋值模块,该赋值模块用于更新特征图的索引,该索引用于量子化模块对特征图进行更新。换句话说,赋值模块可以实现不同像素的聚类中心的对齐。该情况下的训练过程可以如图8所示,将训练图像输入编码器得到该训练图像的特征图,量子化模块对该特征图的像素进行更新,得到量化向量。再将量化向量的索引输入赋值模块进行更新。使用第五损失函数训练特征图在量子化模块更新前后之间的差异越来越小。使用第五损失函数训练索引在赋值模块更新前后之间的差异越来越小,即重新计算的索引值要尽量和原来通过最近邻聚类方法得到的索引值尽可能一致。提出第五损失函数可以提升量化精度,提升生成模型
(包括编码器、生成器)对于前景图像的重建能力。
示例性的,更新索引的过程如公式四所示。
公式四:
其中,表示赋值模块更新后的索引,A表示赋值模块,l表示像素fi通过最近邻方法计算的原型索引值;表示通过最近邻方法计算的索引值,sim表示相似度计算函数,用于计算的相似度,O表示one-hot嵌入函数,可以将索引变成二值向量。
可以理解的是,上述更新索引的公式只是举例,在实际应用中,还可以有其它形式的公式,对于更新索引的具体过程不做限定。
本申请实施例中的第五损失函数可以如公式五所示。
公式五:
其中,Lalign表示第五损失函数,W表示特征图的宽度,H表示特征图的高度,其余参数可以参考前述公式四中的描述,此处不再赘述。
可以理解的是,上述第五损失函数只是举例,在实际应用中,还可以有其它形式的第五损失函数,此处对于第五损失函数的具体公式不做限定。
上述的第三损失函数、第四损失函数以及第五损失函数可以理解为是生成损失函数,用于再更新生成器的过程中使得编码器可以学习到更多的结果纹理特征,进而提升后续包括编码器的第二神经网络进行检测/分割任务的精度。
上面对神经网络的训练方法进行了详细描述,下面对本申请实施例提供的图像处理方法进行详细的介绍。该方法可以由图像处理设备执行,也可以由图像处理设备的部件(例如处理器、芯片、或芯片系统等)执行。该图像处理设备可以是云端设备(如前述图3A所示),也可以是终端设备(例如图3B所示的手机)。当然,该方法也可以是由云端设备和终端设备构成的系统执行(如前述图3A所示)。可选地,该方法可以由图像处理设备中的CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而使用其他适合用于神经网络计算的处理器,本申请不做限制。
上述的终端设备可以是手机、平板电脑(pad)、便携式游戏机、掌上电脑(personal digital assistant,PDA)、笔记本电脑、超级移动个人计算机(ultra mobile personal computer,UMPC)、手持计算机、上网本、车载媒体播放设备、可穿戴电子设备、虚拟现实(virtual reality,VR)终端设备、增强现实(augmented reality,AR)终端设备等终端产品。
本申请实施例提供的方法所适用的应用场景可以是自动驾驶领域、铁路/电网故障检测等小样本物体检测/分割场景。请参阅图9,本申请实施例提供的图像处理方法的一个流程示意图,该方法可以包括步骤901至步骤903。下面对步骤901至步骤903进行详细说明。
步骤901,获取第一图像。
本申请实施例中,图像处理设备获取第一图像的方式有多种,可以是通过采集/拍摄的方式,也可以是通过接收其他设备发送的方式,还可以是从数据库中选取的方式等,具体此处不做限定。
可选地,若应用于自动驾驶场景,图像处理设备可以是车辆,第一图像可以是车辆实时采集的数据,也可以是周期性采集的数据,具体此处不做限定。
步骤902,基于编码器提取第一图像的第一特征图。
本申请实施例中的编码器、解码器可以是由前述图4所示实施例提供的训练方法训练所得到的。
该编码器与解码器由带有标签的训练图像、第一损失函数以及第二损失函数训练得到,该训练图像包括前景物体与背景,第一损失函数用于表示训练过程中基于编码器与生成器生成的第一前景图像与第二前景图像之间的差异,第一前景图像包括前景物体,且不包括背景,第二前景图像为训练图像扣除背景以外的图像,第二损失函数用于表示训练过程中基于编码器与解码器得到的检测/分割结果与标签之间的差异。其中,对于第一损失函数、第二损失函数等的描述可以参考前述图4所示实施例中的描述,此处不再赘述。
步骤903,基于第一特征图与解码器得到第一图像的检测/分割结果。
图像处理设备获取第一特征图之后,可以基于第一特征图与解码器得到第一图像的检测/分割结果。
本申请实施例中,基于第一特征图与解码器得到第一图像的检测/分割结果有多种方式,下面分别描述:
第一种,将第一特征图输入解码器得到检测/分割结果。
第二种,将第一特征图输入量子化模块得到第二特征图,也可以理解为是基于量子化模块更新第一特征图得到第二特征图。该量子化模块基于第四损失函数训练得到,第四损失函数用于表示在训练过程中编码器输出的训练图像的特征图在量子化模块更新前后之间的差异,再将第二特征图输入解码器得到检测/分割结果。
进一步的,上述第二种情况中,还可以使用赋值模块对第二特征图的索引进行更新,以便于量子化模块使用更新后的索引更新第一特征图得到第二特征图。该量子化模块基于第五损失函数训练得到,第五损失函数用于表示在训练过程中索引在赋值模块更新前后之间的差异。
本实施例中,各损失函数(例如第一损失函数、第二损失函数、第三损失函数、第四损失函数、第五损失函数)的描述可以参考前述图4所示实施例中的描述,此处不再赘述。
可选地,本实施例中的整体过程可以看做是将第一图像输入如前述图4所示实施例中训练好的第二神经网络中执行检测/分割任务,得到检测/分割结果。
本实施例中,通过第一损失函数与第二损失函数训练编码器,可以使得编码器可以学习到图像更多的纹理和结构特征,进而提高小样本物体检测的定位能力,提升包含该编码器的第二神经网络的检测/分割效果。
为了更直观的看出本申请实施例方法带来的有益效果,下面结合在物体检测数据集MS-COCO为例描述使用本申请实施例的方法训练出来的模型的有益效果。
首先,介绍下MS-COCO数据集,该MS-COCO数据集一共包括80k个训练样本、40k个验证样本和20k个测试样本,覆盖80个类别。其中,20个类设置为新任务类别,其余的60个类设置为原任务类别。20k个测试样本中的5k张图像中属于20个新任务类别的图像用于模型性能评价,80k个训练样本的图像用于模型训练。
在小样本物体检测中,给定一个原任务,是Ns个类别的检测任务,每个类别都具有大量标注样本。同时,还有一新任务,是Nt个类别的检测任务,每个类只有K个标注样本。其中,原任务和新任务的类别没有重合。标准小样本物体检测的目标是为新任务学习一个检测器。
该检测器有两种情况,下面分别描述:
第一种,检测器为一个Nt类的检测器。训练过程如下:
1.原任务预训练:首先用Ns个类别的训练数据充分训练图5/图6所示的网络,得到一个Ns个类别的检测器。
2.新任务微调:然后修改网络的最后一层,使其输出为Nt个神经元,除了网络最后一层随机初始化之外,其他层均利用Ns个类别的检测器的参数进行初始化。利用新任务的少量数据微调网络参数。
使用训练好的检测器(即第二神经网络)进行小样本物体检测,测试数据来自新任务的类别。衡量指标包括不同交并比下的平均准确率(average precision,AP),一般情况下,上述的不同交并比为0.5-0.95之间以0.05间隔所取的10个值,每个值对应一个准确率,10个值取平均为平均准确率。例如:AP50指预测框和目标框的交并比大于0.5才作为检出时AP的值,AP75指预测框和目标框的交并比大于0.75才作为检出时AP的值,AP也可以理解为是 指不同交并比阈值下AP的均值。这些指标越大说明检测模型性能越好。
分别使用现有小样本物体检测算法与本申请实施例提供的方法训练得到的模型:(也可以称为GLFR)在MS-COCO数据集上进行实验,实验结果如表1所示。其中,现有小样本物体检测算法包括:少样本迁移检测器(low-shot transfer detector,LSTD)、元区域检测器(meta region-based convolutional neural network,Meta RCNN)、变换不变的小样本检测器(Transformation invariant Few-shot object detection,TIP)、基于上下文融合的密集关联蒸馏(Dense relation distillation with context-aware aggregation,DCNet)、查询自适应的小样本物体检测器(Query adaptive few-shot object detector,QA-FewDet)、两阶段微调算法(two-stage finetune algorithm,TFA)、基于多尺度正例精化的小样本物体检测(Multi-scale positive sample refinement for few-shot object detection,MSPR)、基于类别间隔均衡的小样本物体检测器(Class margin equilibrium for few-shot object detection,CME)、基于分类细化和干扰器再处理的小样本目标检测(Few-shot object detection via classification refinement and distractor retreatment,CRDR)、基于语义关系推理的小样本物体检测器(Semantic relation reasoning for few-shot object detection,SRR-FSOD)、通用原型增强的小样本检测器(Universal-prototype enhancing for few-shot object detection,FSODup)、基于对比候选编码的小样本物体检测(few-shot object detection via contrastive proposal encoding,FSCE)、解耦的快速区域神经网络(Decoupled faster R-CNN,DeFRCN)。
表1

其中,表1给出了MS-COCO数据集上的实验结果,分别在每个新类给出10张和30张图像(K-shot=10或者30)作为训练集。从表1可以看出,在标准小样本物体检测的设置下,本申请实施例提供的方法训练得到的模型(即GLFR)显著超过了其他小样本物体检测算法。
第二种,检测器为一个Ns+Nt个类别的同时检测类的检测器。训练过程如下:
1.原任务预训练:首先用Ns个类别的训练数据充分训练图5/图6所示的网络,得到一个Ns个类别的检测器。
2.新任务微调:然后修改网络的最后一层,使其输出为Nt+Ns个神经元,除了网络最后一层随机初始化之外,其他层均利用Ns个类别的检测器的参数进行初始化。对于原任务的Ns类别,每个类别从训练数据中随机采样K个样本,对于新任务,则利用所有训练数据,组合这两部分数据构成一个均衡的微调数据集,利用该数据集微调整个网络的参数。
该种情况仍在物体检测数据集MS-COCO进行评测。设置和前述第一种情况类似,区别在于验证集中的5k图像中均用于模型性能评价。分别使用现有小样本物体检测算法与GLFR在MS-COCO数据集上进行实验,实验结果如表2所示:
表2
其中,表2给出了MS-COCO数据集上的实验结果,分别在每个新类给出10张和30张图像(K-shot=10或者30)作为训练集。从表2可以看出,在标准小样本物体检测的设置下,本申请实施例提供的方法训练得到的模型(即GLFR)显著超过了其他小样本物体检测算法。
上面对本申请实施例中的图像处理方法进行了描述,下面对本申请实施例中的图像处理设备进行描述,请参阅图10,本申请实施例中图像处理设备的一个实施例包括:
获取单元1001,用于获取带有标签的训练图像,训练图像包括前景物体与背景;
训练单元1002,用于基于训练图像、第一损失函数以及第二损失函数训练第一神经网络, 得到第二神经网络,第二神经网络用于实现图像的检测/分割任务,第一神经网络包括编码器、解码器以及生成器,第二神经网络包括编码器与解码器;第一损失函数用于表示训练过程中基于第一神经网络中编码器与生成器生成的第一前景图像与第二前景图像之间的差异,第一前景图像包括前景物体,且不包括背景,第二前景图像为训练图像中扣除背景之后的图像,第二损失函数用于表示训练过程中基于第二神经网络中编码器与解码器得到的检测/分割结果与标签之间的差异。
本实施例中,图像处理设备中各单元所执行的操作与前述图1至图8所示实施例中描述的类似,此处不再赘述。
本实施例中,训练单元1002通过第一损失函数与检测/分割的第二损失函数一起训练编码器,由于第一损失函数用于重建前景图像,可以使得编码器可以捕捉图像更多的纹理和结构特征,进而提高小样本物体检测的定位能力,提升包含该编码器的第二神经网络的检测/分割效果。
请参阅图11,本申请实施例中图像处理设备的一个实施例包括:
获取单元1101,用于获取第一图像;
提取单元1102,用于基于编码器提取第一图像的第一特征图;
处理单元1103,用于基于第一特征图与解码器得到第一图像的检测/分割结果;
编码器与解码器由带有标签的训练图像、第一损失函数以及第二损失函数训练得到,训练图像包括前景物体与背景,第一损失函数用于表示训练过程中基于编码器与生成器生成的第一前景图像与第二前景图像之间的差异,第一前景图像包括前景物体,且不包括背景,第二前景图像为训练图像扣除背景以外的图像,第二损失函数用于表示训练过程中基于编码器与解码器得到的检测/分割结果与标签之间的差异。
本实施例中,图像处理设备中各单元所执行的操作与前述图9所示实施例中描述的类似,此处不再赘述。
本实施例中,通过第一损失函数与第二损失函数训练编码器,可以使得编码器可以学习到图像更多的纹理和结构特征,进而提高小样本物体检测的定位能力,提升包含该编码器的第二神经网络的检测/分割效果。
参阅图12,本申请提供的另一种图像处理设备的结构示意图。该图像处理设备可以包括处理器1201、存储器1202和通信端口1203。该处理器1201、存储器1202和通信端口1203通过线路互联。其中,存储器1202中存储有程序指令和数据。
存储器1202中存储了前述图1至图9所示对应的实施方式中,由图像处理设备执行的步骤对应的程序指令以及数据。
处理器1201,用于执行前述图1至图9所示实施例中任一实施例所示的由图像处理设备执行的步骤。
通信端口1203可以用于进行数据的接收和发送,用于执行前述图1至图9所示实施例中任一实施例中与获取、发送、接收相关的步骤。
一种实现方式中,图像处理设备可以包括相对于图12更多或更少的部件,本申请对此仅 仅是示例性说明,并不作限定。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,read-only memory)、随机存取存储器(RAM,random access memory)、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (19)

  1. 一种图像处理方法,其特征在于,包括:
    获取带有标签的训练图像,所述训练图像包括前景物体与背景;
    基于所述训练图像、第一损失函数以及第二损失函数训练第一神经网络,得到第二神经网络,所述第二神经网络用于实现图像的检测/分割任务,所述第一神经网络包括编码器、解码器以及生成器,所述第二神经网络包括所述编码器与所述解码器;所述第一损失函数用于表示训练过程中基于所述第一神经网络中所述编码器与所述生成器生成的第一前景图像与第二前景图像之间的差异,所述第一前景图像包括所述前景物体,且不包括所述背景,所述第二前景图像为所述训练图像中扣除所述背景之后的图像,所述第二损失函数用于表示训练过程中基于所述第二神经网络中所述编码器与所述解码器得到的检测/分割结果与所述标签之间的差异。
  2. 根据权利要求1所述的方法,其特征在于,所述第一神经网络与所述第二神经网络还包括量子化模块,所述量子化模块用于将编码器输出的特征图进行更新,并将更新后的特征图分别输入至所述解码器与所述生成器。
  3. 根据权利要求2所述的方法,其特征在于,所述第一损失函数包括第三损失函数与第四损失函数,所述第三损失函数用于表示所述第一前景图像与所述第二前景图像之间的差异,所述第四损失函数用于表示在训练过程中所述特征图在所述量子化模块更新前后之间的差异。
  4. 根据权利要求3所述的方法,其特征在于,所述第一神经网络与所述第二神经网络还包括赋值模块,所述赋值模块用于更新所述特征图的索引,所述索引用于所述量子化模块对所述特征图进行更新。
  5. 根据权利要求4所述的方法,其特征在于,所述第一损失函数还包括第五损失函数,所述第五损失函数用于表示在训练过程中所述索引在所述赋值模块更新前后之间的差异。
  6. 一种图像处理方法,其特征在于,包括:
    获取第一图像;
    基于编码器提取所述第一图像的第一特征图;
    基于所述第一特征图与解码器得到所述第一图像的检测/分割结果;
    所述编码器与所述解码器由带有标签的训练图像、第一损失函数以及第二损失函数训练得到,所述训练图像包括前景物体与背景,所述第一损失函数用于表示训练过程中基于所述编码器与生成器生成的第一前景图像与第二前景图像之间的差异,所述第一前景图像包括所述前景物体,且不包括所述背景,所述第二前景图像为所述训练图像扣除所述背景以外的图像,所述第二损失函数用于表示训练过程中基于所述编码器与所述解码器得到的检测/分割结果与所述标签之间的差异。
  7. 根据权利要求6所述的方法,其特征在于,所述基于所述第一特征图与解码器得到所述第一图像的检测/分割结果,包括:
    将所述第一特征图输入所述解码器得到所述检测/分割结果。
  8. 根据权利要求6所述的方法,其特征在于,所述基于所述第一特征图与解码器得到所述第一图像的检测/分割结果,包括:
    基于量子化模块更新第一特征图,得到第二特征图,所述量子化模块基于第四损失函数 训练得到,所述第四损失函数用于表示在训练过程中所述编码器输出的所述训练图像的特征图在所述量子化模块更新前后之间的差异;
    将所述第二特征图输入所述解码器得到所述检测/分割结果。
  9. 一种图像处理设备,其特征在于,所述图像处理设备包括:
    获取单元,用于获取带有标签的训练图像,所述训练图像包括前景物体与背景;
    训练单元,用于基于所述训练图像、第一损失函数以及第二损失函数训练第一神经网络,得到第二神经网络,所述第二神经网络用于实现图像的检测/分割任务,所述第一神经网络包括编码器、解码器以及生成器,所述第二神经网络包括所述编码器与所述解码器;所述第一损失函数用于表示训练过程中基于所述第一神经网络中所述编码器与所述生成器生成的第一前景图像与第二前景图像之间的差异,所述第一前景图像包括所述前景物体,且不包括所述背景,所述第二前景图像为所述训练图像中扣除所述背景之后的图像,所述第二损失函数用于表示训练过程中基于所述第二神经网络中所述编码器与所述解码器得到的检测/分割结果与所述标签之间的差异。
  10. 根据权利要求9所述的图像处理设备,其特征在于,所述第一神经网络与所述第二神经网络还包括量子化模块,所述量子化模块用于将编码器输出的特征图进行更新,并将更新后的特征图分别输入至所述解码器与所述生成器。
  11. 根据权利要求10所述的图像处理设备,其特征在于,所述第一损失函数包括第三损失函数与第四损失函数,所述第三损失函数用于表示所述第一前景图像与所述第二前景图像之间的差异,所述第四损失函数用于表示在训练过程中所述特征图在所述量子化模块更新前后之间的差异。
  12. 根据权利要求11所述的图像处理设备,其特征在于,所述第一神经网络与所述第二神经网络还包括赋值模块,所述赋值模块用于更新所述特征图的索引,所述索引用于所述量子化模块对所述特征图进行更新。
  13. 根据权利要求12所述的图像处理设备,其特征在于,所述第一损失函数还包括第五损失函数,所述第五损失函数用于表示在训练过程中所述索引在所述赋值模块更新前后之间的差异。
  14. 一种图像处理设备,其特征在于,所述图像处理设备包括:
    获取单元,用于获取第一图像;
    提取单元,用于基于编码器提取所述第一图像的第一特征图;
    处理单元,用于基于所述第一特征图与解码器得到所述第一图像的检测/分割结果;
    所述编码器与所述解码器由带有标签的训练图像、第一损失函数以及第二损失函数训练得到,所述训练图像包括前景物体与背景,所述第一损失函数用于表示训练过程中基于所述编码器与生成器生成的第一前景图像与第二前景图像之间的差异,所述第一前景图像包括所述前景物体,且不包括所述背景,所述第二前景图像为所述训练图像扣除所述背景以外的图像,所述第二损失函数用于表示训练过程中基于所述编码器与所述解码器得到的检测/分割结果与所述标签之间的差异。
  15. 根据权利要求14所述的图像处理设备,其特征在于,所述处理单元,具体用于将所述第一特征图输入所述解码器得到所述检测/分割结果。
  16. 根据权利要求14所述的图像处理设备,其特征在于,所述处理单元,具体用于基于量子化模块更新第一特征图,得到第二特征图,所述量子化模块基于第四损失函数训练得到,所述第四损失函数用于表示在训练过程中所述编码器输出的所述训练图像的特征图在所述量子化模块更新前后之间的差异;
    所述处理单元,具体用于将所述第二特征图输入所述解码器得到所述检测/分割结果。
  17. 一种图像处理设备,其特征在于,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述图像处理设备执行如权利要求1至8中任一项所述的方法。
  18. 一种计算机存储介质,其特征在于,包括计算机指令,当所述计算机指令在终端设备上运行时,使得所述终端设备执行如权利要求1至8中任一项所述的方法。
  19. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如权利要求1至8中任一项所述的方法。
PCT/CN2023/086194 2022-04-29 2023-04-04 一种图像处理方法及相关设备 WO2023207531A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210468931.9A CN117036658A (zh) 2022-04-29 2022-04-29 一种图像处理方法及相关设备
CN202210468931.9 2022-04-29

Publications (1)

Publication Number Publication Date
WO2023207531A1 true WO2023207531A1 (zh) 2023-11-02

Family

ID=88517412

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/086194 WO2023207531A1 (zh) 2022-04-29 2023-04-04 一种图像处理方法及相关设备

Country Status (2)

Country Link
CN (1) CN117036658A (zh)
WO (1) WO2023207531A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117437494A (zh) * 2023-12-20 2024-01-23 量子科技长三角产业创新中心 一种图像分类方法、系统、电子设备及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188760A (zh) * 2019-04-01 2019-08-30 上海卫莎网络科技有限公司 一种图像处理模型训练方法、图像处理方法及电子设备
CN112508991A (zh) * 2020-11-23 2021-03-16 电子科技大学 一种前后景分离的熊猫照片卡通化方法
WO2021063476A1 (en) * 2019-09-30 2021-04-08 Toyota Motor Europe Method for training a generative adversarial network, modified image generation module and system for detecting features in an image
CN112734881A (zh) * 2020-12-01 2021-04-30 北京交通大学 基于显著性场景图分析的文本合成图像方法及系统
CN112990211A (zh) * 2021-01-29 2021-06-18 华为技术有限公司 一种神经网络的训练方法、图像处理方法以及装置
CN113221757A (zh) * 2021-05-14 2021-08-06 上海交通大学 一种改善行人属性识别准确率的方法、终端及介质
CN113627421A (zh) * 2021-06-30 2021-11-09 华为技术有限公司 一种图像处理方法、模型的训练方法以及相关设备
US20210398334A1 (en) * 2020-06-22 2021-12-23 Beijing Baidu Netcom Science And Technology Co., Ltd. Method for creating image editing model, and electronic device and storage medium thereof

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188760A (zh) * 2019-04-01 2019-08-30 上海卫莎网络科技有限公司 一种图像处理模型训练方法、图像处理方法及电子设备
WO2021063476A1 (en) * 2019-09-30 2021-04-08 Toyota Motor Europe Method for training a generative adversarial network, modified image generation module and system for detecting features in an image
US20210398334A1 (en) * 2020-06-22 2021-12-23 Beijing Baidu Netcom Science And Technology Co., Ltd. Method for creating image editing model, and electronic device and storage medium thereof
CN112508991A (zh) * 2020-11-23 2021-03-16 电子科技大学 一种前后景分离的熊猫照片卡通化方法
CN112734881A (zh) * 2020-12-01 2021-04-30 北京交通大学 基于显著性场景图分析的文本合成图像方法及系统
CN112990211A (zh) * 2021-01-29 2021-06-18 华为技术有限公司 一种神经网络的训练方法、图像处理方法以及装置
CN113221757A (zh) * 2021-05-14 2021-08-06 上海交通大学 一种改善行人属性识别准确率的方法、终端及介质
CN113627421A (zh) * 2021-06-30 2021-11-09 华为技术有限公司 一种图像处理方法、模型的训练方法以及相关设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117437494A (zh) * 2023-12-20 2024-01-23 量子科技长三角产业创新中心 一种图像分类方法、系统、电子设备及存储介质
CN117437494B (zh) * 2023-12-20 2024-04-16 量子科技长三角产业创新中心 一种图像分类方法、系统、电子设备及存储介质

Also Published As

Publication number Publication date
CN117036658A (zh) 2023-11-10

Similar Documents

Publication Publication Date Title
US20220092351A1 (en) Image classification method, neural network training method, and apparatus
EP3968179A1 (en) Place recognition method and apparatus, model training method and apparatus for place recognition, and electronic device
WO2021155792A1 (zh) 一种处理装置、方法及存储介质
JP2022505775A (ja) 画像分類モデルの訓練方法、画像処理方法及びその装置、並びにコンピュータプログラム
WO2022042713A1 (zh) 一种用于计算设备的深度学习训练方法和装置
WO2021147325A1 (zh) 一种物体检测方法、装置以及存储介质
EP4163831A1 (en) Neural network distillation method and device
WO2021164750A1 (zh) 一种卷积层量化方法及其装置
EP4006776A1 (en) Image classification method and apparatus
US20220148291A1 (en) Image classification method and apparatus, and image classification model training method and apparatus
Ayachi et al. Pedestrian detection based on light-weighted separable convolution for advanced driver assistance systems
CN110222718B (zh) 图像处理的方法及装置
EP4322056A1 (en) Model training method and apparatus
WO2021238548A1 (zh) 区域识别方法、装置、设备及可读存储介质
CN113191489B (zh) 二值神经网络模型的训练方法、图像处理方法和装置
WO2023165361A1 (zh) 一种数据处理方法及相关设备
WO2021227787A1 (zh) 训练神经网络预测器的方法、图像处理方法及装置
WO2021190433A1 (zh) 更新物体识别模型的方法和装置
EP4318313A1 (en) Data processing method, training method for neural network model, and apparatus
WO2023083030A1 (zh) 一种姿态识别方法及其相关设备
DE102022100360A1 (de) Framework für maschinelles lernen angewandt bei einer halbüberwachten einstellung, um instanzenverfolgung in einer sequenz von bildframes durchzuführen
WO2023207531A1 (zh) 一种图像处理方法及相关设备
Yuan et al. Low-res MobileNet: An efficient lightweight network for low-resolution image classification in resource-constrained scenarios
WO2022217434A1 (zh) 感知网络、感知网络的训练方法、物体识别方法及装置
WO2023174256A1 (zh) 一种数据压缩方法以及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23794965

Country of ref document: EP

Kind code of ref document: A1