CN114299304A

CN114299304A - Image processing method and related equipment

Info

Publication number: CN114299304A
Application number: CN202111538907.XA
Authority: CN
Inventors: 孙镜涵; 魏东; 李悦翔; 卢东焕; 宁慕楠; 何楠君; 马锴; 王连生; 郑冶枫
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-04-08
Anticipated expiration: 2041-12-15
Also published as: CN114299304B

Abstract

The embodiment of the application discloses an image processing method and related equipment, which can be applied to various fields or scenes such as cloud technology, artificial intelligence, block chains, Internet of vehicles, intelligent transportation, intelligent home, health management and the like, and the method comprises the following steps: acquiring an image to be processed; calling an image processing model to perform feature extraction on an image to be processed to obtain feature information of the image to be processed; the image processing model is obtained by performing combined training on a first neural network and a second neural network by using a training sample set, wherein the training sample set comprises a first training sample, a second training sample and marking data, the second training sample is generated based on the first training sample, the combined training comprises supervision training and comparison training, a processing result of an image to be processed is determined based on characteristic information of the image to be processed and a target task, and the target task comprises one or more of a classification task, a segmentation task and a detection task. The method and the device can comprehensively and effectively extract useful information of the image.

Description

Image processing method and related equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to an image processing method and a related device.

Background

In order to extract more valuable feature information of an image, some contrast learning methods are adopted in the current extraction method, but some current contrast learning methods may omit the process of supervised pre-training to extract high-order feature information of the image, or mainly learn low-level and medium-level characteristics of the image, so that high-level semantic information of the image is lost; however, although the supervised pre-training model is adopted to extract the high-order feature information of the image, only a limited image area related to the target task is usually focused, and valuable information of the rest area of the image is ignored. Therefore, how to enable the model to comprehensively and effectively extract useful information of the image is a problem to be solved urgently.

Disclosure of Invention

The embodiment of the invention provides an image processing method and related equipment, which can comprehensively and effectively extract useful information of an image, thereby improving the accuracy of task processing.

In one aspect, an embodiment of the present application discloses an image processing method, including:

acquiring an image to be processed;

calling an image processing model to perform feature extraction on the image to be processed to obtain feature information of the image to be processed; the image processing model is obtained by performing joint training on a first neural network and a second neural network by using a training sample set, wherein the training sample set comprises a first training sample, a second training sample and marking data, the second training sample is generated based on the first training sample, and the joint training comprises supervised training and contrast training;

determining a processing result of the image to be processed based on the feature information of the image to be processed and a target task, wherein the target task comprises one or more of a classification task, a segmentation task and a detection task.

On the other hand, an embodiment of the present application discloses an image processing apparatus, including:

the acquisition unit is used for acquiring an image to be processed;

the processing unit is used for calling an image processing model to perform feature extraction on the image to be processed to obtain feature information of the image to be processed; the image processing model is obtained by performing joint training on a first neural network and a second neural network by using a training sample set, wherein the training sample set comprises a first training sample, a second training sample and marking data, the second training sample is generated based on the first training sample, and the joint training comprises supervised training and contrast training;

the determining unit is used for determining a processing result of the image to be processed based on the characteristic information of the image to be processed and a target task, wherein the target task comprises one or more of a classification task, a segmentation task and a detection task.

The embodiment of the present application also discloses an image processing apparatus, including: the image processing method comprises a memory and a processor, wherein the memory stores an image processing program, and the image processing program is used for the processor to execute the steps of the image processing method.

The embodiment of the application also discloses a computer-readable storage medium, which is characterized in that the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to perform the steps of the image processing method.

Accordingly, the present application also discloses a computer program product or a computer program, which includes computer instructions, which are stored in a computer readable storage medium. The processor of the image processing apparatus reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the image processing apparatus performs the image processing method described above.

The embodiment of the invention has the following beneficial effects: acquiring an image to be processed, calling an image processing model to perform feature extraction on the image to be processed to obtain feature information of the image to be processed, wherein the image processing model is obtained by performing joint training on a first neural network and a second neural network by using a training sample set, the training sample set comprises a first training sample, a second training sample and labeling data, and the second training sample is generated based on the first training sample; determining a processing result of the image to be processed based on the feature information of the image to be processed and a target task, wherein the target task comprises one or more of a classification task, a segmentation task and a detection task; because the second training sample is considered in the joint training process and is obtained by inhibiting the high-response region most relevant to the task in the image, the model can extract more valuable image information from other image regions, the useful information of the image can be comprehensively and effectively extracted, and the accuracy of task processing is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a block diagram of an image processing system according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram of an image processing method disclosed in an embodiment of the present application;

FIG. 3a is a schematic diagram of a training process of a first stage of an image processing model according to an embodiment of the present application;

FIG. 3b is a schematic diagram of a second stage training process of an image processing model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a processing result for determining an image to be processed according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a classification task of an image to be processed according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a task of segmenting an image to be processed according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a task of detecting a to-be-processed image according to an embodiment of the present disclosure;

FIG. 8 is a schematic flow chart illustrating a process for performing a target task training using the obtained image processing model according to an embodiment of the present disclosure;

FIG. 9 is a schematic flow chart diagram illustrating a method for training an image processing model according to an embodiment of the present disclosure;

fig. 10 is an overall framework structure diagram for processing an image to be processed according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an image processing apparatus disclosed in an embodiment of the present application;

fig. 12 is a schematic structural diagram of an image processing apparatus disclosed in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms referred to in the present application have the following meanings:

and (3) comparison and learning: a task-independent unsupervised representation learning method. Features of the related data set are learned without image tags by learning similarities and differences between images.

Pre-training: a model is trained by performing a specific task (typically image classification) on a large amount of data before performing the target task.

Low/medium/high characterization: low-level features are typically some small detail in the image, such as edges, colors, etc.; high-level features are features with semantic information; intermediate level features are in between.

And (3) downstream tasks: supervised learning tasks utilizing pre-trained models or components.

High response region/low response region: the high-response area refers to an image area of the high-level features extracted by the network, which has a larger influence on the final classification. Otherwise, the low response region is obtained.

Multilayer perceptron: an artificial neural network of a forward structure is composed of a plurality of fully connected layers.

Referring to fig. 1, fig. 1 is a schematic diagram of an image processing system according to an embodiment of the present disclosure, which may include a client 101 and a server 102. The server 102 may obtain the image to be processed from the client 101, and further, the server 102 may execute the image processing method described in the embodiment of the present application, and call the image processing model to perform feature extraction on the image to be processed, so that more valuable image information may be extracted from other image areas, and further, useful information of the image may be comprehensively and effectively extracted, thereby improving the accuracy of task processing.

It should be noted that the client may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a smart car, or the like, but is not limited thereto; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and a big data and artificial intelligence platform.

In an alternative embodiment, the server 102 may obtain the image to be processed, for example, the client 101 may send the image to be processed to the server 102; the server 102 may call an image processing model to perform feature extraction on the image to be processed, so as to obtain feature information of the image to be processed; the image processing model is obtained by performing joint training on a first neural network and a second neural network by using a training sample set, wherein the training sample set comprises a first training sample, a second training sample and marking data, the second training sample is generated based on the first training sample, and the joint training comprises supervised training and contrast training; and determining a processing result of the image to be processed based on the characteristic information of the image to be processed and a target task, wherein the target task comprises one or more of a classification task, a segmentation task and a detection task.

Optionally, the client 101 may also obtain an image to be processed, and call an image processing model to perform feature extraction on the image to be processed, so as to obtain feature information of the image to be processed; determining a processing result of the image to be processed based on the feature information of the image to be processed and the target task, wherein the image processing model is provided to the client 101 by the server 102 after performing joint training on the first neural network and the second neural network by using the training sample set to obtain the image processing model.

According to the method and the device, the region most relevant to the task in the image is restrained in the joint training process, so that the image processing model can extract more valuable image information from other image regions, the useful information of the image can be comprehensively and effectively extracted, and the task processing accuracy is improved.

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a schematic flowchart of an image processing method according to an embodiment of the present disclosure, where the image processing method shown in fig. 2 is described from the perspective of a client or a server, and the method may include, but is not limited to, the following steps:

s201, acquiring an image to be processed.

In the embodiment of the application, the images to be processed can be a group of images extracted from a video, including a group of images extracted from a live video; the image to be processed may also be a group of photos taken by the shooting device, for example, a user takes a picture of the real object by using the shooting device, and takes the shot picture as the image to be processed; the image to be processed may also be upsampled from public data sets (e.g., miniImageNet, tiered ImageNet, CIFAR-FS, etc.).

S202, calling an image processing model to perform feature extraction on the image to be processed to obtain feature information of the image to be processed.

In the embodiment of the present application, the feature information of the image to be processed may include feature information of a high-response region and feature information of a low-response region in the image to be processed. The high-response region may refer to an image region having the largest influence on the final task, and the other is a low-response region.

In an alternative implementation, reference may be made to fig. 3a and fig. 3b for a training process of an image processing model, where fig. 3a is a schematic diagram of a training process of a first stage of an image processing model provided in an embodiment of the present application, and as shown in fig. 3a, the training process of the first stage of the image processing model may be that a first neural network 302 is supervised-trained by using a first training sample and its corresponding annotation data 301, so as to obtain a trained first neural network 303; fig. 3b is a schematic diagram of a training process of a second stage of the image processing model according to the embodiment of the present application, and as shown in fig. 3b, the training process of the second stage of the image processing model may be obtained by jointly training the first neural network 301 and the second neural network 306 using a training sample set. Wherein the training sample set comprises a first training sample and its label data 301, a second training sample and its label data 304, the second training sample 305 may be generated based on the first training sample 301, and the joint training comprises supervised training 309 and contrast training 311.

For example, the second training sample may be obtained after suppressing the high response region in the first training sample, that is, the second training sample mainly contains the image information of the low response region.

For another example, joint training may include: carrying out supervised training 310 on a first neural network 302 by using a first training sample and corresponding annotation data 301 to obtain a trained first neural network 303; performing supervised training 310 on the trained first neural network 302 by using the first training sample 301, the second training sample 304 and the corresponding annotation data to obtain a first feature projection 307, and performing training on the second neural network 306 by using the second training sample 305 while performing the supervised training 310 to obtain a second feature projection 308; and performing unsupervised comparison training by using the first training result and the second training result.

Wherein the first neural network 302 and the second neural network 306 may be a convolutional neural network, a cyclic neural network, a deep neural network, etc.; the annotation data can be obtained by using an One-Shot method or obtained by manual annotation based on the information of the image to be processed.

S203, determining a processing result of the image to be processed based on the feature information of the image to be processed and a target task, wherein the target task comprises one or more of a classification task, a segmentation task and a detection task.

In the embodiment of the present application, the processing result may be one or more of a classification result, a segmentation result, and a detection result for the image to be processed. Referring to fig. 4, fig. 4 is a schematic diagram illustrating a processing result of determining an image to be processed according to an embodiment of the present application, and as shown in fig. 4, an image processing model is invoked to perform feature extraction on the image to be processed, and the extracted features are imported into corresponding task processing modules according to target tasks, so as to determine the processing result of the image to be processed.

For example, if the target task is a classification task, refer to fig. 5, and fig. 5 is a schematic diagram of a classification task of an image to be processed according to an embodiment of the present application, as shown in fig. 5, an image 501 of a "dog" to be processed is imported into an image processing model 502, feature extraction is performed on the image 501 of the "dog" to be processed by calling the image processing model 502 to obtain feature information of the image of the "dog" to be processed, the feature information of the extracted image is imported into a classifier 503 after the image processing model, and then the classifier 503 is used to output an image of the "dog" or an image of the "cat" or a prediction result of the image of the "monkey".

For example, if the target task is a segmentation task, refer to fig. 6, where fig. 6 is a schematic diagram of a segmentation task of an image to be processed according to an embodiment of the present application, as shown in fig. 6, an image 601 to be processed is imported to an image processing model 602 to obtain feature information of the image to be processed, the extracted feature information of the image is imported to a segmenter 603, and a classification result 604 of each pixel in the image to be processed is output, so as to obtain a final segmentation result.

For example, if the target task is a detection task, refer to fig. 7, and fig. 7 is a schematic diagram of the detection task of the image to be processed provided in the embodiment of the present application, as shown in fig. 7, the image 701 of the "cat" to be processed is imported into the CNN feature extraction module 703 in the image processing model 702, the CNN feature extraction module 703 in the image processing model 702 is invoked to perform CNN feature extraction on the image 701 of the "cat" to be processed, so as to obtain feature information of the image of the "cat" to be processed, the extracted feature information of the image is processed with a preselected frame image region of the image of the "cat" to be processed, and the detection frame 704 of the "cat" is output, so as to obtain a detection result of the image of the "cat" to be processed.

In an optional implementation manner, the image to be processed is carried in an image processing request of the client, where the image processing request further includes a task identifier of a target task; the determining the processing result of the image to be processed based on the feature information of the image to be processed and the target task comprises the following steps: according to the task identification of the target task, determining a task processing module corresponding to the target task from one or more task processing modules included in the image processing model; calling a task processing module corresponding to the target task to process the characteristic information of the image to be processed to obtain a corresponding processing result; and sending the processing result of the image to be processed to the client.

In an optional implementation manner, referring to fig. 8, fig. 8 is a schematic flowchart of a process for performing target task training by using an obtained image processing model according to an embodiment of the present disclosure, as shown in fig. 8, a front end a receives target task training data and labels, uploads the target task training data and labels to a back end, the back end performs target task training and tests a target model by using the image processing model obtained in the embodiment of the present disclosure, and finally outputs a model of the target task to a front end B, where the front end a and the front end B may be the same terminal or may not be the same terminal.

In the embodiment of the application, in the process of jointly training the image processing model, the high-response region most relevant to the task in the image is suppressed, so that the model can extract more valuable image information from other image regions, the useful information of the image can be comprehensively and effectively extracted, the obtained model is used as an initialization model of other downstream tasks (classification task, segmentation task, detection task and the like), and the accuracy of the downstream tasks is improved.

The above describes the process of processing the image to be processed by using the image processing method, and the following describes the training mode of the image processing model.

Fig. 9 is a schematic flowchart of a training method of an image processing model according to an embodiment of the present disclosure. The training method includes, but is not limited to, the following steps:

and S901, obtaining a first training sample and corresponding marking data.

In this embodiment of the application, the first training sample may be a set of images extracted from a video, including a set of images extracted from a live video; the first training sample may also be a set of photos taken by the shooting device, for example, a user takes a picture of the real object by using the shooting device, and takes the taken picture as the first training sample; the first training sample may also be upsampled from a public data set (e.g., a public data set such as miniImageNet, tiered ImageNet, and CIFAR-FS). The labeling data corresponding to the first training sample may be data generated by processing the first training sample by a one-shot method.

S902, training the first neural network based on the first training sample and the corresponding marking data to obtain the trained first neural network.

In this embodiment of the application, the trained first neural network may be obtained by performing supervised training on the first neural network by using a minimum cross entropy loss function based on the first training sample and the corresponding labeled data.

S903, generating a second training sample based on the trained first neural network and the first training sample.

In this embodiment of the application, generating a second training sample based on the trained first neural network and the first training sample may include: processing the first training sample based on the trained first neural network to obtain a class activation thermodynamic diagram of the first training sample; a second training sample is generated based on the first training sample and the class activation thermodynamic diagram.

In an alternative embodiment, processing the first training sample based on the trained first neural network to obtain a class activation thermodynamic diagram of the first training sample may include: and after the trained first neural network is obtained, importing the first training sample into the trained first neural network for feature extraction, and obtaining a class activation thermodynamic diagram of the first training sample based on the extracted features.

In an alternative embodiment, generating a second training sample based on the first training sample and the type of activation thermodynamic diagram may include: scaling the class activation thermodynamic diagram based on the image size of the first training sample to obtain a scaled class activation thermodynamic diagram; normalizing the scaled class activation thermodynamic diagrams; and performing matrix multiplication operation on the normalized class activation thermodynamic diagram and the first training sample to obtain the second training sample.

In an alternative embodiment, after the class activation thermodynamic diagram of the first training sample is obtained, the aspect ratio of the class activation thermodynamic diagram of the first training sample is determined, and if the initial aspect ratio of the class activation thermodynamic diagram of the first training sample is not the aspect ratio of the first training sample, the class activation thermodynamic diagram of the first training sample is scaled according to a preset size so that the size of the class activation thermodynamic diagram of the first training sample is consistent with the size of the first training sample.

For example, assuming that the image size of the first training sample is M × N, the class activation thermodynamic diagram may be scaled according to the size of M × N, resulting in a class activation thermodynamic image of M × N size.

In an alternative embodiment, after obtaining the M × N-sized class activation thermodynamic image, the pixel values of the M × N-sized class activation thermodynamic map may be further normalized to [0,1] to obtain a class activation thermodynamic map having a pixel value of [0,1 ].

In an optional embodiment, after the class activation thermodynamic diagram with the pixel value of [0,1] is obtained, the class activation thermodynamic diagram with the pixel value of [0,1] may further be subjected to matrix corresponding element multiplication operation with an element of the first training sample to obtain the second training sample. The second training sample may be obtained after suppressing the high response region in the first training sample, that is, the second training sample mainly includes image information of the low response region.

S904, performing combined training on a second neural network and the trained first neural network based on the first training sample, the second training sample and corresponding labeling data to obtain an image processing model, wherein the second neural network is a twin network of the first neural network.

In the embodiment of the application, twin networks refer to networks with the same structure but different network parameters; performing joint training on a second neural network and the trained first neural network based on the first training sample, the second training sample and corresponding labeling data to obtain an image processing model, which may include: calling the trained first neural network to respectively process the first training sample and the second training sample to obtain a prediction result of the first training sample, a first feature projection and a prediction result of the second training sample, wherein the first feature projection comprises features of a high-response image area in the first training sample; calling a second neural network and a corresponding prediction function to process the second training sample to obtain a second feature projection, wherein the second feature projection comprises the features of the low-response image area in the second training sample; determining target loss based on the prediction result of the first training sample, the prediction result of the second training sample, corresponding annotation data, the first feature projection and the second feature projection; and adjusting the network parameters of the second neural network and the trained network parameters of the first neural network according to the target loss to obtain an image processing model, wherein the image processing model comprises the trained first neural network after network parameter adjustment.

And calling the trained first neural network to respectively process the first training sample and the second training sample, and calling the second neural network and the corresponding prediction function to process the second training sample simultaneously.

In an optional implementation manner, invoking the trained first neural network to respectively process the first training sample and the second training sample to obtain the prediction result of the first training sample, the first feature projection, and the prediction result of the second training sample may include: calling the trained first neural network to respectively extract the features of the first training sample and the second training sample, and obtaining a prediction result of the first training sample, a first feature projection and a prediction result of the second training sample based on the extracted features, wherein the first feature projection can be used for calling the trained first neural network to project the first training sample of the high-dimensional space or the extracted features to the low-dimensional space, and further extracting the feature vector of the low-dimensional space.

And the first feature projection comprises features of a high-response image area in the first training sample; the prediction result may be one or more of a classification result, a detection result, and a segmentation result.

In an optional implementation, calling a second neural network and a corresponding prediction function to process the second training sample to obtain a second feature projection, where the second feature projection includes features of low-response image regions in the second training sample, and may include: and calling a second neural network to perform feature extraction on the second training sample, and mapping the extracted features by using a corresponding prediction function to obtain a second feature projection. The second feature projection may be to project a second training sample of the high-dimensional space or the extracted features to the low-dimensional space by using a second neural network, and then extract feature vectors of the low-dimensional space. And the second feature projection includes features of the low response image region in the second training sample.

In an alternative embodiment, determining the target loss based on the predicted result of the first training sample, the predicted result of the second training sample, the corresponding annotation data, the first feature projection, and the second feature projection may include: determining a first loss based on the prediction result of the first training sample and the corresponding annotation data; determining a second loss based on the prediction result of the second training sample and the corresponding annotation data; determining a contrast loss based on the first feature projection and the second feature projection; a target loss is determined based on the first loss, the second loss, and the contrast loss.

For example, referring to fig. 10, fig. 10 is a diagram of an overall framework structure for processing an image to be processed according to an embodiment of the present application, which is shown in fig. 10 and includes four steps (a), (b), (c), and (d):

(a) the method comprises the following steps Based on the first training sample x and the corresponding label data y, by minimizing the cross entropyLoss L_cls(f_θ(x) ) training supervised model f_θ。

(b) The method comprises the following steps Class activation thermodynamic diagram M for generating first training sample x_cClass activation thermodynamic diagram M_cScaling to the size of the first training sample x and applying a thermodynamic diagram M_cNormalized to [0,1]]To obtain

Subsequently, high response regions in the image are suppressed:

and taking the x ' as a second training sample, and taking y ' as the label data corresponding to the x '.

(c) The method comprises the following steps Putting the restrained image into a supervised model f_θBy minimizing the cross-entropy loss L_cls(f_θ(x')) is trained to focus the model on low response regions.

(d) The method comprises the following steps Inputting a first training sample x into f while the supervised training is being performed_θIn (d), the second training sample x' is input to g (f)_ξ) Two feature projections are obtained: z ═ f_θ(x),z′＝g(f_ξ(x')). Loss of L by contrast using z and z_self(z, z') optimizing the image processing model.

The trained first neural network obtained in step S902 is f_θ(ii) a The second neural network is f_ξThen, the relationship between the second training sample x' and the first training sample x is shown in equation 1:

wherein the content of the first and second substances,

a thermodynamic diagram is activated for the normalized class of x,

representing corresponding elements of a matrixMultiplication.

The first loss function is shown in equation 2:

n is the total number of the first training samples;

the second loss function is shown in equation 3:

n is the total number of the second training samples;

inputting x into f_θIn, x' is input to g (f)_ξ) Two feature projections are obtained: z ═ f_θ(x),z′＝g(f_ξ(x')), and the contrast loss function L is obtained by using z and z_self(z, z'), g being a prediction function.

Illustratively, when for MoCo _ v2 (a method of momentum contrast), g is an identity map, i.e., a constant of 1; f. of_θAnd f_ξAre identical in structure but different in parameters, f_θIs updated by gradient descent, f_ξThe parameters of (a) are updated by exponential moving averages: ξ ← m ξ + (1-m) θ, where m ∈ [0, 1).

Wherein tau is a temperature coefficient, and K is the number of negative samples.

When for SimSam (a network with two branches, where the input of each branch is different, but the structure of each branch is the same as the network parameters), g is a Multilayer perceptron (MLP), f_θAnd f_ξSharing a parameter, i.e. f_θ＝f_ξThen the contrast loss function is shown in equation (5):

finally, the target loss is shown in equation 6:

L(x)＝L_cls(f_θ(x))+L_cls(f_θ(x′))+L_self(f_θ(x),g(f_ξ(x′))) (6)

in an optional implementation manner, the network parameters of the second neural network and the trained network parameters of the first neural network are adjusted according to the target loss to obtain an image processing model, wherein the image processing model includes the trained first neural network after network parameter adjustment.

For example, θ of the second neural network and ξ of the trained first neural network may be adjusted according to the target loss to obtain an image processing model including the trained first neural network f after network parameter adjustment_θ。

In an alternative implementation, the present embodiment may also compare the standard small sample recognition task with a Supervised pre-training (Supervised), an Exemplar-based contrast learning method and three different contrast learning methods (BYOL, MoCo _ v2, SimSiam), respectively. As can be seen from table 1, the classification accuracy of the supervised method (supervised) on both datasets surpassed the unsupervised (BYOL, MoCo _ v2, simserial) method. In addition, the embodiment of the application combines the supervised and unsupervised methods to be superior to the prior art on the setting of the 1-shot and the 5-shot of the two data sets, and the effectiveness of the invention is verified.

Table 1 comparison of the accuracy (%) of the standard small sample identification 5 class (5-way) of the prior art protocol on miniImageNet and tiered imagenet of the examples of the present application

In an optional implementation manner, in order to test the migration performance of the scheme of the application in a more realistic scenario, a field migration small sample identification experiment between the disclosed databases tiered imagenet and CIFAR-FS is designed in the embodiment of the application. The model is used for pre-training in the base class of tiered ImageNet or CIFAR-FS, and the test is carried out on 600 tasks formed by the new class. As shown in Table 2, the examples of the present application are improved by 3-16% compared to other pre-training methods. This task is more challenging than the standard small sample identification task, but the embodiments of the present application are still able to achieve optimal migration performance.

Table 2 identification 5 classification (5-way) accuracy (%) across small samples. Tiered → CIFAR means pre-training on the base class of Tiered ImgeNet, testing on a new class of CIFAR-FS.

In an alternative implementation manner, in order to explore the migration performance of the comparison method in the embodiment of the present application on other downstream tasks except the classification task, experiments are performed on two tasks: PASCAL VOC target detection and Cityscapes semantic segmentation. Pre-training on the base class data of tiered imagenet. For PASCAL VOC target detection, the present example uses Fast-RCNN as the detector and the input image is scaled to 800X 800. The optimizer Adam [21], trained for 100 epochs, with a decay weight of 5X 10-4 and a batch size of 4. The learning rate was set to 1 x 10-4, multiplying each epoch by 0.95. In the embodiment of the application, the pre-training model is finely adjusted in the training set of the VOC2007 and VOC2012 data sets, and the final average Accuracy (AP) is obtained by detecting in the testing set of the VOC 2007. For the Cityscapes semantic segmentation, the embodiment of the application is segmented based on deep Lab v3+ [16 ]. The input image was scaled to 768 × 768, the optimizer was a random gradient descent optimizer, trained for 30,000 iterations, with a momentum of 0.9, a learning rate of 0.01, a decay weight of 1 × 10-4, and a batch size of 8. Both tasks use ResNet-50[17] as the backbone network, and each layer in the network is fine-tuned end-to-end. The image enhancement method during training comprises random scaling, random cropping and horizontal turning. Finally, the validation set partitioning in Cityscapes yielded the mean intersection ratio (mIOU).

As can be seen from Table 3, the performance of the example (based on MoCo _ v2) in VOC target detection and Cityscapes semantic segmentation exceeds that of other comparison methods by 1-2%, including supervision (Supervised), MoCo _ v2, SimSam and Exemplar. This demonstrates that the comparative learning method of the embodiments of the present application can also improve migration performance on other downstream tasks.

Table 3 performance of PASCAL VOC target detection and cityscaps semantic segmentation.

In summary, in the embodiment of the application, in the process of jointly training the image processing model, by suppressing the high-response region most relevant to the task in the image, the model can extract more valuable image information from other image regions, so that the useful information of the image can be comprehensively and effectively extracted, and the accuracy of task processing is improved.

Based on the above method embodiment, the embodiment of the present application further provides a schematic structural diagram of an image processing apparatus. Fig. 11 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present disclosure. The image processing apparatus 1000 shown in fig. 11 can operate the following units: the image processing device comprises an acquisition unit 1001, a processing unit 1002, a training unit 1003 and a determination unit 1004, wherein the acquisition unit is used for processing an image to be processed;

an acquisition unit 1001 configured to acquire an image to be processed;

the processing unit 1002 is configured to invoke an image processing model to perform feature extraction on the image to be processed, so as to obtain feature information of the image to be processed; the image processing model is obtained by performing joint training on a first neural network and a second neural network by using a training sample set, wherein the training sample set comprises a first training sample, a second training sample and marking data, the second training sample is generated based on the first training sample, and the joint training comprises supervised training and contrast training;

a determining unit 1003, configured to determine a processing result of the image to be processed based on the feature information of the image to be processed and a target task, where the target task includes one or more of a classification task, a segmentation task, and a detection task.

In an alternative embodiment, the obtaining unit 1001 is configured to obtain an image to be processed.

In an optional implementation manner, the processing unit 1002 is configured to invoke an image processing model to perform feature extraction on the image to be processed, so as to obtain feature information of the image to be processed; the image processing model is obtained by performing joint training on a first neural network and a second neural network by using a training sample set, wherein the training sample set comprises a first training sample, a second training sample and labeling data, the second training sample is generated based on the first training sample, and the joint training comprises supervised training and contrast training.

In an optional implementation manner, the determining unit 1003 is configured to determine a processing result of the image to be processed based on the feature information of the image to be processed and a target task, where the target task includes one or more of a classification task, a segmentation task, and a detection task.

In an optional implementation manner, the obtaining unit 1001 is configured to, when obtaining a first training sample and corresponding labeled data, train a first neural network based on the first training sample and the corresponding labeled data, so as to obtain a trained first neural network;

generating a second training sample based on the trained first neural network and the first training sample;

and performing joint training on a second neural network and the trained first neural network based on the first training sample, the second training sample and corresponding labeling data to obtain an image processing model, wherein the second neural network is a twin network of the first neural network.

In an optional implementation manner, the processing unit 1002 is configured to perform joint training on a second neural network and the trained first neural network based on the first training sample, the second training sample and corresponding labeled data to obtain an image processing model, and call the trained first neural network to respectively process the first training sample and the second training sample to obtain a prediction result of the first training sample, a first feature projection and a prediction result of the second training sample, where the first feature projection includes a feature of a high-response image region in the first training sample;

calling a second neural network and a corresponding prediction function to process the second training sample to obtain a second feature projection, wherein the second feature projection comprises features of a low-response image area in the second training sample;

determining target loss based on the prediction result of the first training sample, the prediction result of the second training sample, corresponding annotation data, the first feature projection and the second feature projection;

and adjusting the network parameters of the second neural network and the trained network parameters of the first neural network according to the target loss to obtain an image processing model, wherein the image processing model comprises the trained first neural network after network parameter adjustment.

In an optional embodiment, the determining unit 1003 is configured to determine, when determining the target loss, a first loss based on the prediction result of the first training sample and the corresponding labeling data based on the prediction result of the first training sample, the prediction result of the second training sample, the corresponding labeling data, the first feature projection, and the second feature projection;

determining a second loss based on the prediction result of the second training sample and the corresponding annotation data;

determining a contrast loss based on the first feature projection and the second feature projection;

a target loss is determined based on the first loss, the second loss, and the contrast loss.

In an optional embodiment, the processing unit 1002 is configured to, when generating a second training sample based on the trained first neural network and the first training sample, process the first training sample based on the trained first neural network to obtain a class activation thermodynamic diagram of the first training sample;

generating a second training sample based on the first training sample and the class activation thermodynamic diagram.

In an optional embodiment, the processing unit 1002 is configured to, when generating the second training sample based on the first training sample and the class activation thermodynamic diagram, perform scaling processing on the class activation thermodynamic diagram based on an image size of the first training sample to obtain a scaled class activation thermodynamic diagram;

performing normalization processing on the scaled class activation thermodynamic diagrams;

and carrying out matrix multiplication operation on the normalized class activation thermodynamic diagram and the first training sample to obtain the second training sample.

In an optional implementation manner, the determining unit 1003 is configured to determine, when determining a processing result of the image to be processed based on the feature information of the image to be processed and a target task, a task processing module corresponding to the target task from one or more task processing modules included in the image processing model according to a task identifier of the target task;

calling a task processing module corresponding to the target task to process the characteristic information of the image to be processed to obtain a corresponding processing result;

and sending the processing result of the image to be processed to the client.

According to an embodiment of the present application, the steps involved in the image processing method shown in fig. 2 and the training method of the image processing model shown in fig. 9 may be performed by units in the image processing apparatus shown in fig. 11. For example, step S201 in the image processing method shown in fig. 2 may be performed by the acquisition unit 1001 in the image processing apparatus shown in fig. 11, and step S202 may be performed by the processing unit 1002 in the image processing apparatus shown in fig. 11; steps S601 to S603 in the training method of the image processing model shown in fig. 6 may be performed by the training unit 1003 in the image processing apparatus shown in fig. 11, and step S604 may be performed by the determination unit 1004 in the image processing apparatus shown in fig. 11.

According to the embodiment of the present application, the units in the image processing apparatus shown in fig. 11 may be respectively or entirely combined into one or several other units to form the image processing apparatus, or some unit(s) may be further split into multiple functionally smaller units to form the image processing apparatus, which may achieve the same operation without affecting the achievement of the technical effects of the embodiment of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the image processing apparatus may also include other units, and in practical applications, these functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units.

According to the embodiment of the present application, the image processing apparatus as shown in fig. 11 can be configured by running a computer program (including program codes) capable of executing the steps involved in the respective methods as shown in fig. 2 and 9 on a general-purpose computing device such as a computer including a processing element and a storage element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and the like, and an image processing method according to the embodiment of the present application can be realized. The computer program may be embodied on a computer-readable storage medium, for example, and loaded into and executed by the above-described computing apparatus via the computer-readable storage medium.

In this embodiment of the application, the processing unit 1002 processes the acquired image to be processed by using an image processing model to obtain feature information of the image to be processed, where the image processing model is obtained by performing joint training on a first neural network and a second neural network by using a training sample set, the training sample set includes a first training sample, a second training sample and labeled data, and the second training sample is generated based on the first training sample.

Based on the method and device embodiments, the embodiment of the application provides an image processing device. Referring to fig. 12, a schematic structural diagram of an image processing apparatus according to an embodiment of the present application is shown. The image processing apparatus 1100 shown in fig. 12 includes at least a processor 1101, an input interface 1102, an output interface 1103, a computer storage medium 1104, and a memory 1105. The processor 1101, the input interface 1102, the output interface 1103, the computer storage medium 1104, and the memory 1105 may be connected by a bus or other means.

A computer storage medium 1104 may be stored in the memory 1105 of the image processing apparatus 1100, the computer storage medium 1104 being for storing a computer program comprising program instructions, the processor 1101 being for executing the program instructions stored by the computer storage medium 1104. The processor 1101 (or CPU) is a computing core and a control core of the image Processing apparatus 1100, which is adapted to implement one or more instructions, and in particular to load and execute one or more computer instructions to implement corresponding method flows or corresponding functions.

An embodiment of the present application also provides a computer storage medium (Memory) that is a Memory device in the image processing apparatus 1100 and stores programs and data. It is to be understood that the computer storage medium herein may include a built-in storage medium in the image processing apparatus 1100, and may also include an extended storage medium supported by the image processing apparatus 1100. The computer storage medium provides a storage space that stores an operating system of the image processing apparatus 1100. Also stored in this memory space are one or more instructions, which may be one or more computer programs (including program code), suitable for loading and execution by the processor 1101. The computer storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory; and optionally at least one computer storage medium located remotely from the processor.

In one embodiment, the computer storage medium may be loaded with one or more instructions by processor 1101 and executed to implement the corresponding steps described above with respect to the image processing method shown in FIG. 2 and the training method of the image processing model shown in FIG. 9. In particular implementations, one or more instructions in the computer storage medium are loaded by processor 1101 and perform the following steps:

acquiring an image to be processed; calling an image processing model to perform feature extraction on the image to be processed to obtain feature information of the image to be processed; the image processing model is obtained by jointly training a first neural network and a second neural network by using a training sample set, wherein the training sample set comprises a first training sample, a second training sample and marking data, and the second training sample is generated based on the first training sample; determining a processing result of the image to be processed based on the feature information of the image to be processed and a target task, wherein the target task comprises one or more of a classification task, a segmentation task and a detection task.

In an optional embodiment, the processor 1101 is further configured to take a first training sample and corresponding labeled data;

training a first neural network based on the first training sample and the corresponding marking data to obtain a trained first neural network; generating a second training sample based on the trained first neural network and the first training sample; and performing joint training on a second neural network and the trained first neural network based on the first training sample, the second training sample and corresponding labeling data to obtain an image processing model, wherein the second neural network is a twin network of the first neural network.

In an optional embodiment, the processor 1101 is further configured to, when performing joint training on a second neural network and the trained first neural network based on the first training sample, the second training sample and corresponding labeled data to obtain an image processing model, call the trained first neural network to respectively process the first training sample and the second training sample to obtain a prediction result of the first training sample, a first feature projection and a prediction result of the second training sample, where the first feature projection includes features of a high-response image region in the first training sample; calling a second neural network and a corresponding prediction function to process the second training sample to obtain a second feature projection, wherein the second feature projection comprises features of a low-response image area in the second training sample; determining target loss based on the prediction result of the first training sample, the prediction result of the second training sample, corresponding annotation data, the first feature projection and the second feature projection; and adjusting the network parameters of the second neural network and the trained network parameters of the first neural network according to the target loss to obtain an image processing model, wherein the image processing model comprises the trained first neural network after network parameter adjustment.

In an alternative embodiment, the processor 1101 determines the target loss based on the prediction result of the first training sample, the prediction result of the second training sample, the corresponding annotation data, the first feature projection and the second feature projection, and further determines the first loss based on the prediction result of the first training sample and the corresponding annotation data; determining a second loss based on the prediction result of the second training sample and the corresponding annotation data; determining a contrast loss based on the first feature projection and the second feature projection; a target loss is determined based on the first loss, the second loss, and the contrast loss.

In an optional embodiment, when the processor 1101 generates a second training sample based on the trained first neural network and the first training sample, the processor is further configured to process the first training sample based on the trained first neural network to obtain a class activation thermodynamic diagram of the first training sample; generating a second training sample based on the first training sample and the class activation thermodynamic diagram.

In an alternative embodiment, when the processor 1101 generates the second training sample based on the first training sample and the class activation thermodynamic diagram, the processor 1101 is further configured to scale the class activation thermodynamic diagram based on an image size of the first training sample, so as to obtain a scaled class activation thermodynamic diagram; performing normalization processing on the scaled class activation thermodynamic diagrams; and carrying out matrix multiplication operation on the normalized class activation thermodynamic diagram and the first training sample to obtain the second training sample.

In an optional implementation manner, the image to be processed is carried in an image processing request of a client, where the image processing request further includes a task identifier of a target task, and when the processor 1101 determines a processing result of the image to be processed based on feature information of the image to be processed and the target task, the processor is further configured to determine, according to the task identifier of the target task, a task processing module corresponding to the target task from one or more task processing modules included in the image processing model; calling a task processing module corresponding to the target task to process the characteristic information of the image to be processed to obtain a corresponding processing result; and sending the processing result of the image to be processed to the client.

In the implementation of the present application, the processor 1101 calls an image processing model to perform feature extraction on the image to be processed to obtain feature information of the image to be processed, wherein the image processing model is obtained by performing joint training on a first neural network and a second neural network by using a training sample set, the training sample set includes a first training sample, a second training sample and label data, and the second training sample is generated based on the first training sample; the most relevant regions of the image to the task are suppressed in the training process, so that the model can extract more image information from other regions, and the unsupervised contrast learning method and the supervised algorithm based on the class activation thermodynamic diagram are jointly trained to obtain more effective characteristic information in the image.

Embodiments of the present application also provide a computer product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor 1101 reads the computer instructions from the computer-readable storage medium, and the processor 1101 executes the computer instructions, so that the image processing apparatus 1100 performs the image processing method shown in fig. 2 and the training method of the image processing model shown in fig. 9.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. An image processing method, characterized in that the method comprises:

acquiring an image to be processed;

2. The method of claim 1, further comprising:

acquiring a first training sample and corresponding marking data;

training a first neural network based on the first training sample and the corresponding marking data to obtain a trained first neural network;

3. The method of claim 2, wherein the jointly training a second neural network and the trained first neural network based on the first training sample, the second training sample and corresponding labeling data to obtain an image processing model comprises:

calling the trained first neural network to respectively process the first training sample and the second training sample to obtain a prediction result and a first feature projection of the first training sample and a prediction result of the second training sample, wherein the first feature projection comprises features of a high-response image area in the first training sample;

4. The method of claim 3, wherein determining a target loss based on the prediction of the first training sample, the prediction of the second training sample, the corresponding annotation data, the first feature projection, and the second feature projection comprises:

determining a first loss based on the prediction result of the first training sample and the corresponding annotation data;

5. The method according to any one of claims 2 to 4, wherein the generating a second training sample based on the trained first neural network and the first training sample comprises:

processing the first training sample based on the trained first neural network to obtain a class activation thermodynamic diagram of the first training sample;

6. The method of claim 5, wherein the generating a second training sample based on the first training sample and the class activation thermodynamic diagram comprises:

scaling the class activation thermodynamic diagram based on the image size of the first training sample to obtain a scaled class activation thermodynamic diagram;

7. The method according to claim 1, wherein the image to be processed is carried in an image processing request of a client, the image processing request further comprising a task identifier of a target task; the determining the processing result of the image to be processed based on the feature information of the image to be processed and the target task comprises:

according to the task identification of the target task, determining a task processing module corresponding to the target task from one or more task processing modules included in the image processing model;

and sending the processing result of the image to be processed to the client.

8. An image processing apparatus characterized by comprising:

the acquisition unit is used for acquiring an image to be processed;

9. An image processing apparatus characterized by comprising:

memory, a processor, wherein the memory has stored thereon an image processing program which, when executed by the processor, implements the steps of the image processing method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the steps of the image processing method according to any one of claims 1 to 7.