CN115115910A - Training method, using method, device, equipment and medium of image processing model - Google Patents

Training method, using method, device, equipment and medium of image processing model Download PDF

Info

Publication number
CN115115910A
CN115115910A CN202210826932.6A CN202210826932A CN115115910A CN 115115910 A CN115115910 A CN 115115910A CN 202210826932 A CN202210826932 A CN 202210826932A CN 115115910 A CN115115910 A CN 115115910A
Authority
CN
China
Prior art keywords
feature
image processing
prediction
image
processing model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210826932.6A
Other languages
Chinese (zh)
Inventor
朱明丽
陈思宏
吴保元
朱梓豪
陈宸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Computer Systems Co Ltd
Chinese University of Hong Kong Shenzhen
Original Assignee
Shenzhen Tencent Computer Systems Co Ltd
Chinese University of Hong Kong Shenzhen
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Tencent Computer Systems Co Ltd, Chinese University of Hong Kong Shenzhen filed Critical Shenzhen Tencent Computer Systems Co Ltd
Priority to CN202210826932.6A priority Critical patent/CN115115910A/en
Publication of CN115115910A publication Critical patent/CN115115910A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a training method, a using method, a device, equipment and a medium of an image processing model, and belongs to the field of artificial intelligence. The image processing model is an integrated model supporting n input and n output, wherein n is an integer greater than 1, and the method comprises the following steps: acquiring a basic training set, wherein the basic training set comprises n input images and n label information corresponding to the n input images; inputting n input images into an image processing model to obtain n pieces of prediction information and a first mixed feature representation; inputting the n augmented images into an image processing model to obtain a second mixed feature representation; determining a prediction error loss based on the n prediction information and the n label information; and determining an unsupervised learning loss based on the first hybrid signature representation and the second hybrid signature representation; and training the model parameters of the image processing model based on the prediction error loss and the self-supervision learning loss. The scheme can enhance the feature extraction capability of the image processing model.

Description

Training method, using method, device, equipment and medium of image processing model
Technical Field
The embodiment of the application relates to the technical field of artificial intelligence, in particular to a training method, a using method, a device, computer equipment, a computer readable storage medium and a computer program product of an image processing model.
Background
With the continuous and vigorous development of computer technology in recent years, various computer vision tasks such as image classification, object detection, image segmentation, object detection and the like appear, and therefore image processing technology is gradually attracting attention.
In the related art, an image processing model is trained in advance, feature information of an image is extracted through the image processing model, and an image processing result is analyzed and determined according to the extracted feature information.
However, the feature extraction capability of the image processing model in the related art is insufficient, the feature information of the image cannot be sufficiently extracted, and how to improve the feature extraction capability of the image processing model is an urgent problem to be solved.
Disclosure of Invention
The application provides a training method, a using method, a device, computer equipment, a computer readable storage medium and a computer program product of an image processing model, which can improve the feature extraction capability of the image processing model. The technical scheme is as follows:
in one aspect, a method for training an image processing model is provided, where the image processing model is an integrated model supporting n inputs and n outputs, where n is an integer greater than 1, and the method includes:
acquiring a basic training set, wherein the basic training set comprises n input images and n label information corresponding to the n input images;
inputting the n input images into the image processing model to obtain n pieces of prediction information and a first mixed feature representation; inputting the n augmented images into the image processing model to obtain a second mixed feature representation; the n augmented images are images obtained by augmenting the n input images;
determining a prediction error loss based on the n prediction information and the n label information; and determining an unsupervised learning loss based on the first hybrid signature representation and the second hybrid signature representation;
training model parameters of the image processing model based on the prediction error loss and the auto-supervised learning loss.
In another aspect, a method for using an image processing model, where the image processing model is trained by a training method of the image processing model, is provided, and the method includes:
acquiring an input image to be processed;
inputting n identical input images constructed based on the input images into the image processing model to obtain n pieces of prediction information;
determining a prediction result of the input image based on the n prediction information.
In another aspect, an apparatus for training an image processing model is provided, where the image processing model is an integrated model supporting n inputs and n outputs, where n is an integer greater than 1, and the apparatus includes:
the acquisition module is used for acquiring a basic training set, wherein the basic training set comprises n input images and n label information corresponding to the n input images;
the coding module is used for inputting the n input images into the image processing model to obtain n pieces of prediction information and a first mixed feature representation; inputting the n augmented images into the image processing model to obtain a second mixed feature representation; the n augmented images are images obtained by augmenting the n input images;
a determination module to determine a prediction error loss based on the n prediction information and the n label information; and determining an unsupervised learning loss based on the first hybrid signature representation and the second hybrid signature representation;
and the training module is used for training the model parameters of the image processing model based on the prediction error loss and the self-supervision learning loss.
In another aspect, an apparatus for using an image processing model is provided, the apparatus comprising:
the acquisition module is used for acquiring an input image to be processed;
the prediction module is used for inputting n same input images constructed based on the input images into the image processing model to obtain n pieces of prediction information;
a determining module for determining a prediction result of the input image based on the n prediction information.
In another aspect, a computer device is provided, the computer device comprising: a processor and a memory, the memory storing a computer program that is loaded and executed by the processor to implement a method of training and/or a method of using an image processing model as described above.
In another aspect, a computer-readable storage medium is provided, which stores a computer program that is loaded and executed by a processor to implement the training method and/or the using method of the image processing model as described above.
In another aspect, a computer program product is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the training method and/or the using method of the image processing model provided by the above aspects.
The beneficial effects that technical scheme that this application embodiment brought include at least:
the training efficiency of the image processing model can be improved by taking the lightweight integrated model supporting n input and n output as the image processing model; inputting n input images into an image processing model to obtain n pieces of prediction information and a first mixed feature representation; inputting the n augmented images into an image processing model to obtain a second mixed feature representation; the image processing model can fully learn the characteristic information of the input image in the basic training set, and the performance of the image processing model on the basic training set is obviously improved. By training the model parameters of the image processing model based on the prediction error loss and the self-supervision learning loss, the self-supervision learning can be integrated into the training process of the image processing model, the image processing model is prevented from being over-fitted on the input image of the basic training set, the feature extraction capability of the image processing model can be effectively improved, and the learning of downstream tasks is further facilitated.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 illustrates a block diagram of a computer system provided by an exemplary embodiment;
FIG. 2 is a schematic diagram illustrating a method of training an image processing model provided by an exemplary embodiment;
FIG. 3 illustrates a flow chart of a method for training an image processing model provided by an exemplary embodiment;
FIG. 4 illustrates a flow chart of a method of training an image processing model provided by an exemplary embodiment;
FIG. 5 illustrates a flow chart of a method for training an image processing model provided by an exemplary embodiment;
FIG. 6 illustrates a flow chart of a method for training an image processing model provided by an exemplary embodiment;
FIG. 7 illustrates a flow chart of a method for training an image processing model provided by an exemplary embodiment;
FIG. 8 is a diagram illustrating a method of training an image processing model provided by an exemplary embodiment;
FIG. 9 illustrates a flow chart of a method for training an image processing model provided by an exemplary embodiment;
FIG. 10 illustrates a flow chart of a method for training an image processing model provided by an exemplary embodiment;
FIG. 11 illustrates a flow chart of a method of using an image processing model provided by an exemplary embodiment;
FIG. 12 illustrates a flow chart of a method for training an image processing model provided by an exemplary embodiment;
FIG. 13 is a diagram illustrating a method of training an image processing model provided by an exemplary embodiment;
FIG. 14 is a diagram illustrating a method of using an image processing model provided by an exemplary embodiment;
FIG. 15 is a scene diagram illustrating a method of training an image processing model provided by an exemplary embodiment;
FIG. 16 is a block diagram illustrating an exemplary embodiment of an apparatus for training an image processing model;
FIG. 17 is a block diagram illustrating an apparatus for using an image processing model according to an exemplary embodiment;
FIG. 18 illustrates a block diagram of a computing device, provided in an exemplary embodiment.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region. For example, the input images referred to in this application are obtained with the authorization of the user or with sufficient authorization of the parties.
It will be understood that, although the terms first, second, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first parameter may also be referred to as a second parameter, and similarly, a second parameter may also be referred to as a first parameter, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
FIG. 1 shows a block diagram of a computer system provided in an exemplary embodiment of the present application. The computer system may implement a system architecture that becomes a training method and/or a use method for an image processing model. The computer system may include: a terminal 100 and a server 200.
The terminal 100 may be an electronic device such as a mobile phone, a tablet Computer, a vehicle-mounted terminal (car machine), a wearable device, a PC (Personal Computer), an unmanned terminal, and the like. The terminal 100 may have a client installed therein for running a target application, which may be a training and/or using application of an image processing model, or may be another application provided with a training and/or using function of an image processing model, and the present application is not limited thereto. The form of the target Application is not limited in the present Application, and may include, but is not limited to, an App (Application program) installed in the terminal 100, an applet, and the like, and may be a web page form.
The server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud server, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The server 200 may be a background server of the target application program, and is configured to provide a background service for a client of the target application program.
According to the training method and/or the using method of the image processing model provided by the embodiment of the application, the execution subject of each step can be computer equipment, and the computer equipment refers to electronic equipment with data calculation, processing and storage capabilities. Taking the embodiment implementation environment shown in fig. 1 as an example, the terminal 100 may perform the training method and/or the using method of the image processing model, for example, a client installed in the terminal 100 and running a target application program performs the training method and/or the using method of the image processing model, the server 200 may also perform the training method and/or the using method of the image processing model, or the terminal 100 and the server 200 cooperate with each other to perform the training method and/or the using method of the image processing model, which is not limited in this application.
In addition, the technical scheme of the application can be combined with a block chain technology. For example, some of the data involved in the training method and/or the use method of the image processing model disclosed herein may be saved on the blockchain. The terminal 100 and the server 200 may communicate with each other through a network, such as a wired or wireless network.
The embodiment of the application relates to the technical field of artificial intelligence and computer vision technology.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Computer Vision technology (CV) is a science for researching how to make a machine "see", and more specifically, it refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, detection and measurement on a target, and further perform image processing, so that the Computer processing becomes an image more suitable for human eyes to observe or to transmit to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image Recognition, image semantic understanding, image retrieval, Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face Recognition and fingerprint Recognition.
In the conventional technology, image processing is mainly performed by training an image processing model in advance, extracting feature information of an image through the image processing model, and analyzing and determining an image processing result according to the extracted feature information.
However, the feature extraction capability of the image processing model in the conventional technology is insufficient, the feature information of the image cannot be fully extracted, and how to improve the feature extraction capability of the image processing model is an urgent problem to be solved.
According to the method, an integrated model supporting n input and n output is used as an image processing model, and self-supervision learning is combined in the training process of the image processing model, so that the feature extraction capability of the image processing model is effectively improved, and the learning of downstream tasks is facilitated.
Next, an image processing model in the present application will be described. Fig. 2 is a schematic diagram illustrating a training process of an image processing model according to an embodiment of the present application.
The image processing model 30 is a lightweight integrated model supporting n inputs and n outputs, where n is an integer greater than 1. The image processing model 30 includes: n feature extraction layers, a shared network layer, n prediction layers and a feature mapping layer. The output ends of the n feature extraction layers are respectively connected with the input ends of the shared network layer, the output ends of the shared network layer are respectively connected with the input ends of the n prediction layers, and the output end of the shared network layer is also connected with the input end of the feature mapping layer.
The image processing model 30 processes the n input images as follows:
the n input images are images input to the image processing model 30, the n input images correspond to the n label information, the n input images correspond to the n feature extraction layers one by one, and the n input images are input to the n feature extraction layers to obtain n first feature maps; for example, the input image 300-1 is input to the feature extraction layer 1 to obtain the first feature map 302-1, the input image 300-2 is input to the feature extraction layer 2 to obtain the first feature map 302-2, and the input image 300-n is input to the feature extraction layer n to obtain the first feature map 302-n.
Mixing the n first characteristic graphs to obtain a first mixed characteristic graph; for example, the first feature map 302-1, the first feature map 302-2, and the first feature map 302-n are mixed to obtain a first mixed feature map 304.
Inputting the first mixed feature map into a sharing network layer to obtain a first sharing feature map; for example, the first mixed signature graph 304 is input into the shared network layer to obtain a first shared signature graph 306.
Inputting the first shared characteristic diagram into n prediction layers to obtain n prediction information; for example, the first shared feature map 306 is input to the prediction layer 1 to obtain the 1 st prediction information, the first shared feature map 306 is input to the prediction layer 2 to obtain the 2 nd prediction information, and the first shared feature map is input to the prediction layer n to obtain the nth prediction information.
Inputting the first shared feature map into a feature mapping layer to obtain a first mixed feature representation; for example, the first shared feature map 306 is input to the feature mapping layer to obtain a first mixed feature representation 308.
The image processing model 30 processes the n augmented images as follows:
the n augmented images are obtained by augmenting n input images, and the n augmented images correspond to the n feature extraction layers one by one. Inputting the n augmented images into the n feature extraction layers to obtain n second feature maps; for example, the augmented image 310-2 is input to the feature extraction layer 1 to obtain the second feature map 312-1, the augmented image 310-2 is input to the feature extraction layer 2 to obtain the second feature map 312-2, and the augmented image 310-n is input to the feature extraction layer n to obtain the second feature map 312-n.
Mixing the n second feature maps to obtain a second mixed feature map; for example, the second feature map 312-1, the second feature map 312-2, and the second feature map 312-n are mixed to obtain a second mixed feature map 314.
Inputting the second mixed feature map into a sharing network layer to obtain a second sharing feature map; for example, the second mixed signature graph 314 is input into the shared network layer to obtain a second shared signature graph 316.
Inputting the second shared feature map into the feature mapping layer to obtain a second mixed feature representation; for example, the second shared feature map 316 is input to the feature mapping layer to obtain a second mixed feature representation 318.
Then, determining a prediction error loss based on the n prediction information and the n label information; and determining an unsupervised learning loss based on the first hybrid signature representation and the second hybrid signature representation; and training the model parameters of the image processing model based on the prediction error loss and the self-supervision learning loss.
The n augmented images may be obtained before the n input images are input to the feature extraction layer, may be obtained simultaneously by inputting the n input images to the feature extraction layer, or may be obtained after the n input images are input to the feature extraction layer. The n input images and the n augmented images can be simultaneously input into the feature extraction layer, or can be sequentially and respectively input into the feature extraction layer, the prediction error loss and the self-supervision learning loss can be determined simultaneously, or can be sequentially and respectively determined, and the time sequence is not limited.
In order to improve the feature extraction capability of the image processing model, the image processing model needs to be trained, and then, the training method of the image processing model will be described by the following embodiments.
FIG. 3 is a flowchart illustrating a training method of an image processing model according to an exemplary embodiment of the present application. The method may be performed by a computer device. The method comprises the following steps:
step 420, a basic training set is obtained, where the basic training set includes n input images and n label information corresponding to the n input images.
The basic training set is a training set used by the image processing model to perform a learning task, and includes a large number of sample images, which may also be referred to as input images. The input images correspond to label information, which may be different depending on the learning task. For example, in the image classification task, the type of the input image may be determined, and then the tag information corresponding to the input image is classification tag information, and in the object detection task, the size or the position of the object in the input image may be determined, and then the tag information corresponding to the input image may be rectangular detection frame tag information, or the like.
Illustratively, a base training set is obtained, which includes n input images and n label information corresponding to the n input images. The input image may be represented as x i Wherein i is more than or equal to 1 and less than or equal to n, and i is an integer. The label information corresponding to the input image can be represented as y i . For example, input image x 1 The corresponding tag information is y 1
It should be noted that, in the training process of the image processing model, n input images of the image processing model are different input images, and it is understood that, in the case where the training effect of the image processing model can be ensured, some of the n input images may be the same.
Step 440, inputting n input images into the image processing model to obtain n prediction information and a first mixed feature representation; inputting the n augmented images into an image processing model to obtain a second mixed feature representation; the n augmented images are images obtained by augmenting n input images.
The prediction information is label information of each input image predicted and determined by the image processing model, and the first mixed feature representation is feature representation obtained by mixing feature information of n input images and performing corresponding processing by the image processing model.
Illustratively, n inputs are inputtedThe image is input into an image processing model to obtain n pieces of prediction information and a first mixed feature representation. Input image x i Can be expressed as
Figure BDA0003744313510000091
The first hybrid feature representation may be denoted as p. For example, input image x 1 The prediction information of
Figure BDA0003744313510000092
The first hybrid feature is denoted as P.
Augmentation, which is mainly amplification, can be used to increase the number of input images. The n augmented images are images obtained by augmenting n input images. The second mixed feature representation is a feature representation obtained by mixing feature information of the n augmented images and performing corresponding processing on the mixed feature information by the image processing model.
Illustratively, n input images may be subjected to the same geometric transformation, resulting in n augmented images. Wherein the geometric transformation comprises at least one of image rotation and image inversion. For example, n input images are each rotated 90 ° clockwise, resulting in n augmented images. The augmented image may be a positive sample of the input image, i.e. the input image and the augmented image of the input image may be referred to as a pair of positive samples.
Illustratively, an input image x i Can be represented as x i '. Inputting n augmented images into the image processing model to obtain a second mixed feature representation, the augmented image x i The second hybrid feature representation of 'may be denoted as P'. For example, input image x 1 The augmented image of (a) is x 1 ', augmented image x i The second mixing characteristic of 'is denoted as P'.
Illustratively, the expression of the first hybrid feature representation and the second hybrid feature representation includes, but is not limited to, at least one of a feature vector, a feature matrix, a feature value, or bit information.
Step 460, determining a prediction error loss based on the n prediction information and the n label information; and determining an unsupervised learning loss based on the first hybrid signature representation and the second hybrid signature representation.
The prediction error loss is an error loss between n pieces of prediction information corresponding to n input images and n pieces of label information. The unsupervised learning loss refers to a training loss determined based on the first blended feature representation and the second blended feature.
Illustratively, the prediction error penalty may be expressed as L Ens The loss of self-supervised learning can be expressed as L SSL
And step 480, training model parameters of the image processing model based on the prediction error loss and the self-supervision learning loss.
Illustratively, the training loss of the image processing model includes a prediction error loss and an unsupervised learning loss, and based on the prediction error loss and the unsupervised learning loss, the model parameters of the image processing model may be trained with a training target that reduces or minimizes the training loss of the image processing model.
In summary, in the method provided by the embodiment of the present application, the lightweight integrated model supporting n input and n output is used as the image processing model, so that the training efficiency of the image processing model can be improved; inputting n input images into an image processing model to obtain n pieces of prediction information and a first mixed feature representation; inputting the n augmented images into an image processing model to obtain a second mixed feature representation; the image processing model can fully learn the characteristic information of the input image in the basic training set, and the performance of the image processing model on the basic training set is obviously improved. By training the model parameters of the image processing model based on the prediction error loss and the self-supervision learning loss, the self-supervision learning can be integrated into the training process of the image processing model, the image processing model is prevented from being over-fitted on the input image of the basic training set, the feature extraction capability of the image processing model can be effectively improved, and the learning of downstream tasks is further facilitated.
In one example of the present application, step 460 determines a prediction error loss based on the n prediction information and the n label information, which may be implemented as follows: determining n sub-losses based on the n prediction information and the n label information; determining a prediction error penalty based on a weighted sum of the n sub penalties; an ith loss of the n sub losses is determined based on the ith prediction information and the ith label information, and i is an integer not greater than n.
The sub-loss is an error loss between the prediction information and the label information corresponding to each input image. Each sub-loss has a corresponding sub-loss weight that can be continuously adjusted during the training of the image processing model. The weighted sum refers to that each sub-loss is multiplied by the corresponding sub-loss weight and then summed.
Illustratively, n sub-losses are determined based on n prediction information and n tag information; wherein an ith loss of the n sub losses is determined based on the ith prediction information and the ith label information, and i is an integer not greater than n.
For example, the ith loss of the n sub-losses can be expressed as
Figure BDA0003744313510000101
The corresponding sub-loss weight may be denoted as ω ri (κ). Where r is a hyperparameter. K is a ratio value, and can be determined by the area ratio of the feature information of different input images when the feature information of n input images is mixed based on the image processing model. For example, based on the input image x 1 The prediction information and the tag information of (2) determining a sub-loss of
Figure BDA0003744313510000102
Sub-loss weight of ω r1 (κ)。
For example, determining the prediction error penalty based on a weighted sum of n sub-penalties may be expressed as:
Figure BDA0003744313510000111
in this embodiment, the importance of a plurality of sub-losses can be balanced by the sub-loss weights, and the prediction error loss is determined by the weighted sum based on n sub-losses, so that the effective learning rate, the flow mode of the gradient in the network, and the mode of representing the mixed information in the features are adjusted by the weights, which is beneficial to better performing feature extraction by the image processing model.
In one example, step 460 determines an unsupervised learning loss based on the first blended feature representation and the second blended feature representation, including: and calculating the self-supervision learning loss by taking the improvement of the similarity between the first mixed feature representation and the second mixed feature representation as a target.
Illustratively, a first mixed feature representation determined based on n input images and a second mixed feature representation determined based on n augmented images aim at improving the similarity between the first mixed feature representation and the second mixed feature representation, and self-supervision learning loss is calculated, so that different information between feature representations of the input images and the augmented images extracted by the same input channel can be eliminated to a certain extent, and the dependency relationship between different mixed feature representations can be reduced, thereby improving the feature extraction capability of the image processing model.
FIG. 4 is a flowchart illustrating a method for training an image processing model according to an exemplary embodiment of the present application. The method may be performed by a computer device. The step 460 may include:
step 461, calculating a first variance regularization term corresponding to the first mixed feature representation according to the first dimension feature representation composed of feature values at the predetermined dimension of the first mixed feature representation; and calculating a second variance regularization term corresponding to the second mixed feature representation according to the second dimension feature representation composed of the feature values at the preset dimension of the second mixed feature representation.
The first-dimension feature representation is composed of feature values at a predetermined dimension of the first mixed feature representation, the predetermined dimension being less than or equal to the feature dimension of the first mixed feature representation. The second dimension feature representation is composed of feature values at the predetermined dimension of the second hybrid feature representation. The unsupervised learning loss includes a variance regularization term, wherein a first variance regularization term may be determined based on the first mixed feature representation and a second variance regularization term may be determined based on the second mixed feature representation.
Illustratively, in the training process of the image processing model, the input images in the basic training set are processed in batches. Assuming that the base training set is defined as D, given one sample image i sampled from the base training set, two transforms T and T 'are sampled from the distribution T to generate an input image x and an augmented image x', where x ═ T (i) and x '═ T' (i). After the input image x and the augmented image x' are subjected to image processing model processing, it is assumed that the generated first mixed feature representation is specifically represented as P ═ P 1 ,...,p n ]The second blend feature is specifically represented by P '═ P' 1 ,...,p′ n ]The first mixed feature representation and the second mixed feature representation are composed of n vectors with dimension d, and p is used j Representing a vector consisting of each eigenvalue at a predetermined dimension j of all vectors in P. Defining a variance regularization term v as a hinge function on the standard deviation embedded along the batch dimension, the first variance regularization term v (P) being determined in accordance with the second variance regularization term v (P'), expressed as:
Figure BDA0003744313510000121
where d represents the dimension of the first blended feature representation, p j For the first dimension feature representation, S represents the regularization standard deviation, defined as:
Figure BDA0003744313510000122
where γ is a constant target value of the regularization standard deviation, which may be 1 in this embodiment, and e is a scalar to prevent numerical instability, and var (x) is a variance representing x.
Step 462, based on the first sub-mixture feature representation under each feature dimension of the first mixture feature representation and the average value of each first sub-mixture feature representation, determining a first covariance regularization term corresponding to the first mixture feature representation; and determining a second covariance regularization term corresponding to the second mixed feature representation based on the second sub-mixed feature representation under each feature dimension of the second mixed feature representation and an average value of the second sub-mixed feature representations.
The first sub-mixture feature representation refers to a feature representation in each feature dimension of the first mixture feature, and the second sub-mixture feature representation refers to a feature representation in each feature dimension of the second mixture feature. The supervised learning penalty includes a covariance regularization term, wherein the first covariance regularization term is determined based on the first sub-mixture feature representations and an average of the first sub-mixture feature representations, and the second covariance regularization term is determined based on the second sub-mixture feature representations and an average of the second sub-mixture feature representations.
Illustratively, the covariance regularization matrix is denoted as C, the covariance regularization term is denoted as C, the first covariance regularization matrix C (P) is consistent with the second covariance regularization matrix C (P ') in determination, and the first covariance regularization term C (P) is consistent with the second covariance regularization term C (P') in determination.
Taking the first covariance regularizer c (p) as an example, it is defined as:
Figure BDA0003744313510000131
wherein p is i For the first sub-mixture feature representation,
Figure BDA0003744313510000132
the average of each first sub-mixture feature representation is expressed as:
Figure BDA0003744313510000133
defining the first covariance regularization term c (p) as the sum of the squares of the matrix off-diagonal elements of the first covariance regularization matrix c (p), expressed as:
Figure BDA0003744313510000134
step 463, a mean square euclidean distance calculation is performed based on the first sub-mixture feature representation in each feature dimension of the first mixture feature representation and the second sub-mixture feature representation in the same feature dimension of the second mixture feature representation, and an invariance term is determined.
Illustratively, the invariance criterion of the first mixed feature representation and the second mixed feature representation is defined as a mean squared euclidean distance between each pair of vectors, the pair of vectors referring to a first sub-mixed feature representation in each feature dimension of the first mixed feature representation and a second sub-mixed feature representation in the same feature dimension of the second mixed feature representation. The invariance term s (P, P') is represented as:
Figure BDA0003744313510000135
464, performing weighted average calculation according to the first variance regularization term, the second variance regularization term, the first covariance regularization term, the second covariance regularization term and the invariance term to obtain the self-supervision learning loss.
Illustratively, the unsupervised learning loss of the input image x is a weighted average of the first variance regularization term, the second variance regularization term, the first covariance regularization term, the second covariance regularization term, and the invariance term, and is expressed as:
l(P,P′)=λ·s(P,P′)+μ[v(P)+v(P′)]+v[c(P)+c(P′)]
where λ, μ, and v are hyper-parameters that control the importance of each parameter in the auto-supervised learning loss. In the present embodiment, v ═ 1 and λ ═ μ > 1 may be set.
Assuming that the base training set is defined as D, two transformations T and T' are sampled from the distribution T, the overall unsupervised learning loss for all input images is expressed as:
Figure BDA0003744313510000141
wherein, P I And P' I Refers to a first blended feature representation and a second blended feature representation determined based on a batch of input images input in an image processing model.
In the embodiment, the self-supervision learning loss of the image processing model is regularized by using a learning data transformation invariance term, a variance regularization term for preventing norm collapse and a covariance regularization term for preventing information collapse by solving different dimensions of a correlation vector, so that better training stability and performance improvement can be obtained on a downstream task, the feature extraction capability of the image processing model is effectively improved, and the learning of the downstream task is facilitated.
In one example of the present application, step 480 trains model parameters of the image processing model based on the prediction error loss and the self-supervised learning loss, which can be implemented as:
calculating the product of the self-supervision learning loss and the adjusting weight; determining a sum of the product and the prediction error loss as a training loss; and training the model parameters of the image processing model by adopting an error back propagation algorithm with the aim of reducing the training loss.
The adjusting weight refers to the weight lost by the self-supervision learning, and can be determined in the training process of the image processing model. The training purpose of the image processing model is to reduce the training loss, and particularly, the model parameters of the image processing model can be trained through an error back propagation algorithm.
Illustratively, the tuning weight is represented as γ, and the training loss of the image processing model is represented as:
L=L Ens +γL SSL
in the embodiment, the product of the adjustment weight and the self-supervision learning loss is determined, and the training weight of the image processing model is determined by combining the prediction error loss, so that the self-supervision learning can be fused into the training process of the image processing model, the feature extraction capability of the image processing model is improved, the image processing model can learn general feature expressions for various downstream tasks, and the learning of the downstream tasks is further facilitated.
FIG. 5 is a flowchart illustrating a method for training an image processing model according to an exemplary embodiment of the present application. The method may be performed by a computer device. The image processing model comprises n feature extraction layers, a shared network layer and n prediction layers; step 440 may include:
step 441, inputting n input images into n feature extraction layers to obtain n first feature maps; the n input images correspond to the n feature extraction layers one to one.
The feature extraction layer refers to a network structure layer capable of extracting feature information of the input image, and the feature extraction layer may be a separate network structure layer, such as a convolutional layer, or an independent neural network, such as at least one of a residual error network, a convolutional neural network, a cyclic neural network, and a long-term and short-term memory network.
Illustratively, n input images correspond one-to-one to n feature extraction layers. Inputting n input images into n feature extraction layers to obtain n first feature maps. For example, the 1 st input image is input to the corresponding 1 st feature extraction layer to obtain the 1 st first feature map, and the nth input image is input to the corresponding nth feature extraction layer to obtain the nth first feature map.
Illustratively, an image x is input i The corresponding first characteristic diagram is shown as l i
Step 442, mixing the n first feature maps to obtain a first mixed feature map.
The blending process is to blend the n first feature maps, and the blended first feature map is referred to as a first blended feature map.
For example, the mixing process may be at least one of the following: mixing the n first feature maps according to the proportion (mixup); filling partial image areas of the n first feature maps into 0 pixel values (cutout) at random; and (4) carrying out splicing and mixing (cutmix) on different image areas of the n first feature maps.
For example, after the n first feature maps are subjected to the blending process, the first blended feature map is represented as M.
Step 443, inputting the first mixed feature map into the shared network layer to obtain a first shared feature map.
The shared network layer refers to a network structure layer for encoding the hybrid feature map, and in this embodiment, the hybrid feature map refers to the first hybrid feature map. The shared network layer may be a separate network structure layer, such as a convolutional layer, or may be an independent neural network, such as at least one of a residual network, a convolutional neural network, a cyclic neural network, and a long-short term memory network.
Illustratively, the first mixed feature map is input into the shared network layer to obtain a first shared feature map. The first shared signature is denoted as z.
Step 444, inputting the first shared feature map to n prediction layers to obtain n prediction information.
The prediction layer is a network structure layer for predicting the label information corresponding to the first shared characteristic diagram. The prediction layer may be a separate network structure layer, such as a fully-connected layer, or may be an independent neural network, such as at least one of a residual network, a convolutional neural network, a cyclic neural network, and a long-short term memory network.
Illustratively, the image processing model includes n feature extraction layers, and correspondingly includes n prediction layers. And respectively inputting the first shared characteristic graph into n prediction layers, wherein each prediction layer can output one piece of prediction information to obtain n pieces of prediction information.
In this embodiment, the n first feature maps are mixed to obtain the first mixed feature map, so that feature enhancement can be achieved, input images used for training are more diversified, and subsequent processing and prediction are performed based on the first mixed feature map, so that the learning capability of the image processing model can be improved, and the feature extraction capability of the image processing model can be enhanced.
FIG. 6 is a flowchart illustrating a method for training an image processing model according to an exemplary embodiment of the present application. The method may be performed by a computer device. The image processing model comprises n feature extraction layers, a shared network layer and n prediction layers; step 440 may include:
step 441, inputting n input images into n feature extraction layers to obtain n first feature maps; the n input images correspond to the n feature extraction layers one to one.
Step 442, mixing the n first feature maps to obtain a first mixed feature map.
Step 443, inputting the first mixed feature map into the shared network layer to obtain a first shared feature map.
For example, in the above steps 441 to 443, the manner of inputting n input images into n feature extraction layers to obtain the first shared feature map is consistent with the content of the foregoing exemplary embodiments, and is not repeated herein.
Step 445, input the first shared signature to the signature mapping layer to obtain a first mixed signature representation.
The feature mapping layer is a network structure layer for mapping the first shared feature map. The prediction Layer may be a separate network structure Layer, such as a convolutional Layer, or may be an independent neural network, such as at least one of a Multi-Layer Perceptron (MLP), a convolutional neural network, a cyclic neural network, and a long-short term memory network.
Illustratively, the first shared feature map is input to a feature mapping layer for mapping processing, so as to obtain a first mixed feature representation, which is denoted as P.
In this embodiment, by obtaining the first mixed feature representation, an auto-supervised learning loss may be subsequently constructed by combining the second mixed feature representation, so as to fuse the auto-supervised learning to a training process of the image processing model, and improve a feature extraction capability of the image processing model.
In an example of the present application, the implementation manner of the mixing process of step 442 may be:
determining a two-dimensional mask image which is used in a mixing way at the time; and splicing and mixing different image areas in the n first characteristic diagrams through the two-dimensional mask diagram to obtain a first mixed characteristic diagram.
For example, in this embodiment, when the n first feature maps are spliced and mixed, the mixing manner used is as follows: and (4) carrying out splicing and mixing (cutmix) on different image areas of the n first feature maps.
Taking 2 input images as an example, which are a cat image and a dog image respectively, a first feature map corresponding to the cat image and a first feature map corresponding to the dog image are mixed, for example, a body region of the cat in the first feature map corresponding to the cat image and a head region of the dog in the first feature map corresponding to the dog image are mixed, so that a first mixed feature map composed of feature maps of the head of the dog and the body of the dog can be obtained.
The two-dimensional mask map, also called mask, may be used to mask a part or all of the image area of the first feature map, so as to control the image processing area or the image processing process.
For example, for n first feature maps, the position and size of the region mixed by each first feature map are different, and there may be one two-dimensional mask map for each first feature map. The two-dimensional mask map may be a binary two-dimensional mask map composed of 0 and 1, represented as
Figure BDA0003744313510000171
The size of the two-dimensional mask pattern is identical to the size of the first feature pattern.
Exemplarily, a two-dimensional mask map used in the present hybrid is determined; different image areas in the n first feature maps are spliced and mixed through the two-dimensional mask map to obtain a first mixed feature map, which can be expressed as:
Figure BDA0003744313510000172
illustratively, the first feature map l of the 1 st input image is represented by 2 input images 1 And a first feature map l of the 2 nd input image 2 For example, the first characteristic diagram l 1 The corresponding two-dimensional mask is
Figure BDA0003744313510000173
First characteristic diagram l 2 The corresponding two-dimensional mask pattern is
Figure BDA0003744313510000174
Splicing and mixing different image areas in the 2 first feature maps through a two-dimensional mask map to obtain a first mixed feature map M (l) 1 ,l 2 ) It can be expressed as:
Figure BDA0003744313510000175
in this embodiment, through the above-mentioned hybrid processing manner, pixels of non-image feature information do not appear in the first hybrid feature map, and the training efficiency of the image processing model can be improved. The n first feature maps are subjected to mixing processing, and the obtained first mixed feature maps are subsequently adopted for training the image processing model, so that the image processing model can be required to learn from a local view of an input image, the positioning capability and the learning capability of the image processing model can be further enhanced, and the feature extraction capability is improved.
FIG. 7 is a flowchart illustrating a method for training an image processing model according to an exemplary embodiment of the present application. The method may be performed by a computer device. The image processing model comprises n feature extraction layers, a shared network layer and n prediction layers; step 440 may include:
step 446, inputting the n augmented images into the n feature extraction layers to obtain n second feature maps; the n augmented images correspond to the n feature extraction layers one to one.
Illustratively, an input image and an augmented image of the input image may be considered as a set of images. The n input images correspond to the n feature extraction layers one to one, that is, the n augmented images correspond to the n feature extraction layers one to one.
And inputting the n augmented images into the n feature extraction layers to obtain n second feature maps. For example, the 1 st augmented image is input to the corresponding 1 st feature extraction layer to obtain the 1 st second feature map, and the nth augmented image is input to the corresponding nth feature extraction layer to obtain the nth second feature map.
Illustratively, image x will be augmented i ' corresponding second characteristic diagram is denoted by l i ′。
And step 447, mixing the n second feature maps to obtain a second mixed feature map.
Illustratively, after the n second feature maps are subjected to the blending process, the second blended feature map is represented as M'.
Step 448, the second mixed feature map is input into the shared network layer to obtain a second shared feature map.
Illustratively, the second mixed feature map is input into the shared network layer, resulting in a second shared feature map. The second shared signature is denoted as z'.
Step 449, inputting the second shared characteristic diagram into the characteristic mapping layer to obtain a second mixed characteristic representation.
Exemplarily, the second shared feature map is input to the feature mapping layer for mapping processing, so as to obtain a second mixed feature representation, which is denoted as P'.
In this embodiment, by obtaining the second mixed feature representation, the self-supervised learning loss can be constructed by subsequently combining the first mixed feature representation, so as to fuse the self-supervised learning to the training process of the image processing model, and improve the feature extraction capability of the image processing model.
In an example of the present application, an implementation manner of the mixing process of step 447 above may be consistent with an implementation manner of the mixing process of step 442, and specifically may be:
determining a two-dimensional mask image used in a mixing way at the time; and splicing and mixing different image areas in the n second feature maps through the two-dimensional mask map to obtain a second mixed feature map.
For example, in this embodiment, when the n second feature maps are spliced and mixed, the mixing manner used is as follows: and performing splicing and mixing (cutmix) on different image areas of the n second feature maps. The two-dimensional mask patterns used by the first feature pattern and the second feature pattern are the same for the same set of n input images and augmented images.
Taking 2 input images as an example, the input images are a cat image and a dog image respectively, when the first mixed feature map is generated, the body area of the cat in the first feature map corresponding to the cat image and the head area of the dog in the first feature map corresponding to the dog image are mixed, and the first mixed feature map composed of the feature maps of the head of the dog and the body of the cat can be obtained. When the second mixed feature map is generated, the body area of the cat in the second feature map corresponding to the augmented image of the cat image and the head area of the dog in the second feature map corresponding to the augmented image of the dog image are mixed, and a second mixed feature map composed of the head of the dog and the feature map of the body of the cat can be obtained.
In this embodiment, by a mixing processing manner consistent with the obtaining of the first mixed feature map, the obtained first mixed feature map and the second mixed feature map are still a pair of positive samples, which is beneficial to incorporating self-supervision learning in the training process of the image processing model, so as to further enhance the feature extraction capability of the image processing model.
In an actual image processing application scenario, with the update of an image processing task, new types of images to be processed will increase continuously with time, where the new types of images refer to existing type sample images in a basic training set used in a training process relative to an image processing model, and the image processing model needs to be trained continuously.
For example, model parameter training is performed on the image processing model by using a basic training set, where the basic training set only includes images of cats, dogs, and people, the trained image processing model can recognize images of cats, dogs, and people, and after the image processing task is updated, the trained image processing model needs to recognize images of tables and chairs, so that the image processing model needs to be trained continuously by using the new image, which may also be referred to as small sample incremental learning.
Existing small sample incremental learning can be divided into two categories. One class of methods keeps the old image knowledge of the learned base training set as much as possible in the new learning task through various regularization, sample replay, or model distillation methods. Another method is to train a good feature extractor, and when facing a new type of sample image, fix the feature extractor and insert the new type of sample image into the original space correctly and quickly.
For example, referring to fig. 8, the conventional method employs three training phases, and a feature extractor is obtained by training in the first training phase. The second training phase is a pseudo-incremental learning phase, and a graph attention network is trained, wherein the graph attention network is based on an attention mechanism and can absorb a new type of sample image and enable the new type of sample image to be rapidly inserted into an original space. The second training phase also utilizes the training data of the first training phase. The third training stage is an incremental learning stage, the feature extractor is fixed, and the graph attention network can calculate the attention of the new type sample image and the existing graph network and help the new type sample image to be inserted into the original space.
However, the above method only calculates attention according to the average features of the new type of sample images, and does not fully utilize the new type of sample images to perform line training of model parameters. As such, in small sample learning scenarios, this approach tends to result in over/under fitting of the feature extractor to new types of sample images. Moreover, when the difference between the new type sample image and the existing type sample image is very large, the calculation attention is further inaccurate.
In contrast, the image processing model adopted in the embodiment of the application and combined with the self-supervised learning can obtain a differentiated image processing model focusing on the sample image data, so that overfitting on the input image in the basic training set is avoided, and the subsequent incremental learning task is facilitated. In subsequent incremental learning tasks, a mixed training mode that the new type sample images and the existing type sample images in the basic training set are mutually enhanced is adopted, so that overfitting can be prevented, and the performance of the image processing model on the old type task in the basic training set can be consolidated.
Next, a training method of incremental learning of an image processing model will be described by the following embodiments. FIG. 9 is a flowchart illustrating a method for training an image processing model according to an exemplary embodiment of the present application. The method may be performed by a computer device. The method further comprises the following steps:
step 520, obtaining an incremental training set, wherein the incremental training set comprises: the new type sample image is a sample image of the new image type.
The incremental training set is a training set used for performing incremental training for adjusting model parameters after training the model parameters of the image processing model using the basic training set.
Illustratively, the incremental training set is not the base training set after the number of sample images has been increased. The incremental training set includes: the new type sample image is a sample image of the new image type. For example, assuming that the basic training set includes only 3 types of images of cats, dogs, and people and their corresponding label information, the incremental training set may include 2 additional types of images of tables and chairs and their corresponding label information.
And 540, acquiring the existing type sample image and the label information of the existing type sample image in the basic training set.
The sample images in the base training set may be referred to as existing type images relative to the incremental training set.
Illustratively, the existing type sample image and the label information of the existing type sample image are obtained in the basic training set. The acquired sample images of the existing type are part of the sample images in the base training set.
And step 560, mixing the new type sample image and the existing type sample image to obtain n input images, and mixing the label information of the new type sample image and the label information of the existing type sample image to obtain the label information of the n input images.
In the incremental training process of the image processing model, a mixed training set obtained by mixing the new type sample image and the existing type sample image is used for realizing mixed training of the image processing model by using the new type sample image and the existing type sample image.
Illustratively, n input images are blended using the new-type sample image and the existing-type sample image, and label information of the n input images is blended using label information of the new-type sample image and label information of the existing-type sample image.
Illustratively, the input image may be represented as x i The corresponding tag information is represented as y i
Step 580, performing incremental training on the image processing model using the n input images and the label information of the n input images.
Illustratively, the image processing model is incrementally trained using n input images and label information for the n input images. Incremental training may also be the use of an error back-propagation algorithm.
In this embodiment, n input images are obtained by mixing the new type sample image and the existing type sample image, and the image processing model is subjected to mixed training by using the new type sample image and the existing type sample image, so that the performance of the image processing model on the old type task in the basic training set can be consolidated, and catastrophic forgetting is avoided.
FIG. 10 is a flowchart illustrating a method for training an image processing model according to an exemplary embodiment of the present application. The method may be performed by a computer device. The step 540 can be implemented as:
step 541, clustering the input images in the basic training set to obtain at least one cluster type.
Clustering refers to grouping input images with higher similarity in a basic training set into the same class, which can also be called clustering class. The similarity between input images in the same cluster category is high, and the similarity between input images in different cluster categories intersects.
Illustratively, the input images in the basic training set are clustered according to the image types of the input images in the basic training set to obtain at least one cluster category.
Illustratively, the clustering algorithm employed for clustering may be at least one of K-Means (K-Means) clustering, hierarchical clustering, and density clustering.
Step 542, sampling a part of input images from each cluster category in at least one cluster category, and determining the input images as existing type sample images; and determining label information of a part of the input image as label information of the existing type sample image.
Illustratively, a predetermined number of portions of the input images from each of the at least one cluster class are sampled to determine existing type images. The preset number of samples in each cluster category can be the same or different, and the preset number of samples required in each cluster category can be determined according to the actual application scene of the image processing model. After a part of the input image is extracted, the label information of the part of the input image can be determined as the label information of the existing type sample image.
In the embodiment, the clustering categories are determined by clustering the input images of the basic training set, and a part of images are extracted from each clustering category to serve as the existing type sample images, so that the existing type sample images are subsequently used as the incremental training set, thereby effectively avoiding storing and using all sample images in the basic training set and improving the training efficiency of the incremental training of the image processing model. And all types of sample images in the basic training set can be covered as much as possible so as to reduce the forgetting degree of the image processing model on the learned sample images of the existing types.
FIG. 11 is a flowchart illustrating a method for training an image processing model according to an exemplary embodiment of the present application. The method may be performed by a computer device. The above step 580 can be implemented as:
step 581, inputting n input images into the image processing model to obtain n prediction information.
For example, the manner of obtaining n pieces of prediction information in step 581 is consistent with the manner of obtaining n pieces of prediction information in step 440, and is not described herein again.
At step 582, a prediction error loss is determined based on the n prediction information and the n label information.
For example, the determination of the prediction error loss in step 582 is consistent with the determination of the prediction error loss in step 460, and will not be described herein again.
And 583, performing incremental training on model parameters of the image processing model based on the prediction error loss.
Illustratively, no self-supervised learning is used during the incremental training of the image processing model. The training penalty of the image processing model includes only the prediction error penalty. And performing incremental training on the model parameters of the image processing model by adopting an error back propagation algorithm based on the prediction error loss and aiming at reducing the prediction error loss.
In the embodiment, in the incremental training process of the image processing model, because the number of the new type sample images is small, the image processing model is not subjected to self-supervision learning during the incremental training, and overfitting of the image processing model to the new type sample images can be effectively avoided.
In one example, the image processing model includes a shared network layer and n prediction layers, each of the n prediction layers including m prediction dimensions, the method further including:
expanding m prediction dimensions of each prediction layer into m + k prediction dimensions, wherein k is the type number of the new type sample image; and initializing network parameters of k newly added prediction dimensions in each prediction layer by adopting the average value of a third shared characteristic graph output by the shared network layer to the new type sample image.
After the model parameters of the image processing model are trained by using the basic training set, before the incremental training of the image processing model is performed by using the incremental training set, network parameters of a prediction layer of the image processing model need to be initialized, so that the prediction layer can adapt to a new type of sample image.
Illustratively, after training the model parameters of the image processing model using the base training set, each of the n prediction layers includes m prediction dimensions. And when the number of the types of the new type sample images of the incremental training set is k, expanding the m prediction dimensions of each prediction layer into m + k prediction dimensions. And initializing the network parameters of k newly added prediction dimensions in each prediction layer by using the average value of a third shared characteristic graph output by the shared network layer to the new type sample image.
For example, if there are 5 types of existing type sample images in the basic training set, the prediction dimension of each prediction layer is 512 × 5, and when 2 types of new type sample images are added, the prediction dimension of each prediction layer needs to be expanded to 512 × 7, and the initialized network parameters of the prediction layer in the added 2 dimensions are the average value of the third shared feature map output by the shared network layer on the new type sample images.
In the embodiment, the network parameters of k newly added prediction dimensions in each prediction layer are initialized, so that the prediction layers can be better adapted to an incremental training set, and in the incremental process, when the prediction information is obtained by adopting the prediction layers, the prediction information can be more accurate, and the training effect of the image processing model can be improved.
FIG. 12 illustrates a flow chart of a method for using an image processing model provided by an exemplary embodiment of the present application. The method may be performed by a computer device. The method comprises the following steps:
step 620, an input image to be processed is obtained.
Illustratively, the input image may be an image acquired in any manner, and the image acquired in any manner may be an image newly captured and input by a user, an image acquired from a public data set, or the like.
Step 640, inputting n identical input images constructed based on the input images into the image processing model to obtain n pieces of prediction information.
Illustratively, during use of the image processing model, the n input images of the image processing model are the same input image. Wherein the input image may be copied n-1 times resulting in n-1 copies of the input image for constructing n identical input images based on the input image.
Then, n identical input images constructed based on the input images are input into the image processing model to obtain n pieces of prediction information.
Step 660, determining a prediction result of the input image based on the n prediction information.
For example, after obtaining n prediction information, the prediction result of the input image may be determined comprehensively based on the n prediction information.
In summary, in the method provided in this embodiment, the image processing model with a strong feature extraction capability is obtained through pre-training, and when the image processing model is used to process n identical input images, the accuracy of the obtained n pieces of prediction information is better ensured. By combining the n pieces of prediction information and determining the prediction result of the input image, the accuracy of the prediction result can be improved, and the image processing efficiency of the input image can be effectively improved.
In one example of the present application, when the image processing model is applied in an image classification task, the image processing model is an image classification model having m prediction classifications. In the image classification task, the prediction information may be a probability distribution of m prediction classifications of the input image. At this time, step 660 may include:
superposing the prediction probabilities belonging to the same prediction classification in the probability distribution of the n pieces of prediction information to obtain m superposed prediction probabilities; and determining the prediction classification corresponding to the maximum value of the m overlapped prediction probabilities as the image classification of the input image.
Illustratively, the number of the n prediction information containing prediction classifications is m, and the prediction probabilities belonging to the same prediction classification in the probability distribution of the n prediction information can be superimposed to obtain m superimposed prediction probabilities. Then, the maximum value of the m superimposed prediction probabilities may be determined, and the prediction classification corresponding to the maximum value of the m superimposed prediction probabilities may be determined as the image classification of the input image.
For example, in the case where an input image corresponds to 2 prediction information items and the prediction information items include 3 prediction classifications, the 1 st prediction information probability distribution indicates that the input image is a cat with a prediction probability of 80%, a dog with a prediction probability of 15%, and a tiger with a prediction probability of 5%, and the 2 nd prediction information probability distribution indicates that the input image is a cat with a prediction probability of 70%, a dog with a prediction probability of 20%, and a tiger with a prediction probability of 10%. The image of the input image is determined to be classified as cat if the overlap prediction probability of identifying the input image as cat is 150%, the overlap prediction probability of dog is 35%, and the overlap prediction probability of tiger is 15%.
In this embodiment, in the image classification task, the image classification of the input image is finally determined by superimposing the prediction probabilities belonging to the same prediction classification, and the accuracy of the image classification result can be improved.
In another example of the present application, when the image processing model is applied in the object detection task, the image processing model is an object detection model having m prediction detection frames. In the target detection task, the prediction information may be m prediction detection frames of the input image. At this time, step 660 may include:
determining confidence degrees corresponding to the m prediction detection frames; superposing the confidence coefficients belonging to the same prediction detection frame in the n prediction information to obtain m superposed confidence coefficients; and determining the prediction detection frame corresponding to the maximum value of the m superposition confidence coefficients as a target detection frame of the input image.
In this embodiment, in the target detection task, the target detection frame of the input image is finally determined by superimposing the confidence degrees belonging to the same prediction detection frame, so that the accuracy of the target detection result can be improved.
Taking the image processing model applied to the image classification task as an example, the image processing model is an image classification model. In the following embodiments, a model structure of an image classification model, a training method of the image classification model, and a using method of the image classification model according to the present application are respectively described with reference to a specific image classification model.
1. Model structure of image classification model
Referring to fig. 13, the image classification model shown in fig. 13 is an integrated model supporting 2-input-2-output. The image classification model comprises: 2 convolutional layers (convolutional layer 1 and convolutional layer 2), a shared network, 2 fully-connected layers (fully-connected layer 1 and fully-connected layer 2), and an MLP network.
The output ends of the convolutional layers 1 and 2 are respectively connected with the input end of a shared network, the output end of the shared network is respectively connected with the input ends of the full connection layer 1 and the full connection layer 2, and the output end of the shared network is also connected with the input end of the MLP network. Illustratively, the shared network may be a residual network.
2. Training process of image classification model
2.1 training of image classification models Using the basic training set
With continued reference to fig. 13, assume that the basic training set includes 2 input images, i.e., input image 1 and input image 2. The input image 1 and the input image 2 are different types of input images.
Representing an input image 1 as x 1 The input image 2 is denoted x 2 . Label information corresponding to the input image 1 is represented as y 1 The label information corresponding to the input image 2 is represented as y 2 . An augmented image 1 is obtained by rotating the input image 1 by 90 DEG clockwise, an augmented image 2 is obtained by rotating the input image 2 by 90 DEG clockwise, the augmented image 1 is represented as x 1 ', augmented image 2 is denoted x 2 ′。
Inputting the input image 1 into the convolution layer 1 to obtain a first feature map l 1 (ii) a Inputting the input image 2 into the convolution layer 2 to obtain a first feature map l 2 . The first characteristic diagram l 1 And the first characteristic diagram l 2 After the mixing treatment, a first mixed characteristic diagram M (l) is obtained 1 ,l 2 ) (ii) a Expressed as:
Figure BDA0003744313510000251
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003744313510000252
first characteristic diagram l 1 A corresponding two-dimensional mask map is generated,
Figure BDA0003744313510000253
first characteristic diagram l 2 A corresponding two-dimensional mask map.
In the first mixed feature map M (l) 1 ,l 2 ) Without filling any 0 pixel value, it is expressed as:
Figure BDA0003744313510000254
mixing the first mixed feature map M (l) 1 ,l 2 ) Inputting the shared network to obtain a first shared characteristic diagram z.
Inputting the first shared characteristic diagram z into the full connection layer 1 to obtain the prediction information
Figure BDA0003744313510000255
Inputting the first shared characteristic diagram z into the full connection layer 2 to obtain the prediction information
Figure BDA0003744313510000256
And inputting the first shared characteristic diagram z into an MLP network to obtain a first mixed characteristic representation P.
Inputting the augmented image 1 into the convolution layer 1 to obtain a second feature map l 1 '; inputting the augmented image 2 into the convolution layer 2 to obtain a second feature map l 2 '. The second characteristic diagram l is processed in the same mixing way as the first characteristic diagram 1 ' and second characteristic diagram l 2 ' after the mixing treatment, a second mixed feature map M (l) is obtained 1 ′,l 2 ′)。
Mixing the second mixed feature map M (l) 1 ′,l 2 ') into the shared network, resulting in a second shared profile z'.
And inputting the second shared characteristic diagram z 'into the MLP network to obtain a second mixed characteristic representation P'.
Based on predictive information
Figure BDA0003744313510000257
And tag information y 1 Prediction information
Figure BDA0003744313510000258
And tag information y 2 Determining a prediction error loss, expressed as:
Figure BDA0003744313510000261
wherein the content of the first and second substances,
Figure BDA0003744313510000262
presentation based on prediction information
Figure BDA0003744313510000263
And tag information y 1 Determined sub-losses, ω r1 (kappa) is
Figure BDA0003744313510000264
The corresponding sub-loss weight is given to the sub-loss weight,
Figure BDA0003744313510000265
presentation based on prediction information
Figure BDA0003744313510000266
And tag information y 2 Determined sub-losses, ω r2 (kappa) is
Figure BDA0003744313510000267
The corresponding sub-loss weight.
In the first mixed feature map M (l) 1 ,l 2 ) With the second mixed feature map M (l) 1 ′,l 2 ') does not fill any 0 pixel value, the prediction error penalty is expressed as:
Figure BDA0003744313510000268
wherein, ω is r (κ) may be represented as:
Figure BDA0003744313510000269
where r is a hyper-parameter, κ is a proportional value, and may be determined based on the area proportion occupied by the feature information of the different input images to be mixed in the first mixed feature map and the second mixed feature map.
Determining an unsupervised learning loss based on the first mixed feature representation P and the second mixed feature representation P', the unsupervised learning loss being expressed as:
L SSL =λ·s(P,P )+μ[v(P)+v(P )]+v[c(P)+c(P′)]
where λ, μ, and v are hyper-parameters that control the importance of each parameter in the loss of unsupervised learning, s (P, P ') is an invariance term, c (P) is a first covariance regularization term, c (P ') is a second covariance regularization term, v (P) is a first variance regularization term, and v (P ') is a second variance regularization term.
Based on the prediction error loss and the self-supervision learning loss, the training loss function of the image classification model is determined to be expressed as:
L=L Ens +γL SSL
wherein L is Ens Is the loss of prediction error, L SSL Is the loss of the self-supervised learning, and gamma represents the adjustment weight of the loss of the self-supervised learning.
And training the model parameters of the image classification model by adopting an error back propagation algorithm with the aim of reducing the training loss.
2.2 incremental training of image classification models Using the basic training set and the incremental training set
Assume that the incremental training set includes 1 new type sample image x 3 The label information of the new type sample image is represented as y 3
Clustering the input images in the basic training set to obtain at least one clustering category; sampling a part of input images from each of at least one cluster category to determine the input images as existing type sample images; and determining label information of a part of the input image as label information of the existing type sample image. Namely, the input image 1 and the corresponding label information thereof, and the input image 2 and the corresponding label information thereof are acquired.
The new type sample image and the existing type sample image are mixed to obtain 3 input images, and the label information of the new type sample image and the label information of the existing type sample image are mixed to obtain the label information of the 3 input images.
And expanding the m prediction dimensions of the full connection layer 1 and the full connection layer 2 of the image classification model into m +1 prediction dimensions, wherein the value 1 is the type number of the new type sample image.
And initializing network parameters of 1 newly added prediction dimensionality in the full connection layer 1 and the full connection layer 2 by adopting the average value of a third shared characteristic graph output by the shared network to the new type sample image.
Then, the image classification model is incrementally trained using the 3 input images and the label information of the 3 input images.
In the incremental training process of the image classification model, the training loss of the image classification model is expressed as:
L=L Ens
wherein L is Ens Is a prediction error loss determined based on the input image and the label information of the input image.
3. Use of image classification models
Referring to fig. 14, the image classification model further includes a normalization layer 1 connected to an output terminal of the fully-connected layer 1, and a normalization layer 2 connected to an output terminal of the fully-connected layer 2, where the normalization layer 1 and the normalization layer 2 can be normalization functions (Softmax functions).
Acquiring an input image to be processed; 2 identical input images are constructed based on the input image.
Inputting 2 same input images into an image classification model to obtain 2 pieces of prediction information; the prediction information is the probability distribution of 3 prediction classes of the input image.
Superposing the prediction probabilities belonging to the same prediction classification in the probability distribution of the 2 prediction information to obtain 3 superposed prediction probabilities; and determining the prediction classification corresponding to the maximum value of the 3 superposition prediction probabilities as the image classification of the input image.
For example, in the case where the prediction information includes 3 prediction classifications, the prediction probability of identifying the input image as a tiger is 75%, the prediction probability of a dog is 15%, and the prediction probability of a person is 10% in the probability distribution of the prediction information 1, and the prediction probability of identifying the input image as a tiger is 70%, the prediction probability of a dog is 20%, and the prediction probability of a person is 10% in the probability distribution of the prediction information 1. The superimposition prediction probability of recognizing the input image as a tiger is 145%, the superimposition prediction probability of a dog is 35%, and the superimposition prediction probability of a person is 15%, it is determined that the image of the input image is classified as a tiger.
For example, fig. 15 is a diagram illustrating a scene of a training method of an image processing model according to an exemplary embodiment, where the scene is an example of an intelligent photo album. The smart album may recognize the type of an input image input to the smart album, may automatically label the type of the input image, or may recommend a type tag to a user.
The intelligent photo album can be regarded as a trained image classification model. For example, smart albums have been used to classify types of buildings, landscapes, etc., which may be referred to as existing types. When a user newly takes an image, new album names of people, animals and the like are defined in the intelligent album, and a few input images are manually added, the image classification model is equivalent to a new incremental learning task, and the incremental learning task only has a small number of samples, namely the manually added images of the people and the animals. The image classification model needs to be subjected to incremental training, that is, the image classification model is subjected to incremental training by adopting new images of people and animals and existing images of parts of buildings, landscapes and the like. The trained image classification model can classify new types such as tasks, animals and the like, and can continuously classify existing types such as buildings, landscapes and the like.
Fig. 16 is a block diagram illustrating a structure of an apparatus for training an image processing model according to an exemplary embodiment of the present application, where the image processing model is an integrated model supporting n inputs and n outputs, where n is an integer greater than 1, and the apparatus includes:
an obtaining module 810, configured to obtain a basic training set, where the basic training set includes n input images and n label information corresponding to the n input images.
An encoding module 820, configured to input the n input images into the image processing model, so as to obtain n prediction information and a first mixed feature representation; inputting the n augmented images into the image processing model to obtain a second mixed feature representation; the n augmented images are images obtained by augmenting the n input images.
A determining module 830 for determining a prediction error loss based on the n prediction information and the n label information; and determining an unsupervised learning loss based on the first hybrid signature representation and the second hybrid signature representation.
A training module 840 for training model parameters of the image processing model based on the prediction error loss and the auto-supervised learning loss.
In an example of this embodiment, the determining module 830 is further configured to:
determining n sub-losses based on the n prediction information and the n label information; determining the prediction error penalty based on a weighted sum of the n sub penalties; an ith loss of the n sub losses is determined based on the ith prediction information and the ith label information, and i is an integer not greater than n.
Calculating the auto-supervised learning loss with a goal of improving a similarity between the first mixed feature representation and the second mixed feature representation.
In an example of this embodiment, the determining module 830 is further configured to:
calculating a first variance regularization term corresponding to the first mixed feature representation according to a first dimension feature representation composed of feature values at a predetermined dimension of the first mixed feature representation; calculating a second variance regularization term corresponding to the second mixed feature representation according to a second dimension feature representation composed of feature values at a preset dimension of the second mixed feature representation;
determining a first covariance regularization term corresponding to the first mixed feature representation based on a first sub-mixed feature representation under each feature dimension of the first mixed feature representation and an average value of the first sub-mixed feature representations; determining a second covariance regularization term corresponding to the second mixed feature representation based on a second sub-mixed feature representation under each feature dimension of the second mixed feature representation and an average value of the second sub-mixed feature representations;
performing mean square Euclidean distance calculation according to a first sub-mixture feature representation under each feature dimension of the first mixture feature representation and a second sub-mixture feature representation under the same feature dimension of the second mixture feature representation, and determining an invariance term;
and performing weighted average calculation according to the first variance regularization term, the second variance regularization term, the first covariance regularization term, the second covariance regularization term and the invariance term to obtain the self-supervision learning loss.
In an example of this embodiment, the image processing model includes n feature extraction layers, a shared network layer, and n prediction layers, and the encoding module 820 is further configured to:
inputting the n input images into the n feature extraction layers to obtain n first feature maps; the n input images correspond to the n feature extraction layers one by one;
mixing the n first feature maps to obtain a first mixed feature map;
inputting the first mixed feature map into the shared network layer to obtain a first shared feature map;
and inputting the first shared characteristic diagram into the n prediction layers to obtain n prediction information.
In one example of the present application, the image processing model includes n feature extraction layers, a shared network layer, and a feature mapping layer; the encoding module 820 is further configured to:
inputting the n input images into the n feature extraction layers to obtain n first feature maps; the n input images correspond to the n feature extraction layers one by one;
mixing the n first feature maps to obtain a first mixed feature map;
inputting the first mixed feature map into the shared network layer to obtain a first shared feature map;
and inputting the first shared feature map into the feature mapping layer to obtain the first mixed feature representation.
In an example of the present application, the encoding module 820 is further configured to:
determining a two-dimensional mask image used in a mixing way at the time;
and splicing and mixing different image areas in the n first feature maps through the two-dimensional mask map to obtain the first mixed feature map.
In an example of this embodiment, the image processing model includes n feature extraction layers, a shared network layer, and n prediction layers, and the encoding module 820 is further configured to:
inputting the n augmented images into the n feature extraction layers to obtain n second feature maps; the n augmented images correspond to the n feature extraction layers one by one;
mixing the n second feature maps to obtain a second mixed feature map;
inputting the second mixed feature map into the shared network layer to obtain a second shared feature map;
and inputting the second shared feature map into the feature mapping layer to obtain the second mixed feature representation.
In an example of the present application, the encoding module 820 is further configured to:
determining a two-dimensional mask image used in a mixing way at the time;
and splicing and mixing different image areas in the n second feature maps through the two-dimensional mask map to obtain the second mixed feature map.
In an example of the present application, the obtaining module 810 is further configured to:
obtaining an incremental training set, the incremental training set comprising: the method comprises the steps of obtaining a new type sample image and label information of the new type sample image, wherein the new type sample image is a sample image of a newly added image type;
acquiring an existing type sample image and label information of the existing type sample image in the basic training set;
mixing the new type sample image and the existing type sample image to obtain n input images, and mixing the label information of the new type sample image and the label information of the existing type sample image to obtain the label information of the n input images;
the training module 840 is further configured to perform incremental training on the image processing model using the n input images and the label information of the n input images.
In an example of the present application, the obtaining module 810 is further configured to:
clustering the input images in the basic training set to obtain at least one clustering category;
sampling a part of input images from each of the at least one cluster category to determine the part of input images as the existing type sample images; and determining label information of the part of the input image as label information of the existing type sample image.
In an example of the present application, the encoding module 820 is further configured to input the n input images into the image processing model, so as to obtain n prediction information;
the determining module 830, further configured to determine the prediction error loss based on the n prediction information and the n label information;
the training module 840 is further configured to perform incremental training on the model parameters of the image processing model based on the prediction error loss.
In one example of the present application, the image processing model includes a shared network layer and n prediction layers, each of the n prediction layers including m prediction dimensions, the apparatus further includes an initialization module to:
expanding the m prediction dimensions of each prediction layer into m + k prediction dimensions, wherein k is the type number of the new type sample image;
and initializing the network parameters of the newly added k prediction dimensions in each prediction layer by adopting the average value of a third shared characteristic graph output by the shared network layer to the new type sample image.
Fig. 17 is a block diagram illustrating a structure of an apparatus for using an image processing model according to an exemplary embodiment of the present application, where the apparatus includes:
an obtaining module 910, configured to obtain an input image to be processed.
A prediction module 920, configured to input n identical input images constructed based on the input image into the image processing model, so as to obtain n pieces of prediction information.
A determining module 930 configured to determine a prediction result of the input image based on the n prediction information.
In an example of the present application, the image processing model is an image classification model having m prediction classifications, and the determining module 930 is further configured to:
superposing the prediction probabilities belonging to the same prediction classification in the probability distribution of the n pieces of prediction information to obtain m superposed prediction probabilities;
and determining the prediction classification corresponding to the maximum value of the m overlapped prediction probabilities as the image classification of the input image.
It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the above functional modules is illustrated, and in practical applications, the above functions may be distributed by different functional modules according to actual needs, that is, the content structure of the device is divided into different functional modules, so as to complete all or part of the functions described above.
With regard to the apparatus in the above-described embodiment, the specific manner in which the respective modules perform operations has been described in detail in the embodiment related to the method; the technical effects achieved by the operations performed by the respective modules are the same as those in the embodiments related to the method, and will not be described in detail here.
An embodiment of the present application further provides a computer device, where the computer device includes: a processor and a memory, the memory having stored therein a computer program; the processor is configured to execute the computer program in the memory to implement the method and/or the using method of the image processing model provided by the above method embodiments.
Optionally, the computer device is a server. Illustratively, fig. 18 is a block diagram of a server according to an exemplary embodiment of the present application.
In general, the server 1000 includes: a processor 1001 and a memory 1002.
Processor 1001 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 1001 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 1001 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1001 may be integrated with a Graphics Processing Unit (GPU) which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 1001 may also include an Artificial Intelligence (AI) processor for processing computing operations related to machine learning.
Memory 1002 may include one or more computer-readable storage media, which may be non-transitory. The memory 1002 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1002 is used to store at least one instruction for execution by processor 1001 to implement a method and/or method of use of an image processing model provided by method embodiments herein.
In some embodiments, the server 1000 may further include: an input interface 1003 and an output interface 1004. The processor 1001, the memory 1002, the input interface 1003 and the output interface 1004 may be connected by a bus or a signal line. Each peripheral device may be connected to the input interface 1003 and the output interface 1004 through a bus, a signal line, or a circuit board. The Input interface 1003 and the Output interface 1004 may be used to connect at least one peripheral device related to Input/Output (I/O) to the processor 1001 and the memory 1002. In some embodiments, the processor 1001, the memory 1002, and the input interface 1003 and the output interface 1004 are integrated on the same chip or circuit board; in some other embodiments, the processor 1001, the memory 1002, and any one or both of the input interface 1003 and the output interface 1004 may be implemented on a single chip or circuit board, which is not limited in this embodiment.
Those skilled in the art will appreciate that the above-described illustrated architecture is not meant to be limiting with respect to server 1000, and that server 1000 may include more or fewer components than those illustrated, or may combine certain components, or may employ a different arrangement of components.
In an exemplary embodiment, a chip is also provided, which comprises programmable logic circuits and/or program instructions for implementing the method for training and/or the method for using the image processing model according to the above aspects when the chip is run on a computer device.
In an exemplary embodiment, a computer program product is also provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor reads the computer instructions from the computer readable storage medium and executes the computer instructions to implement the training method and/or the using method of the image processing model provided by the above method embodiments.
In an exemplary embodiment, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program is loaded and executed by a processor to implement the training method and/or the using method of the image processing model provided by the above method embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (20)

1. A method for training an image processing model, wherein the image processing model is an integrated model supporting n inputs and n outputs, where n is an integer greater than 1, the method comprising:
acquiring a basic training set, wherein the basic training set comprises n input images and n label information corresponding to the n input images;
inputting the n input images into the image processing model to obtain n prediction information and a first mixed feature representation; inputting the n augmented images into the image processing model to obtain a second mixed feature representation; the n augmented images are images obtained by augmenting the n input images;
determining a prediction error loss based on the n prediction information and the n label information; and determining an unsupervised learning loss based on the first hybrid signature representation and the second hybrid signature representation;
training model parameters of the image processing model based on the prediction error loss and the auto-supervised learning loss.
2. The method of claim 1, wherein determining a prediction error loss based on the n prediction information and the n label information comprises:
determining n sub-losses based on the n prediction information and the n label information; determining the prediction error penalty based on a weighted sum of the n sub penalties; an ith loss of the n sub-losses is determined based on the ith prediction information and the ith label information, i being an integer no greater than n;
the determining an unsupervised learning loss based on the first hybrid signature representation and the second hybrid signature representation comprises:
calculating the auto-supervised learning loss with a goal of improving a similarity between the first mixed feature representation and the second mixed feature representation.
3. The method of claim 2, wherein computing the unsupervised learning loss with a goal of improving a similarity between the first hybrid feature representation and the second hybrid feature representation comprises:
calculating a first variance regularization term corresponding to the first mixed feature representation according to a first dimension feature representation composed of feature values at a predetermined dimension of the first mixed feature representation; calculating a second variance regularization term corresponding to the second mixed feature representation according to a second dimension feature representation composed of feature values at a preset dimension of the second mixed feature representation;
determining a first covariance regularization term corresponding to the first mixed feature representation based on a first sub-mixed feature representation under each feature dimension of the first mixed feature representation and an average value of the first sub-mixed feature representations; determining a second covariance regularization term corresponding to the second mixed feature representation based on a second sub-mixed feature representation under each feature dimension of the second mixed feature representation and an average value of the second sub-mixed feature representations;
performing mean square Euclidean distance calculation according to a first sub-mixture feature representation under each feature dimension of the first mixture feature representation and a second sub-mixture feature representation under the same feature dimension of the second mixture feature representation, and determining an invariance term;
and performing weighted average calculation according to the first variance regularization term, the second variance regularization term, the first covariance regularization term, the second covariance regularization term and the invariance term to obtain the self-supervision learning loss.
4. The method according to any one of claims 1 to 3, wherein the image processing model comprises n feature extraction layers, a shared network layer and n prediction layers;
the inputting the n input images into the image processing model to obtain n prediction information includes:
inputting the n input images into the n feature extraction layers to obtain n first feature maps; the n input images correspond to the n feature extraction layers one by one;
mixing the n first feature maps to obtain a first mixed feature map;
inputting the first mixed feature map into the shared network layer to obtain a first shared feature map;
and inputting the first shared characteristic diagram into the n prediction layers to obtain n prediction information.
5. The method according to any one of claims 1 to 3, wherein the image processing model comprises n feature extraction layers, a shared network layer and a feature mapping layer;
the inputting the n input images into the image processing model to obtain n first mixed feature representations includes:
inputting the n input images into the n feature extraction layers to obtain n first feature maps; the n input images correspond to the n feature extraction layers one by one;
mixing the n first feature maps to obtain a first mixed feature map;
inputting the first mixed feature map into the shared network layer to obtain a first shared feature map;
and inputting the first shared feature map into the feature mapping layer to obtain the first mixed feature representation.
6. The method according to claim 4 or 5, wherein the obtaining a first mixed feature map after the mixing of the n first feature maps comprises:
determining a two-dimensional mask image used in a mixing way at the time;
and splicing and mixing different image areas in the n first feature maps through the two-dimensional mask map to obtain the first mixed feature map.
7. The method according to any one of claims 1 to 3, wherein the image processing model comprises n feature extraction layers, a shared network layer and a feature mapping layer;
inputting the n augmented images into the image processing model to obtain a second mixed feature representation, including:
inputting the n augmented images into the n feature extraction layers to obtain n second feature maps; the n augmented images correspond to the n feature extraction layers one by one;
mixing the n second feature maps to obtain a second mixed feature map;
inputting the second mixed feature map into the shared network layer to obtain a second shared feature map;
and inputting the second shared feature map into the feature mapping layer to obtain the second mixed feature representation.
8. The method according to claim 7, wherein the obtaining a second mixed feature map after the mixing the n second feature maps comprises:
determining a two-dimensional mask image used in a mixing way at the time;
and splicing and mixing different image areas in the n second feature maps through the two-dimensional mask map to obtain the second mixed feature map.
9. The method of any of claims 1 to 3, further comprising:
carrying out the same geometric transformation on the n input images to obtain n augmented images;
wherein the geometric transformation comprises at least one of image rotation and image flipping.
10. The method of any of claims 1 to 3, further comprising:
obtaining an incremental training set, the incremental training set comprising: the method comprises the steps of obtaining a new type sample image and label information of the new type sample image, wherein the new type sample image is a sample image of a newly added image type;
acquiring an existing type sample image and label information of the existing type sample image in the basic training set;
mixing the new type sample image and the existing type sample image to obtain n input images, and mixing the label information of the new type sample image and the label information of the existing type sample image to obtain the label information of the n input images;
incrementally training the image processing model using the n input images and label information for the n input images.
11. The method of claim 10, wherein obtaining existing type sample images and label information of the existing type sample images in the base training set comprises:
clustering the input images in the basic training set to obtain at least one clustering category;
sampling a part of input images from each of the at least one cluster category to determine the part of input images as the existing type sample images; and determining label information of the part of the input image as label information of the existing type sample image.
12. The method of claim 10, wherein the incrementally training the image processing model using the n input images and label information for the n input images comprises:
inputting the n input images into the image processing model to obtain n prediction information;
determining the prediction error loss based on the n prediction information and the n label information;
and performing incremental training on model parameters of the image processing model based on the prediction error loss.
13. The method of claim 10, wherein the image processing model comprises a shared network layer and n prediction layers, each of the n prediction layers comprising m prediction dimensions, the method further comprising:
expanding the m prediction dimensions of each prediction layer into m + k prediction dimensions, wherein k is the type number of the new type sample image;
and initializing network parameters of newly added k prediction dimensions in each prediction layer by adopting the average value of a third shared characteristic graph output by the shared network layer to the new type sample image.
14. A method of using an image processing model, wherein the image processing model is trained by the method of any one of claims 1 to 13, the method comprising:
acquiring an input image to be processed;
inputting n identical input images constructed based on the input images into the image processing model to obtain n pieces of prediction information;
determining a prediction result of the input image based on the n prediction information.
15. The method of claim 14, wherein the image processing model is an image classification model having m prediction classifications, and wherein determining the prediction result for the input image based on the n prediction information comprises:
superposing the prediction probabilities belonging to the same prediction classification in the probability distribution of the n pieces of prediction information to obtain m superposed prediction probabilities;
and determining the prediction classification corresponding to the maximum value of the m overlapped prediction probabilities as the image classification of the input image.
16. An apparatus for training an image processing model, wherein the image processing model is an integrated model supporting n inputs and n outputs, where n is an integer greater than 1, the apparatus comprising:
the acquisition module is used for acquiring a basic training set, wherein the basic training set comprises n input images and n label information corresponding to the n input images;
the coding module is used for inputting the n input images into the image processing model to obtain n pieces of prediction information and a first mixed feature representation; inputting the n augmented images into the image processing model to obtain a second mixed feature representation; the n augmented images are images obtained by augmenting the n input images;
a determination module to determine a prediction error loss based on the n prediction information and the n label information; and determining an unsupervised learning loss based on the first hybrid signature representation and the second hybrid signature representation;
and the training module is used for training the model parameters of the image processing model based on the prediction error loss and the self-supervision learning loss.
17. An apparatus for using an image processing model, the apparatus comprising:
the acquisition module is used for acquiring an input image to be processed;
the prediction module is used for inputting n same input images constructed based on the input images into the image processing model to obtain n pieces of prediction information;
a determining module for determining a prediction result of the input image based on the n prediction information.
18. A computer device, characterized in that the computer device comprises: a processor and a memory, the memory storing a computer program that is loaded and executed by the processor to implement a method of training an image processing model according to any one of claims 1 to 13 or a method of using an image processing model according to claim 14 or 15.
19. A computer-readable storage medium, characterized in that it stores a computer program which is loaded and executed by a processor to implement the method of training an image processing model according to any one of claims 1 to 13 or the method of using an image processing model according to claim 14 or 15.
20. A computer program product, characterized in that it comprises computer instructions stored in a computer-readable storage medium, from which a processor retrieves said computer instructions, causing said processor to load and execute to implement a method of training an image processing model according to any one of claims 1 to 13, or to implement a method of using an image processing model according to claim 14 or 15.
CN202210826932.6A 2022-07-13 2022-07-13 Training method, using method, device, equipment and medium of image processing model Pending CN115115910A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210826932.6A CN115115910A (en) 2022-07-13 2022-07-13 Training method, using method, device, equipment and medium of image processing model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210826932.6A CN115115910A (en) 2022-07-13 2022-07-13 Training method, using method, device, equipment and medium of image processing model

Publications (1)

Publication Number Publication Date
CN115115910A true CN115115910A (en) 2022-09-27

Family

ID=83332255

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210826932.6A Pending CN115115910A (en) 2022-07-13 2022-07-13 Training method, using method, device, equipment and medium of image processing model

Country Status (1)

Country Link
CN (1) CN115115910A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115879004A (en) * 2022-12-21 2023-03-31 北京百度网讯科技有限公司 Target model training method, apparatus, electronic device, medium, and program product

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115879004A (en) * 2022-12-21 2023-03-31 北京百度网讯科技有限公司 Target model training method, apparatus, electronic device, medium, and program product

Similar Documents

Publication Publication Date Title
CN114418030B (en) Image classification method, training method and device for image classification model
CN114298122B (en) Data classification method, apparatus, device, storage medium and computer program product
CN114332578A (en) Image anomaly detection model training method, image anomaly detection method and device
CN111553267A (en) Image processing method, image processing model training method and device
CN115565238B (en) Face-changing model training method, face-changing model training device, face-changing model training apparatus, storage medium, and program product
CN116580257A (en) Feature fusion model training and sample retrieval method and device and computer equipment
CN115050064A (en) Face living body detection method, device, equipment and medium
CN114419351A (en) Image-text pre-training model training method and device and image-text prediction model training method and device
CN114612902A (en) Image semantic segmentation method, device, equipment, storage medium and program product
CN114611672A (en) Model training method, face recognition method and device
CN115577768A (en) Semi-supervised model training method and device
CN114972016A (en) Image processing method, image processing apparatus, computer device, storage medium, and program product
CN116980541B (en) Video editing method, device, electronic equipment and storage medium
CN115115910A (en) Training method, using method, device, equipment and medium of image processing model
CN113011320A (en) Video processing method and device, electronic equipment and storage medium
CN113569081A (en) Image recognition method, device, equipment and storage medium
CN116975347A (en) Image generation model training method and related device
CN117011416A (en) Image processing method, device, equipment, medium and program product
CN117036658A (en) Image processing method and related equipment
CN116958615A (en) Picture identification method, device, equipment and medium
CN114298961A (en) Image processing method, device, equipment and storage medium
CN114692715A (en) Sample labeling method and device
Ilo Weather Image Generation using a Generative Adversarial Network
CN117216534A (en) Model training method, device, equipment, storage medium and product
CN113392865A (en) Picture processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination