CN116258939A - Model training method and device for image processing model - Google Patents

Model training method and device for image processing model Download PDF

Info

Publication number
CN116258939A
CN116258939A CN202310200257.0A CN202310200257A CN116258939A CN 116258939 A CN116258939 A CN 116258939A CN 202310200257 A CN202310200257 A CN 202310200257A CN 116258939 A CN116258939 A CN 116258939A
Authority
CN
China
Prior art keywords
image
self
attention
image processing
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310200257.0A
Other languages
Chinese (zh)
Inventor
刘政岐
罗浩
桂杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Damo Institute Hangzhou Technology Co Ltd
Original Assignee
Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Damo Institute Hangzhou Technology Co Ltd filed Critical Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority to CN202310200257.0A priority Critical patent/CN116258939A/en
Publication of CN116258939A publication Critical patent/CN116258939A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the specification provides a model training method and device for an image processing model, wherein the model training method for the image processing model comprises the following steps: acquiring an initial sample image and preset training times; training an image processing model based on the initial sample image and the preset training times to obtain a self-attention value; generating a self-attention map from the self-attention values and determining a target masking image in the initial sample image from the self-attention map; and continuing training the image processing model according to the target masking image and the initial sample image.

Description

Model training method and device for image processing model
Technical Field
The embodiment of the specification relates to the technical field of computers, in particular to a model training method of an image processing model.
Background
With the continuous development of computer technology, image processing technology is also continuously perfected.
In order to improve the recognition efficiency of the image, the model can be trained by adopting the image containing the label, so that the model capable of recognizing the image is obtained; however, in practical application, it is difficult to obtain an image containing a tag, and the cost is high, so that the image which does not contain the tag is currently used as sample data to perform self-supervision training on the model, and the model training effect is achieved.
However, self-monitoring training is often performed by randomly masking a portion of the content in the sample image, so that the model is trained based on the masked image and the original image, resulting in a low recognition accuracy of the trained model image.
Disclosure of Invention
In view of this, the present embodiment provides a model training method of an image processing model. One or more embodiments of the present specification relate to a model training apparatus for an image processing model, a computing device, a computer-readable storage medium, and a computer program to solve the technical drawbacks of the related art.
According to a first aspect of embodiments of the present specification, there is provided a model training method of an image processing model, including:
acquiring an initial sample image and preset training times;
training an image processing model based on the initial sample image and the preset training times to obtain a self-attention value;
generating a self-attention map from the self-attention values and determining a target masking image in the initial sample image from the self-attention map;
and continuing training the image processing model according to the target masking image and the initial sample image.
According to a second aspect of embodiments of the present specification, there is provided a model training apparatus of an image processing model, comprising:
the acquisition module is configured to acquire an initial sample image and preset training times;
the training module is configured to train the image processing model based on the initial sample image and the preset training times to obtain a self-attention value;
a generation module configured to generate a self-attention-seeking-diagram from the self-attention value and to determine a target mask image in the initial sample image from the self-attention-seeking-diagram;
a continuation training module configured to continue training the image processing model based on the target masking image and the initial sample image.
According to a third aspect of embodiments of the present specification, there is provided a model training system of an image processing model, the model training system including an end-measurement device and a cloud-measurement device, wherein:
the end testing equipment is used for determining an initial sample image according to the model training request and sending the initial sample image to the cloud testing equipment;
the cloud testing equipment is used for acquiring an initial sample image and preset training times; training an image processing model based on the initial sample image and the preset training times to obtain a self-attention value; generating a self-attention map from the self-attention values and determining a target masking image in the initial sample image from the self-attention map; and continuing training the image processing model according to the target masking image and the initial sample image.
According to a fourth aspect of embodiments of the present specification, there is provided a computing device comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions that, when executed by the processor, perform the steps of the model training method for an image processing model described above.
According to a fifth aspect of embodiments of the present specification, there is provided a computer readable storage medium storing computer executable instructions which, when executed by a processor, implement the steps of the model training method of the image processing model described above.
According to a sixth aspect of embodiments of the present specification, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the model training method of the image processing model described above.
One embodiment of the present disclosure achieves obtaining an initial sample image and a preset number of training times; training an image processing model based on the initial sample image and the preset training times to obtain a self-attention value; generating a self-attention map from the self-attention values and determining a target masking image in the initial sample image from the self-attention map; and continuing training the image processing model according to the target masking image and the initial sample image.
According to the model training method of the image processing model, the image processing model is processed through an initial sample image and preset training times, so that a self-attention value is obtained; generating a self-attention force diagram based on the self-attention value, guiding masking of an initial sample image based on the self-attention force diagram to obtain a target masking image, and training an image processing model based on the target masking image, so that the accuracy of the image processing model on image processing is improved, and the accuracy of a subsequent image processing result obtained based on the image processing model is improved.
Drawings
FIG. 1 is a process diagram of a model training method for an image processing model according to one embodiment of the present disclosure;
FIG. 2 is a flow chart of a model training method for an image processing model according to one embodiment of the present disclosure;
FIG. 3 is a flow chart of an image processing method provided in one embodiment of the present disclosure;
FIG. 4 is a process flow diagram of a model training method for an image processing model according to one embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a model training system for an image processing model according to one embodiment of the present disclosure;
FIG. 6 is a schematic diagram of a model training apparatus for an image processing model according to an embodiment of the present disclosure;
FIG. 7 is a block diagram of a computing device provided in one embodiment of the present description.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.
The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
First, terms related to one or more embodiments of the present specification will be explained.
MIM: masked Image Modeling, image mask modeling.
ViT: vision Transformer combines CV and NLP domain knowledge, blocks the original picture, flattens it into a sequence, inputs it into the Encoder section of the original transform model, and finally connects to a full connection layer to classify the picture.
In the present specification, a model training method of an image processing model is provided, and the present specification relates to a model training apparatus of an image processing model, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
Referring to fig. 1, fig. 1 shows a process schematic diagram of a model training method of an image processing model according to an embodiment of the present disclosure, which specifically includes the following steps.
Acquiring an initial sample image, and cutting the initial sample image to obtain four image blocks; masking the image block corresponding to the initial sample image according to a preset masking proportion of 25%, so as to obtain a random masking image; inputting the random masking image and the sample image into an image processing model, and performing self-supervision training on the image processing model by the image processing model based on the random masking image and the sample image until training is performed for a preset training time: 40 times; acquiring self-attention values output by a decoder of an image processing model after training a preset training number; by upsampling the self-attention value, a self-attention map is obtained that is consistent with the sample image size.
Further, the image processing model is trained on the basis of the self-attention-seeking map once every 10 times later, namely, the training on the basis of the self-attention-seeking map is carried out at 11 th, 21 st and the like; specifically, the self-attention force diagram and the sample image are cut in the same mode, so that a sample sub-image set and a self-attention sub-image set are obtained; calculating the sum of pixel points of the sample sub-image and the self-attention sub-image corresponding to the sample sub-image to obtain the sampling weight corresponding to each sample sub-image; generating an initial masking image based on each sampling weight, masking and discarding image blocks in the initial masking image based on corresponding self-attention values to obtain an image based on self-attention force diagram masking, namely a target masking image; the target masking image is input to the image processing model, such that the image processing model performs self-supervised training based on the target masking image.
After the 11 th training, the 12 th to 20 th training is still based on the random masking image corresponding to the initial sample image; to the 21 st training, the 21 st training of the image processing model is required based on the self-attention force diagram output by the image processing model after the 20 th training; and (5) circulating the steps until reaching the training stopping condition to obtain the trained image processing model.
One embodiment of the present disclosure achieves obtaining an initial sample image and a preset number of training times; training an image processing model based on the initial sample image and the preset training times to obtain a self-attention value; generating a self-attention map from the self-attention values and determining a target masking image in the initial sample image from the self-attention map; and continuing training the image processing model according to the target masking image and the initial sample image.
According to the model training method of the image processing model, the image processing model is processed through an initial sample image and preset training times, so that a self-attention value is obtained; generating a self-attention force diagram based on the self-attention value, guiding masking of an initial sample image based on the self-attention force diagram to obtain a target masking image, and training an image processing model based on the target masking image, so that the accuracy of the image processing model on image processing is improved, and the accuracy of a subsequent image processing result obtained based on the image processing model is improved.
Referring to fig. 2, fig. 2 shows a flowchart of a model training method of an image processing model according to an embodiment of the present disclosure, which specifically includes the following steps.
Step 202: and acquiring an initial sample image and preset training times.
In practical applications, in order to implement self-supervision training on an image processing model, it is generally necessary to determine a sample image for training and the number of times that training is required; the sample image may be a pre-selected image for training the image processing model; or an image for which there is a need for image processing.
The initial sample image is an image used for performing self-supervision training on the image processing model; the initial sample image may be selected by a user, set in advance for an image processing model, or the like, and the source of the initial sample image is not particularly limited in this specification; the preset training times refer to the times of performing self-supervision training on the image processing model based on the initial sample image, for example, the preset training times are 100 times; it should be noted that, in the case that the preset training frequency is the total training frequency for the image processing model, the training frequency is consistent with the current method for implementing training on the image processing model based on the random masking mode, and in the scheme of the specification, the preset training frequency is smaller than the total training frequency corresponding to the image processing model; in practical applications, the image processing model needs to be trained by using a sample image set, so that the subsequent description is facilitated, and in the present specification, a sample image arbitrarily selected from the sample image set is taken as an initial sample image, and the subsequent description is performed by taking the initial sample image as an example.
Specifically, a model training request for an image processing model is received by a terminal, wherein the terminal can be a computer, a client, a server, a virtual server and the like; acquiring a sample image set corresponding to an image processing model based on a model training request and presetting training times; a sample image is selected from the collection of sample images as an initial sample image.
In one embodiment of the present disclosure, a model training request for an image processing model is received; determining a sample image set D and preset training times for 100 times based on the model training request, wherein the sample image set D comprises 100 sample images; sample image 1 in sample image set D is selected as the initial sample image.
The image processing model is trained by acquiring an initial sample image and a preset training frequency so as to be subsequently trained based on the initial sample image and the preset training frequency.
Step 204: and training an image processing model based on the initial sample image and the preset training times to obtain a self-attention value.
In practical application, after determining the sample image and the training times, the sample image can be randomly masked in each round of training to obtain a masking image corresponding to each round; and in each round of training, calculating a loss value based on the masking image and the sample image of the round, and adjusting parameters of the image processing model based on the loss value obtained in each round, thereby realizing self-supervision training of the image processing model.
Specifically, the image processing model may perform at least one of image processing tasks such as an image segmentation task, an image classification task, and an image object detection task. Because the accuracy requirement on the extraction of the image features is higher in the image processing tasks such as the image classification task, the image object detection task and the like, the image processing model can be subjected to self-supervision training based on the model method of the specification in order to further improve the accuracy of the extraction of the image features; the model training method of the specification can complete self-supervision training without depending on the architecture of an image processing model; self-supervised learning of the vison transducer and MIM paradigms can be achieved using the model training approach of the present specification.
After determining the image processing model to be trained, training the image processing model based on the initial sample image corresponding to the image processing model and the preset training times, and obtaining the self-attention value may include:
inputting the initial sample image into an image processing model to obtain a predicted image output by the image processing model;
calculating a loss value based on the initial sample image and the predicted image;
Adjusting model parameters of the image processing model according to the loss value;
continuing to input the initial sample image into the image processing model until the preset training times are reached;
and after the preset training times of the image processing model are obtained, the attention value output by the decoder of the image processing model.
Wherein, the predicted image refers to an image output by the image processing model based on the initial sample image; the loss value refers to a value calculated based on the predicted image, the initial sample image, and the loss function; specifically, the loss value is determined by calculating the similarity between the predicted image and the initial sample image, that is, determining the gap between the predicted image and the initial sample image, so that the image processing model is continuously adjusted, and the image processing model can output a processing result which is more similar to the real result.
Specifically, the initial sample image is converted into a one-dimensional vector, and the one-dimensional vector is cut according to a preset cutting rule to obtain an image block set.
For example, training samples
Figure BDA0004109479490000051
Representation, wherein H and W represent the width and height of the sample, respectively, C represents the number of channels of the sample, R represents the real number set, x i Represents the i-th sample, n represents the total number of samples; deforming the training sample to +.>
Figure BDA0004109479490000052
So as to change the two-dimensional image into a one-dimensional sequence, wherein (P, P) is the size of each image block, n=hw/P 2 Is the number of image blocks; the image block is flattened and transformed linearly +.>
Figure BDA0004109479490000053
Mapped to dimension D and append a [ CLS ]]Mark x cls . By adding position codes E pos ∈R (N+1)×D To construct model input vectors
Figure BDA0004109479490000055
Wherein->
Figure BDA0004109479490000056
Represents x p In line m, z e R (N+1)×D The model used here is ViT.
Selecting image blocks with preset thresholds from the image block set for masking, for example, randomly selecting 75% of the image blocks for masking, and generating input vectors for inputting an image processing model based on the rest image blocks in the image block set; a predicted image is obtained by an image processing model based on the input vector.
For example, the remaining picture block vectors are encoded by an encoder, and then the encoded vectors are restored by a decoder, and the distance between the decoded masking image block and the original image block is calculated using the minimum mean square error as a loss function to iterate continuously while updating the model parameters of the image processing model.
After training the image processing model a preset number of training times based on the initial sample image, a self-attention value output by a decoder of the image processing model after the last training is obtained.
For example, the multi-head self-attention for each sample is calculated by the multi-head self-attention mechanism in ViT
Figure BDA0004109479490000057
Wherein D is h =D/N h ,q i ,k i Is to divide the vector z into N by columns h The obtained results are identical in value, and the finally obtained attention matrix A i A matrix of (n+1) × (n+1); last layer N of encoder h Attention from the attention head is averaged +.>
Figure BDA0004109479490000058
Wherein->
Figure BDA0004109479490000059
Represents A i First row of a w ∈E N Each element in the image block corresponds to an attention value of each image block, and each attention value is normalized and output as
Figure BDA0004109479490000061
Wherein a is wmax Is a as w Maximum value of>
Figure BDA0004109479490000062
Represents a w I-th element of (a).
In one embodiment of the present disclosure, the total training number of the image processing model is determined to be 200, the preset training number is 100, and the sample image a; generating an input vector a based on the sample image a, and inputting the input vector a to the image processing model; obtaining a predicted image output by an image processing model; calculating a loss value based on the predicted image and the sample image A, and adjusting model parameters based on the loss value; continuing to execute the steps of generating an input vector a1 based on the sample image A and calculating a loss value based on a predicted image obtained by the input vector a1 and the sample image A until the preset training times 100 are reached; after the 100 th training of the image processing model is obtained, the self-attention value obtained by calculation by the decoder.
Self-attention values are obtained by self-supervised training of the image processing model a preset number of training times for subsequent generation of self-attention force maps based on the self-attention values.
Step 206: a self-attention map is generated from the self-attention values and a target masking image is determined in the initial sample image from the self-attention map.
Wherein the self-attention map refers to an image generated based on the self-attention value and having a size consistent with the initial sample image; the target mask image refers to an image obtained by masking the initial sample image based on a self-attention attempt.
Specifically, the self-attention value may represent an association relationship between image contents, so that the self-attention value is used for generating a self-attention pattern, and the initial sample image is masked based on the self-attention pattern, so that the masking accuracy of the initial sample image can be improved, and the processing accuracy of the image processing model can be further improved.
In practical applications, the method for generating a self-attention map according to the self-attention value may include:
determining image size information corresponding to the initial sample image;
and upsampling the self-attention value based on the image size information to obtain a self-attention map.
The image size information refers to size information of an initial sample image, for example, the size information of the sample image a is 144px long, 40px wide, and the like; the self-attention value is up-sampled, for example, by bilinear interpolation to supplement the missing pixels, resulting in a self-attention map.
In one embodiment of the present disclosure, the last layer N of the encoder by the image processing model h The self-attention head gets the attention weight a w ∈R N Wherein each element corresponds to the attention value of each image block, and then the sampling weight a is obtained ws . The a obtained in step S4 is sampled up ws The distortion to the picture size h×w gets an attention map T, where each pixel represents the attention value of the corresponding pixel in the picture.
After obtaining the self-attention profile, a method of determining a target masking image in the initial sample image from the self-attention profile may comprise:
cutting the initial sample image according to a preset cutting mode to obtain an initial sample sub-image set;
cutting the self-attention force diagram according to a preset cutting mode to obtain a self-attention sub-image set;
a target masking image is determined based on the initial sample sub-image set and the self-attention sub-image set.
The preset cutting mode refers to a mode of dividing an image into image blocks; included in the initial sample sub-image set is a sample sub-image obtained by clipping the initial sample image; included in the self-attention sub-image set are self-attention self-images resulting from cropping the self-attention attempt.
Specifically, when the initial sample image is cut in each training, the self-attention force diagram is correspondingly cut, and the one-to-one correspondence between the pixels of the cut initial sample image and the pixels of the self-attention image is ensured.
In a practical application, the method of determining the target masking image based on the initial sample sub-image set and the self-attention sub-image set may comprise:
generating an initial masking image from the initial sample sub-image set and the self-attention sub-image set;
a target mask image is determined in the initial mask image.
Specifically, determining a target sample sub-image in the initial sample sub-image set and a target self-attention sub-image in the self-attention sub-image set, wherein the target sample sub-image corresponds to the target self-attention sub-image; summing pixel points of the target sample sub-image and the target self-attention sub-image to obtain a target sampling weight corresponding to the target sample sub-image, wherein the target sampling weight comprises a self-attention value corresponding to each pixel point in the target sample sub-image; and calculating the sampling weight corresponding to each sample sub-image in the initial sample sub-image based on the steps, and obtaining a sampling weight sequence corresponding to the initial sample sub-image set, namely an initial masking image.
Further, the method of determining a target mask image in the initial mask image may include:
acquiring a masking image block and discarding the image block in the initial masking image;
masking the masking image block, deleting the discarded image block, and obtaining a target masking image.
The masking image block refers to an image block which needs masking operation in the image blocks corresponding to the initial masking image; discarding the image block refers to an image block which needs to be deleted from the image blocks corresponding to the initial masking image; the initial masking image block also comprises display image blocks, and after the masking image blocks and the discarded image blocks are processed, the rest image blocks are display image blocks; the target mask image is composed based on the mask image block and the display image block.
In practical application, the method for acquiring the masking image block and discarding the image block in the initial masking image may include:
determining a self-attention value corresponding to each image block in the initial masking image;
selecting an image block with the self-attention value larger than a first threshold value as a masking image block;
image blocks with self-attention values greater than the second threshold and less than the first threshold are selected as discarded image blocks.
Wherein the first threshold is a threshold for determining a masking image block and discarding an image block, for example, the first threshold is 30, and if the self-attention threshold of the image block a is 40, the image block a is divided into masking image blocks; the second threshold is a threshold for determining a discarded image block, for example, the first threshold is 30, the second threshold is 10, and if the self-attention value of the image block b is 27, the image block b may be divided into discarded image blocks.
It should be noted that, the above manner of determining the masking image block and discarding the image block is an achievable manner, and the image block with the self-attention value greater than the first threshold may be selected as the discarding image block, and the image block with the self-attention value greater than the second threshold and less than the first threshold may be selected as the masking image block.
Partial image blocks are discarded by generating the target masking image, so that the data processing amount is reduced, and the subsequent processing efficiency is improved conveniently; and further, the efficiency of training the image processing model based on the target masking image and the initial sample image is improved.
Step 208: and continuing training the image processing model according to the target masking image and the initial sample image.
Specifically, inputting a target masking image and an image processing model to obtain a predicted image output by the image processing model; calculating a loss value based on the initial sample image and the predicted image; and adjusting parameters of an image processing model according to the loss value, wherein the image processing model is trained after the preset training times.
Further, after continuing to train the image processing model according to the target mask image and the initial sample image, further includes:
and continuing to execute the steps of acquiring the initial sample image and the preset training times until the model training stopping condition is reached.
Specifically, after the image processing model is trained for the preset training times based on the initial sample image and the preset training times, the image processing model may be trained for once in a training manner based on the self-attention graph every preset training time threshold until the model training stop condition is reached, for example, the image processing model is trained for the preset total training times, a pause training instruction is received, and the like.
In one embodiment of the present disclosure, the image processing model is trained 100 times based on the sample image d to obtain a self-attention map; the total training times of the image processing model are 1000 times, after the image processing model is trained for 100 times, the image processing model is trained once every 100 times based on self-attention force diagram, namely, the image processing model is trained for 101 times and still is based on an initial sample image for 102 to 200 times, further, the image processing model is trained for 1 time in 201 times, and the training is finished until the training times reach 1000 times, so that the trained image processing model is obtained.
One embodiment of the present disclosure achieves obtaining an initial sample image and a preset number of training times; training an image processing model based on the initial sample image and the preset training times to obtain a self-attention value; generating a self-attention map from the self-attention values and determining a target masking image in the initial sample image from the self-attention map; and continuing training the image processing model according to the target masking image and the initial sample image.
According to the model training method of the image processing model, the image processing model is processed through an initial sample image and preset training times, so that a self-attention value is obtained; generating a self-attention force diagram based on the self-attention value, guiding masking of an initial sample image based on the self-attention force diagram to obtain a target masking image, and training an image processing model based on the target masking image, so that the accuracy of the image processing model on image processing is improved, and the accuracy of a subsequent image processing result obtained based on the image processing model is improved.
Referring to fig. 3, fig. 3 shows an image processing method according to an embodiment of the present disclosure, which specifically includes the following steps:
Step 302: determining an image to be processed, and inputting the image to be processed into an image processing model, wherein the image processing model is obtained by training based on the model training method of the image processing model.
The image to be processed refers to an image which needs to obtain an image processing result based on an image processing model; the image processing model is a model training method based on the image processing model, and the model is trained.
In one embodiment of the present disclosure, the image processing model may perform image classification tasks; receiving an image classification request, and determining an image G to be classified based on the image processing request; the image G to be classified is input into the image processing model.
Step 304: and obtaining an image processing result output by the image processing model.
The image processing result refers to an image processing result corresponding to the image to be processed, which is output by the image processing model.
In a specific embodiment of the present disclosure, along the above example, the image type of the image G to be classified output by the acquired image processing model is K.
According to the image processing method, the image to be processed is input into the image processing model obtained based on the model training method for processing, so that the accuracy of the image processing result corresponding to the image to be processed is improved.
The model training method of the image processing model will be further described below with reference to fig. 4, taking the application of the method provided in the present specification to the sample image L as an example. Fig. 4 is a flowchart of a process of a model training method of an image processing model according to an embodiment of the present disclosure, which specifically includes the following steps.
Step 402: based on the image processing model training request, a sample image L and a preset training frequency are determined.
Specifically, determining an image processing model corresponding to the request, wherein the image processing model is used for identifying an object in an image; acquiring a sample image set corresponding to the image processing model, and determining a sample image L in the sample image set; and acquiring 100 times of total training times corresponding to the image processing model, and determining 40 times of preset training times based on the total training times.
Step 404: and cutting the sample image L based on a preset cutting mode to obtain an image block set.
Specifically, the sample image L is divided into 9 rectangles of equal size, each rectangle is an image block, and an image block set is generated from the 9 rectangles.
Step 406: and carrying out random masking on the image blocks in the image block set to obtain a masking image.
Specifically, the image block is randomly masked according to a preset masking proportion.
Step 408: the sample image L and the mask image are input to the image processing model, and a predicted image output by the image processing model is obtained.
Step 410: the image processing model is referred to based on the predicted image and the sample image L.
Step 412: repeating the steps until the training times reach the preset training times, and obtaining the self-attention value output by the image processing model.
Step 414: a self-attention map is generated based on the self-attention values, and the self-attention map is clipped corresponding to the sample image L.
Step 416: and calculating the sum of pixel points for the self-attention sub-image corresponding to the self-attention force diagram and the sample sub-image corresponding to the sample image L to obtain an initial masking image.
Step 418: masking the masking image block in the initial masking image, deleting the discarded image block in the initial masking image, and obtaining the target masking image.
Step 420: and inputting the target masking image into the image processing model to obtain a predicted image output by the image processing model.
Step 422: model parameters of the image processing model are continuously adjusted based on the predicted image and the sample image L.
Specifically, training the image processing model based on self-attention force diagram every preset training times, and ending the training until the training times reach 100 times of the total training times.
One embodiment of the present disclosure achieves obtaining an initial sample image and a preset number of training times; training an image processing model based on the initial sample image and the preset training times to obtain a self-attention value; generating a self-attention map from the self-attention values and determining a target masking image in the initial sample image from the self-attention map; and continuing training the image processing model according to the target masking image and the initial sample image.
According to the model training method of the image processing model, the image processing model is processed through an initial sample image and preset training times, so that a self-attention value is obtained; generating a self-attention force diagram based on the self-attention value, guiding masking of an initial sample image based on the self-attention force diagram to obtain a target masking image, and training an image processing model based on the target masking image, so that the accuracy of the image processing model on image processing is improved, and the accuracy of a subsequent image processing result obtained based on the image processing model is improved.
Corresponding to the above method embodiments, the present disclosure further provides a model training system embodiment of the image processing model, and fig. 5 shows a schematic structural diagram of model training of the image processing model according to one embodiment of the present disclosure. As shown in fig. 5, the model training system includes an end-test device and a cloud-test device, wherein:
The end test device 502 is configured to determine an initial sample image according to a model training request, and send the initial sample image to the cloud test device;
the Yun Ceshe device 504 is configured to obtain an initial sample image and a preset training number; training an image processing model based on the initial sample image and the preset training times to obtain a self-attention value; generating a self-attention map from the self-attention values and determining a target masking image in the initial sample image from the self-attention map; and continuing training the image processing model according to the target masking image and the initial sample image.
Optionally, the Yun Ceshe module is configured to input the initial sample image into an image processing model to obtain a predicted image output by the image processing model;
calculating a loss value based on the initial sample image and the predicted image;
adjusting model parameters of the image processing model according to the loss value;
continuing to input the initial sample image into the image processing model until the preset training times are reached;
and after the preset training times of the image processing model are obtained, the self-attention value output by a decoder of the image processing model.
Optionally, the Yun Ceshe is configured to determine image size information corresponding to the initial sample image;
and upsampling the self-attention value based on the image size information to obtain a self-attention map.
Optionally, the cloud testing device is configured to cut the initial sample image according to a preset cutting mode to obtain an initial sample sub-image set;
cutting the self-attention force diagram according to a preset cutting mode to obtain a self-attention sub-image set;
a target masking image is determined based on the initial sample sub-image set and the self-attention sub-image set.
Optionally, the Yun Ceshe device is configured to generate 504 an initial masking image from the initial set of sample sub-images and the set of self-attention sub-images;
a target mask image is determined in the initial mask image.
Optionally, the Yun Ceshe is provided 504 for obtaining a masking image block and discarding an image block in the initial masking image;
masking the masking image block, deleting the discarded image block, and obtaining a target masking image.
Optionally, the Yun Ceshe device is configured to determine 504 a self-attention value corresponding to each image block in the initial masking image;
Selecting an image block with the self-attention value larger than a first threshold value as a masking image block;
image blocks with self-attention values greater than the second threshold and less than the first threshold are selected as discarded image blocks.
According to the model training system of the image processing model, the image processing model is processed through an initial sample image and preset training times, so that a self-attention value is obtained; generating a self-attention force diagram based on the self-attention value, guiding masking of an initial sample image based on the self-attention force diagram to obtain a target masking image, and training an image processing model based on the target masking image, so that the accuracy of the image processing model on image processing is improved, and the accuracy of a subsequent image processing result obtained based on the image processing model is improved.
Corresponding to the above method embodiment, the present disclosure further provides an embodiment of a model training device for an image processing model, and fig. 6 shows a schematic structural diagram of a model training device for an image processing model according to one embodiment of the present disclosure. As shown in fig. 6, the apparatus includes:
an acquisition module 602 configured to acquire an initial sample image and a preset number of training times;
A training module 604 configured to train an image processing model based on the initial sample image and the preset training times to obtain a self-attention value;
a generation module 606 configured to generate a self-attention-seeking-diagram from the self-attention values and to determine a target mask image in the initial sample image from the self-attention-seeking-diagram;
a continuation training module 608 is configured to continue training the image processing model based on the target mask image and the initial sample image.
Optionally, the apparatus further comprises:
and continuing to execute the steps of acquiring the initial sample image and the preset training times until the model training stopping condition is reached.
Optionally, the training module 604 is further configured to:
inputting the initial sample image into an image processing model to obtain a predicted image output by the image processing model;
calculating a loss value based on the initial sample image and the predicted image;
adjusting model parameters of the image processing model according to the loss value;
continuing to input the initial sample image into the image processing model until the preset training times are reached;
And after the preset training times of the image processing model are obtained, the self-attention value output by a decoder of the image processing model.
Optionally, the generating module 606 is further configured to:
determining image size information corresponding to the initial sample image;
and upsampling the self-attention value based on the image size information to obtain a self-attention map.
Optionally, the generating module 606 is further configured to:
cutting the initial sample image according to a preset cutting mode to obtain an initial sample sub-image set;
cutting the self-attention force diagram according to a preset cutting mode to obtain a self-attention sub-image set;
a target masking image is determined based on the initial sample sub-image set and the self-attention sub-image set.
Optionally, the generating module 606 is further configured to:
generating an initial masking image from the initial sample sub-image set and the self-attention sub-image set;
a target mask image is determined in the initial mask image.
Optionally, the generating module 606 is further configured to:
acquiring a masking image block and discarding the image block in the initial masking image;
Masking the masking image block, deleting the discarded image block, and obtaining a target masking image.
Optionally, the generating module 606 is further configured to:
determining a self-attention value corresponding to each image block in the initial masking image;
selecting an image block with the self-attention value larger than a first threshold value as a masking image block;
image blocks with self-attention values greater than the second threshold and less than the first threshold are selected as discarded image blocks.
Optionally, the image processing model may perform at least one of an image segmentation task, an image classification task, and an image object detection task.
The model training device of the image processing model of the specification acquires an initial sample image and preset training times; training an image processing model based on the initial sample image and the preset training times to obtain a self-attention value; generating a self-attention map from the self-attention values and determining a target masking image in the initial sample image from the self-attention map; and continuing training the image processing model according to the target masking image and the initial sample image.
The model training device of the image processing model in the specification processes the image processing model through an initial sample image and preset training times, so as to obtain a self-attention value; generating a self-attention force diagram based on the self-attention value, guiding masking of an initial sample image based on the self-attention force diagram to obtain a target masking image, and training an image processing model based on the target masking image, so that the accuracy of the image processing model on image processing is improved, and the accuracy of a subsequent image processing result obtained based on the image processing model is improved.
The above is a schematic scheme of a model training apparatus of an image processing model of the present embodiment. It should be noted that, the technical solution of the model training device of the image processing model and the technical solution of the model training method of the image processing model belong to the same concept, and details of the technical solution of the model training device of the image processing model, which are not described in detail, can be referred to the description of the technical solution of the model training method of the image processing model.
Fig. 7 illustrates a block diagram of a computing device 700 provided in accordance with one embodiment of the present description. The components of computing device 700 include, but are not limited to, memory 710 and processor 720. Processor 720 is coupled to memory 710 via bus 730, and database 750 is used to store data.
Computing device 700 also includes access device 740, access device 740 enabling computing device 700 to communicate via one or more networks 760. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 740 may include one or more of any type of network interface, wired or wireless (e.g., a Network Interface Card (NIC)), such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 700, as well as other components not shown in FIG. 7, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 7 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.
Computing device 700 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 700 may also be a mobile or stationary server.
Wherein the processor 720 is configured to execute computer-executable instructions that, when executed by the processor, perform the steps of the model training method for an image processing model described above. The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the model training method of the image processing model belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the model training method of the image processing model.
An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the model training method of the image processing model described above.
The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the model training method of the image processing model belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the model training method of the image processing model.
An embodiment of the present disclosure further provides a computer program, where the computer program, when executed in a computer, causes the computer to perform the steps of the model training method of the image processing model described above.
The above is an exemplary version of a computer program of the present embodiment. It should be noted that, the technical solution of the computer program and the technical solution of the model training method of the image processing model belong to the same concept, and details of the technical solution of the computer program, which are not described in detail, can be referred to the description of the technical solution of the model training method of the image processing model.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims (14)

1. A model training method of an image processing model, comprising:
acquiring an initial sample image and preset training times;
training an image processing model based on the initial sample image and the preset training times to obtain a self-attention value;
generating a self-attention map from the self-attention values and determining a target masking image in the initial sample image from the self-attention map;
and continuing training the image processing model according to the target masking image and the initial sample image.
2. The method of claim 1, further comprising, after continuing training the image processing model from the target mask image and the initial sample image:
and continuing to execute the steps of acquiring the initial sample image and the preset training times until the model training stopping condition is reached.
3. The method of claim 1, training an image processing model based on the initial sample image and the preset number of training times to obtain a self-attention value, comprising:
inputting the initial sample image into an image processing model to obtain a predicted image output by the image processing model;
calculating a loss value based on the initial sample image and the predicted image;
Adjusting model parameters of the image processing model according to the loss value;
continuing to input the initial sample image into the image processing model until the preset training times are reached;
and after the preset training times of the image processing model are obtained, the self-attention value output by a decoder of the image processing model.
4. The method of claim 1, generating a self-attention profile from the self-attention value, comprising:
determining image size information corresponding to the initial sample image;
and upsampling the self-attention value based on the image size information to obtain a self-attention map.
5. The method of claim 1, determining a target masking image in the initial sample image in accordance with the self-attention attempt, comprising:
cutting the initial sample image according to a preset cutting mode to obtain an initial sample sub-image set;
cutting the self-attention force diagram according to a preset cutting mode to obtain a self-attention sub-image set;
a target masking image is determined based on the initial sample sub-image set and the self-attention sub-image set.
6. The method of claim 5, determining a target masking image based on the initial sample sub-image set and the self-attention sub-image set, comprising:
Generating an initial masking image from the initial sample sub-image set and the self-attention sub-image set;
a target mask image is determined in the initial mask image.
7. The method of claim 6, determining a target mask image in the initial mask image, comprising:
acquiring a masking image block and discarding the image block in the initial masking image;
masking the masking image block, deleting the discarded image block, and obtaining a target masking image.
8. The method of claim 7, acquiring masking image blocks and discarding image blocks in the initial masking image, comprising:
determining a self-attention value corresponding to each image block in the initial masking image;
selecting an image block with the self-attention value larger than a first threshold value as a masking image block;
image blocks with self-attention values greater than the second threshold and less than the first threshold are selected as discarded image blocks.
9. The method of claim 1, the image processing model is operable to perform at least one of an image segmentation task, an image classification task, and an image object detection task.
10. An image processing method, comprising:
determining an image to be processed, and inputting the image to be processed into an image processing model, wherein the image processing model is obtained by training based on the model training method according to the claims 1-9;
And obtaining an image processing result output by the image processing model.
11. The model training system of the image processing model comprises end measurement equipment and cloud measurement equipment, wherein:
the end testing equipment is used for determining an initial sample image according to the model training request and sending the initial sample image to the cloud testing equipment;
the cloud testing equipment is used for acquiring an initial sample image and preset training times; training an image processing model based on the initial sample image and the preset training times to obtain a self-attention value; generating a self-attention map from the self-attention values and determining a target masking image in the initial sample image from the self-attention map; and continuing training the image processing model according to the target masking image and the initial sample image.
12. A model training apparatus of an image processing model, comprising:
the acquisition module is configured to acquire an initial sample image and preset training times;
the training module is configured to train the image processing model based on the initial sample image and the preset training times to obtain a self-attention value;
a generation module configured to generate a self-attention-seeking-diagram from the self-attention value and to determine a target mask image in the initial sample image from the self-attention-seeking-diagram;
A continuation training module configured to continue training the image processing model based on the target masking image and the initial sample image.
13. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer executable instructions, the processor being configured to execute the computer executable instructions, which when executed by the processor, implement the steps of the method of any one of claims 1 to 10.
14. A computer readable storage medium storing computer executable instructions which when executed by a processor implement the steps of the method of any one of claims 1 to 10.
CN202310200257.0A 2023-02-27 2023-02-27 Model training method and device for image processing model Pending CN116258939A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310200257.0A CN116258939A (en) 2023-02-27 2023-02-27 Model training method and device for image processing model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310200257.0A CN116258939A (en) 2023-02-27 2023-02-27 Model training method and device for image processing model

Publications (1)

Publication Number Publication Date
CN116258939A true CN116258939A (en) 2023-06-13

Family

ID=86680670

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310200257.0A Pending CN116258939A (en) 2023-02-27 2023-02-27 Model training method and device for image processing model

Country Status (1)

Country Link
CN (1) CN116258939A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114926338A (en) * 2022-05-25 2022-08-19 上海商汤智能科技有限公司 Model training method and device, electronic equipment and storage medium
CN115496919A (en) * 2022-10-24 2022-12-20 西安交通大学 Hybrid convolution-transformer framework based on window mask strategy and self-supervision method
CN115512005A (en) * 2022-08-22 2022-12-23 华为技术有限公司 Data processing method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114926338A (en) * 2022-05-25 2022-08-19 上海商汤智能科技有限公司 Model training method and device, electronic equipment and storage medium
CN115512005A (en) * 2022-08-22 2022-12-23 华为技术有限公司 Data processing method and device
CN115496919A (en) * 2022-10-24 2022-12-20 西安交通大学 Hybrid convolution-transformer framework based on window mask strategy and self-supervision method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YOUCANS_: ""图像的掩模加法"", 《HTTPS://BLOG.CSDN.NET/YOUCANS/ARTICLE/DETAILS/121244290》, pages 1 - 5 *
ZHENGQI LIU等: ""Good helper is around you: Attention-driven Masked Image Modeling"", 《ARXIV.ORG》, pages 1 *
ZHENGQI LIU等: ""Good helper is around you: Attention-driven Masked Image Modeling"", 《ARXIV》, pages 1 *

Similar Documents

Publication Publication Date Title
US10902245B2 (en) Method and apparatus for facial recognition
CN108073910B (en) Method and device for generating human face features
CN108280451B (en) Semantic segmentation and network training method and device, equipment and medium
CN113971751A (en) Training feature extraction model, and method and device for detecting similar images
CN111696038A (en) Image super-resolution method, device, equipment and computer-readable storage medium
CN114022887B (en) Text recognition model training and text recognition method and device, and electronic equipment
CN115641485A (en) Generative model training method and device
CN113392791A (en) Skin prediction processing method, device, equipment and storage medium
CN114140831B (en) Human body posture estimation method and device, electronic equipment and storage medium
CN117372782A (en) Small sample image classification method based on frequency domain analysis
CN114005019B (en) Method for identifying flip image and related equipment thereof
CN114898266A (en) Training method, image processing method, device, electronic device and storage medium
CN113837965A (en) Image definition recognition method and device, electronic equipment and storage medium
CN116468919A (en) Image local feature matching method and system
CN116258939A (en) Model training method and device for image processing model
CN116168394A (en) Image text recognition method and device
CN116416131A (en) Target object prediction method and device
CN115376137A (en) Optical character recognition processing and text recognition model training method and device
CN115810215A (en) Face image generation method, device, equipment and storage medium
CN114445632A (en) Picture processing method and device
CN115861684B (en) Training method of image classification model, image classification method and device
CN117058686A (en) Feature generation method
CN114550236B (en) Training method, device, equipment and storage medium for image recognition and model thereof
CN113705430B (en) Form detection method, device, equipment and storage medium based on detection model
CN108230413B (en) Image description method and device, electronic equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination