CN116363429A - Training method of image recognition model, image recognition method, device and equipment - Google Patents

Training method of image recognition model, image recognition method, device and equipment Download PDF

Info

Publication number
CN116363429A
CN116363429A CN202310317047.XA CN202310317047A CN116363429A CN 116363429 A CN116363429 A CN 116363429A CN 202310317047 A CN202310317047 A CN 202310317047A CN 116363429 A CN116363429 A CN 116363429A
Authority
CN
China
Prior art keywords
auxiliary
target
image
model
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310317047.XA
Other languages
Chinese (zh)
Inventor
李兴建
张泽人
熊昊一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202310317047.XA priority Critical patent/CN116363429A/en
Publication of CN116363429A publication Critical patent/CN116363429A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/28Quantising the image, e.g. histogram thresholding for discrimination between background and foreground patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Abstract

The disclosure provides a training method of an image recognition model, an image recognition method, an image recognition device and equipment, relates to the technical field of artificial intelligence, and particularly relates to the technical field of deep learning, image processing and computer vision. The method comprises the following steps: graying the first sample image in the auxiliary field to obtain a first auxiliary image; inputting the first auxiliary image into an auxiliary model, and performing color recovery on the first auxiliary image through an auxiliary encoder and an auxiliary decoder in the auxiliary model to obtain a second auxiliary image; pre-training an auxiliary encoder and an auxiliary decoder based on the first auxiliary image and the second auxiliary image; fine-tuning the target model by adopting a second sample image of the target field, and taking the fine-tuned target model as an image recognition model of the target field; the target encoder in the target model is initialized with the pre-trained auxiliary encoder. Through the technical scheme, the image recognition effect can be improved.

Description

Training method of image recognition model, image recognition method, device and equipment
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular to the technical fields of deep learning, image processing, and computer vision. In particular to a training method of an image recognition model, an image recognition method, an image recognition device and image recognition equipment.
Background
With the rapid development of Deep Learning (DL) technology, deep Learning has been widely applied to the fields of computer vision technology, voice recognition technology, natural language processing technology, deep Learning, big data processing technology, and the like.
The mask self-coding (Mask Auto Encoder, MAE) model has achieved great success in many natural image recognition tasks, but the mask self-coding (Mask Auto Encoder, MAE) model is migrated to the image recognition tasks in partial fields, and is affected by less data quantity, so that the recognition effect of the image is not ideal.
Disclosure of Invention
The disclosure provides a training method of an image recognition model, an image recognition method, an image recognition device and equipment.
According to an aspect of the present disclosure, there is provided a training method of an image recognition model, including:
graying the first sample image in the auxiliary field to obtain a first auxiliary image;
inputting the first auxiliary image into an auxiliary model, and performing color recovery on the first auxiliary image through an auxiliary encoder and an auxiliary decoder in the auxiliary model to obtain a second auxiliary image;
pre-training the auxiliary encoder and the auxiliary decoder based on the first auxiliary image and the second auxiliary image;
Fine-tuning the target model by adopting a second sample image of the target field, and taking the fine-tuned target model as an image recognition model of the target field; the target encoder in the target model is initialized with a pre-trained auxiliary encoder.
According to another aspect of the present disclosure, there is provided an image recognition method including:
acquiring a target image to be identified in the target field;
inputting the target image into an image recognition model in the target field to obtain a recognition result of the target image;
the image recognition model in the target field is obtained by training the image recognition model training method disclosed by any embodiment of the disclosure.
According to still another aspect of the present disclosure, there is provided a training apparatus of an image recognition model, including:
the graying module is used for graying the first sample image in the auxiliary field to obtain a first auxiliary image;
the color recovery module is used for inputting the first auxiliary image into an auxiliary model, and performing color recovery on the first auxiliary image through an auxiliary encoder and an auxiliary decoder in the auxiliary model to obtain a second auxiliary image;
a pre-training module for pre-training the auxiliary encoder and the auxiliary decoder according to the first auxiliary image and the second auxiliary image;
The model fine-tuning module is used for fine-tuning the target model by adopting a second sample image of the target field, and taking the fine-tuned target model as an image recognition model of the target field; the target encoder in the target model is initialized with a pre-trained auxiliary encoder.
According to still another aspect of the present disclosure, there is provided an image recognition apparatus including:
the target image module is used for acquiring a target image to be identified in the target field;
the image recognition module is used for inputting the target image into an image recognition model in the target field to obtain a recognition result of the target image;
the image recognition model in the target field is obtained by training the training device of the image recognition model disclosed in any embodiment of the disclosure.
According to another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods provided by any of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method provided by any of the embodiments of the present disclosure.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1a is a flow chart of a training method for an image recognition model provided in accordance with an embodiment of the present disclosure;
FIG. 1b is a schematic diagram of training principles of an image recognition model provided in accordance with an embodiment of the present disclosure;
FIG. 2 is a flow chart of another method of training an image recognition model provided in accordance with an embodiment of the present disclosure;
FIG. 3a is a flow chart of a training method for yet another image recognition model provided in accordance with an embodiment of the present disclosure;
FIG. 3b is a schematic diagram of a training process for an auxiliary model, a target model, provided in accordance with an embodiment of the present disclosure;
FIG. 4 is a flow chart of an image recognition method provided in accordance with an embodiment of the present disclosure;
FIG. 5 is a schematic structural diagram of a training device for an image recognition model according to an embodiment of the present disclosure;
fig. 6 is a schematic structural view of an image recognition apparatus according to an embodiment of the present disclosure;
fig. 7 is a block diagram of an electronic device for implementing a training method or image recognition method of an image recognition model of an embodiment of the present disclosure.
Detailed Description
Fig. 1a is a flowchart of a training method of an image recognition model according to an embodiment of the present disclosure. The method is suitable for training an image recognition model in the field of targets. The method may be performed by a training device of the image recognition model, which may be implemented in software and/or hardware and may be integrated in an electronic device. As shown in fig. 1a, the training method of the image recognition model of the present embodiment may include:
s101, graying a first sample image in the auxiliary field to obtain a first auxiliary image;
s102, inputting the first auxiliary image into an auxiliary model, and performing color recovery on the first auxiliary image through an auxiliary encoder and an auxiliary decoder in the auxiliary model to obtain a second auxiliary image;
s103, pre-training the auxiliary encoder and the auxiliary decoder according to the first auxiliary image and the second auxiliary image;
S104, fine-tuning the target model by adopting a second sample image of the target field, and taking the fine-tuned target model as an image recognition model of the target field.
In the embodiment of the disclosure, the target field is a field in which image recognition is required, but the number of marked samples in the target field is very limited, and if the marked samples in the target field are directly adopted for training, the marked samples are easy to cause over-fitting, so that the generalization of a model obtained by training is very poor. The auxiliary field is a field other than the target field, i.e., the auxiliary field is different from the target field. The auxiliary field has rich labeling samples, and can adopt an image dataset of a natural scene, such as a visual dataset ImageNet. The image net data set has 1281167 images and labels, 1000 types are about 1300 images in each type, the verification set has 50000 images, 50 data in each type, the test set has 100000 images, and 100 data in each type, so that sufficient first sample images can be improved.
The auxiliary model is a deep learning model of the auxiliary field, the target model is a deep learning model of the target field, and an auxiliary encoder in the auxiliary model and a target encoder in the target field can be built based on a mask self-coding model (Mask Auto Encoder, MAE) structure. MAE models have achieved great success in many natural image recognition tasks, but the effect of directly migrating them into image recognition tasks in the target area is not ideal. The embodiment of the disclosure provides a brand new MAE pre-training task, and enables an encoder to learn the characteristics more suitable for the image in the target field by utilizing the existing large-scale natural scene data set through gray level picture color recovery (color), so that the image recognition effect in the target field is improved.
FIG. 1b is a schematic diagram of training an image recognition model according to an embodiment of the present disclosure, and referring to FIG. 1b, an auxiliary model is constructed based on a MAE structure, and includes an auxiliary Encoder and an auxiliary decoder, wherein the auxiliary Encoder may employ an Encoder (Encoder) structure in ViT (Vision Transformer, visual transducer), and the auxiliary decoder may employ a MAE decoder structure; the target model includes a target encoder and a target output layer, and the target encoder is identical in structure to the auxiliary decoder. The structure of the target output layer is determined by the type of target recognition task, which may be an image classification task or an image segmentation task. The first sample image is an RGB three-channel image of the auxiliary field, and the second sample image may be a gray-scale image of the target field, such as a CT image of the medical field, etc., and the number of the first sample images is greater than the number of the second sample images.
Referring to fig. 1b, in a pre-training stage, graying a first sample image of an auxiliary type to obtain a first auxiliary image, inputting the first auxiliary image into an auxiliary model, and performing color recovery on the first auxiliary image through an auxiliary encoder and an auxiliary decoder in the auxiliary model to obtain a second auxiliary image; and pre-training the auxiliary encoder and the auxiliary decoder according to the first auxiliary image and the second auxiliary image to obtain a pre-trained auxiliary encoder and an auxiliary decoder. In the pre-training stage, the auxiliary encoder fully learns the image semantic features of the auxiliary field through a color recovery task, so that the representation with good generalization performance is learned, and the auxiliary encoder is migrated to the target field, so that the target encoder also has good generalization performance, and the image features of the target field can be learned with high quality, so that the image recognition performance of the target field is improved.
In the fine tuning stage, the pre-trained auxiliary encoder is adopted as an initial target encoder, namely, network parameters in the pre-trained auxiliary encoder are adopted to initialize the network parameters in the target encoder, a second sample image in the target field is input into a target model, image recognition is carried out on the second sample image through the target encoder and a target output layer in the target model, fine tuning is carried out on the target encoder and the target output layer according to the image recognition result, and the fine-tuned target model is adopted as an image recognition model in the target field and is used for recognizing the image in the target field.
The auxiliary encoder can learn image semantic features better in the pre-training stage through the color recovery task based on the MAE model structure, and the target encoder in the target model is initialized through the pre-trained auxiliary encoder, so that the target encoder can learn image features of the target field with high quality, and the image recognition performance of the target field is improved.
According to the technical scheme provided by the embodiment of the disclosure, the first auxiliary image is subjected to color recovery through the auxiliary model to obtain the second auxiliary image, and the target encoder in the target model is initialized by adopting the pre-trained auxiliary encoder, so that the target encoder can learn the image characteristics more suitable for the target field by utilizing the data set in the auxiliary field, and the image recognition performance of the target field can be improved.
In an alternative embodiment, the target area is a medical area; the object model is used for processing images in the medical field.
Under the condition that the target field is a medical field, the image of the target field can be a gray scale image such as a CT image, and the model can be more easily moved to the target field through the gray scale image color recovery processing of the pre-training stage, so that a better recognition effect is obtained.
Fig. 2 is a flowchart of another method of training an image recognition model provided in accordance with an embodiment of the present disclosure. Referring to fig. 2, the training method of the image recognition model of the present embodiment may include:
s201, graying a first sample image in the auxiliary field to obtain a first auxiliary image;
s202, carrying out mask shielding on the first auxiliary image to obtain a shielded first auxiliary image;
s203, inputting the blocked first auxiliary image into an auxiliary model, and performing color recovery and blocking position reconstruction on the first auxiliary image through an auxiliary encoder and an auxiliary decoder in the auxiliary model to obtain a second auxiliary image;
s204, pre-training the auxiliary encoder and the auxiliary decoder according to the first auxiliary image and the second auxiliary image;
S205, fine-tuning the target model by adopting a second sample image of the target field, and taking the fine-tuned target model as an image recognition model of the target field; the target encoder in the target model is initialized with a pre-trained auxiliary encoder.
In the pre-training stage, not only is the first sample image in the auxiliary field subjected to graying to obtain a first auxiliary image, but also the first auxiliary image is subjected to mask shielding to obtain a shielded first auxiliary image, the shielded first auxiliary image is input into an auxiliary model, and color recovery is performed on the first auxiliary image through the auxiliary model, and the shielding position is reconstructed according to the non-shielding position to obtain a second auxiliary image. The auxiliary model performs not only the color recovery task but also the MIM (Mask Image Modeling, mask image reconstruction) task, which can further improve the image feature extraction capability of the encoder.
The embodiment of the disclosure does not specifically limit the graying manner of the first sample image, for example, gray level transformation may be adopted: y=0.2126×r+0.7152×g+0.0722×b, where RGB represents the values of three color channels of red, green, and blue in the first sample image, and Y is a single-channel gray value. The first sample image may also be randomly scaled, cut out at 224 x 224 resolution, before being grayed. The second auxiliary image and the first sample image are both RGB three-channel color images.
Specifically, in the pre-training stage, scaling, clipping and other treatments can be performed on the first sample image, and gray scale transformation is adopted: y=0.2126×r+0.7152×g+0.0722×b the processed first sample image is grayed to obtain a first auxiliary image, and a partial region in the first auxiliary image is blocked with a mask to obtain a blocked first auxiliary image. And inputting the blocked first auxiliary image into an auxiliary model, and executing a color recovery task and an MIM task through an auxiliary encoder and an auxiliary decoder in the auxiliary model to obtain a second auxiliary image. The auxiliary encoder can better learn the image characteristics by considering the color recovery task and the MIM task, so that the image recognition performance transferred to the target field is further improved.
In an optional embodiment, the masking the first auxiliary image to obtain an occluded first auxiliary image includes: masking and shielding the first auxiliary image by adopting a preset shielding proportion to obtain a shielded first auxiliary image; the occlusion ratio is less than an occlusion ratio threshold.
The preset shielding proportion and the shielding proportion threshold value can be experience values, for example, the shielding proportion threshold value can be 75%, and the preset shielding proportion can be 15%. Since the color recovery task corresponds to a classification of 256 x 256 categories for each pixel point, the computational complexity is high. By controlling the MIM task to adopt a lower shielding proportion, the auxiliary model can be better converged compared with a high shielding proportion adopting a shielding proportion threshold value.
In an alternative embodiment, the pre-training the auxiliary encoder and the auxiliary decoder according to the first auxiliary image and the second auxiliary image includes: determining a pre-training loss function according to the pixel value of each pixel point in the first auxiliary image and the pixel value of the corresponding pixel point in the second auxiliary image; and pre-training the auxiliary encoder and the auxiliary decoder by adopting the pre-training loss function.
The loss function during the pre-training phase may take the form of a pixel-level average squared error (mean square error, MSE) loss, also known as L2 loss. Specifically, for each pixel point I1 in the first auxiliary image, a corresponding pixel point I2 of the pixel point in the second auxiliary image may be determined, a pre-training loss value is obtained according to the euclidean distance between the pixel value of the pixel point I1 in the first auxiliary image and the pixel value of the corresponding pixel point I2 in the second auxiliary image, and network parameters in the auxiliary encoder and the auxiliary decoder are updated by adopting the pre-training loss value. In the pre-training stage, the network parameters in the auxiliary encoder and the auxiliary decoder are updated through the pixel basic L2 loss function, so that the auxiliary encoder has good image feature extraction capability. It should be noted that the trained auxiliary model may also be used for color recovery function for black-and-white pictures.
According to the technical scheme provided by the embodiment of the disclosure, in the pre-training stage, the auxiliary model is further led into the MIM task after the color recovery task, and the mask shielding proportion in the MIM task is controlled, so that the image feature extraction capability can be further improved, the auxiliary model can be better converged, and the image recognition performance of the auxiliary model in the target field is further improved.
Fig. 3a is a flowchart of a training method of yet another image recognition model provided in accordance with an embodiment of the present disclosure. This embodiment is an alternative to the embodiments described above. Referring to fig. 3a, the training method of the image recognition model of the present embodiment may include:
s301, graying a first sample image in the auxiliary field to obtain a first auxiliary image;
s302, inputting the first auxiliary image into an auxiliary model, and performing color recovery on the first auxiliary image through an auxiliary encoder and an auxiliary decoder in the auxiliary model to obtain a second auxiliary image;
s303, pre-training the auxiliary encoder and the auxiliary decoder according to the first auxiliary image and the second auxiliary image;
s304, inputting a second sample image of the target field into a target encoder in a target model to perform feature extraction to obtain a second target feature;
S305, inputting the second target feature into a target output layer in a target model to obtain prediction information of a second sample image;
s306, fine tuning is carried out on the target encoder and the target output layer by adopting the marking information and the prediction information of the second sample image, and the fine-tuned target model is used as an image recognition model of the target field.
The target model includes a target encoder and a target output layer, the target encoder being initialized with a pre-trained auxiliary encoder. The second sample image has labeling information (group trunk), the type of the labeling information is determined by an image recognition task in the target field, and when the image recognition task is an image segmentation task, the labeling information of the second sample image is an image segmentation result labeled by the second sample image; and when the image recognition task is an image classification task, the labeling information of the second sample image is the image classification result labeled by the second sample image.
In the fine tuning stage, inputting a second sample image in the target field into a target encoder, and extracting features of the second sample image through the target encoder to obtain second target features; inputting the second target characteristics into a target output layer for image recognition to obtain the prediction information of a second sample image; network parameters in the target encoder and the target output layer are updated according to the prediction information and the labeling information of the second sample image, for example, a Logits score between the prediction information and the labeling information of the second sample image is updated, and the target model after fine adjustment training is used as an image recognition model of the target field.
And in the fine adjustment stage, performing image recognition on the second sample image through the target model to obtain the prediction information of the second sample image, and performing fine adjustment on the target model according to the labeling information and the prediction information of the second sample image to obtain the image recognition model of the target field. Because the target encoder in the target model is initialized by adopting the pre-trained auxiliary encoder, the first sample image learning image feature extraction capability of the auxiliary field can be borrowed, and the dependence on the labeling data of the target field is reduced.
In an alternative embodiment, the fine tuning the target encoder and the target output layer using the labeling information and the prediction information of the second sample image includes: determining cross entropy loss according to the difference value between the labeling information of the second sample image and the prediction information; determining auxiliary coding parameters corresponding to the target coding parameters in the auxiliary encoder aiming at the target coding parameters in the target encoder, and determining regularization loss according to the target coding parameters and the auxiliary coding parameters; and fine tuning the target encoder and the target output layer according to the cross entropy loss and the regularization loss.
In the fine tuning stage, not only cross entropy loss is considered, but also regularization loss between the target coding parameters and auxiliary coding parameters can be introduced, and the target coder is updated by taking the parameters of the auxiliary coder as reference basis through introducing regularization loss, so that the target coder can reduce over fitting in the fine tuning stage.
Specifically, for target coding parameters in a target encoder, determining auxiliary coding parameters corresponding to the target coding parameters in the auxiliary encoder according to an initialized corresponding relation between the target encoder and the auxiliary encoder, and determining regularization loss according to the target coding parameters and the corresponding auxiliary coding parameters; network parameters in the target encoder and the target output layer are fine-tuned in combination with cross entropy loss and regularization loss.
In an alternative embodiment, the target encoder employs an encoder in a visual transformer; under the condition that the target model is an image segmentation task, the target output layer is a decoder for semantic segmentation; in the case that the target model is an image classification task, the target output layer is a linear classifier.
In the embodiment of the disclosure, the target encoder and the auxiliary encoder are identical in structure, and the ViT encoder in the MAE can be adopted. The target output layer discards the lightweight MAE decoder network. Referring to fig. 3b, in case the object model is an image segmentation task, the object output layer may be a semantic segmented decoder, such as an Upernet decoder for 2D image segmentation or a UNETR decoder for 3D image segmentation.
Referring to fig. 3b, in the case where the target model is an image segmentation task, the target output layer may be a linear layer, and the output corresponding to the learnable class CLS in the target encoder may be used as a representation of the whole image, and the linear layer is externally connected, and the output is mapped into the logits score with the same number of classes through the linear layer. Because much data in medical image classification is multi-labeled, i.e. one image has multiple class labels, it can be activated by using Sigmoid function again, and the network of the whole target model is optimized by using loss function. By adapting the network structure for the target output layer according to the image recognition task, image recognition of the target field can be realized.
According to the technical scheme provided by the embodiment of the disclosure, in the fine tuning stage, the pre-trained auxiliary encoder is adopted to initialize the target encoder, so that the first sample image learning image feature extraction capability in the auxiliary field can be borrowed; by fine tuning network parameters in the target encoder and the target output layer according to the cross entropy loss and the regularization loss, the target encoder can avoid overfitting in the fine tuning stage; and by providing an adaptive network structure of a target output layer for the target recognition task, image recognition of the target field can be realized.
Fig. 4 is a flowchart of an image recognition method provided according to an embodiment of the present disclosure. The method is suitable for executing the image recognition task in the target field. The method may be performed by an image recognition device, which may be implemented in software and/or hardware, and may be integrated in an electronic device. As shown in fig. 4, the image recognition method of the present embodiment may include:
s401, acquiring a target image to be identified in the target field;
s402, inputting the target image into an image recognition model in the target field to obtain a recognition result of the target image;
the image recognition model in the target field is obtained by training the image recognition model training method disclosed by any embodiment of the disclosure.
The target area is an area where image recognition is required, but the number of marked samples in the target area is very limited. The target field may be a medical image field, which has high labeling cost, and generally has a limited data set size. In the case that the target image is a gray image, for example, a CT image, the target image may be directly input into an image recognition model in the target field to perform image recognition, so as to obtain a recognition result of the target image. When the target image is a color image, the color image may be first grayed and then input into the image recognition model.
The image recognition model in the target field provided by the embodiment of the disclosure is constructed by adopting the training method of the image recognition model provided by any embodiment of the disclosure, so that the image recognition model has good image recognition performance.
According to the technical scheme provided by the embodiment of the disclosure, the MAE model with color recovery is added for pre-training, and the pre-trained auxiliary encoder is adopted for initializing the encoder in the target field, so that the target encoder can better learn the image characteristics of the target field, and therefore the target field has good image recognition performance.
Fig. 5 is a schematic structural diagram of a training device for an image recognition model according to an embodiment of the present disclosure. The method and the device are suitable for training the image recognition model in the target field. The apparatus may be implemented in software and/or hardware. As shown in fig. 5, the training apparatus 500 of the image recognition model of the present embodiment may include:
the graying module 510 is configured to graying the first sample image in the auxiliary field to obtain a first auxiliary image;
the color recovery module 520 is configured to input the first auxiliary image into an auxiliary model, and perform color recovery on the first auxiliary image through an auxiliary encoder and an auxiliary decoder in the auxiliary model to obtain a second auxiliary image;
A pre-training module 530 for pre-training the auxiliary encoder and the auxiliary decoder based on the first auxiliary image and the second auxiliary image;
a model fine tuning module 540, configured to fine tune the target model using the second sample image of the target domain, and use the fine-tuned target model as an image recognition model of the target domain; the target encoder in the target model is initialized with a pre-trained auxiliary encoder.
In an alternative embodiment, the color recovery module 520 includes:
a masking unit, configured to mask and block the first auxiliary image to obtain a blocked first auxiliary image;
and the auxiliary model unit is used for inputting the blocked first auxiliary image into an auxiliary model, and carrying out color recovery and blocking position reconstruction on the first auxiliary image through an auxiliary encoder and an auxiliary decoder in the auxiliary model to obtain a second auxiliary image.
In an alternative embodiment, the masking unit is specifically configured to:
masking and shielding the first auxiliary image by adopting a preset shielding proportion to obtain a shielded first auxiliary image; the occlusion ratio is less than an occlusion ratio threshold.
In an alternative embodiment, the pre-training module 530 includes:
the training loss unit is used for determining a pre-training loss function according to the pixel value of each pixel point in the first auxiliary image and the pixel value of the corresponding pixel point in the second auxiliary image;
and the pre-training unit is used for pre-training the auxiliary encoder and the auxiliary decoder by adopting the pre-training loss function.
In an alternative embodiment, the model fine tuning module 540 includes:
the feature extraction unit is used for inputting a second sample image in the target field into a target encoder in the target model to perform feature extraction to obtain a second target feature;
the prediction information unit is used for inputting the second target characteristics into a target output layer in a target model to obtain the prediction information of a second sample image;
and the model fine tuning unit is used for fine tuning the target encoder and the target output layer by adopting the marking information and the prediction information of the second sample image.
In an alternative embodiment, the model fine tuning unit comprises:
a cross entropy loss subunit, configured to determine a cross entropy loss according to a difference between the labeling information of the second sample image and the prediction information;
A regularization loss subunit, configured to determine, for a target coding parameter in the target encoder, an auxiliary coding parameter corresponding to the target coding parameter in the auxiliary encoder, and determine regularization loss according to the target coding parameter and the auxiliary coding parameter;
and the model fine tuning subunit is used for carrying out fine tuning on the target encoder and the target output layer according to the cross entropy loss and the regularization loss.
In an alternative embodiment, the target encoder employs an encoder in a visual transformer;
under the condition that the target model is an image segmentation task, the target output layer is a decoder for semantic segmentation; in the case that the target model is an image classification task, the target output layer is a linear classifier.
In an alternative embodiment, the target area is a medical area; the object model is used for processing images in the medical field.
According to the technical scheme, the first auxiliary image is subjected to color recovery through the auxiliary model to obtain the second auxiliary image, so that an auxiliary encoder in the auxiliary model can learn image semantic features with high quality; and initializing a target encoder in the target model by adopting the pre-trained auxiliary encoder, and fine-tuning the target model by adopting a second sample image of the target field, so that the image recognition performance of the target field can be improved.
Fig. 6 is a schematic structural view of an image recognition apparatus according to an embodiment of the present disclosure. The embodiment is suitable for executing the image recognition task in the target field. The apparatus may be implemented in software and/or hardware. As shown in fig. 6, the image recognition apparatus 600 of the present embodiment may include:
a target image module 610, configured to acquire a target image to be identified in a target field;
the image recognition module 620 is configured to input the target image into an image recognition model in the target field, so as to obtain a recognition result of the target image;
the image recognition model in the target field is obtained by training the training device of the image recognition model provided by any embodiment of the disclosure.
According to the technical scheme, the MAE model with color recovery is added for pre-training, and the pre-trained auxiliary encoder is adopted for initializing the encoder in the target field, so that the target encoder can better learn the image characteristics of the target field, and therefore the target field has good image recognition performance.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 7 is a block diagram of an electronic device for implementing a training method or image recognition method of an image recognition model of an embodiment of the present disclosure.
Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the electronic device 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 701 performs the respective methods and processes described above, for example, a training method of an image recognition model or an image recognition method. For example, in some embodiments, the training method of the image recognition model or the image recognition method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the training method of the image recognition model or the image recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the training method of the image recognition model or the image recognition method by any other suitable means (e.g. by means of firmware).
Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligent software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.
Cloud computing (cloud computing) refers to a technical system that a shared physical or virtual resource pool which is elastically extensible is accessed through a network, resources can comprise servers, operating systems, networks, software, applications, storage devices and the like, and resources can be deployed and managed in an on-demand and self-service mode. Through cloud computing technology, high-efficiency and powerful data processing capability can be provided for technical application such as artificial intelligence and blockchain, and model training.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (21)

1. A training method of an image recognition model, comprising:
graying the first sample image in the auxiliary field to obtain a first auxiliary image;
inputting the first auxiliary image into an auxiliary model, and performing color recovery on the first auxiliary image through an auxiliary encoder and an auxiliary decoder in the auxiliary model to obtain a second auxiliary image;
pre-training the auxiliary encoder and the auxiliary decoder based on the first auxiliary image and the second auxiliary image;
fine-tuning the target model by adopting a second sample image of the target field, and taking the fine-tuned target model as an image recognition model of the target field; the target encoder in the target model is initialized with a pre-trained auxiliary encoder.
2. The method of claim 1, wherein the inputting the first auxiliary image into an auxiliary model, color recovering the first auxiliary image by an auxiliary encoder and an auxiliary decoder in the auxiliary model, and obtaining a second auxiliary image, comprises:
masking and shielding the first auxiliary image to obtain a shielded first auxiliary image;
and inputting the blocked first auxiliary image into an auxiliary model, and performing color recovery and blocking position reconstruction on the first auxiliary image through an auxiliary encoder and an auxiliary decoder in the auxiliary model to obtain a second auxiliary image.
3. The method of claim 2, wherein the masking the first auxiliary image results in an occluded first auxiliary image, comprising:
masking and shielding the first auxiliary image by adopting a preset shielding proportion to obtain a shielded first auxiliary image; the occlusion ratio is less than an occlusion ratio threshold.
4. The method of claim 1, wherein the pre-training the auxiliary encoder and the auxiliary decoder from the first auxiliary image and the second auxiliary image comprises:
Determining a pre-training loss function according to the pixel value of each pixel point in the first auxiliary image and the pixel value of the corresponding pixel point in the second auxiliary image;
and pre-training the auxiliary encoder and the auxiliary decoder by adopting the pre-training loss function.
5. The method of claim 1, wherein the fine-tuning the target model with the second sample image of the target area comprises:
inputting a second sample image of the target field into a target encoder in a target model to perform feature extraction to obtain a second target feature;
inputting the second target characteristics into a target output layer in a target model to obtain prediction information of a second sample image;
and fine tuning the target encoder and the target output layer by adopting the marking information and the prediction information of the second sample image.
6. The method of claim 5, wherein said employing the labeling information and the prediction information of the second sample image to fine tune the target encoder and the target output layer comprises:
determining cross entropy loss according to the difference value between the labeling information of the second sample image and the prediction information;
Determining auxiliary coding parameters corresponding to the target coding parameters in the auxiliary encoder aiming at the target coding parameters in the target encoder, and determining regularization loss according to the target coding parameters and the auxiliary coding parameters;
and fine tuning the target encoder and the target output layer according to the cross entropy loss and the regularization loss.
7. The method of claim 5, the target encoder employing an encoder in a visual transformer;
under the condition that the target model is an image segmentation task, the target output layer is a decoder for semantic segmentation; in the case that the target model is an image classification task, the target output layer is a linear classifier.
8. The method of any one of claims 1-7, wherein the target area is a medical area; the object model is used for processing images in the medical field.
9. An image recognition method, comprising:
acquiring a target image to be identified in the target field;
inputting the target image into an image recognition model in the target field to obtain a recognition result of the target image;
the image recognition model of the target field is trained by the method of any one of claims 1-8.
10. A training device for an image recognition model, comprising:
the graying module is used for graying the first sample image in the auxiliary field to obtain a first auxiliary image;
the color recovery module is used for inputting the first auxiliary image into an auxiliary model, and performing color recovery on the first auxiliary image through an auxiliary encoder and an auxiliary decoder in the auxiliary model to obtain a second auxiliary image;
a pre-training module for pre-training the auxiliary encoder and the auxiliary decoder according to the first auxiliary image and the second auxiliary image;
the model fine-tuning module is used for fine-tuning the target model by adopting a second sample image of the target field, and taking the fine-tuned target model as an image recognition model of the target field; the target encoder in the target model is initialized with a pre-trained auxiliary encoder.
11. The apparatus of claim 10, wherein the color recovery module comprises:
a masking unit, configured to mask and block the first auxiliary image to obtain a blocked first auxiliary image;
and the auxiliary model unit is used for inputting the blocked first auxiliary image into an auxiliary model, and carrying out color recovery and blocking position reconstruction on the first auxiliary image through an auxiliary encoder and an auxiliary decoder in the auxiliary model to obtain a second auxiliary image.
12. The apparatus of claim 11, wherein the masking unit is specifically configured to:
masking and shielding the first auxiliary image by adopting a preset shielding proportion to obtain a shielded first auxiliary image; the occlusion ratio is less than an occlusion ratio threshold.
13. The apparatus of claim 10, wherein the pre-training module comprises:
the training loss unit is used for determining a pre-training loss function according to the pixel value of each pixel point in the first auxiliary image and the pixel value of the corresponding pixel point in the second auxiliary image;
and the pre-training unit is used for pre-training the auxiliary encoder and the auxiliary decoder by adopting the pre-training loss function.
14. The apparatus of claim 10, wherein the model fine tuning module comprises:
the feature extraction unit is used for inputting a second sample image in the target field into a target encoder in the target model to perform feature extraction to obtain a second target feature;
the prediction information unit is used for inputting the second target characteristics into a target output layer in a target model to obtain the prediction information of a second sample image;
and the model fine tuning unit is used for fine tuning the target encoder and the target output layer by adopting the marking information and the prediction information of the second sample image.
15. The apparatus of claim 14, wherein the model fine tuning unit comprises:
a cross entropy loss subunit, configured to determine a cross entropy loss according to a difference between the labeling information of the second sample image and the prediction information;
a regularization loss subunit, configured to determine, for a target coding parameter in the target encoder, an auxiliary coding parameter corresponding to the target coding parameter in the auxiliary encoder, and determine regularization loss according to the target coding parameter and the auxiliary coding parameter;
and the model fine tuning subunit is used for carrying out fine tuning on the target encoder and the target output layer according to the cross entropy loss and the regularization loss.
16. The apparatus of claim 14, the target encoder employing an encoder in a visual transformer;
under the condition that the target model is an image segmentation task, the target output layer is a decoder for semantic segmentation; in the case that the target model is an image classification task, the target output layer is a linear classifier.
17. The apparatus of any of claims 10-16, wherein the target area is a medical area; the object model is used for processing images in the medical field.
18. An image recognition apparatus comprising:
the target image module is used for acquiring a target image to be identified in the target field;
the image recognition module is used for inputting the target image into an image recognition model in the target field to obtain a recognition result of the target image;
wherein the image recognition model of the target area is trained by the apparatus of any one of claims 10-17.
19. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
20. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-9.
21. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-9.
CN202310317047.XA 2023-03-28 2023-03-28 Training method of image recognition model, image recognition method, device and equipment Pending CN116363429A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310317047.XA CN116363429A (en) 2023-03-28 2023-03-28 Training method of image recognition model, image recognition method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310317047.XA CN116363429A (en) 2023-03-28 2023-03-28 Training method of image recognition model, image recognition method, device and equipment

Publications (1)

Publication Number Publication Date
CN116363429A true CN116363429A (en) 2023-06-30

Family

ID=86914246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310317047.XA Pending CN116363429A (en) 2023-03-28 2023-03-28 Training method of image recognition model, image recognition method, device and equipment

Country Status (1)

Country Link
CN (1) CN116363429A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117373016A (en) * 2023-10-20 2024-01-09 农芯(南京)智慧农业研究院有限公司 Tobacco leaf baking state judging method, device, equipment and storage medium
CN117373016B (en) * 2023-10-20 2024-04-30 农芯(南京)智慧农业研究院有限公司 Tobacco leaf baking state judging method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117373016A (en) * 2023-10-20 2024-01-09 农芯(南京)智慧农业研究院有限公司 Tobacco leaf baking state judging method, device, equipment and storage medium
CN117373016B (en) * 2023-10-20 2024-04-30 农芯(南京)智慧农业研究院有限公司 Tobacco leaf baking state judging method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
US11200696B2 (en) Method and apparatus for training 6D pose estimation network based on deep learning iterative matching
CN111783870B (en) Human body attribute identification method, device, equipment and storage medium
CN113033537B (en) Method, apparatus, device, medium and program product for training a model
CN114549840B (en) Training method of semantic segmentation model and semantic segmentation method and device
CN113642583B (en) Deep learning model training method for text detection and text detection method
CN113379627B (en) Training method of image enhancement model and method for enhancing image
US20230030431A1 (en) Method and apparatus for extracting feature, device, and storage medium
CN114863437B (en) Text recognition method and device, electronic equipment and storage medium
JP2023531350A (en) A method for incrementing a sample image, a method for training an image detection model and a method for image detection
CN115115918A (en) Visual learning method based on multi-knowledge fusion
CN113902696A (en) Image processing method, image processing apparatus, electronic device, and medium
CN113780578B (en) Model training method, device, electronic equipment and readable storage medium
CN114495101A (en) Text detection method, and training method and device of text detection network
CN114781499A (en) Method for constructing ViT model-based intensive prediction task adapter
CN115273148B (en) Pedestrian re-recognition model training method and device, electronic equipment and storage medium
CN115457329B (en) Training method of image classification model, image classification method and device
CN116402914A (en) Method, device and product for determining stylized image generation model
CN116052288A (en) Living body detection model training method, living body detection device and electronic equipment
CN116363459A (en) Target detection method, model training method, device, electronic equipment and medium
CN115565186A (en) Method and device for training character recognition model, electronic equipment and storage medium
CN114463734A (en) Character recognition method and device, electronic equipment and storage medium
CN116363429A (en) Training method of image recognition model, image recognition method, device and equipment
CN115019057A (en) Image feature extraction model determining method and device and image identification method and device
CN113989152A (en) Image enhancement method, device, equipment and storage medium
CN113379592A (en) Method and device for processing sensitive area in picture and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination