CN116468970A

CN116468970A - Model training method, image processing method, device, equipment and medium

Info

Publication number: CN116468970A
Application number: CN202310440035.6A
Authority: CN
Inventors: 刁文辉; 毛秀华; 杨怡冉; 白建东; 冯瑛超; 李俊希; 路晓男; 尹文昕
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2023-04-21
Filing date: 2023-04-21
Publication date: 2023-07-21

Abstract

The disclosure provides a model training method, an image processing device, equipment and a medium, which can be applied to the technical field of artificial intelligence. The method comprises the following steps: responding to a first visual processing task for training a pre-training model, and acquiring a first visual processing task template; inputting a first visual processing task template and a sample image into a pre-training model, and processing the sample image and the first visual processing task template by using a first adapter to obtain a first visual processing result; training a pre-training model by using a first visual processing result and a first visual processing label corresponding to the sample image to obtain a first visual processing task model, wherein model parameters of a backbone network remain unchanged in the training process of the pre-training model; the first visual processing task template comprises a template image and marking information about the template image, wherein the template image is used for constructing the first visual processing task template, and the marking information is used for guiding the pre-training model to output a first visual processing result.

Description

Model training method, image processing method, device, equipment and medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, and in particular relates to a model training method, an image processing device, equipment and a medium.

Background

For a new processing task, global adjustment is generally needed to be performed on the model parameters again, so that more resources are consumed, more training time is wasted, and more storage space is occupied by the model parameters obtained through training.

Disclosure of Invention

In view of the foregoing, the present disclosure provides a model training method, an image processing method, an apparatus, a device, and a medium.

According to a first aspect of the present disclosure, there is provided a model training method comprising: responding to a first visual processing task for training a pre-training model, and acquiring a first visual processing task template, wherein the pre-training model comprises a backbone network and a first adapter corresponding to the first visual processing task, and model parameters of the backbone network are obtained through pre-training; inputting a first visual processing task template and a sample image into a pre-training model, and processing the sample image and the first visual processing task template by using a first adapter to obtain a first visual processing result; training a pre-training model by using a first visual processing result and a first visual processing label corresponding to the sample image to obtain a first visual processing task model, wherein model parameters of a backbone network are kept unchanged in the training process of the pre-training model; the first visual processing task template comprises a template image and marking information about the template image, wherein the template image is used for constructing the first visual processing task template, the marking information is used for guiding the pre-training model to output a first visual processing result, and the first visual processing task model is used for processing the first visual processing task.

According to an embodiment of the present disclosure, before the first visual processing task template is acquired in response to the first visual processing task for training the pre-training model, the model training method further includes: acquiring a template image based on a first visual processing task; based on the template image, a first visual processing task template and a first visual processing tag are obtained.

According to an embodiment of the present disclosure, training a pre-training model using a first visual processing result and a first visual processing label corresponding to a sample image to obtain a first visual processing task model, comprising: obtaining loss information according to the first visual processing result and the first visual processing label; and carrying out inverse gradient optimization on the parameter information of the first adapter based on the loss information until the loss information meets the preset condition, so as to obtain a first visual processing task model.

According to an embodiment of the present disclosure, a backbone network of a pre-training model includes a target vision processing module and at least two target network blocks, the at least two target network blocks including a first target network block and a second target network block; the first adapter is arranged between at least two target network blocks; inputting the first visual processing task template and the sample image into a pre-training model, processing the sample image and the first visual processing task template by using a first adapter to obtain a first visual processing result, wherein the method comprises the following steps of: inputting the sample image and the first visual processing task template into a first target network block, and outputting a first feature map and a first template feature map; inputting the first feature map and the first template feature map into a first adapter, and outputting a second feature map and a second template feature map; inputting the second feature map and the second template feature map into a second target network block, and outputting a third feature map and a third template feature map; and inputting the third feature map and the third template feature map into a target visual processing module, and outputting a first visual processing result.

According to an embodiment of the present disclosure, the above model training method further includes: and exchanging the first adapter in the first visual processing task model by using a trained second adapter corresponding to the second visual processing task to obtain a second visual processing task model, wherein the second visual processing task model is used for processing a second visual processing task of an image, and the trained second adapter is obtained by training a pre-training model based on the second visual processing task.

According to an embodiment of the present disclosure, before inputting the first visual processing task template and the sample image into the pre-training model, processing the sample image and the first visual processing task template with the first adapter to obtain a first visual processing result, the model training method further includes: determining a sample image from a sample database in response to receiving a sample acquisition instruction from the electronic device; and calling a sample transmission interface to acquire a sample image from a sample database.

A second aspect of the present disclosure provides an image processing method, including: acquiring an image to be processed; obtaining a first visual processing task model according to the model training method; and inputting the image to be processed into a first visual processing task model, and outputting a target visual processing result.

A third aspect of the present disclosure provides a model training apparatus comprising: the first acquisition module is used for responding to a first visual processing task used for training a pre-training model, and acquiring a first visual processing task template, wherein the pre-training model comprises a backbone network and a first adapter corresponding to the first visual processing task, and model parameters of the backbone network are obtained through pre-training; the first input module is used for inputting a first visual processing task template and a sample image into the pre-training model, and processing the sample image and the first visual processing task template by using the first adapter to obtain a first visual processing result; the first training module is used for training the pre-training model by utilizing the first visual processing result and the first visual processing label corresponding to the sample image to obtain a first visual processing task model, wherein model parameters of the backbone network are kept unchanged in the training process of the pre-training model; the first visual processing task template comprises a template image and marking information about the template image, wherein the template image is used for constructing the first visual processing task template, the marking information is used for guiding the pre-training model to output a first visual processing result, and the first visual processing task model is used for processing the first visual processing task.

A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method described above.

A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described method.

According to the model training method, the image processing method, the device, the equipment and the medium, the model parameters of the backbone network are kept unchanged in the training process of the pre-training model, only the parameters of the first adapter are regulated, so that the fitting capacity of the pre-training model to the first visual processing task is enhanced through the adapter, the parameters of the pre-training model are prevented from being globally regulated, the memory occupied by the stored model parameters is reduced, the resources of the pre-training model are fully utilized, sample images required in the training process are saved, the first visual processing result is guided through the first visual processing task template, the generalization capacity of the pre-training model in the training process is enhanced through combining the first adapter with the first visual processing task template, the training effect meeting the requirements can be realized under the condition that only fewer sample images are used, the first visual processing task model meeting the requirements is obtained, the resources of the sample images are saved, and the training efficiency is improved.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an application scenario diagram of a model training method or an image processing method according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a model training method according to an embodiment of the disclosure;

FIG. 3 schematically illustrates a schematic diagram of deriving a first visual processing task model in accordance with an embodiment of the present disclosure;

FIG. 4 schematically illustrates a schematic diagram of acquiring loss information according to an embodiment of the disclosure;

FIG. 5 schematically illustrates a schematic diagram of a first visual processing result acquisition method according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a schematic diagram of a set of output feature maps according to an embodiment of the disclosure;

FIG. 7 schematically illustrates a schematic diagram of a training process according to an embodiment of the present disclosure;

FIG. 8 schematically illustrates a schematic diagram of exchanging a first adapter according to an embodiment of the disclosure;

fig. 9 schematically illustrates a flowchart of an image processing method according to an embodiment of the present disclosure;

FIG. 10 schematically illustrates a block diagram of a model training apparatus according to an embodiment of the present disclosure;

Fig. 11 schematically shows a block diagram of the structure of an image processing apparatus according to an embodiment of the present disclosure;

fig. 12 schematically illustrates a block diagram of an electronic device adapted to implement a model training method or an image processing method according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In the technical scheme of the disclosure, the related data (such as including but not limited to personal information of a user) are collected, stored, used, processed, transmitted, provided, disclosed, applied and the like, all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public welcome is not violated.

According to the embodiment of the disclosure, in the training process of the model, the training is generally completed through a supervised algorithm of a specific processing task, but because the training sample is obtained by labeling data, more human resources are consumed, the model can be pre-trained through non-supervised training, and then the model parameters of the pre-trained model obtained through pre-training are globally regulated, so that the pre-trained model can process a new processing task.

However, as the data volume in the model increases and the model volume gradually increases, global adjustment of model parameters of the pre-trained model consumes more resources and complicates model migration. For some special processing tasks, such as processing tasks of remote sensing images, there is a problem that the training effect of the model is difficult to meet the requirement because the sample images for training are difficult to acquire.

With the increasing computing resources, pre-training models with deformers as the main architecture have entered a new era, and with more and more models with larger volumes being proposed, solutions are provided for different tasks, but many new problems are brought about. These massive data-trained, larger models contain more parameters than the general depth model.

Thus, storing parameters of such a larger model takes up more memory space and training such a larger model takes up more resources and consumes more time in case the model is retrained for different processing tasks. Based on this, the inventors have found that by using lightweight and highly expandable adapters, less training and less memory space is possible, and the model obtained by training can be made to meet the requirements.

In the process of implementing the inventive concept of the present disclosure, the inventor finds that, for some processing tasks, such as remote sensing image processing tasks, sample images are difficult to obtain, so that the number of sample images is small, and even if an adapter is used, a model obtained by training is difficult to meet the requirement.

Based on this, the inventors found that by using the prompt learning method, new processing tasks can be readjusted to a form similar to the pre-training tasks, and by using the prompt learning, model parameters in the pre-training model can be fully utilized, thereby facilitating training of the pre-training model.

In view of this, the inventors have found that by combining prompt learning with an adapter, the number of sample images required for training can be reduced, and the model obtained by training can meet the demands.

Specifically, embodiments of the present disclosure provide a model training method, including: responding to a first visual processing task for training a pre-training model, and acquiring a first visual processing task template, wherein the pre-training model comprises a backbone network and a first adapter corresponding to the first visual processing task, and model parameters of the backbone network are obtained through pre-training; inputting a first visual processing task template and a sample image into a pre-training model, and processing the sample image and the first visual processing task template by using a first adapter to obtain a first visual processing result; training a pre-training model by using a first visual processing result and a first visual processing label corresponding to the sample image to obtain a first visual processing task model, wherein model parameters of a backbone network are kept unchanged in the training process of the pre-training model; the first visual processing task template comprises a template image and marking information about the template image, wherein the template image is used for constructing the first visual processing task template, the marking information is used for guiding the pre-training model to output a first visual processing result, and the first visual processing task model is used for processing the first visual processing task.

Fig. 1 schematically illustrates an application scenario diagram of a model training method or an image processing method according to an embodiment of the present disclosure.

As shown in fig. 1, an application scenario 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 through the network 104 using at least one of the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or send messages, etc. Various communication client applications, such as a shopping class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only) may be installed on the first terminal device 101, the second terminal device 102, and the third terminal device 103.

The first terminal device 101, the second terminal device 102, the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by the user using the first terminal device 101, the second terminal device 102, and the third terminal device 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that, the model training method or the image processing method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the model training apparatus or the image processing apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The model training method or the image processing method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105. Accordingly, the model training apparatus or the image processing apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The model training method of the disclosed embodiment will be described in detail below with reference to fig. 2 to 9 based on the scenario described in fig. 1.

Fig. 2 schematically illustrates a flow chart of a model training method according to an embodiment of the present disclosure.

As shown in fig. 2, the model training of this embodiment includes operations S210 to S230.

In operation S210, a first visual processing task template is acquired in response to a first visual processing task for training a pre-training model, wherein the pre-training model includes a backbone network and a first adapter corresponding to the first visual processing task, and model parameters of the backbone network are obtained through pre-training.

According to an embodiment of the present disclosure, the pre-training model may be a model that is pre-trained. For example, but not limited to, a pretrained CNN (Convolutional Neural Network ), etc. For example, the pre-training model may be trained by a self-monitoring algorithm, but is not limited thereto.

The pre-training model may be used to process other processing tasks, for example, may be used to process image object classification tasks, and the like, but is not limited thereto. For example, the image object classification task may be to classify according to the shape of the object, without specifically determining what class the object belongs to.

According to an embodiment of the present disclosure, the first visual processing task may be a task different from the task processed by the pre-training model, for example, may be an image object detection task, but is not limited thereto. The image target detection task may be, but is not limited to, category information, size information, and the like of the target object in the detection image, as long as it is different from the visual processing task corresponding to the pre-training model.

According to embodiments of the present disclosure, the first adapter may be untrained, may be partially trained corresponding to the first visual processing task, or may have a priori knowledge corresponding to the first visual processing task. For example, in the case where the first adapter is partially trained in correspondence with the first vision processing task or has a priori knowledge in correspondence with the first vision processing task, the training speed of training the first adapter may be increased.

The first adapter may be disposed in the pre-training model, e.g., may be disposed between at least two network blocks comprised by the backbone network; the first adapter may also be provided at an output of the backbone network in the pre-training model; the first adapter may also be provided at an input of the backbone network in the pre-training model. For example, a plurality of first adapters may be disposed in the pre-training model, where the locations are the same, and will not be described herein.

In operation S220, the first vision processing task template and the sample image are input into the pre-training model, and the sample image and the first vision processing task template are processed using the first adapter, resulting in a first vision processing result.

According to embodiments of the present disclosure, the sample image may correspond to the first visual processing task, for example, in the case where the first visual processing task is a remote sensing image target detection task, the sample image may correspond to the remote sensing image.

According to an embodiment of the present disclosure, the first visual processing task template includes a template image for constructing the first visual processing task template and marking information on the template image for guiding the pre-training model to output a first visual processing result, and the first visual processing task model for processing the first visual processing task. The first visual processing result may be a result output by the pre-training model in the training process, and the first visual processing result may correspond to the first visual processing task.

For example: in the case where the first visual processing task is an image target detection task, the first visual processing result may correspond to an image target detection result. The detection result may include category information, size information, position information, and the like of the target object, but is not limited thereto.

The first visual processing task template can be a prompt learning template, the training speed of the pre-training model can be improved through a prompt learning method, and the training of the model can also achieve the effect of meeting the requirement under the condition that sample images are fewer. Thus, the first visual processing result guided by the first visual processing task template can correspond to the answer space of the prompt learning method.

According to the embodiment of the disclosure, the target object may be included in the template image, and the target object in the template image may be detected to obtain a detection result. For example, the template image may comprise a remote sensing image. The marking information may be generated by marking the target object in the template image based on the detection result.

For example, the marking operation may be performed on the target object in the template image, while the marking operation is not performed on other objects. The marking operation may be marking the position information of the target object in the template image, marking the size information of the target object in the template image, or marking the category information of the target object in the template image, but is not limited thereto.

Thus, in the case of training the model, the pre-training model may be caused to process the sample image based on the marker information and the target object in the template image corresponding to the marker information, and output the first visual processing result instead of outputting the first visual processing result based on all the objects in the sample image. Based on the method, the first visual processing result can be guided through the marking information, so that the requirement of the model training process on the number of sample images can be reduced, and the training effect meeting the requirement can be realized under the condition that only fewer sample images are used.

In operation S230, training the pre-training model using the first visual processing result and the first visual processing label corresponding to the sample image to obtain a first visual processing task model, wherein model parameters of the backbone network remain unchanged during the training process of the pre-training model.

According to the embodiment of the disclosure, the model parameters of the backbone network remain unchanged during the training process of the pre-training model, so that only the parameter information of the first adapter can be trained during the training process of the pre-training model.

According to an embodiment of the present disclosure, the first visual processing tag corresponds to the sample image, for example, in a case where the first visual processing task is a remote sensing image target detection task, the sample image may correspond to the remote sensing image, and the first visual processing tag corresponds to real information including a target object in the sample image, for example, the real information may include a real category of the target object, a real size of the target object, and the like.

According to an embodiment of the present disclosure, a pre-training model is trained using a first visual processing result and a first visual processing label corresponding to a sample image. For example: determining loss information by determining a first visual processing combination and a first visual processing label, and adjusting parameter information of a first adapter in the pre-training model according to the loss information to obtain a first visual processing task model.

According to the embodiment of the disclosure, the pre-training model is trained to obtain the first vision processing task model, instead of training the untrained model to obtain the first vision processing task model, so that compared with the untrained model, the training effect meeting the requirement can be achieved under the condition that the pre-training model is trained by using fewer sample images, resources can be saved, and the training efficiency is improved.

And by setting the first adapter, the fitting capacity of the pre-training model to the first visual task can be enhanced, the performance of the first visual task processing model obtained by training is improved, and the first visual task processing model obtained by training can meet the requirement.

FIG. 3 schematically illustrates a schematic diagram of deriving a first visual processing task model in accordance with an embodiment of the present disclosure.

As shown in fig. 3, the first visual processing task template 310 and the sample image 320 are input into the pre-training model 330 including the first adapter, the first visual processing result 340 is output, and the pre-training model 330 including the first adapter is trained according to the first visual processing result 340 and the first visual processing label 350 to obtain the first visual processing task model 370.

According to the embodiment of the disclosure, the model parameters of the backbone network are kept unchanged in the training process of the pre-training model, and only the parameters of the first adapter are regulated, so that the fitting capacity of the pre-training model to the first visual processing task is enhanced through the adapter, the parameters of the pre-training model are prevented from being globally regulated, the memory occupied by the stored model parameters is reduced, the resources of the pre-training model are fully utilized, sample images required in the training process are saved, the first visual processing result is guided through the first visual processing task template, and therefore, the generalization capability of the pre-training model in the training process is enhanced through combining the first adapter with the first visual processing task template, the training effect meeting the requirements can be realized under the condition that only fewer sample images are used, the first visual processing task model meeting the requirements is obtained, the resources of the sample images are saved, and the training efficiency is improved.

According to the embodiment of the present disclosure, the template image may be acquired from other electronic devices or may be stored in the electronic device in advance, but is not limited thereto.

In the case where the first visual processing task is an image object detection task, the template image may be a corresponding detected image.

The marking information may be obtained based on a detection result of the detected image and the detected image, and the first vision processing task template may be obtained based on the marking information and the image.

According to an embodiment of the present disclosure, in the case where the first visual processing task is an image target detection task, the first visual processing tag may further include category information, size information, and the like of the target object in the detected template image.

According to the embodiment of the disclosure, the template image is acquired based on the first visual processing task, and the first visual processing task template and the first visual processing label are acquired based on the template image, so that the first visual processing task template and the first visual processing label can be applied to the training process of the pre-training model, the training efficiency of the model can be improved, and training resources can be saved.

According to embodiments of the present disclosure, the loss information may be used to adjust parameter information of the first adapter. The penalty information may include a penalty value resulting from processing the first visual processing result and the first visual processing tag with a penalty function. For example, the first visual processing result and the first visual processing label may be processed by using a cross entropy loss function to obtain a loss value, but not limited thereto, and a loss function such as a cosine similarity loss function may be used to obtain a loss value, which is not described herein.

According to an embodiment of the disclosure, the preset condition may be a preset loss value threshold, parameter information of the first adapter may be adjusted so that the loss value is smaller than or equal to the preset loss value threshold, and the first vision processing task model may be obtained according to the corresponding parameter information when the loss value is smaller than or equal to the preset loss value threshold. The preset loss value threshold may be, for example, 10%,15%, etc., which is not limited by the present disclosure.

Fig. 4 schematically illustrates a schematic diagram of acquiring loss information according to an embodiment of the present disclosure.

As shown in fig. 4, the sample image 410 and the first visual processing task template 420 may be input into the pre-training model 430, the first visual processing result 440 is output, and the loss information 460 is obtained according to the first visual processing result 440 and the first visual processing tag 450. Parameters of the first adapter in the pre-training model 430 may be adjusted based on the loss information 460.

According to the embodiment of the disclosure, according to the first visual processing label and the first visual processing result obtained by guiding by using the first visual processing template, loss information is obtained, and then, based on the loss information, reverse gradient optimization is performed on the parameter information of the first adapter until the loss information meets the preset condition, so that training efficiency can be improved, sample images can be saved, and a first visual processing task model meeting requirements can be obtained by using fewer sample images.

According to an embodiment of the present disclosure, the target network block may perform convolution processing on the input image, but is not limited thereto; the first adaptor may also perform convolution processing on the input image, but is not limited thereto. The manner in which the target network block and the first adapter process the input image may correspond to a first visual processing task.

According to an embodiment of the disclosure, the processing manner of the input image by the target visual processing module may also correspond to the first visual processing task. For example, in the case where the first visual processing task is an image object detection task, the object visual processing module may be correspondingly configured to detect an input image.

According to an embodiment of the present disclosure, a plurality of first adapters may be disposed between at least two target network blocks, which will not be described herein.

Fig. 5 schematically illustrates a schematic diagram of a first visual processing result acquisition method according to an embodiment of the present disclosure.

As shown in fig. 5, the sample image 510 and the first visual processing task template 520 may be input into a first target network block 530, outputting a first feature map and a first template feature map; inputting the first feature map and the first template feature map into the first adapter 540, and outputting the second feature map and the second template feature map; inputting the second feature map and the second template feature map into the second target network block 550, and outputting a third feature map and a third template feature map; the third feature map and the third template feature map are input to the target visual processing module 560, and the first visual processing result 570 is output.

According to the embodiment of the disclosure, the sample image and the first visual processing task template are input into the first target network block, the first feature image and the first template feature image are output, the first feature image and the first template feature image are input into the first adapter, the second feature image and the second template feature image are output, the second feature image and the second template feature image are input into the second target network block, the third feature image and the third template feature image are obtained, the third feature image and the third template feature image are input into the target visual processing module, and the first visual processing result is obtained, so that the first visual processing result is guided through the first visual processing task template, the pre-training model comprising the first adapter can be trained by utilizing the first visual processing result and the first visual processing label, the model training efficiency is improved, and the sample image required by the training model is saved.

According to embodiments of the present disclosure, a sample image may be sliced prior to being input into a pre-training model, and then the sliced image may be input into the pre-training model to reduce existing occupancy during training of the model.

Fig. 6 schematically illustrates a schematic diagram of an output feature map set according to an embodiment of the disclosure.

As shown in fig. 6, the complete sample image may be sliced into slice images I ₁ … slice image I _m Then image I ₁ … image I _m A backbone network for inputting a pre-training model, the backbone networkThe backbone network may include an encoder (embedded), whereby slice image I may be output ₁ … slice image I _m Corresponding feature map set E ₀ 。

Fig. 7 schematically illustrates a schematic diagram of a training process according to an embodiment of the present disclosure.

As shown in FIG. 7, the feature map set E may be ₀ And a first vision processing task template P ₀ Deformer encoder layer L comprised by a network block of an input backbone network ₁ ～L _N Thus, a third feature map and a third template feature map can be obtained, and the third feature map and the third template feature map can be input into the detector to obtain a first visual processing result.

According to an embodiment of the present disclosure, the training method of the second adapter may include: setting a second adapter on a pre-training model which does not contain the first adapter, and inputting a second visual processing task template and a sample image corresponding to a second visual processing task into the pre-training model which contains the second adapter to obtain a second visual processing result; training the pre-training model according to the second visual processing result and the second visual processing label corresponding to the sample image to obtain a second visual processing task model comprising a second adapter.

According to embodiments of the present disclosure, the location where the second adapter is disposed in the second visual processing task model may be the same as the location where the first adapter is disposed in the first visual processing task model.

According to the embodiment of the disclosure, the first adapter in the first visual processing task model can be taken out, and the second adapter is arranged at the same position, so that the exchange of the first adapter and the second adapter can be completed, and the first visual processing task model is converted into the second visual processing task model, so that the model can be used for processing the second visual processing task.

Fig. 8 schematically illustrates a schematic diagram of exchanging a first adapter according to an embodiment of the disclosure.

As shown in fig. 8, the first visual processing task model 810 may include a first target network block 811, a first adapter 812, and a second target network block 813, and the second visual processing task model 820 may include a first target network block 811, a second adapter 821, and a second target network block 813.

According to the embodiment of the disclosure, since only the parameters of the first adapter are adjusted in the process of training the pre-training model by using the model training method of the disclosure, the storage of the model parameters of the first vision processing task model can be realized only by saving the trained parameters of the first adapter and the model parameters of the backbone network of the pre-training model. Similarly, a second visual processing task model may be trained using a similar model training method, which may include a trained second adapter, and which may be referred to the model training method of the present disclosure, and is not described in detail herein. Therefore, the first visual processing task model and the second visual processing task model can be switched by exchanging the trained first adapter and the trained second adapter, the direct storage of the model parameters of the first visual processing task model and the model parameters of the second visual processing task model is avoided, and the same effect can be achieved by only storing the parameters of the trained first adapter and the trained second adapter, so that the occupation of memory by the storage model parameters is reduced.

According to the embodiment of the disclosure, the first adapter is exchanged by using the trained second adapter, so that the conversion between the first visual processing task model and the second visual processing task model can be realized without retraining, and training resources are saved.

According to embodiments of the present disclosure, the sample tag generation instructions may be from other electronic devices or may be pre-stored by the electronic device in an instruction database. The sample label generating instruction may include identification information of samples to be acquired, so that the required sample images can be efficiently determined from a sample database storing a large number of sample images.

According to an embodiment of the present disclosure, the sample transmission interface may be a data interface preset on the electronic device, for example, may be a serial communication interface or the like.

According to embodiments of the present disclosure, the sample database may be pre-stored on the electronic device; but may also be stored on other electronic devices. The sample image can be acquired from other electronic devices through the sample transmission interface.

According to the embodiment of the disclosure, the sample transmission interface is called according to the sample determining instruction by the electronic equipment, and the sample image is obtained from the sample database, so that efficient obtaining of the sample image is realized, and the obtaining efficiency can meet the requirement.

Fig. 9 schematically shows a flowchart of an image processing method according to an embodiment of the present disclosure.

As shown in fig. 9, the image processing method of this embodiment includes operations S910 to S930.

In operation S910, an image to be processed is acquired.

In operation S920, a first vision processing task model is obtained according to the model training method of the present disclosure.

In operation S930, the image to be processed is input into the first visual processing task model, and the target visual processing result is output.

According to an embodiment of the present disclosure, the image to be processed may be the same type of image as the sample image, for example, in the case where the sample image is a remote sensing image, the image to be processed may also correspond to the remote sensing image.

Based on the model training method, the disclosure also provides a model training device. The device will be described in detail below in connection with fig. 10.

Fig. 10 schematically shows a block diagram of a model training apparatus according to an embodiment of the present disclosure.

As shown in fig. 10, the model training apparatus 1000 of this embodiment includes a first acquisition module 1010, a first input module 1020, and a first training module 1030.

The first obtaining module 1010 is configured to obtain a first visual processing task template in response to a first visual processing task for training a pre-training model, where the pre-training model includes a backbone network and a first adapter corresponding to the first visual processing task, and model parameters of the backbone network are obtained through pre-training. In an embodiment, the first obtaining module 1010 may be configured to perform the operation S210 described above, which is not described herein.

The first input module 1020 is configured to input a first visual processing task template and a sample image into the pre-training model, and process the sample image and the first visual processing task template using the first adapter to obtain a first visual processing result. In an embodiment, the first input module 1020 may be used to perform the operation S220 described above, which is not described herein.

The first training module 1030 is configured to train the pre-training model using the first visual processing result and a first visual processing label corresponding to the sample image to obtain a first visual processing task model, where model parameters of the backbone network remain unchanged during the training process of the pre-training model; the first visual processing task template comprises a template image and marking information about the template image, wherein the template image is used for constructing the first visual processing task template, the marking information is used for guiding the pre-training model to output a first visual processing result, and the first visual processing task model is used for processing the first visual processing task. In an embodiment, the first training module 1030 may be used to perform the operation S230 described above, which is not described herein.

It should be noted that, the model training apparatus corresponds to the model training method, and the model training apparatus may include modules, units, sub-units, etc. for implementing all functions of the model training method related in the flowchart, and for brevity of description, reference may be made to the description of the model training method for details of description.

According to embodiments of the present disclosure, any of the plurality of modules of the first acquisition module 1010, the first input module 1020, and the first training module 1030 may be combined in one module to be implemented, or any of the plurality of modules may be split into a plurality of modules. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the first acquisition module 1010, the first input module 1020, and the first training module 1030 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, at least one of the first acquisition module 1010, the first input module 1020, and the first training module 1030 may be at least partially implemented as a computer program module that, when executed, may perform the corresponding functions.

Based on the processing method, the disclosure also provides an image processing device. The device will be described in detail below with reference to fig. 11.

Fig. 11 schematically shows a block diagram of the image processing apparatus according to the embodiment of the present disclosure.

As shown in fig. 11, the image processing apparatus 1100 of this embodiment includes a second acquisition module 1110, a second training module 1120, and a second input module 1130.

The second acquisition module 1110 is configured to acquire an image to be processed. In an embodiment, the second obtaining module 1110 may be used to perform the operation S910 described above, which is not described herein.

The second training module 1120 is configured to obtain a first vision processing task model according to the model training method of the present disclosure. In an embodiment, the second training module 1120 may be used to perform the operation S920 described above, which is not described herein.

The second input module 1130 is configured to input the image to be processed into the first visual processing task model, and output a target visual processing result. In an embodiment, the second input module 1130 may be used to perform the operation S930 described above, which is not described herein.

It should be noted that, the image processing apparatus corresponds to the image processing method, and the image processing apparatus may include modules, units, sub-units, and so on for implementing all functions of the image processing method related to the flowchart, which are not described in detail herein for brevity of description, and reference may be made to the description of the image processing method.

According to embodiments of the present disclosure, any of the second acquisition module 1110, the second training module 1120, and the second input module 1130 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the second acquisition module 1110, the second training module 1120, and the second input module 1130 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable way of integrating or packaging circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, at least one of the second acquisition module 1110, the second training module 1120, and the second input module 1130 may be at least partially implemented as a computer program module, which when executed, may perform the corresponding functions.

As shown in fig. 12, an electronic device 1200 according to an embodiment of the present disclosure includes a processor 1201, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a Random Access Memory (RAM) 1203. The processor 1201 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 1201 may also include on-board memory for caching purposes. The processor 1201 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the disclosure.

In the RAM 1203, various programs and data required for the operation of the electronic apparatus 1200 are stored. The processor 1201, the ROM 1202, and the RAM 1203 are connected to each other through a bus 1204. The processor 1201 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 1202 and/or RAM 1203. Note that the program may be stored in one or more memories other than the ROM 1202 and the RAM 1203. The processor 1201 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the disclosure, the electronic device 1200 may also include an input/output (I/O) interface 1205, the input/output (I/O) interface 1205 also being connected to the bus 1204. The electronic device 1200 may also include one or more of the following components connected to an input/output (I/O) interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output portion 1207 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 1208 including a hard disk or the like; and a communication section 1209 including a network interface card such as a LAN card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. The driver 1210 is also connected to an input/output (I/O) interface 1205 as required. A removable medium 1211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 1210 so that a computer program read out therefrom is installed into the storage section 1208 as needed.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include the ROM 1202 and/or the RAM 1203 and/or one or more memories other than the ROM 1202 and the RAM 1203 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to implement the model training method or the image processing method provided by the embodiments of the present disclosure.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1201. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program can also be transmitted, distributed over a network medium in the form of signals, and downloaded and installed via a communication portion 1209, and/or from a removable medium 1211. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1209, and/or installed from the removable media 1211. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1201. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A model training method, comprising:

responding to a first visual processing task for training a pre-training model, and acquiring a first visual processing task template, wherein the pre-training model comprises a backbone network and a first adapter corresponding to the first visual processing task, and model parameters of the backbone network are obtained through pre-training;

Inputting the first visual processing task template and the sample image into the pre-training model, and processing the sample image and the first visual processing task template by using the first adapter to obtain a first visual processing result;

training the pre-training model by using the first visual processing result and a first visual processing label corresponding to the sample image to obtain a first visual processing task model, wherein model parameters of the backbone network remain unchanged in the training process of the pre-training model;

the first visual processing task template comprises a template image and marking information related to the template image, the template image is used for constructing the first visual processing task template, the marking information is used for guiding the pre-training model to output the first visual processing result, and the first visual processing task model is used for processing the first visual processing task.

2. The method of claim 1, wherein prior to the acquiring the first visual processing task template in response to the first visual processing task for training the pre-training model, the method further comprises:

acquiring the template image based on the first visual processing task;

And obtaining the first visual processing task template and the first visual processing label based on the template image.

3. The method of claim 1, wherein the training the pre-training model using the first visual processing result and a first visual processing label corresponding to the sample image results in a first visual processing task model, comprising:

obtaining loss information according to the first visual processing result and the first visual processing label;

and carrying out inverse gradient optimization on the parameter information of the first adapter based on the loss information until the loss information meets a preset condition to obtain the first visual processing task model.

4. The method of claim 1, wherein the backbone network of the pre-training model comprises a target vision processing module and at least two target network blocks, the at least two target network blocks comprising a first target network block and a second target network block; the first adapter is arranged between the at least two target network blocks;

inputting the first visual processing task template and the sample image into the pre-training model, processing the sample image and the first visual processing task template by using the first adapter to obtain a first visual processing result, wherein the method comprises the following steps of:

Inputting the sample image and the first visual processing task template into the first target network block, and outputting a first feature map and a first template feature map;

inputting the first feature map and the first template feature map into the first adapter, and outputting a second feature map and a second template feature map;

inputting the second feature map and the second template feature map into the second target network block, and outputting a third feature map and a third template feature map;

and inputting the third feature map and the third template feature map into the target visual processing module, and outputting the first visual processing result.

5. The method of claim 1, further comprising:

and exchanging the first adapter in the first visual processing task model by using a trained second adapter corresponding to a second visual processing task to obtain a second visual processing task model, wherein the second visual processing task model is used for processing a second visual processing task of an image, and the trained second adapter is obtained by training the pre-training model based on the second visual processing task.

6. The method of claim 1, wherein prior to said inputting the first visual processing task template and sample image into the pre-training model, processing the sample image and the first visual processing task template with the first adapter to obtain a first visual processing result, the method further comprises:

Determining the sample image from a sample database in response to receiving a sample acquisition instruction from the electronic device;

and calling a sample transmission interface to acquire the sample image from the sample database.

7. An image processing method, comprising:

acquiring an image to be processed;

the method of any one of claims 1 to 6, obtaining the first visual processing task model;

and inputting the image to be processed into the first visual processing task model, and outputting a target visual processing result.

8. A model training apparatus comprising:

the first acquisition module is used for responding to a first visual processing task used for training a pre-training model, and acquiring a first visual processing task template, wherein the pre-training model comprises a backbone network and a first adapter corresponding to the first visual processing task, and model parameters of the backbone network are obtained through pre-training;

the first input module is used for inputting the first visual processing task template and the sample image into the pre-training model, and processing the sample image and the first visual processing task template by using the first adapter to obtain a first visual processing result;

The first training module is used for training the pre-training model by using the first visual processing result and a first visual processing label corresponding to the sample image to obtain a first visual processing task model, wherein model parameters of the backbone network are kept unchanged in the training process of the pre-training model;

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-7.

10. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-7.