CN116385836A

CN116385836A - Image recognition method, model training method and device

Info

Publication number: CN116385836A
Application number: CN202310422862.2A
Authority: CN
Inventors: 节世博; 邓志鸿
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2023-04-19
Filing date: 2023-04-19
Publication date: 2023-07-04

Abstract

The application relates to an image recognition method, a model training method and a device. The method comprises the following steps: acquiring an image to be identified and an identification task corresponding to the image to be identified; determining a target convolution bypass and a target additional network unit corresponding to the identification task, wherein the target convolution bypass and the target additional network unit are generated after training according to an image sample set corresponding to the identification task; loading the target convolution bypass and the target additional network unit into a pre-training model to generate an image recognition model corresponding to the recognition task; and inputting the image to be identified into an image identification model, and acquiring an identification result output by the image identification model. According to the method, for different recognition tasks, image recognition corresponding to the recognition tasks can be performed only by loading different convolution bypasses and additional network units on the pre-training model, and compared with the method for storing all model parameters corresponding to the recognition tasks, the method effectively reduces storage overhead.

Description

Image recognition method, model training method and device

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to an image recognition method, a model training method, and an apparatus.

Background

Currently, large-scale pre-training models are widely used in the field of computer vision. For downstream tasks such as image classification, target detection, semantic segmentation and the like, the model parameters of the pre-training model can be respectively fine-tuned by using the labeling data corresponding to different downstream tasks, so that the pre-training model can better perform image recognition based on different downstream tasks after fine tuning.

In the related art, when fine tuning parameters of a pre-trained model, it is generally necessary to adjust all model parameters in the pre-trained model. Accordingly, different trimmed model parameters need to be stored for different downstream tasks.

However, as the size of the pre-training model is continuously increased, model parameters of the pre-training model are also continuously increased, which results in excessive storage overhead of the model parameters of the pre-training model after fine tuning.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an image recognition method, a model training method, and a device that can effectively reduce the storage overhead of model parameters of a pre-training model.

In a first aspect, the present application provides an image recognition method. The method comprises the following steps:

Acquiring an image to be identified and an identification task corresponding to the image to be identified;

determining a target convolution bypass and a target additional network unit corresponding to the identification task, wherein the target convolution bypass and the target additional network unit are generated after training according to an image sample set corresponding to the identification task;

loading the target convolution bypass and the target additional network unit into a pre-training model to generate an image recognition model corresponding to the recognition task;

inputting the image to be identified into the image identification model, and acquiring an identification result output by the image identification model.

In one embodiment, the pre-training model comprises at least one self-attention unit and at least one feed-forward neural network unit;

the loading the target convolution bypass and the target add-on unit into a pre-training model comprises:

adding a target convolution bypass to each self-attention unit and each feedforward neural network unit in the pre-training model;

adding the target additional network element at a target location in the pre-training model.

In one embodiment, the input of the target convolution bypass is the input of a self-attention unit or a feedforward neural network unit corresponding to the target convolution bypass;

The output of the target convolution bypass is used for being added to the output of a self-attention unit or a feedforward neural network unit corresponding to the target convolution bypass.

In one embodiment, the target convolution bypass comprises a first convolution subunit and a second convolution subunit, the convolution kernel of the first convolution subunit being smaller than the convolution kernel of the second convolution subunit.

In one embodiment, if the input of the pre-training model is a serialized input, the target convolution bypass further comprises a reconstruction subunit;

the reconstruction subunit is disposed before the second convolution subunit;

the reconstruction subunit is configured to reconstruct a two-dimensional spatial structure of the graphic token in the input data to generate a reconstructed picture token, and input the reconstructed picture token and the identification token in the input data into the second convolution subunit.

In one embodiment, before the determining the target convolution bypass and the target additional network element corresponding to the identification task, the method further includes:

acquiring an image sample set corresponding to the identification task;

according to the identification task, truly original additional network elements;

Adding an original convolution bypass and the original additional network element to the pre-training model;

training an original convolution bypass and an original additional network element in the pre-training model by using the image sample set under the condition of locking model parameters of the pre-training model to obtain a trained target convolution bypass and a trained target additional network element;

and storing the target convolution bypass and the target additional network element according to the identification task.

In a second aspect, the present application provides a model training method. The method comprises the following steps:

acquiring an image sample set corresponding to an identification task;

adding the original convolution bypass and the original additional network element to a pre-training model;

under the model parameters of the pre-training model, training an original convolution bypass and an original additional network unit in the pre-training model by using the image sample set to obtain a trained target convolution bypass and a target additional network unit, wherein the target convolution bypass and the target additional network unit are used for generating an image recognition model corresponding to the recognition task;

And saving the target convolution bypass and the target additional network element.

In a third aspect, the present application provides an image recognition apparatus. The method comprises the following steps:

the first acquisition module is used for acquiring an image to be identified and an identification task corresponding to the image to be identified; determining a target convolution bypass and a target additional network unit corresponding to the identification task, wherein the target convolution bypass and the target additional network unit are generated after training according to an image sample set corresponding to the identification task;

the loading module is used for loading the target convolution bypass and the target additional network unit into a pre-training model and generating an image recognition model corresponding to the recognition task;

the identification module is used for inputting the image to be identified into the image identification model and acquiring an identification result output by the image identification model.

the loading module is specifically configured to add a target convolution bypass to each self-attention unit and each feedforward neural network unit in the pre-training model; adding the target additional network element at a target location in the pre-training model.

the reconstruction subunit is disposed before the second convolution subunit;

In one embodiment, the image recognition apparatus further includes:

the training module is used for acquiring an image sample set corresponding to the identification task; according to the identification task, truly original additional network elements; adding an original convolution bypass and the original additional network element to the pre-training model; training an original convolution bypass and an original additional network element in the pre-training model by using the image sample set under the condition of locking model parameters of the pre-training model to obtain a trained target convolution bypass and a trained target additional network element; and storing the target convolution bypass and the target additional network element according to the identification task.

In a fourth aspect, the present application provides a model training apparatus. The method comprises the following steps:

the second acquisition module is used for acquiring an image sample set corresponding to the identification task; according to the identification task, truly original additional network elements;

an adding module for adding the original convolution bypass and the original additional network element to a pre-training model;

the training module is used for training an original convolution bypass and an original additional network unit in the pre-training model by using the image sample set under the model parameters of the pre-training model to obtain a trained target convolution bypass and a target additional network unit, wherein the target convolution bypass and the target additional network unit are used for generating an image recognition model corresponding to the recognition task;

and the storage module is used for storing the target convolution bypass and the target additional network element.

In a fifth aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the image recognition method according to the first aspect or the model training method according to the second aspect when the processor executes the computer program.

In a sixth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the image recognition method described in the first aspect or the model training method described in the second aspect.

In a seventh aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the image recognition method according to the first aspect or the model training method according to the second aspect.

The image recognition method, the model training method and the device firstly acquire an image to be recognized and a recognition task corresponding to the image to be recognized, secondly determine a target convolution bypass and a target additional network unit corresponding to the recognition task, wherein the target convolution bypass and the target additional network unit are generated after training according to an image sample set corresponding to the recognition task; thirdly, loading the target convolution bypass and the target additional network unit into a pre-training model to generate an image recognition model corresponding to the recognition task; and finally, inputting the image to be identified into the image identification model, and acquiring an identification result output by the image identification model. According to the method, for different recognition tasks, image recognition corresponding to the recognition tasks can be performed only by loading different convolution bypasses and additional network units on the pre-training model, and compared with the method for storing all model parameters of the pre-training model corresponding to the recognition tasks, the method effectively reduces storage cost.

Drawings

Fig. 1 is an application environment diagram of an image recognition method provided in an embodiment of the present application;

fig. 2 is a schematic flow chart of an image recognition method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a convolution bypass provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of a pre-training model loading convolution bypass provided in an embodiment of the present application;

FIG. 5 is a schematic flow chart of model training according to an embodiment of the present disclosure;

fig. 6 is a flowchart of another image recognition method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an image recognition device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a model training device according to an embodiment of the present application;

fig. 9 is an internal structure diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The image recognition method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. When an image recognition task is required, the terminal device 101 may send an image recognition request to the server, and after the server 102 receives the image recognition request, the image to be recognized and the recognition task may be extracted from the image recognition request, and the corresponding convolution bypass and the additional network element are determined according to the recognition task. Server 102 may then load the convolution bypass and additional network elements into the pre-training model to generate an image recognition model corresponding to the recognition task. Finally, the server 102 inputs the image to be identified into the image identification model, acquires the identification result output by the image identification model, and feeds back the image identification result to the terminal device 101.

The terminal 101 and the terminal 103 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 102 may be implemented as a stand-alone server or as a server cluster of multiple servers.

In one embodiment, as shown in fig. 2, an image recognition method is provided, and the image recognition method is applied to the server in fig. 1 for illustration, and includes the following steps:

s201, acquiring an image to be identified and an identification task corresponding to the image to be identified.

In the application, when the image recognition task is required to be performed, the server can acquire the image to be recognized and the recognition task corresponding to the image to be recognized.

It should be understood that the embodiments of the present application do not limit how to acquire the image to be identified and the identification task corresponding to the image to be identified. In some embodiments, when the user needs to perform image recognition, an image recognition request may be sent to the server through the terminal device, where the image recognition request may include an image to be recognized and a recognition task corresponding to the image to be recognized. The server can acquire the image to be identified from the image identification request and an identification task corresponding to the image to be identified indicated by the user so as to carry out subsequent image identification.

In other embodiments, the server may receive the image to be identified acquired in real-time and store the image to be identified in a database. The user can send an image recognition instruction to the server through the terminal equipment to instruct the server to carry out image recognition on the images to be recognized in certain time periods and/or certain scenes, and the image recognition instruction comprises a recognition task instructed by the user. Then, the server acquires the image to be identified from the database and acquires an identification task from the image identification instruction so as to carry out subsequent image identification.

The above-mentioned identification task may be determined according to the image identification requirement of the user, which is not limited in the embodiment of the present application. By way of example, the identification tasks described above may include an image classification task, a target detection task, a semantic segmentation task, and so forth.

S202, determining a target convolution bypass and a target additional network unit corresponding to the recognition task, wherein the target convolution bypass and the target additional network unit are generated after training according to an image sample set corresponding to the recognition task.

In this step, after the server obtains the image to be identified and the identification task corresponding to the image to be identified, the target convolution bypass and the target additional network element corresponding to the identification task may be determined.

The target additional network element may be a trained additional network element corresponding to the identification task. The embodiment of the application does not limit the type of the target additional network element, and if the identification task is an image classification task, the target additional network element may include a classification header attached to a pre-training model; if the identification task is a target detection task, the target additional network element may include a detection head attached to a pre-training model; if the recognition task is a semantic segmentation task, the target additional network element may include a segmentation decoder that is attached to a pre-trained model.

The target convolution bypass may be a trained convolution bypass corresponding to the recognition task. In some embodiments, the target convolution bypass includes a first convolution subunit and a second convolution subunit, the convolution kernel of the first convolution subunit being smaller than the convolution kernel of the second convolution subunit.

Illustratively, the convolution kernel of the first convolution subunit may be 1*1 and the convolution kernel of the second convolution subunit may be 3*3. Illustratively, the convolution kernel of the first convolution unit may be 1*1 and the convolution kernel of the second convolution unit may be 2 x 2.

It should be appreciated that an activation function unit, such as a gaussian error linear unit (Gaussian Error Linear Unit, gel) activation function unit, may also be provided between the first and second convolution subunits.

It should be understood that the number of the first convolution subunit and the second convolution subunit in the target convolution bypass may be one or more, and the number of the first convolution subunit and the second convolution subunit is not limited in the embodiments of the present application. For example, the target convolution bypass includes two first convolution subunits and one second convolution subunit.

Exemplary, as shown in fig. 3, a schematic diagram of a convolution bypass provided in an embodiment of the present application is provided, where the convolution bypass includes two convolution subunits 1*1 and one 3*3 convolution subunit, the convolution subunit 3*3 is disposed between the two convolution subunits 1*1, and a GELU activation function subunit is disposed between the different convolution subunits.

In other embodiments, if the input to be added to the pre-training model corresponding to the target convolution bypass is a serialized input, the target convolution bypass may further include a reconstruction subunit. The reconstruction subunit may be disposed before the second convolution subunit, and the reconstruction subunit is configured to reconstruct a two-dimensional spatial structure of the graphic token in the input data to generate a reconstructed picture token, and input the reconstructed picture token and the identification token in the input data into the second convolution subunit. Wherein the input is a serialized input pre-training model may include a visual transducer (Vision Transformer) model.

It should be noted that, in the embodiment of the present application, the convolution bypasses and the additional network elements corresponding to different recognition tasks are different, and the original convolution bypasses and the original additional network elements can be trained based on the image sample sets corresponding to different recognition tasks, so as to generate different convolution bypasses and additional network elements after fine tuning, so as to adapt to different recognition tasks. Therefore, before performing image recognition, the server needs to determine a target convolution bypass and a target additional network element corresponding to the recognition task.

S203, loading the target convolution bypass and the target additional network unit into the pre-training model to generate an image recognition model corresponding to the recognition task.

In this step, after the server determines the target convolution bypass and the target additional network element corresponding to the recognition task, the target convolution bypass and the target additional network element may be loaded into the pre-training model to generate the image recognition model corresponding to the recognition task.

The pre-training model may be a model that performs preliminary training by using sample data and performs parameter adjustment according to a lower recognition task on the basis of the preliminary training, where the preliminary training may use images with or without labeling information. The embodiment of the application is not limited to the type of the pre-training model, and the pre-training model may include a regional convolutional neural network (Region-based Convolutional Neural Networks, mask-CNN) model, a mobile network (MobileNet), a Vision Transformer model, and the like.

In some embodiments, the pre-training model may include a multi-layer network structure including at least one self-attention unit and at least one feedforward neural network unit, the self-attention unit and the feedforward neural network unit constituting the pre-training model.

It should be understood that the number of self-focusing units and feedforward neural network units in the pre-training model may be determined according to practical situations, and the number of self-focusing units and feedforward neural network units may be equal or unequal.

For example, when the pre-training model is a Vision Transformer model, 12 multi-headed self-attention modules and 12 feedforward neural network modules may be included in the pre-training model. In addition, the pre-training model can further comprise 1 embedded unit and 13 normalized units. If the recognition task is a classification task, 1 classification head can be added in the pre-training model.

It should be appreciated that the embodiments herein do not limit how the target convolution bypass and target additional network elements are loaded into the pre-training model, and in some embodiments, the server may add one target convolution bypass for each self-attention element and each feed-forward neural network element in the pre-training model, and a target additional network element at a target location in the pre-training model.

Illustratively, as shown in FIG. 4, the input of the target convolution bypass is the input of the self-attention unit or feedforward neural network unit to which the target convolution bypass corresponds. The output of the target convolution bypass is used to be added to the output of the self-attention unit or feedforward neural network unit to which the target convolution bypass corresponds.

It will be appreciated that the above-described target location may be determined from a target additional network element. For example, if the recognition task is an image classification task, the target additional network element is a classification header, which may be added to the top of the pre-trained model, accordingly.

S204, inputting the image to be identified into the image identification model, and acquiring an identification result output by the image identification model.

In this step, after generating the image recognition model corresponding to the recognition task, the server may input the image to be recognized into the image recognition model, and obtain the recognition result output by the image recognition model.

Wherein, the identification result can be determined according to the identification task. For example, if the recognition task is an image classification task, the recognition result may be the classification result of the image to be recognized. If the recognition task is the target detection task, the corresponding recognition result may be the detection result of the image to be recognized.

The image recognition method includes the steps that firstly, an image to be recognized and a recognition task corresponding to the image to be recognized are obtained, secondly, a target convolution bypass and a target additional network unit corresponding to the recognition task are determined, and the target convolution bypass and the target additional network unit are generated after training according to an image sample set corresponding to the recognition task; thirdly, loading the target convolution bypass and the target additional network unit into the pre-training model to generate an image recognition model corresponding to the recognition task; and finally, inputting the image to be identified into an image identification model, and acquiring an identification result output by the image identification model. According to the method, for different recognition tasks, image recognition corresponding to the recognition tasks can be performed only by loading different convolution bypasses and additional network units on the pre-training model, and compared with the method for storing all model parameters corresponding to the recognition tasks, the method effectively reduces storage overhead.

Training for convolutional bypass and additional network elements is described below. Fig. 5 is a schematic flow chart of model training provided in an embodiment of the present application, and as shown in fig. 5, the model training method includes:

s501, acquiring an image sample set corresponding to an identification task.

It should be understood that in this application, the image sample set may contain images carrying annotation information. For example, if the recognition task is an image classification task, the image sample set may include a plurality of sample images X, and each sample image is labeled with a corresponding class label Y.

In some embodiments, different recognition tasks correspond to different image sample sets, where the different image sample sets may include the same sample image carrying different labeling information, or may be different sample images carrying different labeling information.

For example, the image sample set corresponding to the image recognition task and the image sample set corresponding to the target detection task may include the same sample image, but the labeling information of the same sample image is different, the sample image is labeled with the image classification information in the image sample set corresponding to the image recognition task, and the image sample set corresponding to the target detection task is labeled with the target detection information. For example, the sample images in the image sample set corresponding to the image recognition task and the image sample set corresponding to the target detection task may be different.

S502, according to the identification task, the additional network element is truly original.

The original additional network element may be an additional network element corresponding to the identification task, where the original additional network element is not trained by the image sample set corresponding to the identification task.

For example, if the recognition task is an image classification task, the original additional network element may include a classification header attached to the pre-training model; if the identification task is a target detection task, the original additional network element may include a detection head attached to a pre-training model; if the recognition task is a semantic segmentation task, the original additional network element may include a segmentation decoder attached to a pre-trained model.

S503, adding the original convolution bypass and the original additional network element into the pre-training model.

In some embodiments, the pre-training model includes at least one self-attention unit and at least one feed-forward neural network unit. The server may add an original convolution bypass for each self-attention unit and each feedforward neural network unit in the pre-training model. Original additional network elements are added at target locations in the pre-training model.

The input of the target convolution bypass is the input of a self-attention unit or a feedforward neural network unit corresponding to the target convolution bypass. The output of the target convolution bypass is used to be added to the output of the self-attention unit or feedforward neural network unit to which the target convolution bypass corresponds.

For example, if the pre-trained model is a Vision Transformer model, each layer of the Vision Transformer model may be composed of one multi-headed self-attention module, one feed-forward neural network module, and two normalization layers. Accordingly, the input and output of each layer of the Vision Transformer model can be as shown in the general terms (1) and (2):

x ₁ ＝x ₀ +MHSA(LN(x ₀ )) (1)

x ₂ ＝x ₁ +FFN(LN(x ₁ )) (2)

wherein x is ₀ To input data, x ₂ To output data, x ₁ For intermediate data, MHSA is a multi-head self-attention mechanism, FFN is a feed-forward neural network, and LN is a normalization process.

After adding a convolution bypass for each multi-headed self-attention module and each feedforward neural network module, the input and output of each layer of the Vision Transformer model may be as shown in the general terms (3) and (4):

x ₁ ＝x ₀ +MHSA(LN(x ₀ )) +Convpass(LN(x ₀ )) (1)

x ₂ ＝x ₁ +FFN(LN(x ₁ ))+Convpass(LN(x ₁ )) (2)

wherein x is ₀ To input data, x ₂ To output data, x ₁ For intermediate data, MHSA is a multi-head self-attention mechanism, FFN is a feed-forward neural network, LN is normalization processing, and Convpass is convolution bypass.

S504, training an original convolution bypass and an original additional network unit in the pre-training model by using the image sample set under the model parameters of the locked pre-training model to obtain a trained target convolution bypass and a trained target additional network unit.

It should be understood that the present embodiments are not limited in how to train the original convolution bypass and the original additional network elements.

In some embodiments, the server trains the convolution bypass and the target additional network element using gradient descent by calculating cross entropy classification loss on the image sample set with model parameters of the pre-trained model locked.

S505, storing the target convolution bypass and the target additional network element according to the identification task.

In some embodiments, the target convolution bypass and the target additional network element may be sorted according to the corresponding identification task.

In the embodiment of the application, since only the target convolution bypass and the target additional network unit corresponding to each recognition task are saved, compared with the case of saving all adjustment parameters of the pre-training model, the storage cost can be greatly reduced when the pre-training model is used.

In one embodiment, as shown in fig. 6, another image recognition method is provided, comprising the steps of:

s601, acquiring an image sample set corresponding to the identification task.

S602, according to the identification task, the additional network element is truly original.

S603, adding the original convolution bypass and the original additional network element into the pre-training model.

S604, training an original convolution bypass and an original additional network unit in the pre-training model by using the image sample set under the model parameters of the locked pre-training model to obtain a trained target convolution bypass and a target additional network unit.

S605, save the target convolution bypass and the target additional network element.

S606, acquiring an image to be identified and an identification task corresponding to the image to be identified.

S607, determining a target convolution bypass and a target additional network unit corresponding to the identification task, wherein the target convolution bypass and the target additional network unit are generated after training according to an image sample set corresponding to the identification task.

S608, loading the target convolution bypass and the target additional network unit into the pre-training model, and generating an image recognition model corresponding to the recognition task.

S609, inputting the image to be identified into the image identification model, and acquiring an identification result output by the image identification model.

It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an image recognition device for realizing the image recognition method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the image recognition device or devices provided below may be referred to the limitation of the image recognition method hereinabove, and will not be repeated here.

In one embodiment, as shown in fig. 7, there is provided an image recognition apparatus 700 including: a first acquisition module 701, a loading module 702, an identification module 703 and a training module 704.

A first obtaining module 701, configured to obtain an image to be identified and an identification task corresponding to the image to be identified; determining a target convolution bypass and a target additional network unit corresponding to the identification task, wherein the target convolution bypass and the target additional network unit are generated after training according to an image sample set corresponding to the identification task;

the loading module 702 is configured to load the target convolution bypass and the target additional network element into the pre-training model, and generate an image recognition model corresponding to the recognition task;

the recognition module 703 is configured to input the image to be recognized into the image recognition model, and obtain a recognition result output by the image recognition model.

In one embodiment, the pre-training model includes at least one self-attention unit and at least one feed-forward neural network unit.

The loading module 702 is specifically configured to add a target convolution bypass to each self-attention unit and each feedforward neural network unit in the pre-training model; adding a target additional network element at a target location in the pre-training model.

In one embodiment, the input of the target convolution bypass is the input of the self-attention unit or the feedforward neural network unit to which the target convolution bypass corresponds.

The output of the target convolution bypass is used to be added to the output of the self-attention unit or feedforward neural network unit to which the target convolution bypass corresponds.

the reconstruction subunit is disposed before the second convolution subunit;

and the reconstruction subunit is used for reconstructing the two-dimensional space structure of the graphic token in the input data to generate a reconstructed picture token and inputting the reconstructed picture token and the identification token in the input data into the second convolution subunit.

In one embodiment, the image recognition apparatus 700 further includes:

the training module 704 is configured to obtain an image sample set corresponding to the recognition task; according to the identification task, truly original additional network elements; adding the original convolution bypass and the original additional network element to the pre-training model; under the model parameters of the pre-training model, training an original convolution bypass and an original additional network unit in the pre-training model by using an image sample set to obtain a trained target convolution bypass and a target additional network unit; and storing the target convolution bypass and the target additional network element according to the identification task.

The respective modules in the image recognition apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

Based on the same inventive concept, the embodiment of the application also provides a model training device for realizing the model training method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in one or more embodiments of the model training device provided below may be referred to above for the limitation of the model training method, which is not repeated here.

In one embodiment, as shown in fig. 8, there is provided a model training apparatus 800 including: a second acquisition module 801, an addition module 802, a training module 803, and a storage module 804.

A second obtaining module 801, configured to obtain an image sample set corresponding to the identification task; according to the identification task, truly original additional network elements;

An adding module 802 for adding the original convolution bypass and the original additional network element to the pre-training model;

the training module 803 is configured to train the original convolution bypass and the original additional network unit in the pre-training model by using the image sample set under the model parameters of the locked pre-training model, to obtain a trained target convolution bypass and a target additional network unit, where the target convolution bypass and the target additional network unit are used to generate an image recognition model corresponding to the recognition task;

a storage module 804 is configured to store the target convolution bypass and the target additional network element.

The various modules in the model training apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 9. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an image recognition method or a model training method.

It will be appreciated by those skilled in the art that the structure shown in fig. 9 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. An image recognition method, the method comprising:

2. The method of claim 1, wherein the pre-training model comprises at least one self-attention unit and at least one feed-forward neural network unit;

3. The method of claim 2, wherein the input of the target convolution bypass is an input of a self-attention unit or a feed-forward neural network unit corresponding to the target convolution bypass;

4. The method of claim 1, wherein the target convolution bypass comprises a first convolution subunit and a second convolution subunit, the convolution kernel of the first convolution subunit being smaller than the convolution kernel of the second convolution subunit.

5. The method of claim 4, wherein if the input of the pre-training model is a serialized input, the target convolution bypass further comprises a reconstruction subunit;

the reconstruction subunit is disposed before the second convolution subunit;

6. The method according to any of claims 1-5, wherein prior to said determining a target convolutional bypass and a target additional network element for the identification task, the method further comprises:

acquiring an image sample set corresponding to the identification task;

Under the condition of locking the model parameters of the pre-training model, training an original convolution bypass and an original additional network unit in the pre-training model by using the image sample set to obtain a trained target convolution bypass and a target additional network unit;

7. A method of model training, the method comprising:

acquiring an image sample set corresponding to an identification task;

adding an original convolution bypass and the original additional network element to a pre-training model;

8. An image recognition apparatus, the apparatus comprising:

9. A model training apparatus, the apparatus comprising:

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.

12. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.