CN117132819A

CN117132819A - Image classification method, device, equipment and storage medium

Info

Publication number: CN117132819A
Application number: CN202311091521.8A
Authority: CN
Inventors: 胡懋成; 蔡明琦; 方昕; 吴江照; 邹亮
Original assignee: Hefei Intelligent Voice Innovation Development Co ltd
Current assignee: Hefei Intelligent Voice Innovation Development Co ltd
Priority date: 2023-08-28
Filing date: 2023-08-28
Publication date: 2023-11-28

Abstract

The invention provides an image classification method, an image classification device and a storage medium, wherein the method comprises the following steps: acquiring a target image and a target text in a specific field, wherein the target text is a visual description text of the target image; extracting general image features from the target image and general text features from the target text; according to the general image features and the general text features, respectively corresponding target features of a target image and a target text are obtained, the corresponding target features of the target image are target image features blended with the text features of the target text, the target image features comprise general image features and specific field image features, the corresponding target features of the target text are target text features blended with the image features of the target image, and the corresponding target text features comprise general text features and specific field text features; and determining the category of the target image according to the target characteristics respectively corresponding to the target image and the target text. The invention can accurately classify the images in the specific field.

Description

Image classification method, device, equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image classification method, apparatus, device, and storage medium.

Background

Image classification tasks refer to identifying what categories an image belongs to, which are important basic tasks in computer vision and are the basis of many visual tasks.

Most of the current image classification schemes aim at image classification schemes in the general field, namely, training samples in the general field are adopted in advance to obtain an image classification model, and then the training image classification model is used for classifying images to be classified.

Currently, image classification in a particular field is becoming increasingly important. However, the image classification scheme for the general field is not highly applicable to the image of the specific field, that is, the image of the specific field is classified by using the image classification scheme for the general field, and the classification effect is poor.

Disclosure of Invention

In view of the above, the present invention provides an image classification method, apparatus, device and storage medium, which are used for solving the problem that the current image classification scheme is not high in image applicability to a specific field, and the technical scheme is as follows:

in a first aspect, there is provided an image classification method, comprising:

acquiring a target image and a target text in a specific field, wherein the target text is a visual description text of the target image;

Extracting general image features from the target image and general text features from the target text;

acquiring target characteristics corresponding to the target image and the target text respectively according to the general image characteristics and the general text characteristics, wherein the target characteristics corresponding to the target image are target image characteristics blended with the text characteristics of the target text, the target image characteristics comprise general image characteristics and specific field image characteristics of the target image, the target characteristics corresponding to the target text are target text characteristics blended with the image characteristics of the target image, and the target text characteristics comprise general text characteristics and specific field text characteristics of the target text;

and determining the category of the target image according to the target characteristics respectively corresponding to the target image and the target text.

Optionally, the determining the category of the target image according to the target features corresponding to the target image and the target text respectively includes:

identifying entities from the target text to obtain a plurality of entities, wherein each entity is a candidate category of the target image;

Extracting the characteristics corresponding to the entities from the target characteristics corresponding to the target text;

calculating the correlation degree between the features corresponding to each entity and the target features corresponding to the target image;

and determining the category of the target image from the entities according to the calculated correlation.

Optionally, the extracting general image features from the target image and general text features from the target text; according to the general image feature and the general text feature, acquiring target features respectively corresponding to the target image and the target text, including:

extracting general image features from the target image and general text features from the target text based on an image classification model obtained through pre-training;

based on the image classification model, acquiring target features corresponding to the target image and the target text respectively according to the general image features and the general text features;

the image classification model is obtained by training the training image in the specific field and the visual description text of the training image.

Optionally, the obtaining the target features corresponding to the target image and the target text respectively based on the general image feature and the general text feature includes:

Acquiring specific field image features fused with the text features of the target text according to the general image features and the general text features to obtain first fusion features, and acquiring specific field text features fused with the image features of the target image to obtain second fusion features;

fusing the first fusion feature with the general image feature to obtain a third fusion feature, and fusing the second fusion feature with the general text feature to obtain a fourth fusion feature;

and acquiring target features corresponding to the target image according to the third fusion features, and acquiring target features corresponding to the target text according to the fourth fusion features.

Optionally, the obtaining, according to the general image feature and the general text feature, a specific field image feature fused with the text feature of the target text to obtain a first fusion feature, and obtaining a specific field text feature fused with the image feature of the target image to obtain a second fusion feature, includes:

the general text features are fused into the general image features to obtain first features, and the general image features are fused into the general text features to obtain second features;

Extracting image features from the first features to obtain third features, and extracting text features from the second features to obtain fourth features;

extracting image features from the third features to obtain fifth features, and extracting text features from the fourth features to obtain sixth features;

and merging the fourth feature into the fifth feature to obtain the first merged feature, and merging the third feature into the sixth feature to obtain the second merged feature.

Optionally, the obtaining the target feature corresponding to the target image according to the third fusion feature, and obtaining the target feature corresponding to the target text according to the fourth fusion feature includes:

processing the third fusion feature into a feature which is adapted to an image classification task to obtain a fifth fusion feature, and processing the fourth fusion feature into a feature which is adapted to the image classification task to obtain a sixth fusion feature;

and fusing the fifth fusion feature with the general image feature to obtain a target feature corresponding to the target image, and fusing the sixth fusion feature with the general text feature to obtain a target feature corresponding to the target text.

Optionally, the image classification model includes an image encoder, a text encoder and an adapter, wherein the image encoder is obtained by pre-training an image sample in a general field, and the text encoder is obtained by pre-training a text sample in the general field;

the training process of the image classification model comprises the following steps:

acquiring general image features of the training image based on the image encoder, and acquiring general text features of visual description text of the training image based on the text encoder;

based on the adapter, acquiring target features corresponding to the training image and the visual description text of the training image respectively according to the general image features of the training image and the general text features of the visual description text of the training image, wherein the target features corresponding to the training image are training image features blended with the text features of the visual description text of the training image, the training image features comprise general image features and specific field image features of the training image, the target features corresponding to the visual description text of the training image are training text features blended with the image features of the training image, and the training text features comprise general text features and specific field text features of the visual description text of the training image;

And determining a first loss according to the target characteristics respectively corresponding to the training image and the visual description text of the training image, and updating parameters of the image classification model according to the first loss, wherein in the training process, the parameters of the image encoder and the text encoder are not updated.

In a second aspect, there is provided an image classification apparatus comprising: the system comprises a data acquisition module, a general feature acquisition module, a target feature acquisition module and an image category determination module;

the data acquisition module is used for acquiring a target image and a target text in a specific field, wherein the target text is a visual description text of the target image;

the general feature acquisition module is used for extracting general image features from the target image and general text features from the target text;

the target feature acquiring module is configured to acquire target features corresponding to the target image and the target text respectively according to the general image feature and the general text feature, where the target features corresponding to the target image are target image features that are integrated with text features of the target text, the target image features include general image features and specific field image features of the target image, the target features corresponding to the target text are target text features that are integrated with image features of the target image, and the target text features include general text features and specific field text features of the target text;

The image category determining module is used for determining the category of the target image according to the target characteristics corresponding to the target image and the target text respectively.

In a third aspect, there is provided an image classification apparatus comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement each step of the image classification method described in any one of the above.

In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the image classification method of any of the above.

The invention provides an image classification method, which comprises the steps of firstly, obtaining a target image in a specific field and a visual description text of the target image, then, obtaining general visual characteristics of the target image and general text characteristics of the visual description text of the target image, then, according to the general image characteristics and the general text characteristics, obtaining target characteristics respectively corresponding to the target image and the target text (the target characteristics corresponding to the target image are the target image characteristics integrated with the text characteristics of the target text, the target image characteristics comprise the general image characteristics of the target image and the specific field image characteristics, the target characteristics corresponding to the target text are the target text characteristics integrated with the image characteristics of the target image, the target text characteristics comprise the general text characteristics of the target text and the specific field text characteristics), and finally, determining the category of the target image according to the target characteristics respectively corresponding to the target image and the target text. The image classification method provided by the invention has the advantages that when the target image in the specific field is classified, the general feature and the specific field feature are considered, the multi-mode feature is utilized, and the general feature and the specific field feature are considered, so that the image classification method provided by the invention can accurately classify the image in the specific field.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a hardware architecture according to the present invention;

fig. 2 is a schematic flow chart of an image classification method according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an image classification model according to an embodiment of the present invention;

FIG. 4 is a flowchart of a training method of an image classification model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an adapter in an image classification model according to an embodiment of the present invention;

FIG. 6 is a flowchart of another training method of the image classification model according to the embodiment of the present invention;

fig. 7 is a schematic structural diagram of an image classification device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an image classification apparatus according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Whereas image classification in a specific field is becoming more and more important, the image classification scheme (i.e. the image classification scheme for a general field) at present is not highly applicable to the image in the specific field, the present inventors have studied, and the initial idea is: training samples in the specific field are adopted in advance to obtain an image classification model in the specific field, and then images to be classified in the specific field are classified based on the image classification model in the specific field.

However, the number of training samples in a specific field is usually small, and an image classification model obtained by training with a small number of training samples has an overfitting problem, wherein the overfitting problem means that the image classification model has better performance on a known sample, but does not perform well on an unknown sample, namely the generalization capability of the scheme is poor.

The inventor of the present invention has continued to study the problems existing in the above-mentioned ideas, and in the course of the study, it is conceivable that the number of training samples can be increased based on the data enhancement technique, that is, new samples are obtained by performing processes such as rotation, scaling, color dithering, affine variation, etc. on existing training samples, and an image classification model is trained using the original training samples and the new samples obtained by the data enhancement technique.

However, the distribution of samples obtained based on the data augmentation method is close to that of the original training samples, so that the image classification model trained by using the original training samples and the new samples obtained by the data augmentation technology still has not good generalization.

In view of the above, the inventor of the present application continues to study, and finally provides an image classification method with good effect through continuous study.

Before introducing the image classification method provided by the invention, the hardware architecture related to the invention is described.

In one possible implementation manner, as shown in fig. 1, the hardware architecture related to the present invention may include: an electronic device 101 and a server 102.

By way of example, the electronic device 101 may be any electronic product that can interact with a user by one or more of a keyboard, touchpad, touch screen, remote control, voice interaction, or handwriting device, such as a mobile phone, notebook computer, tablet computer, palm top computer, personal computer, wearable device, smart television, PAD, etc.

It should be noted that fig. 1 is only an example, and the types of electronic devices may be various, and are not limited to the notebook computer in fig. 1.

The server 102 may be a server, a server cluster comprising a plurality of servers, or a cloud computing server center, for example. The server 102 may include a processor, memory, network interfaces, and the like.

By way of example, the electronic device 101 may establish a connection and communicate with the server 102 over a wireless communication network; illustratively, the electronic device 101 may establish a connection and communicate with the server 102 over a wired network.

The electronic device 101 obtains an image to be classified and a visual description text of the image to be classified in a specific field, sends the image to be classified and the visual description text of the image to be classified to the server 102, and the server 102 processes the image to be classified and the visual description text of the image to be classified according to the image classification method provided by the invention to determine the image category, and then sends the image category to the electronic device 101.

In another possible implementation manner, the hardware architecture related to the present invention may include: an electronic device. The electronic device is a device with a relatively strong data processing capability.

The electronic equipment can process the images to be classified and the visual description text of the images to be classified in the specific field according to the image classification method provided by the invention so as to determine the image category.

Those skilled in the art will appreciate that the above-described electronic devices and servers are merely examples, and that other existing or future-occurring electronic devices or servers, as applicable, are also within the scope of the present invention and are hereby incorporated by reference herein.

The image classification method provided by the present invention will be described by the following examples.

Referring to fig. 2, a flow chart of an image classification method according to an embodiment of the present invention is shown, where the method may include:

step S201: and acquiring a target image and a target text in a specific field.

The target image is an image to be classified, and the target text is a visual description text of the target image. It should be noted that the visual description text of the target image is text describing the image content of the target image.

Step S202: extracting general image features from the target image and general text features from the target text.

Where generic image features refer to image features that are ubiquitous or applicable in all areas, and similarly generic text features refer to text features that are ubiquitous or applicable in all areas.

Step S203: and acquiring target characteristics corresponding to the target image and the target text respectively according to the general image characteristics and the general text characteristics.

The target image corresponds to the target image features which are the target image features integrated with the text features of the target text, the target image features comprise two layers of image features, namely, the general image features of the target image and the specific field image features of the target image, similarly, the target image corresponds to the target text features which are the target text features integrated with the image features of the target image, and the target text features comprise two layers of features, namely, the general text features of the target text and the specific field text features of the target text. It should be noted that the domain-specific image feature is an image feature adapted to a specific domain, which is related to a classification task of the specific domain, and similarly, the domain-specific text feature is a text feature adapted to a specific domain, which is related to a classification task of the specific domain.

In order to finally obtain an accurate classification result, the method and the device for classifying the images simultaneously combine the universal text features according to the universal image features, determine the features which simultaneously comprise the two aspects of the image features and are fused with the text features, namely the target features corresponding to the target images, and combine the universal image features according to the universal text features, determine the features which simultaneously comprise the two aspects of the text features and are fused with the image features, namely the target features corresponding to the target texts.

Step S204: and determining the category of the target image according to the target characteristics respectively corresponding to the target image and the target text.

Specifically, firstly, identifying an entity from a target text to obtain a plurality of entities, then extracting features corresponding to the entities respectively from target features corresponding to the target text, then calculating the correlation degree between the features corresponding to each entity and the target features corresponding to the target image, and finally determining the category of the target image from the entities according to the calculated correlation degree.

When the entity is identified from the target text, the entity can be identified from the target text by using the existing entity identification method, and each identified entity is a candidate category of the target image.

After calculating the correlation between the feature corresponding to each entity and the target feature corresponding to the target image, in one possible implementation manner, N (N is a positive integer, and the specific value of N can be determined according to a specific application scenario) entities with the largest correlation between the corresponding feature and the target feature corresponding to the target image are determined as the category of the target image; in another possible implementation manner, an entity whose correlation degree between the corresponding feature and the target feature corresponding to the target image is greater than a preset correlation degree threshold may be determined as a category of the target image; in still another possible implementation manner, an entity with the largest correlation degree between the corresponding feature and the target feature corresponding to the target image and greater than a preset correlation threshold may be determined as the category of the target image.

The embodiment of the invention provides an image classification method, which comprises the steps of firstly, acquiring a target image in a specific field and a visual description text of the target image, then, acquiring general visual characteristics of the target image and general text characteristics of the visual description text of the target image, then, acquiring target characteristics corresponding to the target image and the target text respectively according to the general image characteristics and the general text characteristics (the target characteristics corresponding to the target image are the target image characteristics integrated with the text characteristics of the target text, the target image characteristics comprise the general image characteristics of the target image and the image characteristics in the specific field, the target characteristics corresponding to the target text are the target text characteristics integrated with the image characteristics of the target image, the target text characteristics comprise the general text characteristics of the target text and the text characteristics in the specific field), and finally, determining the category of the target image according to the target characteristics corresponding to the target image and the target text respectively. The image classification method provided by the embodiment of the invention has the advantages that when the target images in the specific field are classified, on one hand, the general features and the features in the specific field are considered, on the other hand, the multi-mode features are utilized, and the general features and the features in the specific field are considered, so that the image classification method provided by the embodiment of the invention can accurately classify the images in the specific field.

In one possible implementation manner, the image classification method provided in the foregoing embodiment may be based on implementation of an image classification model obtained by training in advance, where the image classification model is obtained by training an image and a visual description text of the training image.

It should be noted that, the specific implementation form of the image classification method provided in the foregoing embodiment is not limited, that is, the embodiment is not limited to the model-based implementation of the image classification method provided in the foregoing embodiment, and may be implemented, for example, based on rules.

Next, the image classification method provided in the above embodiment is further described based on the model implementation as an example.

Before describing the process of implementing image classification based on the image classification model obtained by training in advance, the training process of the image classification model will be described by taking the image classification model shown in fig. 3 as an example.

The image classification model shown in fig. 3 may include an image encoder 301I, a text encoder 301T, and an adapter 302. The image encoder 301I is obtained by pretraining a large number of image samples in the general field, and the text encoder is obtained by pretraining a large number of text samples in the general field. During the training of the image classification model, parameters of the image encoder 301I and the text encoder 301T are fixed, i.e., parameters of the image encoder 301I and the text encoder 301T are not updated.

There are various training modes of the image classification model, and the following two training modes are provided in this embodiment.

Referring to fig. 4, a flowchart of a training manner of the image classification model is shown, which may include:

step S401: the original sample image and the visual description text of the original sample image are obtained from the training sample set and serve as the training image and the visual description text of the training image.

It should be noted that the training sample set includes a plurality of pieces of sample data, and each piece of sample data includes an original sample image and a visual description text of the original sample image.

Step S402: the image encoder 301I based on the image classification model acquires general image features of the training image, and the text encoder 301T based on the image classification model acquires general text features of visual descriptive text of the training image.

Since the image encoder 301I is pre-trained with a large number of image samples in the general field, the training image is input to the image encoder 301I of the image classification model, the image encoder 301I can output general image features of the training image, and similarly, since the text encoder 301T is pre-trained with a large number of text samples in the general field, the visual description text of the training image is input to the text encoder 301T of the image classification model, and the text encoder 301T can output general text features of the visual description text of the training image.

Step S403: the adapter 302 based on the image classification model obtains target features corresponding to the training image and the visual description text of the training image based on the general image features of the training image and the general text features of the visual description text of the training image.

The target features corresponding to the training images are training image features which are integrated with text features of visual description texts of the training images, the training image features comprise general image features and specific field image features of the training images, the target features corresponding to the visual description texts of the training images are training text features which are integrated with image features of the training images, and the training text features comprise general text features and specific field text features of the visual description texts of the training images.

The process of obtaining target features respectively corresponding to the visual description text of the training image and the visual description text of the training image based on the universal image features of the training image and the universal text features of the visual description text of the training image by the image classification model based adapter 302 may include:

step S4031, based on the image classification model, the adapter 302 obtains the specific domain image feature blended with the text feature of the visual description text of the training image based on the general image feature of the training image and the general text feature of the visual description text of the training image, to obtain the first fusion feature, and obtains the specific domain text feature blended with the image feature of the training image to obtain the second fusion feature.

Specifically, the implementation procedure of step S4031 may include:

step a1, the adapter 302 based on the image classification model blends the general text features of the visual description text of the training image into the general image features of the training image to obtain first features, and the adapter 302 based on the image classification model blends the general image features of the training image into the general text features of the visual description text of the training image to obtain second features.

Optionally, as shown in fig. 5, the adapter 302 of the image classification model may include: the first feature adjustment module 501, the first feature stitching module 502I and the second feature stitching module 502T may adjust the general text feature of the visual description text of the training image to a feature that can be stitched with the general image feature of the training image based on the first feature adjustment module 501, obtain an adjusted text feature, stitch the adjusted text feature with the general image feature of the training image based on the first feature stitching module 502I, obtain a first feature, similarly, adjust the general image feature of the training image to a feature that can be stitched with the general text feature of the visual description text of the training image based on the first feature adjustment module 501, obtain an adjusted image feature, stitch the adjusted image feature with the general text feature of the visual description text of the training image based on the second feature stitching module 502T, and obtain a second feature.

Alternatively, the first feature adjustment module 501 may be composed of a linear layer (e.g., two linear layers) and a ReLu activation function that dimension the input features to stitch with features of another modality.

Step a2, the adapter 302 based on the image classification model performs image feature extraction on the first feature to obtain a third feature, and the adapter 302 based on the image classification model performs text feature extraction on the second feature to obtain a fourth feature.

As shown in fig. 5, the adapter 302 of the image classification model may include a first image feature extraction module 503I and a first text feature extraction module 503T, where the first feature may be extracted to obtain a third feature based on the first image feature extraction module 503I, and the second feature may be extracted to obtain a fourth feature based on the first text feature extraction module 503T.

Alternatively, the first image feature extraction module 503I may be, but not limited to, a RepVGG module, which extracts image features on the first feature side, and the first text feature extraction module 503T may be, but not limited to, a BERT module, which extracts text features on the second feature side.

It should be noted that, since the first feature is a feature obtained by stitching features of two modes, the third feature obtained by extracting the image feature from the first feature includes text features in addition to image features, but mainly includes image features, and similarly, the fourth feature includes image features in addition to text features, but mainly includes text features.

Step a3, the adapter 302 based on the image classification model performs image feature extraction on the third feature to obtain a fifth feature, and the adapter 302 based on the image classification model performs text feature extraction on the fourth feature to obtain a sixth feature.

As shown in fig. 5, the adapter 302 of the image classification model may include a second image feature extraction module 504I and a second text feature extraction module 504T, where the image feature extraction may be performed on the third feature based on the second image feature extraction module 504I to obtain a fifth feature, and the text feature extraction may be performed on the fourth feature based on the second text feature extraction module 504T to obtain a sixth feature.

Alternatively, the second image feature extraction module 504I may be, but not limited to, a RepGhost module, and the second text feature extraction module 504T may be, but not limited to, a BERT module.

Step a4, the adapter 302 based on the image classification model fuses the fourth feature into the fifth feature to obtain a first fused feature, and the adapter 302 based on the image classification model fuses the third feature into the sixth feature to obtain a second fused feature.

As shown in fig. 5, the adapter 302 of the image classification model may include a second feature adjustment module 505, a third feature stitching module 506I, and a fourth feature stitching module 506T, where the fourth feature may be adjusted to a feature that may be stitched with the fifth feature based on the second feature adjustment module 505, the adjusted feature may be stitched with the fifth feature based on the third feature stitching module 506I to obtain a first fused feature, and similarly, the third feature may be adjusted to a feature that may be stitched with the sixth feature based on the second feature adjustment module 505, and the adjusted feature may be stitched with the sixth feature based on the fourth feature stitching module 506T to obtain a second fused feature.

Alternatively, the second feature adjustment module 505 may be composed of a linear layer (e.g., a two-layer linear layer) and a ReLu activation function that dimension-adjusts the input features to stitch with features of another modality.

Through the steps a1 to a4, the full fusion and information supplementation of the two modal characteristics can be realized.

Step S4032, the adapter 302 based on the image classification model fuses the first fusion feature with the general image feature of the training image to obtain a third fusion feature, and fuses the second fusion feature with the general text feature of the visual description text of the training image to obtain a fourth fusion feature.

After the feature processing process from step a1 to step a4, the general feature may be forgotten, in order to avoid the problem of over-fitting of the model, after the first fusion feature and the second fusion feature are obtained, the first fusion feature may be fused with the general image feature, and the second fusion feature may be fused with the general text feature.

Alternatively, the first fusion feature may be fused with the generic image features of the training image using an attention mechanism based on the adapter 302 of the image classification model, and the second fusion feature may be fused with the generic text features of the visual descriptive text of the training image using an attention mechanism based on the adapter 302 of the image classification model.

As shown in fig. 5, the adaptor 302 of the image classification model may include a first attention module 507I and a second attention module 507T, where the first attention module 507I determines weights corresponding to the first fusion feature and the universal image feature of the training image respectively, then weights the corresponding features according to the weights corresponding to the first fusion feature and the universal image feature of the training image respectively, and then sums the corresponding features, where the weighted and summed features are the first fusion feature, and similarly, the second attention module 507T determines weights corresponding to the second fusion feature and the universal text feature of the visual description text of the training image respectively, then weights the corresponding features according to the weights corresponding to the second fusion feature and the universal text feature of the visual description text of the training image respectively, and then sums the weighted and summed features are the second fusion feature.

Step S4033, the adapter 302 based on the image classification model obtains the target feature corresponding to the training image based on the third fusion feature, and the adapter 302 based on the image classification model obtains the target feature corresponding to the visual description text of the training image based on the fourth fusion feature.

The implementation manner of step S4033 is various: in one possible implementation, the adapter 302 based on the image classification model may process the third fused feature into a feature adapted to the image classification task as a target feature corresponding to the training image, and the adapter 302 based on the image classification model may process the fourth fused feature into a feature adapted to the image classification task as a target feature corresponding to the visual descriptive text of the training image; in another possible implementation manner, the third fusion feature may be processed into a feature adapted to the image classification task first, and used as a fifth fusion feature, and the fourth fusion feature may be processed into a feature adapted to the image classification task, and used as a sixth fusion feature, and then the fifth fusion feature is fused with the general image feature of the training image, and the fused feature is used as a target feature corresponding to the training image, and the sixth fusion feature is fused with the general text feature of the visual description text of the training image, and the fused feature is used as a target feature corresponding to the visual description text of the training image.

Alternatively, the fifth fusion feature may be fused with the generic image features of the training image using an attention mechanism based on the adapter 302 of the image classification model, and the sixth fusion feature may be fused with the generic text features of the visual descriptive text of the training image using an attention mechanism based on the adapter 302 of the image classification model.

For the second implementation described above, as shown in fig. 5, the adapter 302 of the image classification model may include a third feature adjustment module 508I, a fourth feature adjustment module 508T, a third attention module 509I, and a fourth attention module 509T. The third feature adjustment module 508I processes the third fusion feature into a feature adapted to the image classification task to obtain a fifth fusion feature, the fourth feature adjustment module 508T processes the fourth fusion feature into a feature adapted to the image classification task to obtain a sixth fusion feature, the third attention module 509I determines weights corresponding to the fifth fusion feature and the general image feature respectively, weights the corresponding features according to the weights corresponding to the fifth fusion feature and the general image feature respectively, then sums the weights, the summed features are target features corresponding to the training image, the fourth attention module 509T determines weights corresponding to the sixth fusion feature and the general text feature respectively, then sums the corresponding features according to the weights corresponding to the sixth fusion feature and the general text feature respectively, and the weighted summed features are target features corresponding to the visual description text of the training image.

Alternatively, the third feature adjustment module 508I and the fourth feature adjustment module 508T may be composed of linear layers (e.g., two linear layers) and a ReLu activation function.

Step S404: and determining the first loss according to the training image and the target characteristics respectively corresponding to the visual description text of the training image.

Specifically, a contrast loss may be determined for a target feature corresponding to the training image and a target feature corresponding to the visual description text of the training image as the first loss.

Step S405: and updating parameters of the image classification model according to the first loss.

And updating parameters of other parts except the image encoder and the text encoder in the image classification model.

The image classification model is trained multiple times according to the above-mentioned processes of step S401 to step S405 until the training end condition is satisfied (for example, the set training times are reached).

In order to enhance the feature extraction capability of the image classification model, the present embodiment provides another training method of the image classification model, referring to fig. 6, which shows a flow chart of another training method of the image classification model, and may include:

step S601: the method comprises the steps of obtaining an original sample image from a training sample set and obtaining visual description text of the original sample image.

Step S602: preprocessing an original sample image, taking the preprocessed image as a training image, and taking a visual description text of the original sample image as a visual description text of the training image.

Wherein, the preprocessing of the original sample image may include: the partial image blocks in the original sample image are masked (for example, the partial image blocks in the original sample image are replaced by the image blocks with all 0 pixel values, namely, black image blocks), and the masked image blocks in the original sample image can be randomly selected.

Step S603: the image encoder based on the image classification model obtains general image features of the training image, and the text encoder based on the image classification model obtains general text features of visual description text of the training image.

The specific implementation process of step S603 is similar to that of step S402, and the description of this embodiment is omitted here.

Step S604: and based on the adapter of the image classification model, acquiring target features respectively corresponding to the training image and the visual description text of the training image based on the general image features of the training image and the general text features of the visual description text of the training image.

Specifically, the implementation procedure of step S604 may include: the method comprises the steps that based on an adapter of an image classification model, general text features of visual description texts of training images are blended into general image features of the training images to obtain first features, and general image features of the training images are blended into general text features of the visual description texts of the training images to obtain second features; the adapter based on the image classification model performs image feature extraction on the first feature to obtain a third feature, and performs text feature extraction on the second feature to obtain a fourth feature; performing image feature extraction on the third feature based on the adapter of the image classification model to obtain a fifth feature, and performing text feature extraction on the fourth feature to obtain a sixth feature; the adapter based on the image classification model fuses the fourth feature into the fifth feature to obtain a first fusion feature, and fuses the third feature into the sixth feature to obtain a second fusion feature; the adapter based on the image classification model fuses the first fusion feature with the general image feature of the training image to obtain a third fusion feature, and fuses the second fusion feature with the general text feature of the visual description text of the training image to obtain a fourth fusion feature; processing the third fusion feature into a feature which is adapted to an image classification task based on an adapter of the image classification model to obtain a fifth fusion feature, and processing the fourth fusion feature into a feature which is adapted to the image classification task as a sixth fusion feature; and fusing the fifth fused feature with the general image feature of the training image based on the adapter of the image classification model, wherein the fused feature is used as a target feature corresponding to the training image, and fusing the sixth fused feature with the general text feature of the visual description text of the training image, and the fused feature is used as a target feature corresponding to the visual description text of the training image.

It should be noted that, the specific implementation process of the adaptor based on the image classification model to obtain each feature may refer to the relevant part in the first training manner, and this embodiment is not described herein.

Step S605: and restoring the training image according to intermediate features generated by the adapter of the image classification model in the process of acquiring the target features corresponding to the training image, so as to obtain a restored image.

Specifically, the intermediate features may be input to a decoder for decoding to obtain a restored image. Alternatively, the decoder may include a convolution module, a deconvolution operation, and an upsampling module in that order.

The intermediate extraction feature in this step may be the third fusion feature described above.

Step S606: and determining a first loss according to the target characteristics respectively corresponding to the training image and the visual description text of the training image, and determining a second loss according to the restored image and the original sample image.

Specifically, a contrast loss may be determined for a target feature corresponding to the training image and a target feature corresponding to the visual descriptive text of the training image, as a first loss, and an MSE (mean square error) loss may be determined for the restored image and the original sample image, as a second loss.

In determining the MSE loss for the restored image and the original sample image, in one possible implementation, the MSE loss may be determined for the entire restored image and the entire original sample image, in another possible implementation, the MSE loss may be determined for the image block at the masking position in the restored image and the image block at the masking position in the original sample image, in yet another possible implementation, the MSE loss for both aspects may be determined at the same time, and then the MSE losses for both aspects may be fused (e.g., weighted summation, weight may be set according to specific scene), and the fused MSE loss is taken as the final MSE loss, i.e., the second loss.

Step S607: and updating parameters of the image classification model according to the first loss and the second loss.

The image classification model is trained multiple times according to the above-mentioned procedures of step S601 to step S607 until the training end condition is satisfied (for example, the set training times are reached).

In order to improve the classification capability of the image classification model on the difficult sample, the preprocessing of the original sample image may further include: partial image blocks in the original image sample are replaced by image blocks of difficult samples corresponding to the original image sample (the replaced image blocks in the original image sample can be random).

It should be noted that, the difficult sample corresponding to the original image sample is a sample that is difficult to distinguish from the original image sample, the difficult sample corresponding to the original image sample may be determined based on the image classification model, if the type of the original image sample is a, and the type determined for the original image sample based on the image classification model is B, the image sample with the type of B is the difficult sample corresponding to the original image sample.

Considering that the classification capability of the image classification model is not good at the initial stage of training, an image sample with a different class from that of the original image sample can be used as a difficult sample corresponding to the original image sample, and after multiple rounds (such as 50 rounds) of training, the difficult sample corresponding to the original image sample can be determined based on the trained image classification model.

The method for determining the difficult sample corresponding to the original image sample based on the image classification model can be as follows: acquiring an original image sample and target features respectively corresponding to visual description texts of the original image sample based on an image classification model; identifying entities from visual description text of an original image sample to obtain a plurality of entities; extracting the corresponding characteristics of each entity from the target characteristics corresponding to the visual description text of the original image sample; calculating the correlation degree between the features corresponding to each entity and the target features corresponding to the original image samples; and removing the category of the original image sample from a plurality of entities, taking the entity with the largest correlation degree of the corresponding characteristic and the target characteristic corresponding to the original image sample in the rest entities as the target category, and taking the image sample with the category as the target category as the difficult sample corresponding to the original image sample.

In addition, whether the first training mode or the second training mode is adopted, parameters of the image encoder and the text encoder are frozen in the process of training the image classification model, and the parameters of the two encoders are frozen so that knowledge learned by a large number of samples in the general field can be kept, and the training efficiency of the model can be improved.

The image classification model not only learns the characteristics of the small sample in the appointed field, but also learns the characteristics of the large sample in the general field, so that the image classification model has better generalization capability. In addition, the image classification model provided by the invention can carry out multi-stage interactive integration on the characteristics of the two modes, so that the characteristics of the two modes can be fully integrated, and further the model effect can be improved. Furthermore, part of image blocks in the original image sample are covered, and meanwhile, the feature extraction capability of the image classification model can be improved by combining the upper decoder and the second loss, and the classification capability of the image classification model on the difficult sample can be improved by replacing part of image blocks in the original image sample with image blocks of the difficult sample.

After training is finished, a target image and a target text (visual description text of the target image) in a specific field can be obtained, general image features are extracted for the target image based on an image classification model obtained through training, general text features are extracted for the target text, the target features respectively corresponding to the target image and the target text are obtained based on an image classification model obtained through training by taking the general image features and the general text features as references, and then the category of the target image is determined according to the target features respectively corresponding to the target image and the target text.

Wherein the image encoder based on the image classification model may extract generic image features for the target image and the text encoder based on the image classification model may extract generic text features for the target text.

Taking the image classification model obtained by training based on the second training mode as an example, the process of obtaining the target features respectively corresponding to the target image and the target text based on the general image features of the target image and the general text features of the target text based on the image classification model obtained by training may include:

and b1, based on an image classification model, acquiring specific field image characteristics blended with text characteristics of a target text based on general image characteristics of the target image and general text characteristics of the target text to obtain first fusion characteristics, and acquiring specific field text characteristics blended with image characteristics of the target image to obtain second fusion characteristics.

Specifically, firstly, general text features of a target text are fused into general image features of a target image based on an adapter of an image classification model to obtain first features, the general image features of the target image are fused into general text features of the target text to obtain second features, then, image feature extraction is carried out on the first features based on the adapter of the image classification model to obtain third features, text feature extraction is carried out on the second features to obtain fourth features, then, image feature extraction is carried out on the third features based on the adapter of the image classification model to obtain fifth features, text feature extraction is carried out on the fourth features to obtain sixth features, finally, fusion of the fourth features into the fifth features based on the adapter of the image classification model to obtain first fusion features, and fusion of the third features into the sixth features to obtain second fusion features.

And b2, fusing the first fusion feature with the general image feature of the target image based on the image classification model to obtain a third fusion feature, and fusing the second fusion feature with the general text feature of the target text to obtain a fourth fusion feature.

The first fusion feature may be fused with the generic image feature of the target image using an attention mechanism and the second fusion feature may be fused with the generic text feature of the target text using an attention mechanism based on an adapter of the image classification model.

And b3, based on the image classification model, acquiring target features corresponding to the target image according to the third fusion features, and acquiring target features corresponding to the target text according to the fourth fusion features.

Specifically, firstly, processing the third fusion feature into a feature which is adapted to an image classification task based on an adapter of an image classification model to obtain a fifth fusion feature, processing the fourth fusion feature into a feature which is adapted to the image classification task to obtain a sixth fusion feature, then, fusing the fifth fusion feature with a general image feature of a target image based on the adapter of the image classification model to obtain a target feature corresponding to the target image, and fusing the sixth fusion feature with a general text feature of the target text to obtain a target feature corresponding to the target text.

The fifth fusion feature may be fused with the generic image feature of the target image using an attention mechanism and the sixth fusion feature may be fused with the generic text feature of the target text using an attention mechanism based on an adapter of the image classification model.

It should be noted that, the process of acquiring the target features corresponding to the target image and the target text respectively based on the image classification model is similar to the process of acquiring the target features corresponding to the training image and the visual description text of the training image respectively based on the image classification model, and the implementation process of acquiring the target features corresponding to the target image and the target text respectively based on the image classification model is more specific, and the detailed implementation process of acquiring the target features corresponding to the visual description text of the training image and the training image respectively based on the image classification model will not be repeated herein.

In addition, if the image classification model is obtained by training in the first training manner, the implementation process of obtaining the target features corresponding to the target image and the target text respectively based on the image classification model may refer to the implementation process of obtaining the target features corresponding to the training image and the visual description text of the training image respectively based on the image classification model in the first training manner, which is not described herein in detail.

An embodiment of the present invention provides an image classification device, and the image classification device and the image classification method described above may be referred to correspondingly.

Referring to fig. 7, a schematic structural diagram of an image classification device according to an embodiment of the invention is shown, where the image classification device may include: a data acquisition module 701, a general feature acquisition module 702, a target feature acquisition module 703, and an image category determination module 704.

The data acquisition module 701 is configured to acquire a target image and a target text in a specific field, where the target text is a visual description text of the target image.

The general feature acquisition module 702 is configured to extract general image features from the target image and general text features from the target text.

And a target feature acquiring module 703, configured to acquire target features corresponding to the target image and the target text respectively according to the general image feature and the general text feature.

The target image is characterized by comprising a general image feature and a specific field image feature of the target image, wherein the target image is characterized by comprising a text feature of the target text, the general image feature and the specific field image feature of the target image are integrated with the text feature of the target text, and the target text feature comprises the general text feature and the specific field text feature of the target text.

And the image category determining module 704 is configured to determine a category of the target image according to target features corresponding to the target image and the target text respectively.

Alternatively, the image category determination module 704 may include: the system comprises an entity identification sub-module, an entity characteristic acquisition sub-module, a correlation determination sub-module and an image category determination sub-module.

And the entity identification sub-module is used for identifying the entities from the target text to obtain a plurality of entities, wherein each entity is a candidate category of the target image.

And the entity characteristic obtaining sub-module is used for extracting the characteristics respectively corresponding to the entities from the target characteristics corresponding to the target text.

And the correlation determination submodule is used for calculating the correlation between the feature corresponding to each entity and the target feature corresponding to the target image.

And the image category determining sub-module is used for determining the category of the target image from the entities according to the calculated correlation.

Optionally, the general feature obtaining module 702 is specifically configured to, when extracting general image features from the target image and general text features from the target text:

and extracting general image features from the target image and general text features from the target text based on the pre-trained image classification model.

The target feature acquiring module 703 is specifically configured to, when acquiring the target features corresponding to the target image and the target text according to the general image feature and the general text feature, respectively:

Optionally, the target feature acquiring module 703 is specifically configured to, when acquiring, based on the image classification model and based on the general image feature and the general text feature, target features corresponding to the target image and the target text respectively, perform:

based on the image classification model, acquiring specific field image features fused with text features of the target text based on the general image features and the general text features to obtain first fusion features, and acquiring specific field text features fused with image features of the target image to obtain second fusion features;

Based on the image classification model, fusing the first fusion feature with the general image feature to obtain a third fusion feature, and fusing the second fusion feature with the general text feature to obtain a fourth fusion feature;

and based on the image classification model, acquiring target features corresponding to the target image according to the third fusion features, and acquiring target features corresponding to the target text according to the fourth fusion features.

Optionally, the target feature obtaining module 703 obtains, based on the image classification model and based on the general image feature and the general text feature, a specific field image feature fused with the text feature of the target text to obtain a first fused feature, and obtains a specific field text feature fused with the image feature of the target image to obtain a second fused feature, where the specific field image feature is specifically configured to:

based on the image classification model, merging the universal text feature into the universal image feature to obtain a first feature, and merging the universal image feature into the universal text feature to obtain a second feature;

based on the image classification model, extracting image features from the first features to obtain third features, and extracting text features from the second features to obtain fourth features;

Based on the image classification model, extracting image features from the third features to obtain fifth features, and extracting text features from the fourth features to obtain sixth features;

and based on the image classification model, merging the fourth feature into the fifth feature to obtain the first merged feature, and merging the third feature into the sixth feature to obtain the second merged feature.

Optionally, when the target feature obtaining module 703 obtains, based on the image classification model, a target feature corresponding to the target image based on the third fusion feature, and obtains a target feature corresponding to the target text based on the fourth fusion feature, the method specifically includes:

processing the third fusion feature into a feature which is adapted to an image classification task based on the image classification model to obtain a fifth fusion feature, and processing the fourth fusion feature into a feature which is adapted to the image classification task to obtain a sixth fusion feature;

and based on the image classification model, fusing the fifth fusion feature with the general image feature to obtain a target feature corresponding to the target image, and fusing the sixth fusion feature with the general text feature to obtain a target feature corresponding to the target text.

Optionally, the target feature obtaining module 703 is specifically configured to, when fusing the first fusion feature with the generic image feature based on the image classification model to obtain a third fusion feature, and fusing the second fusion feature with the generic text feature to obtain a fourth fusion feature:

based on the image classification model, fusing the first fusion feature with the general image feature by adopting an attention mechanism to obtain a third fusion feature, and fusing the second fusion feature with the general text feature by adopting an attention mechanism to obtain a fourth fusion feature;

optionally, when the target feature obtaining module 703 fuses the fifth fusion feature with the generic image feature based on the image classification model to obtain a target feature corresponding to the target image, and fuses the sixth fusion feature with the generic text feature to obtain a target feature corresponding to the target text, the target feature obtaining module is specifically configured to:

and based on the image classification model, fusing the fifth fusion feature with the general image feature by adopting an attention mechanism to obtain a target feature corresponding to the target image, and fusing the sixth fusion feature with the general text feature by adopting an attention mechanism to obtain a target feature corresponding to the target text.

The optional image classification model comprises an image encoder, a text encoder and an adapter, wherein the image encoder is obtained by pre-training an image sample in the general field, and the text encoder is obtained by pre-training a text sample in the general field.

The image classification device provided by the embodiment of the invention can further comprise: and a model training module. The model training module is used for:

Optionally, the training image is an image obtained by preprocessing an original image sample obtained from a training sample set in a specific field, wherein the preprocessing comprises masking part of image blocks in the original image sample; the model training module is also for:

restoring the training image according to intermediate characteristics generated by the adapter in the process of acquiring target characteristics corresponding to the training image, so as to obtain a restored image;

determining a second loss from the restored image and the original image sample;

the updating the parameters of the image classification model according to the first loss comprises the following steps:

and updating parameters of the image classification model according to the first loss and the second loss.

Optionally, the preprocessing further includes: and replacing part of image blocks in the original image sample with the image blocks of the difficult sample corresponding to the original image sample.

When the image classification device provided by the embodiment of the invention classifies target images in a specific field, on one hand, the general features and the specific field features are considered, on the other hand, the multi-mode features are utilized, and the general features and the specific field features are considered, so that the image classification device provided by the embodiment of the invention can accurately classify the images in the specific field.

An embodiment of the present invention provides an image classification apparatus, referring to fig. 8, which shows a schematic structural diagram of the image classification apparatus, the image classification apparatus may include: a processor 801, a communication interface 802, a memory 803, and a communication bus 804;

in the embodiment of the present invention, the number of the processor 801, the communication interface 802, the memory 803 and the communication bus 804 is at least one, and the processor 801, the communication interface 802 and the memory 803 complete communication with each other through the communication bus 804;

the processor 801 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 803 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory), etc., such as at least one magnetic disk memory;

Wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:

Alternatively, the refinement function and the extension function of the program may be described with reference to the above.

The embodiment of the present invention also provides a readable storage medium storing a program adapted to be executed by a processor, the program being configured to:

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An image classification method, comprising:

2. The image classification method according to claim 1, wherein the determining the category of the target image according to the target features respectively corresponding to the target image and the target text includes:

3. The image classification method according to claim 1, wherein the extracting general image features from the target image and general text features from the target text; according to the general image feature and the general text feature, acquiring target features respectively corresponding to the target image and the target text, including:

4. The image classification method according to claim 3, wherein the obtaining the target features respectively corresponding to the target image and the target text based on the general image features and the general text features includes:

5. The method according to claim 4, wherein the obtaining, according to the general image feature and the general text feature, the specific domain image feature fused with the text feature of the target text to obtain a first fused feature, and obtaining the specific domain text feature fused with the image feature of the target image to obtain a second fused feature includes:

and merging the fourth feature into the fifth feature to obtain a first merged feature, and merging the third feature into the sixth feature to obtain a second merged feature.

6. The method according to claim 4, wherein the obtaining the target feature corresponding to the target image according to the third fusion feature and obtaining the target feature corresponding to the target text according to the fourth fusion feature includes:

7. The image classification method according to claim 2, wherein the image classification model comprises an image encoder, a text encoder and an adapter, wherein the image encoder is obtained by pre-training image samples in a general field, and the text encoder is obtained by pre-training text samples in the general field;

8. An image classification apparatus, comprising: the system comprises a data acquisition module, a general feature acquisition module, a target feature acquisition module and an image category determination module;

9. An image classification apparatus, characterized by comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the respective steps of the image classification method according to any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the image classification method according to any one of claims 1-7.