CN116469110A

CN116469110A - Image classification method, device, electronic equipment and computer readable storage medium

Info

Publication number: CN116469110A
Application number: CN202310454595.7A
Authority: CN
Inventors: 李春宇; 张倩; 胡兴; 郝碧波; 倪渊
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-04-18
Filing date: 2023-04-18
Publication date: 2023-07-21

Abstract

The application relates to the technical field of image processing and the field of digital medical treatment, and discloses an image classification method, an image classification device, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring text information of an image to be classified, and performing first coding on the text information based on a prompt text coding layer to obtain a prompt text feature vector; performing second coding on the image to be classified based on the image coding layer to obtain an image feature vector; fusing the prompt text feature vector and the image feature vector to obtain a fusion vector; based on the classification layer, the fusion vector is subjected to classification prediction, and a classification result of the image to be classified is obtained. According to the method, a single classification method of pure picture classification or pure text classification in the prior art is avoided, the complementarity of different dimensional information is utilized by combining multi-mode and multi-dimensional information of texts and images, the classification accuracy of an image classification model is greatly improved, and the robustness is good.

Description

Image classification method, device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of image processing technology and the field of digital medical treatment, and in particular, to an image classification method, an apparatus, an electronic device, and a computer-readable storage medium.

Background

The document picture classification technology is a technology for classifying document pictures by using a natural language processing method according to preset categories. As a basic technology of natural language processing, document picture classification technology is widely applied to the fields of insurance, medical treatment and the like. For example, in the case of claims in the insurance industry, it is necessary to classify pictures of medical documents submitted by users, such as case reports, physical examination reports, examination sheets, discharge nodules, medical invoices, case records, and the like.

In the prior art, document pictures can be classified through a picture classification model, and aiming at document pictures with higher similarity but different categories, the classification accuracy of the prior art has a further improvement space. The method can also be used for carrying out text recognition on the document pictures to obtain text recognition results, classifying the text recognition results by using a text classification model, and further improving the prior art aiming at the document pictures with similar text recognition results but different categories. Therefore, a method for improving the classification accuracy of the document pictures is needed.

Disclosure of Invention

In view of the above, embodiments of the present application provide an image classification method, apparatus, electronic device, and computer readable storage medium, which aim to solve the problem of low accuracy of document and picture classification.

In a first aspect, an embodiment of the present application provides an image classification method, where the image classification method is implemented based on an image classification model, where the model includes a prompt text coding layer, an image coding layer, and a classification layer, where the prompt text coding layer and the image coding layer are respectively connected to the classification layer;

the method comprises the following steps:

acquiring text information of an image to be classified, and carrying out first coding on the text information based on the prompt text coding layer to obtain a prompt text feature vector;

performing second coding on the image to be classified based on the image coding layer to obtain an image feature vector;

fusing the prompt text feature vector and the image feature vector to obtain a fusion vector;

and based on the classification layer, carrying out classification prediction on the fusion vector to obtain a classification result of the image to be classified.

In a second aspect, an embodiment of the present application further provides an image classification device, where an image classification model is deployed in the device, where the model includes a prompt text coding layer, an image coding layer, and a classification layer, where the prompt text coding layer and the image coding layer are respectively connected to the classification layer;

the device comprises:

the prompt text coding unit is used for acquiring text information of the images to be classified, and carrying out first coding on the text information based on the prompt text coding layer to obtain a prompt text feature vector;

the image coding unit is used for carrying out second coding on the image to be classified based on the image coding layer to obtain an image feature vector;

the fusion unit is used for fusing the prompt text feature vector and the image feature vector to obtain a fusion vector;

and the classification unit is used for carrying out classification prediction on the fusion vector based on the classification layer to obtain a classification result of the image to be classified.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform the steps of any of the image classification methods described above.

In a fourth aspect, embodiments of the present application also provide a computer-readable storage medium storing one or more programs that, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to perform the steps of any of the image classification methods described above.

The above-mentioned at least one technical scheme that this application embodiment adopted can reach following beneficial effect:

according to the method, an image classification model is constructed, and text information in pictures to be classified is subjected to first coding based on a prompt text coding layer of the model to obtain prompt text feature vectors; secondly, based on an image coding layer of the model, performing second coding on the image to be classified to obtain an image feature vector; and finally, fusing the prompt text feature vector and the image feature vector, and carrying out classification prediction on the obtained fusion vector based on a classification layer of the model to obtain an image classification result. According to the method, a single classification method of pure picture classification or pure text classification in the prior art is avoided, the complementarity of different dimensional information is utilized by combining multi-mode and multi-dimensional information of texts and images, the classification accuracy of an image classification model is greatly improved, and the robustness is good.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 shows a flow diagram of an image classification method according to one embodiment provided herein;

FIG. 2 illustrates a schematic diagram of an image classification model according to one embodiment provided herein;

FIG. 3 illustrates a schematic diagram of the structure of an image encoding layer according to one embodiment provided herein;

FIG. 4 shows a flow diagram of an image classification method according to another embodiment provided herein;

FIG. 5 shows a schematic diagram of an image classification apparatus according to an embodiment provided herein;

fig. 6 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that such uses may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or described herein. Furthermore, the terms "include" and variations thereof are to be interpreted as open-ended terms that mean "include, but are not limited to.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

The document picture classification technology is a technology for classifying document pictures by using a natural language processing method according to preset categories. As a basic technology of natural language processing, document picture classification technology is widely applied to the fields of insurance, medical treatment and the like. For example, in the case of claims in the insurance industry, it is necessary to classify pictures of medical documents submitted by users, such as case reports, physical examination reports, examination sheets, discharge nodules, medical invoices, case records, and the like. In the prior art, document pictures can be classified through a picture classification model, the method has strong advantages in the aspect of learning image characteristics, but with the deep business, scenes of continuous classification on pictures of different types, such as color Doppler ultrasound document pictures of a plurality of different parts in case records, are still present, and the picture similarity is high, so that the picture classification model is difficult to accurately classify, and the classification accuracy of the prior art has a further improvement space. The method can solve the problem of classifying through the picture classification model, but the text classification model can lose the information of the whole picture layout, so that the method aims at pictures with different categories but similar text information, and the model is difficult to accurately classify, so that the prior art is further improved.

Based on the method, the image classification method is provided, the single classification method of pure picture classification or pure text classification in the prior art is avoided, the complementarity of different dimensional information is utilized by combining multi-mode and multi-dimensional information of texts and images, the classification accuracy of an image classification model is greatly improved, and the robustness is good.

Fig. 1 shows a flow chart of an image classification method according to an embodiment provided in the present application, and as can be seen from fig. 1, the present application at least includes steps S101 to S104:

step S101: and acquiring text information of the image to be classified, and carrying out first coding on the text information based on the prompt text coding layer to obtain a prompt text feature vector.

The image classification method of the present application is implemented based on an image classification model, fig. 2 shows a schematic structural diagram of the image classification model according to an embodiment provided in the present application, and as can be seen from fig. 2, the image classification model 200 includes a prompt text coding layer 201, an image coding layer 202 and a classification layer 203, where the prompt text coding layer 201 and the image coding layer 202 are respectively connected to the classification layer 203, and an output of the classification layer 203 is used as an output of the image classification model 200.

First, text information of an image to be classified is acquired. In some embodiments of the present application, the acquiring text information of the image to be classified includes: acquiring the image to be classified; performing text recognition processing on the images to be classified to obtain text recognition results; and performing word segmentation on the text recognition result to obtain text information of the image to be classified.

In this embodiment, the image to be classified has a very abundant source, for example, when a merchant handles a business transaction on a take-out platform, the merchant needs to take a photograph of documents such as a food management license, a business license, a health license, a sanitary license, a pollution discharge license, etc. and scan the documents into an image, and upload the image to the take-out platform, so as to obtain a document picture corresponding to the document material. For another example, when the insurance company settles claims, the company may take a photograph of a medical document submitted by the user, such as a case report, a physical examination report, a check sheet, an discharge nub, a medical invoice, a case record, etc., and scan the document, thereby obtaining a corresponding document picture.

Performing text recognition processing on the obtained image to be classified to obtain a text recognition result; and performing word segmentation processing on the text recognition result to obtain text information of the image to be classified. Specifically, in some embodiments, text recognition results may be obtained by performing text extraction on an image to be classified through optical character recognition (optical character recognition, OCR), for example, obtaining a text recognition result "a limited company B clinic DR image slice", and performing word segmentation on the text recognition result, for example, using a jieba, snowNLP, THULAC, NLPIR equal word segmentation tool, and performing word segmentation on the text recognition result, for example, obtaining text information of the image to be classified, for example, "a/limited company/B/clinic/DR image slice".

And then, based on the prompt text coding layer, performing first coding on the text information to obtain the prompt text feature vector. In some embodiments, the hint text encoding layer may be generated based on a text encoding model such as a BERT model, a CNN model, an RNN model, or an ELMo model, which is not limited in this application.

Step S102: and carrying out second coding on the image to be classified based on the image coding layer to obtain an image characteristic vector.

The image to be classified can be input into an image coding layer to obtain an image characteristic vector. Specifically, in some embodiments of the present application, in the above method, the performing, based on the image coding layer, second coding on the image to be classified to obtain an image feature vector includes: dividing the image to be classified into a plurality of sub-images; and respectively carrying out vectorization processing on each sub-image, and taking the obtained plurality of image token vectors as image feature vectors.

Fig. 3 is a schematic structural diagram of an image coding layer according to an embodiment provided in the present application, and the second coding process described above is exemplarily described below with reference to fig. 3.

The image to be classified can be divided into 9 sub-images equally, the 9 sub-images are unfolded according to a certain sequence to obtain a sequence, each sub-image in the sequence is equivalent to one token, the 9 tokens are input into a full-connection layer to perform linear operation to obtain a token vector 1, a token vector 2, a token vector 3, a token vector 4, a token vector 5, a token vector 6, a token vector 7, a token vector 8 and a token vector 9, and the 9 image token vectors can be used as image feature vectors.

Step S103: and fusing the prompt text feature vector and the image feature vector to obtain a fusion vector.

After the prompt text feature vector and the image feature vector are obtained, the prompt text feature vector and the image feature vector can be fused to obtain a fusion vector. Specifically, in some embodiments, the text feature vector and the image feature vector may be spliced in a front-to-back order or an up-to-down order to obtain a fusion vector.

Step S104: and based on the classification layer, carrying out classification prediction on the fusion vector to obtain a classification result of the image to be classified.

After the fusion vector is obtained, the fusion vector can be subjected to classification prediction based on the classification layer, so that a classification result of the image to be classified is obtained. The classification layer may be a full connection layer, and may also be a network layer implemented based on a lightweight gradient hoister (Light Gradient Boosting Machine, lightGBM) algorithm, an extreme gradient hoister tree (eXtreme Gradient Boosting, xgboost) algorithm, and a gradient hoister decision tree (Gradient Boosting Decision Tree, GBDT) algorithm, which is not limited in this application.

As can be seen from the method shown in fig. 1, the present application constructs an image classification model, and based on the prompt text coding layer of the model, performs a first coding on text information in a picture to be classified to obtain a prompt text feature vector; secondly, based on an image coding layer of the model, performing second coding on the image to be classified to obtain an image feature vector; and finally, fusing the prompt text feature vector and the image feature vector, and carrying out classification prediction on the obtained fusion vector based on a classification layer of the model to obtain an image classification result. According to the method, a single classification method of pure picture classification or pure text classification in the prior art is avoided, the complementarity of different dimensional information is utilized by combining multi-mode and multi-dimensional information of texts and images, the classification accuracy of an image classification model is greatly improved, and the robustness is good.

In some embodiments of the present application, in the above method, the image classification model is constructed based on a prompt learning idea and is obtained through training.

The prompt learning concept in this embodiment may embed the inputs into a specific template, reconstructing the downstream task from different pre-trained models. For example, for a text emotion classification task, the original input data may be "i likes the movie", the output data of the task is "positive or negative", a prompt learning idea is introduced, a specific template "original input data because it is a movie of 'output data', the original input data" i likes the movie "is embedded in the template, and the obtained input data is" i likes the movie because it is a movie of 'output data'), thereby achieving the purpose of reconstructing the downstream task. The introduction of prompt learning can enable the pre-training model to learn a small number of samples, and even under the condition without any sample, the model with better performance can be obtained.

According to the embodiment, the text prompt information can be used for the image classification model by introducing the prompt learning thought, so that the defect of a single image classification method of pure picture classification and pure text classification is effectively overcome, the image classification model can be trained by small-batch training data, a good effect can be obtained, and the robustness of the image classification model and the accuracy of the model classification result are greatly improved.

In some embodiments of the present application, in the above method, the hint text encoding layer is constructed based on a BERT model; the first coding is performed on the text information based on the prompt text coding layer to obtain a prompt text feature vector, which comprises the following steps: and carrying out first coding on the text information based on an Embedding layer of the prompt text coding layer, and taking the obtained text token vectors as the prompt text feature vectors.

In this embodiment, the encoding layer is composed of a Token encoding layer, a Segment Embedding layer, and a Position Embedding layer. The Token Embedding layer can convert each word in the text information into a word vector with fixed dimension, and each word can be converted into a 768-dimensional vector representation in the BERT model; the Segment Embedding layer is used for distinguishing two sentences in the text information, each word of the former sentence is represented by 0, each word of the latter sentence is represented by 1, so that a type vector corresponding to the text information is obtained, and if the text information only contains one sentence, the type vector is all 0; the Position Embedding layer may encode sequence position information of each word in the text information to obtain a position vector. The word vector, the type vector and the position vector are added, namely, the word vector, the type vector and the position vector corresponding to each word in the text information are added, for example, a text token vector 1 corresponding to the word 1, a text token vector 2 corresponding to the word 2 and a text token vector 3 corresponding to the word 3 can be obtained, and the text token vector 1, the text token vector 2 and the text token vector 3 can be used as prompt text feature vectors.

In some embodiments of the present application, in the above method, the performing, based on the classification layer, classification prediction on the fusion vector to obtain a classification result of the image to be classified includes: taking the prompt text feature vector contained in the fusion vector as a prompt word, taking the image feature vector contained in the fusion vector as content to be classified, and carrying out classification prediction scoring based on the classification layer; and determining the image to be classified as any one of a plurality of preset classification categories of the image classification model according to the classification prediction scoring result.

The prompt text feature vector contained in the fusion vector can be used as a prompt word, and the image feature vector contained in the fusion vector is used as the content to be classified. For example, the term for representing the text feature vector may be "one/'classification result'/color hypergraph", and the image feature vector may be classified according to the term for representing the text feature vector.

Classifying, predicting and scoring the fusion vector based on the classification layer; and determining the image to be classified as any one of a plurality of preset classification categories of the image classification model according to the classification prediction scoring result. Specifically, in some embodiments, if the preset classification categories of the image classification model may include a case report, a check list, an outgoing nodule, a medical invoice, and a case record, and the fusion vector is classified and predicted based on the classification layer, so as to obtain a classification and prediction scoring result, where the result may be a 7×1 vector, and elements in the vector sequentially correspond to a case report category, a check list category, an outgoing nodule category, a medical invoice category, and a case record category, and if the vector is [0.1,0.1,0.05,0.05,0.6,0.05,0.05], it may be determined that the category of the image to be classified is the outgoing nodule category.

In some embodiments of the present application, in the above method, the image coding layer is constructed based on at least one of a VIT model, a CNN model, and an MLP-Mixer model.

In this embodiment, the image coding layer is constructed based on at least one of a VIT model, a CNN model, and an MLP-Mixer model. In some embodiments, the image coding layer may also be constructed based on a ConvMixer model, a res net model, or the like, which is not limited in this application.

The VIT (Vision Transformer) model combines the knowledge in the field of computer vision and natural language processing, firstly blocks an original picture, flattens the original picture into a sequence, inputs the sequence into an Encoder part of an original converter model, and finally accesses a full-connection layer to classify the original picture. The VIT model has the advantages of simple structure, good classification effect and strong expandability.

The CNN (Convolutional Neural Networks) model is widely applied to various fields and can span different platforms, such as smart phones, security systems and automobile driver assistance systems. The CNN model has very excellent properties such as weight sharing, local connection, translation and the like, so that the CNN model has great success in various visual tasks, and the CNN model becomes a preferred solution for image classification.

The MLP-Mixer model uses a multi-layer perceptron MLP to replace convolution operation and a self-attention mechanism, completely utilizes basic matrix multiplication operation, data transformation and a nonlinear layer to complete image classification tasks, reduces the degree of freedom of feature extraction, and can skillfully and alternately exchange information between the patches and information in the patches. The MLP-Mixer model has a simple structure, and the performance of the model can be steadily improved along with the increase of the pre-training data set.

Fig. 4 shows a flow chart of an image classification method according to another embodiment provided in the present application, and as can be seen from fig. 4, the present embodiment includes the following steps S401 to S411:

step S401: and acquiring an image to be classified.

Step S402: and carrying out text recognition processing on the images to be classified to obtain a text recognition result.

Step S403: and performing word segmentation processing on the text recognition result to obtain text information of the image to be classified.

Step S404: based on the prompt learning thought, the BERT model and the VIT model, and training to obtain an image classification model comprising a prompt text coding layer, an image coding layer and a classification layer.

Step S405: and performing first coding on the text information based on an encoding layer in the prompt text coding layer, and taking the obtained text token vectors as prompt text feature vectors.

Step S406: the image to be classified is divided into a plurality of sub-images based on an image encoding layer.

Step S407: each sub-image is vectorized, and a plurality of obtained image token vectors are used as image feature vectors.

Step S408: and fusing the prompt text feature vector and the image feature vector to obtain a fusion vector.

Step S409: and taking the prompt text feature vector contained in the fusion vector as a prompt word.

Step S410: and taking the image feature vectors contained in the fusion vector as contents to be classified, and carrying out classification prediction scoring based on the classification layer.

Step S411: and determining the image to be classified as any one of a plurality of preset classification categories of the image classification model according to the classification prediction scoring result.

FIG. 5 shows a schematic structural diagram of an image classification device according to one embodiment provided herein, the image classification device 500 is deployed with an image classification model, the model includes a prompt text encoding layer, an image encoding layer, and a classification layer, wherein the prompt text encoding layer and the image encoding layer are respectively connected to the classification layer; the device comprises a prompt text coding unit 501, an image coding unit 502, a fusion unit 503 and a classification unit 504, wherein:

a prompt text coding unit 501, configured to obtain text information of an image to be classified, and perform first coding on the text information based on the prompt text coding layer to obtain a prompt text feature vector;

an image encoding unit 502, configured to perform second encoding on the image to be classified based on the image encoding layer, to obtain an image feature vector;

a fusion unit 503, configured to fuse the prompt text feature vector and the image feature vector to obtain a fusion vector;

and the classification unit 504 is configured to perform classification prediction on the fusion vector based on the classification layer, so as to obtain a classification result of the image to be classified.

In some embodiments of the present application, in the above apparatus, the image classification model is constructed based on a prompt learning idea and is obtained through training.

In some embodiments of the present application, in the foregoing apparatus, the prompt text coding unit 501 is configured to obtain the image to be classified; performing text recognition processing on the images to be classified to obtain text recognition results; and performing word segmentation on the text recognition result to obtain text information of the image to be classified.

In some embodiments of the present application, in the above apparatus, the hint text encoding layer is constructed based on a BERT model; the prompt text coding unit 501 is configured to perform a first coding on the text information based on an enhancement layer of the prompt text coding layer, and take the obtained multiple text token vectors as the prompt text feature vectors.

In some embodiments of the present application, in the above apparatus, the image encoding unit 502 is configured to divide the image to be classified into a plurality of sub-images; and respectively carrying out vectorization processing on each sub-image, and taking the obtained plurality of image token vectors as image feature vectors.

In some embodiments of the present application, in the foregoing apparatus, the classification unit 504 is configured to use a prompt text feature vector included in the fusion vector as a prompt word, use an image feature vector included in the fusion vector as content to be classified, and perform classification prediction scoring based on the classification layer; and determining the image to be classified as any one of a plurality of preset classification categories of the image classification model according to the classification prediction scoring result.

In some embodiments of the present application, in the above apparatus, the image coding layer is constructed based on at least one of a VIT model, a CNN model, and an MLP-Mixer model.

It should be noted that any one of the above image classification devices may be in one-to-one correspondence with the above image classification method, which is not described herein.

Fig. 6 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 6, at the hardware level, the electronic device comprises a processor, optionally together with an internal bus, a network interface, a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, network interface, and memory may be interconnected by an internal bus, which may be an ISA (Industry Standard Architecture ) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 6, but not only one bus or type of bus.

And the memory is used for storing programs. In particular, the program may include program code including computer-operating instructions. The memory may include memory and non-volatile storage and provide instructions and data to the processor.

The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs to form the image classification device on a logic level. And the processor is used for executing the program stored in the memory and particularly used for executing the method.

The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

The electronic device may execute the image classification method provided in the embodiments of the present application and implement the function of the embodiment of the image classification device shown in fig. 5, which is not described herein again.

The embodiments also provide a computer-readable storage medium storing one or more programs, the one or more programs including instructions, which when executed by an electronic device comprising a plurality of application programs, enable the electronic device to perform the image classification methods provided by the embodiments of the present application.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other identical elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims

1. The image classification method is characterized by being realized based on an image classification model, wherein the model comprises a prompt text coding layer, an image coding layer and a classification layer, and the prompt text coding layer and the image coding layer are respectively connected with the classification layer;

the method comprises the following steps:

2. The method of claim 1, wherein the image classification model is constructed based on a prompt learning concept and is trained.

3. The method according to claim 1, wherein the acquiring text information of the image to be classified comprises:

acquiring the image to be classified;

performing text recognition processing on the images to be classified to obtain text recognition results;

and performing word segmentation on the text recognition result to obtain text information of the image to be classified.

4. The method of claim 1, wherein the hint text encoding layer is constructed based on a BERT model;

the first coding is performed on the text information based on the prompt text coding layer to obtain a prompt text feature vector, which comprises the following steps:

and carrying out first coding on the text information based on an Embedding layer of the prompt text coding layer, and taking the obtained text token vectors as the prompt text feature vectors.

5. The method according to claim 1, wherein the performing, based on the image coding layer, the second coding on the image to be classified to obtain an image feature vector includes:

dividing the image to be classified into a plurality of sub-images;

and respectively carrying out vectorization processing on each sub-image, and taking the obtained plurality of image token vectors as image feature vectors.

6. The method according to claim 1, wherein the classifying predicting the fusion vector based on the classifying layer to obtain the classification result of the image to be classified includes:

taking the prompt text feature vector contained in the fusion vector as a prompt word, taking the image feature vector contained in the fusion vector as content to be classified, and carrying out classification prediction scoring based on the classification layer;

and determining the image to be classified as any one of a plurality of preset classification categories of the image classification model according to the classification prediction scoring result.

7. The method of any one of claims 1-6, wherein the image coding layer is constructed based on at least one of a VIT model, a CNN model, and a MLP-Mixer model.

8. An image classification device is characterized in that an image classification model is deployed in the device, the model comprises a prompt text coding layer, an image coding layer and a classification layer, wherein the prompt text coding layer and the image coding layer are respectively connected with the classification layer;

the device comprises:

9. An electronic device, comprising:

a processor; and

a memory arranged to store computer executable instructions which, when executed, cause the processor to perform the steps of the image classification method of any of claims 1 to 7.

10. A computer readable storage medium storing one or more programs, which when executed by an electronic device comprising a plurality of application programs, cause the electronic device to perform the steps of the image classification method of any of claims 1-7.