CN116883737A - Classification method, computer device, and storage medium - Google Patents

Classification method, computer device, and storage medium Download PDF

Info

Publication number
CN116883737A
CN116883737A CN202310836197.1A CN202310836197A CN116883737A CN 116883737 A CN116883737 A CN 116883737A CN 202310836197 A CN202310836197 A CN 202310836197A CN 116883737 A CN116883737 A CN 116883737A
Authority
CN
China
Prior art keywords
text
image
classification
sample
coding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310836197.1A
Other languages
Chinese (zh)
Inventor
李箴
高耀宗
詹翊强
周翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai United Imaging Intelligent Healthcare Co Ltd
Original Assignee
Shanghai United Imaging Intelligent Healthcare Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai United Imaging Intelligent Healthcare Co Ltd filed Critical Shanghai United Imaging Intelligent Healthcare Co Ltd
Priority to CN202310836197.1A priority Critical patent/CN116883737A/en
Publication of CN116883737A publication Critical patent/CN116883737A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H30/00ICT specially adapted for the handling or processing of medical images
    • G16H30/40ICT specially adapted for the handling or processing of medical images for processing medical images, e.g. editing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to a classification method, a computer device and a storage medium. The method comprises the steps of obtaining an image to be classified, inputting the image to be classified into a classification network for focus classification, and obtaining a classification result, wherein the image to be classified comprises a focus area, and the classification network is obtained by training a sample set based on an image text; the classification results include a variety of classification results for the lesion area. The classification network is obtained by training the sample set based on the image text, and the image text not only contains the image sample but also contains the text sample in the sample set, that is, the sample used for training the classification network contains rich information, so that the training effect can be improved, and the classification accuracy of the trained classification network is further improved. In addition, compared with the traditional two-classification classifier, the classification network can output various classification results, and fine classification is achieved.

Description

Classification method, computer device, and storage medium
Technical Field
The present application relates to the field of medical image processing technology, and in particular, to a classification method, a computer device, and a storage medium.
Background
Along with the rapid development of medical image scanning equipment, focus classification technology based on medical images has been widely applied to various medical detection scenes.
At present, a method for classifying focuses in medical images mainly performs focus classification based on a plurality of image sample sets training classifiers containing focuses in advance to obtain classification results, wherein the classification results are usually classification results, for example, images possibly containing nodules are classified to obtain classification results of whether the nodules are contained or not.
However, the above classification method has a problem of inaccurate classification when it is directed to classification of some rare lesions.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a classification method, apparatus, computer device, and storage medium capable of improving classification accuracy.
In a first aspect, the present application provides a classification method. The method comprises the following steps:
acquiring an image to be classified; the image to be classified comprises a focus area;
inputting the images to be classified into a classification network to perform focus classification, and obtaining a classification result; the classification network is obtained by training a sample set based on the image text obtained after the contrast pre-training; the classification results include a plurality of classification results for the lesion area.
In one embodiment, the classification network comprises: an image encoder, a text generation model, and a classifier; inputting the image to be classified into a classification network for focus classification to obtain a classification result, wherein the method comprises the following steps:
inputting the images to be classified into the image encoder for image coding extraction to obtain image codes;
inputting the image code into the text generation model to generate text, and obtaining a text code corresponding to the image code;
inputting the text codes and the image codes into the classifier to classify focus, and obtaining the classification result.
In one embodiment, the classification network further comprises: a text decoder, the method further comprising:
and inputting the text codes to the text decoder for text decoding to obtain focus text description corresponding to the image to be classified.
In one embodiment, the classification network further comprises: a fusion module, the method further comprising:
inputting the text codes and the image codes to the fusion module for fusion to obtain fusion characteristics;
inputting the text codes and the image codes into the classifier for focus classification to obtain classification results, wherein the classification results comprise:
And inputting the fusion characteristics into the classifier to perform focus classification, and obtaining the classification result.
In one embodiment, the method further comprises:
acquiring a sample set of the image text pairs; the image text pair sample set comprises a first image sample set and a first text sample set;
image coding is carried out on the first image sample set to obtain an image coding set, and text coding is carried out on the first text sample set to obtain a text coding set;
training an initial text generation model according to the image coding set and the text coding set to obtain the text generation model;
training an initial classifier according to the image coding set and the text coding set to obtain the classifier.
In one embodiment, the training the initial text generation model according to the image coding set and the text coding set to obtain the text generation model includes:
inputting the image coding set into the initial text generation model to generate text, and obtaining an output text coding set corresponding to the image coding set;
and training the initial text generation model according to the output text coding set and the text coding set to obtain a trained text generation model.
In one embodiment, the method comprises the step of summing according to the image coding set
Training an initial classifier by the text coding set to obtain the classifier, wherein the training comprises the following steps:
carrying out fusion processing on the image coding set and the text coding set to obtain a fusion feature set;
and training the initial classifier according to the fusion feature set to obtain a trained classifier.
In one embodiment, the acquiring the image text pair sample set includes:
and inputting a second image sample set and a second text sample set into a comparison pre-training network for comparison pre-training to obtain the image text pair sample set, wherein the image text pair sample set comprises the first image sample set and the first text sample set which are highest in similarity.
In a second aspect, the application further provides a classification device. The device comprises:
the acquisition module is used for acquiring the images to be classified; the image to be classified comprises a focus area;
the classification module is used for inputting the images to be classified into a classification network to carry out focus classification, so as to obtain classification results; the classification network is obtained by comparing and pre-training a sample set through an image text; the classification results include a plurality of classification results for the lesion area.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:
acquiring an image to be classified; the image to be classified comprises a focus area;
inputting the images to be classified into a classification network to perform focus classification, and obtaining a classification result; the classification network is obtained by comparing and pre-training a sample set through an image text; the classification results include a plurality of classification results for the lesion area.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
acquiring an image to be classified; the image to be classified comprises a focus area;
inputting the images to be classified into a classification network to perform focus classification, and obtaining a classification result; the classification network is obtained by comparing and pre-training a sample set through an image text; the classification results include a plurality of classification results for the lesion area.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:
acquiring an image to be classified; the image to be classified comprises a focus area;
inputting the images to be classified into a classification network to perform focus classification, and obtaining a classification result; the classification network is obtained by comparing and pre-training a sample set through an image text; the classification results include a plurality of classification results for the lesion area.
The classification method, the computer equipment and the storage medium are used for obtaining the images to be classified, inputting the images to be classified into a classification network for focus classification to obtain classification results, wherein the images to be classified comprise focus areas, and the classification network is obtained by training a sample set based on image texts; the classification results include a variety of classification results for the lesion area. The classification network is obtained by training the sample set based on the image text, and the image text not only contains the image sample but also contains the text sample in the sample set, that is, the sample used for training the classification network contains rich information, so that the training effect can be improved, and the classification accuracy of the trained classification network is further improved. In addition, compared with the traditional two-classification classifier, the classification network can output various classification results, and fine classification is achieved.
Drawings
FIG. 1 is a diagram of an application environment of a classification system in one embodiment;
FIG. 2 is a flow diagram of a classification method in one embodiment;
FIG. 3 is a flow chart of a classification method according to another embodiment;
FIG. 3A is a schematic diagram of a classification network in one embodiment;
FIG. 3B is another schematic diagram of a classification network in one embodiment;
FIG. 3C is another schematic diagram of a classification network in one embodiment;
FIG. 4 is a flow chart of a classification method according to another embodiment;
FIG. 5 is a flow chart of a classification method according to another embodiment;
FIG. 6 is a flow diagram of a training method in one embodiment;
FIG. 7 is a flow chart of a training method in another embodiment;
FIG. 8 is a flow chart of a training method in another embodiment;
FIG. 9 is a schematic diagram of a multi-modal classification network in one embodiment;
FIG. 10 is a flow diagram of a classification method in one embodiment;
FIG. 11 is a schematic diagram of a sorting apparatus according to an embodiment;
FIG. 12 is a schematic view of a sorting apparatus according to another embodiment;
FIG. 13 is a schematic view showing a structure of a sorting apparatus according to another embodiment;
FIG. 14 is a schematic view showing a structure of a sorting apparatus according to another embodiment;
FIG. 15 is a schematic view showing a structure of a sorting apparatus according to another embodiment;
fig. 16 is an internal structural view of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
Currently, medical image lesion (e.g., lung nodule) classification techniques have been widely used in various downstream scenarios. In related applications, most methods determine whether an input image block contains a target lesion based on existing classification and labeling of data, such as classification problems for (1, nodule) (0, not nodule) of an image that may contain a nodule. However, in a practical scenario, the input of lesion images sometimes has a large discrepancy with the features of the classifier when actually trained, which results in some rare lesions and difficult-to-classify samples not being correctly classified. Meanwhile, the overall understanding of the model to various focuses is not comprehensive enough, and semantic association of anatomical layers among different focuses is lacking. In addition, for subdivisions with signs of nodules (3 mm nodules, large nodules, multiple nodules, ground glass density shadows, etc.) are also generally categorized as nodules or not, and the output of such a classification results in subsequent passes that do not permit further classification and treatment of lesions according to some characteristics. The present application provides a classification method for solving the above problems, and the following examples will specifically explain the classification method of the present application.
The processing method of video data provided by the embodiment of the application can be applied to a classification system as shown in fig. 1. Wherein the scanning device 102 communicates with the processing device 104 by wire or wirelessly. The scanning device 102 performs image scanning on the target object to obtain an image to be classified, and sends the image to be classified to the processing device 104, and the processing device 104 performs classification on the focus area of the image to be classified to obtain a classification result. The processing device 104 may also train the sample set on the initial classification network in advance based on the image text, so as to obtain the classification network in practical application. The processing device 104 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, servers, etc. The server may be implemented as a stand-alone server or as a server cluster formed by a plurality of servers.
It will be appreciated by persons skilled in the art that the architecture shown in FIG. 1 is a block diagram of only some of the architecture relevant to the present inventive arrangements and is not limiting of the data processing system to which the present inventive arrangements may be applied, and that a particular data processing system may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, as shown in fig. 2, a classification method is provided, and the method is applied to the processing device in fig. 1 for illustration, and includes the following steps:
s201, obtaining an image to be classified; the images to be classified include lesion areas.
Wherein the image to be classified is a focus image. The focal region may be any type of lesion, such as a lung nodule.
In the embodiment of the application, the processing equipment can be connected with the scanning equipment, and when the scanning equipment scans and images the target object, the focus image can be acquired from the scanning equipment in real time, and any acquired focus image is used as the image to be classified. Alternatively, the processing device may also acquire a lesion image from the cloud platform or by other means as the image to be classified.
S202, inputting an image to be classified into a classification network to perform focus classification, and obtaining a classification result; the classification network is obtained by training a sample set based on image text; the classification results include a plurality of classification results for the lesion area.
The classification network may be a pre-trained neural network model, which is used to classify information of multiple dimensions for a lesion area in an input image. The image text pair sample set comprises a plurality of pairs of image text pair samples, the image text pair samples comprise image samples and text samples, the image samples and the text samples are in corresponding relation, namely the text samples are text descriptions of focus areas in the image samples, in addition, the similarity between the image samples and the text samples is high or the association degree is high, namely the similarity between the image samples and the text samples is larger than a preset similarity threshold value or the association degree between the image samples and the text samples is larger than a preset association degree threshold value. The various classification results of the lesion area represent various results of classifying the lesion, for example, the lesion area for the lung nodule, and the multi-classification results include results such as a small nodule, a large nodule, a multiple nodule, a ground glass density shadow, and the like.
In the embodiment of the application, the processing equipment can obtain a large number of image text pairs through comparison pre-training to form an image text pair sample set, wherein the similarity or the association degree between the paired image samples and the text samples is very high, or the text samples are texts for describing focuses in the image samples; or the processing equipment firstly acquires the image sample and carries out text description on the image sample, so that a text sample related to the image sample is obtained, image text pairs of the image sample and the text sample are formed, a large number of image text pairs are acquired in this way, and an image text pair sample set is formed.
When the processing equipment acquires the image text pair sample set, the initial classification network constructed by the sample set pair can be trained based on the image text pair sample set to obtain a classification network to be applied, and the trained classification network has the capability of classifying focuses to obtain various classification results. And the processing equipment can be connected with the scanning equipment, when the scanning equipment scans the image of the target object to obtain a scanning image, the scanning image can be sent to the processing equipment, and after the processing equipment acquires the scanning image, the scanning image can be used as an image to be classified and input into a pre-trained classification network to perform focus classification, so that various classification results are obtained.
In the above classification method, the classification result is obtained by acquiring the image to be classified and inputting the image to be classified into the classification network for focus classification, wherein the image to be classified comprises a focus area, and the classification network is obtained by training the sample set based on the image text; the classification results include a variety of classification results for the lesion area. The classification network is obtained by training the sample set based on the image text, and the image text not only contains the image sample but also contains the text sample in the sample set, that is, the sample used for training the classification network contains rich information, so that the training effect can be improved, and the classification accuracy of the trained classification network is further improved. In addition, compared with the traditional two-classification classifier, the classification network can output various classification results, and fine classification is achieved.
In one embodiment, there is provided a structure of a classification network, as shown in fig. 3A, the classification network comprising: the image encoder, the text generation model and the classifier, wherein, the output end of the image encoder is connected with the text generation model, the output end of the text generation model is connected with the classifier, based on the structure, as shown in fig. 3, the corresponding classification method, that is, the above S202 "inputs the image to be classified into the classification network to perform focus classification, and obtains the classification result", includes:
S301, inputting the image to be classified into an image encoder for image feature extraction, and obtaining image codes.
The image encoder may be a neural network model for feature extraction of the input image. Alternatively, the image encoder may be a ViT model or a ResNet model, particularly for encoding the input image into a vector [ V ] of fixed dimension size 0 ,V 1 ,V 2 .....V n ]。
In the embodiment of the application, when the processing equipment acquires the image to be classified based on the steps, the image to be classified can be input into a pre-trained image encoder to extract the image characteristics, so as to obtain the image code; optionally, the processing device may also pre-process the image to be classified and then input the pre-processed image to the image encoder to obtain the image encoding.
S302, inputting the image code into a text generation model to generate text, and obtaining the text code corresponding to the image code.
The text codes are obtained by performing text description on focuses in images to be classified and extracting features of the obtained text. The text generation model may be a neural network model for generating text codes that have a very high similarity or relevance to the image code of the input image. Alternatively, the structure of the text generation model may use the decoder structure of the transducer.
In the embodiment of the application, when the processing device acquires the image code based on the steps, the image code can be input into a pre-trained text generation model to generate the text code, so that the text code with extremely high similarity or association with the image code is obtained.
S303, inputting the text codes and the image codes into a classifier to perform focus classification, and obtaining a classification result.
The classifier may be a neural network model, which is used to classify the image code and the text code, or to classify a fusion code of the text code and the image code.
In the embodiment of the application, when the processing equipment acquires the text code corresponding to the image code, the text code and the image code can be simultaneously input into a pre-trained classifier for focus classification, so as to obtain a classification result; optionally, the processing device may also fuse the image code and the text code and then input the fused image code and text code into a pre-trained classifier to perform focus classification, so as to obtain a classification result. .
According to the classification method provided by the embodiment of the application, the text generation model in the classification network can output the corresponding text codes according to the image codes, so that when the text codes assist in inputting the image codes into the classifier for classification, the basis of classification by the classifier is more comprehensive and reasonable, and the accuracy of classification by the classifier can be improved.
In one embodiment, another structure of a classification network is provided, as shown in fig. 3B, where the classification network in fig. 3A further includes: the text decoder, wherein, the output terminal of the image encoder is connected with the text generation model, the output terminal of the text generation model is connected with the classifier and the text decoder respectively, based on this structure, as shown in fig. 4, the classification method of fig. 3 further comprises the steps of:
s304, inputting the text codes into a text decoder for text decoding, and obtaining focus text description corresponding to the images to be classified.
The focal text description corresponding to the image to be classified represents a focal region in the image to be classified, and the focal text description includes abundant semantic information such as size, density, position, and sign of the focal, for example, if the focal region in the image to be classified is a lung nodule, the corresponding focal text description is: the anterior segment of the upper left lung, the lingual segment, the posterior basal segment of the lower leaf and the external basal segment of the lower right lung are tiny ground into glass nodules. The text decoder may be a neural network model, a model trained on the countermeasure network based on the text encoder and the text decoder in advance, for decoding the input text code, i.e. for restoring the lesion text description.
In the embodiment of the application, the processing device can acquire a large number of samples of focus text descriptions and corresponding samples of text codes, and then train the countermeasure network based on the large number of samples of focus text descriptions and corresponding samples of text codes, wherein the countermeasure network comprises an initial text encoder and an initial text decoder to obtain a trained text encoder and a trained text decoder, so that the trained text encoder has the capability of extracting characteristics of the input focus text descriptions to obtain text codes, and the corresponding text decoder has the capability of recovering the text descriptions of the input text codes to obtain focus text descriptions corresponding to the text codes. And then, when the processing equipment acquires the text codes corresponding to the images to be classified, the text codes can be input into a pre-trained text decoder to perform text description restoration, so that focus text descriptions corresponding to the images to be classified are obtained.
According to the classification method provided by the embodiment of the application, the text generation model in the classification network can output the corresponding text codes according to the image codes, so that when the text codes are input into the classifier for classification, the basis of classification by the classifier is more comprehensive and reasonable, the accuracy of classification by the classifier can be improved, and the classification network can output various classification results; on the other hand, the classification network can obtain the classification result integrating the multi-mode characteristics of the text and the image by only inputting the original image to be classified, has better effect than a single-mode model, and simultaneously outputs the focus text description corresponding to the image to be classified, thereby realizing multi-mode classification and facilitating the further processing and analysis of focuses in later period.
In one embodiment, another structure of a classification network is provided, as shown in fig. 3C, where the classification network in fig. 3A or fig. 3B further includes: the fusion module, wherein, the output of image encoder is connected with the text generation model, and the output of text generation model is connected with fusion module, and the output of fusion module is connected with the classifier, based on this structure, as shown in fig. 5, the classification method of fig. 3 or fig. 4 still includes the step:
s305, inputting the text codes and the image codes into a fusion module for fusion, and obtaining fusion codes.
The fusion module can be a splicing layer used for splicing the text codes and the image codes; alternatively, the fusion module may be a neural network model, which is used to perform fusion processing on the input image code and the text code.
In the embodiment of the application, the processing equipment can train the initial fusion module based on a large number of text coding samples and image coding samples to obtain the fusion module to be applied, and the trained fusion module can have the capability of fusing the image coding samples and the text coding samples, wherein the text coding samples and the image coding samples are in one-to-one correspondence, and the image coding samples and the text coding samples belong to the same image sample. And then, when the processing equipment acquires the text code corresponding to the image code, the text code and the image code can be simultaneously input into a pre-trained fusion module for fusion, so that a fusion code is obtained. It should be noted that, the text code and the image code need to have the same size during the fusion, so that the sizes of the text code and the image code can be adjusted before the fusion of the image code and the text code, so that the sizes of the text code and the fusion code are consistent, and the two codes can be fused accurately at a later stage. In addition, as shown in the structure of fig. 3C, in the training stage, the text generation model, the fusion module and the classifier may be trained separately based on the image coding samples and the corresponding text coding samples, or alternatively, the text generation model, the fusion module and the classifier may be connected together for training, where the actual training situation may be determined according to the actual requirement, and this is not limited herein.
Correspondingly, when executing the step S303 "inputting the text code and the image code to the classifier to perform focus classification, the processing device specifically performs the steps of: and inputting the fusion characteristics into a classifier to classify the focus, and obtaining a classification result.
In the embodiment of the application, the processing equipment can train the initial classifier based on the characteristic sample set to obtain the classifier to be applied, and the trained classifier can have the capability of classifying the focus to obtain various classification results, wherein the characteristic sample set comprises fusion coding samples of the image coding samples and the text coding samples, the image coding samples and the text coding samples are in one-to-one correspondence, and the image coding samples and the text coding samples belong to the same image sample. And then, when the processing equipment acquires the fusion code of the image code and the text code corresponding to the image code, the fusion code can be input into a pre-trained classifier to carry out focus classification, and a classification result is obtained.
According to the classification method provided by the embodiment of the application, the fusion module in the classification network fuses the text codes and the image codes and inputs the text codes and the image codes into the classifier for classification, so that when the text codes and the image codes are input into the classifier for classification, the basis of classification by the classifier is more comprehensive and reasonable, the accuracy of classification by the classifier can be improved, and the classification network can output various classification results.
In one embodiment, the present application further provides a training method, that is, a method for training to obtain a text generating model and a classifier in the classification network shown in fig. 3A, as shown in fig. 6, where the method includes:
s401, acquiring a sample set of image text pairs; the image text pair sample set includes a first image sample and a first text sample.
The image text pair sample set comprises a plurality of pairs of image text pair samples, the image text pair samples comprise a first image sample and a first text sample, the first image sample and the first text sample are in one-to-one correspondence, namely, the first text sample is a text description of a focus area in the first image sample, in addition, the similarity or the association degree between the first image sample and the first text sample is very high, namely, the similarity or the association degree between the first image sample and the first text sample is larger than a preset similarity threshold value.
In the embodiment of the application, the processing device can obtain a large number of image text pair samples through comparison pre-training to form an image text pair sample set, wherein the similarity or the association degree between the paired first image samples and the first text samples is high, or the first text samples are texts for describing focuses in the first image samples; or the processing equipment firstly acquires a first image sample and carries out text description on the first image sample, so that a first text sample related to the first image sample is obtained, an image text pair sample of the first image sample and the first text sample is formed, and in this way, a large number of image text pair samples are acquired, and an image text pair sample set is formed.
S402, performing image coding on the first image sample to obtain an image coding sample, and performing text coding on the first text sample to obtain a text coding sample.
In the embodiment of the application, when the processing equipment acquires the first image sample and the corresponding first text sample, the first image sample can be input into a pre-trained image encoder to perform image encoding to obtain an image encoding sample, and meanwhile, the first text sample is input into the pre-trained text encoder to perform text encoding to obtain a text encoding sample. The image encoder encodes the input image (first sample image) into a vector of a fixed dimension, and the corresponding text encoder encodes the input text description (first text sample) into a vector of a dimension consistent with the image encoder, and the output image encoding sample and the vector of the text encoding sample form a pair, so that a plurality of such image texts can be included in a batch for training the samples.
S403, training the initial text generation model according to the image coding sample and the text coding sample to obtain a text generation model.
In the embodiment of the application, when the processing equipment acquires the image coding sample and the text coding sample, the image coding sample can be input into the initial text generation model, and the text coding sample is used as the supervision information for training, so that the text generation model to be applied is obtained, and the trained text generation model has the capability of generating the corresponding text coding sample according to the image coding sample. According to the embodiment of the application, the image coding samples and the text coding samples are in one-to-one correspondence, namely, the image coding samples and the text coding samples are embedded into the same space, so that anchoring of text semantics and image semantic information is realized, therefore, the initial text generation model is trained based on the image coding samples and the text coding samples, the initial text generation model can generate accurate text codes with high similarity or association degree according to the image coding, the difference and similarity between the image coding samples and other focuses are easily positioned in a high latitude space, and the text generation capacity of the trained generation model is improved.
S404, training the initial classifier according to the image coding sample and the text coding sample to obtain the classifier.
In the embodiment of the application, when the processing equipment acquires the image coding sample and the text coding sample, the image coding sample and the text coding sample can be simultaneously input into the initial classifier, so that the training of the initial classifier by multi-mode characteristics is realized, the classifier to be applied is obtained, and the trained classifier has accurate classifying capability. According to the embodiment of the application, the image coding samples and the text coding samples are in one-to-one correspondence, namely, the image coding samples and the text coding samples are embedded into the same space, so that anchoring of text semantics and image semantic information is realized, and therefore, the initial classifier is trained based on the image coding samples and the text coding samples, and the prediction capability of the initial classifier on unknown or rare lesions can be more robust and reasonable.
Further, there is provided a method for training the initial text generation model, as shown in fig. 7, the method comprising:
s501, inputting the image coding sample into an initial text generation model to generate text, and obtaining an output text coding sample corresponding to the image coding sample.
The output text coding samples and the image coding samples have extremely high similarity or extremely high association degree. The initial text generation model may be a neural network model for generating text codes that have a very high similarity or relevance to the input image code.
In the embodiment of the application, when the processing equipment acquires the image coding sample, the image coding sample can be input into the constructed initial text generation model for text generation to obtain the output text coding sample related to the image coding sample.
S502, training the initial text generation model according to the output text coding sample and the text coding sample to obtain a trained text generation model.
When the processing equipment obtains an output text code sample, the text code sample can be used as a supervision signal to train the initial text generation model to obtain a trained text generation model, so that the trained text generation model has the capability of generating extremely relevant text codes according to image codes. Specifically, the processing device may determine a target loss according to the output text encoding sample and the text encoding sample, and adjust parameters of the initial text generation model based on the target loss in the training process until the model converges or the target loss meets the training condition, so as to obtain a trained text generation model.
According to the training method, the text coding sample is used as a supervision signal to train the initial generation model, so that the initial generation model can generate the text coding sample related to the text coding sample according to the input image coding sample, and further image classification can be realized by using text coding auxiliary image coding in the later stage, and the prediction capability of the classifier on unknown or rare lesions is improved.
Further, there is provided a method of training the above initial classifier, as shown in fig. 8, the method comprising:
s601, carrying out fusion processing on an image coding sample and a text coding sample to obtain a fusion coding sample;
in the embodiment of the application, when the processing equipment acquires the image coding sample and the text coding sample, the image coding sample and the text coding sample can be fused, and specifically, the image coding sample and the text coding sample can be input into a pre-trained fusion module for fusion processing, so as to obtain the fusion coding sample.
S602, training the initial classifier according to the fusion coding sample to obtain a trained classifier.
When the processing equipment obtains the fusion coding sample, the fusion coding sample can be input into the initial classifier for training, so that a trained classifier is obtained, and the trained classifier has the capability of classifying the focus in the image to be classified.
According to the training method, the fusion coding sample is used as a sample to train the initial classifier, so that the initial classifier can classify based on rich information, and the classification accuracy of the classifier can be improved.
Further, there is provided a method for acquiring the image text pair sample set, that is, when the processing device performs the step of S401, the processing device specifically performs the steps of: and inputting each second image sample in the second image sample set and each second text sample in the second text sample set into a comparison pre-training network to perform comparison similarity learning, so as to obtain an image text pair sample set, wherein the image text pair sample set comprises a first image sample and a first text sample which have the highest similarity or association degree.
Wherein the contrast pre-training network comprises an image encoder and a text encoder, the image encoder may be a ViT model or a ResNET model. The text encoder uses a transducer encoder. Alternatively, the contrast Pre-Training network may use a multi-modal Training model (Contrastive Language-Image Pre-Training, CLIP) similar to the contrast language-Image to embed the text semantic information and the Image semantic information into the same space, thereby anchoring the text semantic information and the Image semantic information.
In an embodiment of the present application, an image encoder encodes an input image into a vector [ V ] of a fixed dimension size 0 ,V 1 ,V 2 .....V n ]. The text encoder encodes the text description corresponding to the image as a vector that corresponds to the dimensions of the image encoder, such an image vector and a text vector forming a pair. During contrast pre-training, one batch may contain a plurality of such pairs of image text. During training, the similarity or the association degree of each text vector and each image vector in each batch is calculated through comparison pre-training, and the objective function can relate to the image text pair sample formed by the first image sample and the first text sample with the highest similarity or association degree so as to make the image text pair with the original pair have the highest similarity or association degree.
Optionally, in the training process, any second image sample is input at the image encoder side, and a plurality of corresponding second text samples (text descriptions of the second image samples) are input at the text encoder side, and then the similarity or the association degree between the image code output by the image code and the text code output by the text code is calculated, wherein the text description closest to the input image will obtain the highest similarity or association degree score. Also, in the case of inputting a plurality of second image samples and a corresponding section of second text sample, the image closest to the description will obtain the highest similarity or relevance score.
Alternatively, the image code of the image to be classified is obtained by using the image encoder already trained in the previous contrast pre-training, and is input to the text generation model, where the structure of the text generation model may use the decoder structure of the transducer, and the image code of the input image will be input to the self-attention layer of the text generation model (decoder). The training task (objective function) is to autoregressively generate the text codes generated by the text encoder used in the contrast pre-training, the initialized text codes respectively do self-attention calculation with the model weight and the image codes output by the image encoder, and finally the generated text codes have high correlation with the image codes. Then, after the text encoding and the image encoding generated by the image encoder are merged by stitching, a multimodal model (classifier) of the structure of a transducer encoder can be input to generate a final classification result. In a multimodal model, taking a lung nodule as an example, the output of the classification may be the result of a two-classification, e.g., the classification result includes a nodule, not a nodule. Multiple classification results are also possible, such as multiple classification results including solid nodules, pneumonia, ground glass, and the like.
According to the embodiment of the application, the first image sample and the first text sample with the highest similarity or association degree are obtained through comparison pre-training, so that when the text generation model is trained based on the first image sample and the first text sample with the highest similarity or association degree, the text generation model can give out the text description closest to the input image based on the size, density, sign characteristics and rules of the focus in the input image.
In summary, the present application also provides a multi-modal classification network and a corresponding training method, as shown in fig. 9, where the structure of the multi-modal classification network includes: the method comprises the steps of comparing a pre-training branch network with a classifying branch network, wherein the comparing pre-training branch network comprises an image encoder and a text encoder, and the classifying branch network comprises the image encoder, a generating model, a fusion module, the text encoder and a multi-mode classifying model; the method for training based on the multi-modal classification network comprises the following steps: inputting a sample image at the input side of an image encoder, encoding the input sample image by the image encoder to obtain a first image code, inputting a text at the input side of a text encoder, encoding the input text to obtain a first text code, and finally comparing a pre-training branch network to obtain the highest similarity of the image and text sample pair codes output by the image encoder and the text encoder through similarity learning, namely outputting the first image code and the first text code with the highest similarity to form an image text pair sample; and training the generation model and the multi-mode classification model in the other classification branch network based on the image text at the later stage respectively, wherein the first image code in the image text pair sample is taken as an input sample, and the first text code in the image text pair sample is taken as label data to train so as to obtain a trained generation model and multi-mode classification model.
In the application process of the other classification branch network, when an image to be classified is input at the input side of the image encoder, and the image encoder encodes the image to be classified to output a second image code, the second image code can be input into a generation model to generate text, so as to obtain a second text code corresponding to the image to be classified, then one path of the second image code and the second text code are input into the trained multi-mode classification model to output various classification results of a focus, and the other path of the second text code is input into a text decoder to output a focus text description.
Based on the multi-modal classification network according to all the embodiments and the classification method according to all the embodiments, the present application further provides a classification method, as shown in fig. 10, which includes:
s901, inputting each second image sample in the second image sample set and each second text sample in the second text sample set into a contrast pre-training network for contrast pre-training to obtain an image text pair sample set, wherein the image text pair sample set comprises a first image sample and a first text sample which are highest in similarity.
S902, training an initial text generation model based on the image text to obtain a trained text generation model, and training an initial classifier based on the image text to obtain a trained classifier.
S903, obtaining an image to be classified, inputting the image to be classified into an image encoder for feature extraction, and obtaining an image code corresponding to the image to be classified.
S904, inputting the image code into a trained text generation model to generate text, and obtaining a text code corresponding to the image code.
S905, inputting the text code and the image code into a fusion module for fusion to obtain a fusion code.
S906, inputting the fusion codes into a classifier to perform focus classification, and obtaining a classification result.
S907, inputting the text codes into a text decoder for text decoding, and obtaining focus text description corresponding to the images to be classified.
The above steps are all described in the foregoing, and the detailed description is referred to the foregoing description, which is not repeated here.
The classification method provided by the application firstly introduces the thought of multi-mode contrast pre-training, firstly generates the rich text description containing various focuses and the image blocks of the focuses through corresponding encoders to embed contrast learning, trains a text generation model, and enables the text generation model to learn the feature distinction between the semantics and the images. And finally training a robust classifier based on image and rich text description by using the text code embedding generated by the trained text generation model and the corresponding image embedding as input. In addition, when the whole classification model is used, only an original image is required to be input, the classification model can generate text embedding related to the image, and then the classifier utilizes the text embedding information and the original embedding information of the image as input to finish classification of the focus.
In the embodiment of the application, the coding features based on abundant semantic information are added as the basis of comparison, the multimodal model based on comparison pre-training has better distinguishing capability for rare samples, the input of the image contains abundant semantic information, and the image block of the nodule contains the position, the size and the density of the nodule and even related signs by taking the lung nodule as an example. When rare lesion types (e.g., nodules with very few features) that are not present in existing datasets are present, conventional classifiers are prone to errors and classification failures. The multi-mode features of the focus, size and density are anchored and projected in a high-dimensional space by comparing the pre-training model, and the closest classification in the space relation can be found to obtain more accurate positioning. In addition, when the classification network used by the classification method provided by the application is used, the classification result integrating the multi-mode characteristics of the text and the image can be obtained only by inputting the original image, and the effect is better than that of a single-mode model. Meanwhile, text description of the original image is also output, so that subsequent processing is facilitated.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a classification device for realizing the above-mentioned classification method. The implementation of the solution provided by the device is similar to that described in the above method, so the specific limitation of one or more embodiments of the classification device provided below may be referred to above for limitation of the classification method, and will not be repeated here.
In one embodiment, as shown in fig. 11, there is provided a sorting apparatus including:
an acquisition module 10, configured to acquire an image to be classified; the image to be classified comprises a focus area.
The classification module 11 is configured to input the image to be classified into a classification network for focus classification, so as to obtain a classification result; the classification network is obtained by training a sample set based on image text; the classification results include a plurality of classification results for the lesion area.
In one embodiment, the classification module 11, as shown in fig. 12, includes:
an extracting unit 110, configured to input the image to be classified to the image encoder for image encoding extraction, so as to obtain an image encoding;
a generating unit 111, configured to input the image code to the text generating model for text generation, so as to obtain a text code corresponding to the image code;
And the classification unit 112 is configured to input the text code and the image code to the classifier to perform focus classification, and obtain the classification result.
In one embodiment, as shown in fig. 13, the classification device further includes:
and the decoding module 13 is used for inputting the text codes to the text decoder for text decoding to obtain focus text descriptions corresponding to the images to be classified.
In one embodiment, as shown in fig. 14, the classification device further includes:
the fusion module 14 is configured to input the text code and the image code to the fusion module for fusion, so as to obtain fusion features;
correspondingly, the classification module 12 is configured to input the fusion feature to the classifier to perform focus classification, so as to obtain the classification result.
In one embodiment, as shown in fig. 15, the classification device further includes:
training module 15, comprising an acquisition sample unit 150, a coding unit 151, a first training unit 152, a second training unit 153, wherein:
an acquisition sample unit 150, configured to acquire a sample set of the image text pairs; the image text pair sample set includes a first image sample and a first text sample;
The encoding unit 151 is configured to perform image encoding on the first image sample to obtain an image encoding sample, and perform text encoding on the first text sample to obtain a text encoding sample.
A first training unit 152, configured to train an initial text generation model according to the image coding sample and the text coding sample, so as to obtain the text generation model;
and the second training unit 153 is configured to train the initial classifier according to the image coding sample and the text coding sample, so as to obtain the classifier.
In one embodiment, the first training unit 152 is specifically configured to input the image coding sample into the initial text generation model for text generation, so as to obtain an output text coding sample corresponding to the image coding sample; and training the initial text generation model according to the output text coding sample and the text coding sample to obtain a trained text generation model.
In one embodiment, the second training unit 153 is specifically configured to perform fusion processing on the image encoding sample and the text encoding sample to obtain a fusion feature; and training the initial classifier according to the fusion characteristic sample to obtain a trained classifier.
In one embodiment, the sample obtaining unit 150 is configured to input each second image sample in the second image sample set and each second text sample in the second text sample set into a contrast pretraining network for contrast pretraining, so as to obtain the image text pair sample set, where the image text pair sample set includes the first image sample and the first text sample with the highest similarity.
The respective modules in the above-described sorting apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 16. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a classification method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in FIG. 16 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:
acquiring an image to be classified; the image to be classified comprises a focus area;
inputting the images to be classified into a classification network to perform focus classification, and obtaining a classification result; the classification network is obtained by training a sample set based on image text; the classification results include a plurality of classification results for the lesion area.
The computer device provided in the foregoing embodiments has similar implementation principles and technical effects to those of the foregoing method embodiments, and will not be described herein in detail.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
Acquiring an image to be classified; the image to be classified comprises a focus area;
inputting the images to be classified into a classification network to perform focus classification, and obtaining a classification result; the classification network is obtained by training a sample set based on image text; the classification results include a plurality of classification results for the lesion area.
The foregoing embodiment provides a computer readable storage medium, which has similar principles and technical effects to those of the foregoing method embodiment, and will not be described herein.
In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:
acquiring an image to be classified; the image to be classified comprises a focus area;
inputting the images to be classified into a classification network to perform focus classification, and obtaining a classification result; the classification network is obtained by training a sample set based on image text; the classification results include a plurality of classification results for the lesion area.
The foregoing embodiment provides a computer program product, which has similar principles and technical effects to those of the foregoing method embodiment, and will not be described herein.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

1. A method of classification, the method comprising:
acquiring an image to be classified; the image to be classified comprises a focus area;
inputting the images to be classified into a classification network to perform focus classification, and obtaining a classification result; the classification network is obtained by training a sample set based on image text; the classification results include a plurality of classification results for the lesion area.
2. The method of claim 1, wherein the classification network comprises: an image encoder, a text generation model, and a classifier; inputting the image to be classified into a classification network for focus classification to obtain a classification result, wherein the method comprises the following steps:
inputting the images to be classified into the image encoder for image coding extraction to obtain image codes;
inputting the image code into the text generation model to generate text, and obtaining a text code corresponding to the image code;
inputting the text codes and the image codes into the classifier to classify focus, and obtaining the classification result.
3. The method of claim 2, wherein the classification network further comprises: a text decoder, the method further comprising:
and inputting the text codes to the text decoder for text decoding to obtain focus text description corresponding to the image to be classified.
4. The method of claim 2, wherein the classification network further comprises: a fusion module, the method further comprising:
inputting the text codes and the image codes to the fusion module for fusion to obtain fusion characteristics;
Inputting the text codes and the image codes into the classifier for focus classification to obtain classification results, wherein the classification results comprise:
and inputting the fusion characteristics into the classifier to perform focus classification, and obtaining the classification result.
5. The method according to any one of claims 2-4, further comprising:
acquiring a sample set of the image text pairs; the image text pair sample set includes a first image sample and a first text sample;
performing image coding on the first image sample to obtain an image coding sample, and performing text coding on the first text sample to obtain a text coding sample;
training an initial text generation model according to the image coding sample and the text coding sample to obtain the text generation model;
training an initial classifier according to the image coding sample and the text coding sample to obtain the classifier.
6. The method of claim 5, wherein training an initial text generation model based on the image encoding samples and the text encoding samples to obtain the text generation model comprises:
Inputting the image coding sample into the initial text generation model to generate text, and obtaining an output text coding sample corresponding to the image coding sample;
and training the initial text generation model according to the output text coding sample and the text coding sample to obtain a trained text generation model.
7. The method of claim 5, wherein training an initial classifier based on the image-encoded samples and the text-encoded samples results in the classifier, comprising:
carrying out fusion processing on the image coding sample and the text coding sample to obtain fusion characteristics;
and training the initial classifier according to the fusion characteristic sample to obtain a trained classifier.
8. The method of claim 5, wherein the acquiring the image text pair sample set comprises:
and inputting each second image sample in the second image sample set and each second text sample in the second text sample set into a contrast pre-training network for contrast pre-training to obtain the image text pair sample set, wherein the image text pair sample set comprises the first image sample and the first text sample with highest similarity.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.
CN202310836197.1A 2023-07-07 2023-07-07 Classification method, computer device, and storage medium Pending CN116883737A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310836197.1A CN116883737A (en) 2023-07-07 2023-07-07 Classification method, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310836197.1A CN116883737A (en) 2023-07-07 2023-07-07 Classification method, computer device, and storage medium

Publications (1)

Publication Number Publication Date
CN116883737A true CN116883737A (en) 2023-10-13

Family

ID=88263722

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310836197.1A Pending CN116883737A (en) 2023-07-07 2023-07-07 Classification method, computer device, and storage medium

Country Status (1)

Country Link
CN (1) CN116883737A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117577258A (en) * 2024-01-16 2024-02-20 北京大学第三医院(北京大学第三临床医学院) PETCT (pulse-based transmission control test) similar case retrieval and prognosis prediction method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117577258A (en) * 2024-01-16 2024-02-20 北京大学第三医院(北京大学第三临床医学院) PETCT (pulse-based transmission control test) similar case retrieval and prognosis prediction method
CN117577258B (en) * 2024-01-16 2024-04-02 北京大学第三医院(北京大学第三临床医学院) PETCT (pulse-based transmission control test) similar case retrieval and prognosis prediction method

Similar Documents

Publication Publication Date Title
CN110472531B (en) Video processing method, device, electronic equipment and storage medium
US20210390700A1 (en) Referring image segmentation
CN112766244B (en) Target object detection method and device, computer equipment and storage medium
CN113297975A (en) Method and device for identifying table structure, storage medium and electronic equipment
US20230102217A1 (en) Translating texts for videos based on video context
Jiang et al. A deep evaluator for image retargeting quality by geometrical and contextual interaction
CN112804558B (en) Video splitting method, device and equipment
CN115565238B (en) Face-changing model training method, face-changing model training device, face-changing model training apparatus, storage medium, and program product
CN113870395A (en) Animation video generation method, device, equipment and storage medium
CN117609550B (en) Video title generation method and training method of video title generation model
CN116883737A (en) Classification method, computer device, and storage medium
CN115083435A (en) Audio data processing method and device, computer equipment and storage medium
CN117078790A (en) Image generation method, device, computer equipment and storage medium
Belharbi et al. Deep neural networks regularization for structured output prediction
CN115147890A (en) System, method and storage medium for creating image data embedding for image recognition
CN113191355A (en) Text image synthesis method, device, equipment and storage medium
CN113689527B (en) Training method of face conversion model and face image conversion method
Huang et al. Dynamic sign language recognition based on CBAM with autoencoder time series neural network
CN116152826A (en) Handwritten character recognition method and device, storage medium and computer equipment
CN115116548A (en) Data processing method, data processing apparatus, computer device, medium, and program product
CN115359492A (en) Text image matching model training method, picture labeling method, device and equipment
CN113469197A (en) Image-text matching method, device, equipment and storage medium
AU2021240188B1 (en) Face-hand correlation degree detection method and apparatus, device and storage medium
US20240046412A1 (en) Debiasing image to image translation models
CN117893859A (en) Multi-mode text image classification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination