CN116612324A - Small sample image classification method and device based on semantic self-adaptive fusion mechanism - Google Patents

Small sample image classification method and device based on semantic self-adaptive fusion mechanism Download PDF

Info

Publication number
CN116612324A
CN116612324A CN202310561130.1A CN202310561130A CN116612324A CN 116612324 A CN116612324 A CN 116612324A CN 202310561130 A CN202310561130 A CN 202310561130A CN 116612324 A CN116612324 A CN 116612324A
Authority
CN
China
Prior art keywords
feature
features
semantic
visual
support set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310561130.1A
Other languages
Chinese (zh)
Inventor
唐培人
程旗
高晓利
李捷
王维
赵火军
包庆红
聂常赟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Jiuzhou Electric Group Co Ltd
Original Assignee
Sichuan Jiuzhou Electric Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Jiuzhou Electric Group Co Ltd filed Critical Sichuan Jiuzhou Electric Group Co Ltd
Priority to CN202310561130.1A priority Critical patent/CN116612324A/en
Publication of CN116612324A publication Critical patent/CN116612324A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/778Active pattern-learning, e.g. online learning of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a small sample image classification method and a device based on a semantic self-adaptive fusion mechanism, which relate to the field of image recognition classification and have the technical scheme that: the application realizes the high-level multi-modal feature extraction of each category of the target through the object visual feature prototype extraction and the semantic feature extraction; the dimension adaptation of semantic features and visual features is provided, and the fusion among different dimension modes is solved; through the self-adaptive convex combination of semantic features and visual features, effective fusion among different dimension modal features is realized, and a multi-modal fusion feature prototype with enhanced characterization is obtained; and providing support set query set sample splicing based on weight, and splicing features according to the importance degree of sample distribution. And obtaining the relationship scores among the samples through a relationship scoring network, wherein the highest score is the same type. The method provided by the application aims at image classification under the condition of a small sample, makes full use of multi-mode information, increases target characterization information, and improves target classification accuracy.

Description

Small sample image classification method and device based on semantic self-adaptive fusion mechanism
Technical Field
The application relates to the field of image recognition and classification, in particular to a small sample image classification method and device based on a semantic self-adaptive fusion mechanism.
Background
Deep learning has been widely used in the fields of image recognition, speech recognition, chess playing, and the like. However, deep learning has a large dependence on the amount of sample data, most deep learning algorithms require massive amounts of data to be trained, and convergence is slow, which prevents further application of deep learning in many scenarios where large amounts of samples are difficult to obtain.
Aiming at the problem of target identification under the condition of a small sample, the prior art provides methods such as data enhancement, initialization, measurement learning and the like. The data enhancement mainly adopts a certain method to expand the sample data volume to solve the problem of overfitting, including using an antagonism generation network to realize sample generation and fantasy features, but the generated samples and features have high similarity, and the target classification effect is difficult to effectively improve. The optimization-based method is based on the idea of meta-learning, and aims at learning a group of element classifiers, wherein the classifier can achieve better classification performance on a new task through parameter fine tuning, and a trained model can only be pre-trained and migrated on a fixed task. According to the method based on the graph neural network, a single sample is used as a node, similarity among samples is used as an edge, and the connection matrix of the graph model is calculated through iteration of the neural network model to finally infer the similarity between the sample to be identified and all the support samples, but a large amount of memory space is consumed in the model training process, and meanwhile, the calculated amount is increased rapidly along with the increase of the number of the samples.
However, conventional classification is less accurate for classifying objects under small sample conditions. In particular, the following disadvantages are present: firstly, the characterization of target features is difficult due to the rare samples; secondly, the fixed nearest neighbor classifier and the linear classifier can prevent performance optimization; thirdly, the introduction mode of the auxiliary information is too direct, and the targeted introduction of the characteristics of different samples is lacking.
Disclosure of Invention
Aiming at the problem of poor target classification accuracy under the condition of a small sample, the application provides a small sample image classification method and device based on a semantic self-adaptive fusion mechanism.
The technical aim of the application is realized by the following technical scheme:
the application provides a small sample image classification method based on a semantic self-adaptive fusion mechanism, which comprises the following steps:
acquiring an image data set, and acquiring a support set and a query set with at least one category according to the image data set;
acquiring text description information of images of each category of the support set;
extracting visual feature prototypes of each category of the support set and extracting first semantic features of the text description information;
matching the dimension of the first semantic feature to obtain a second semantic feature which is the same as the dimension of the visual feature prototype;
calculating fusion weights according to the distribution of the first semantic features of the text description information of the images of each category of the support set, and fusing the visual feature prototype and the second semantic features according to the fusion weights to obtain a fusion feature prototype;
extracting visual features of images to be detected in a query set, constructing a splicing weight calculation network to calculate fusion feature prototypes corresponding to each type of the visual features of the query set and each type of support set, outputting a first weight factor corresponding to the query set and a second weight factor corresponding to the support set, splicing the fusion feature prototypes of all types of the images to be detected and the support set along the channel dimension of the visual features according to the first weight factor and the second weight factor to obtain splicing features, and splicing the splicing features corresponding to the support set of each type along the length direction to obtain total splicing features;
and scoring the total spliced characteristics by using a pre-trained relational scoring network, and determining the classification result of the query set according to the scoring result.
In one embodiment, a support set and a query set having at least one category are obtained from an image dataset, specifically: and randomly selecting N types of images from the image dataset, and randomly selecting K images and T images from the images of each of the N types as a support set and a query set respectively.
In one embodiment, extracting visual feature prototypes for each category of the support set specifically includes:
constructing a visual feature prototype extraction network, wherein the visual feature prototype extraction network consists of a visual feature extraction network and a feature weight calculation network, the visual feature extraction network is a convolutional neural network, and the feature weight calculation network consists of a plurality of convolutional layers and a full connection layer;
extracting visual feature prototypes of each category of the support set based on a pre-trained visual feature prototypes extraction network;
the first semantic feature of the text description information is extracted, specifically: extracting first semantic features of the text description information by using a word vector learning algorithm;
in one embodiment, extracting visual feature prototypes for each category of the support set based on pre-trained visual feature prototypes comprises:
extracting the target visual characteristics of each image of each category of the support set according to the visual characteristic extraction network, splicing the target visual characteristics corresponding to each image of each category along the channel dimension direction of the target visual characteristics, sending the splicing result to the characteristic weight calculation network, and calculating the weight of each image;
and weighting the target visual characteristics of each image according to the weight of each image to obtain the visual characteristic prototype of each category of the support set.
In one embodiment, matching the dimensions of the first semantic feature specifically includes:
constructing a characteristic dimension matching network, wherein the characteristic dimension matching network is obtained by sequentially connecting a full connection layer and a deconvolution layer;
sending the first semantic features into a full-connection layer for depth matching to obtain first semantic sub-features matched with the depth of the visual feature prototype;
and sending the first semantic sub-features into a deconvolution layer for length matching to obtain second semantic features matched with the depth and the length of the visual feature prototype.
In one embodiment, the fused weights are calculated from the distribution of the first semantic features of the support set asWherein λ represents a fusion weight, +.>Representing a first semantic feature, h representing a fully connected network, n representing a category index for each category of the support set, and S representing the support set.
In one embodiment, the fused feature prototype is calculated byWherein λ represents a fusion weight, +.>Representing a second semantic feature->Representing a visual feature prototype, n representing a category index for each category of the support set, and S representing the support set.
In one embodiment, a splicing weight computing network is constructed to compute the second semantic features and the visual features, specifically: respectively sending the visual features of the query set and the fusion feature prototype of the support set into a splicing weight calculation network, and outputting a first weight factor corresponding to the query set and a second weight factor corresponding to the support set; the visual characteristics of each image to be tested in the query set are extracted by utilizing a pre-trained convolutional neural network; the splicing weight calculation network is formed by sequentially connecting a first convolution module, a first maximum value pooling 2 x 2, a second convolution module, a second maximum value pooling 2 x 2 and a full connection layer.
In one embodiment, the method includes the steps of splicing the image to be detected and the fusion feature prototype of all the categories of the support set along the channel dimension of the visual feature according to the first weight factor and the second weight factor to obtain the spliced feature, and specifically includes the following steps:
determining a first duty cycle of the first weight factor to the sum of the first weight factor and the second weight factor, and a second duty cycle of the second weight factor to the sum of the first weight factor and the second weight factor;
and respectively carrying out feature splicing on the fusion feature prototype and the visual features along the channel direction according to the first duty ratio and the second duty ratio to obtain splicing features of all categories of the support set corresponding to the query set.
In a second aspect of the present application, there is provided a small sample image classification device based on a semantic adaptive fusion mechanism, including:
the first data module is used for acquiring an image data set and obtaining a support set and a query set with at least one category according to the image data set;
the second data module is used for acquiring the text description information of the images of each category of the support set;
the feature extraction module is used for extracting visual feature prototypes of each category of the support set and extracting first semantic features of the text description information;
the dimension matching module is used for matching the dimension of the first semantic feature to obtain a second semantic feature which is the same as the dimension of the visual feature prototype;
the feature fusion module is used for calculating fusion weight according to the distribution of the first semantic features of the text description information of the images of each category of the support set, and fusing the visual feature prototype and the second semantic features according to the fusion weight to obtain a fusion feature prototype;
the feature splicing module is used for extracting visual features of images to be detected in the query set, constructing a splicing weight calculation network to calculate fusion feature prototypes corresponding to each type of the visual features of the query set and the support set, outputting a first weight factor corresponding to the query set and a second weight factor corresponding to the support set, splicing the fusion feature prototypes of all types of the images to be detected and the support set along the channel dimension of the visual features according to the first weight factor and the second weight factor to obtain splicing features, and splicing the splicing features corresponding to the support set of each type along the length direction to obtain total splicing features;
and the scoring classification module is used for scoring the total spliced characteristics by utilizing a pre-trained relational scoring network, and determining the classification result of the query set according to the scoring result.
Compared with the prior art, the application has the following beneficial effects:
1. the application realizes the high-level multi-modal feature extraction of each category of the image through the visual feature prototype extraction and the first semantic feature extraction; performing dimension adaptation on the first semantic features and the visual feature prototypes, and solving the problem that different dimension features are difficult to fuse; the fusion feature prototype with enhanced characterization is obtained through the self-adaptive convex combination of the second semantic features and the visual feature prototype, so that the problem of weak feature characterization capability caused by direct feature fusion is solved, and the self-adaptive fusion enhancement among the features is realized; the method is characterized in that the support set query set sample splicing based on the weight is provided, the characteristics can be spliced according to the importance degree of sample distribution, the influence of irrelevant characteristics on a follow-up relationship scoring network is weakened, finally, the relationship score among the samples of the image data set is calculated through the relationship scoring network, and the highest score is the same type. In summary, the classification method provided by the application aims at image classification under the condition of small samples, and the multi-mode information of the images is utilized to increase the representation information of the images, so that the classification accuracy of the images is improved.
2. The application is based on the visual feature extraction network and the feature weight calculation network of the convolutional neural network, calculates the weight of each image through the feature weight calculation network, weights the target visual feature of each image according to the weight of each image, and distributes larger weight to the feature with obvious feature, thus realizing the feature characterization enrichment of the visual feature while retaining the dominant feature.
3. The application designs a modal dimension matching method based on a full-connection layer and a deconvolution layer, realizes the depth matching of the second semantic features and the visual feature prototypes through the calculation of the full-connection layer, and realizes the length-width matching of the second semantic features and the visual feature prototypes through the deconvolution layer, thereby solving the problem that feature sizes are different and cannot be fused.
4. The application gives the fusion weight of each feature through the fully connected network based on the influence of the distribution of the text description information corresponding to the first semantic features, realizes the self-adaptive fusion enhancement between the first semantic features of the text description information, and enhances the characterization features of the image.
Drawings
The accompanying drawings, which are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings:
fig. 1 is a schematic flow chart of a small sample image classification method based on a semantic adaptive fusion mechanism according to an embodiment of the present application;
FIG. 2 is a block diagram of a visual feature prototype-extracting network according to an embodiment of the present application;
FIG. 3 is a first semantic feature and visual feature prototype dimension adaptation flow chart provided by an embodiment of the present application;
fig. 4 is a schematic structural diagram of a splicing weight calculation network according to an embodiment of the present application;
fig. 5 is a block diagram of a small sample image classification device based on a semantic adaptive fusion mechanism according to an embodiment of the present application.
Detailed Description
For the purpose of making apparent the objects, technical solutions and advantages of the present application, the present application will be further described in detail with reference to the following examples and the accompanying drawings, wherein the exemplary embodiments of the present application and the descriptions thereof are for illustrating the present application only and are not to be construed as limiting the present application.
It should be appreciated that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
As described in the background, conventional classification is less accurate for classifying objects under small sample conditions. In particular, the following disadvantages are present: firstly, the characterization of target features is difficult due to the rare samples; secondly, the fixed nearest neighbor classifier and the linear classifier can prevent performance optimization; thirdly, the introduction mode of the auxiliary information is too direct, and the targeted introduction of the characteristics of different samples is lacking. Therefore, the embodiment of the application provides a small sample image classification method based on a semantic self-adaptive fusion mechanism, which is applied to terminal equipment, and a small sample image classification device based on the semantic self-adaptive fusion mechanism is operated on the terminal equipment. A classification device as shown in fig. 5, which acquires an image dataset and obtains a support set and a query set having at least one category from the image dataset; acquiring text description information of images of each category of the support set; extracting visual feature prototypes of each category of the support set and extracting first semantic features of the text description information; matching the dimension of the first semantic feature to obtain a second semantic feature which is the same as the dimension of the visual feature prototype; calculating fusion weights according to the distribution of the first semantic features of the text description information of the images of each category of the support set, and fusing the visual feature prototype and the second semantic features according to the fusion weights to obtain a fusion feature prototype; extracting visual features of images to be detected in a query set, constructing a splicing weight calculation network to calculate fusion feature prototypes corresponding to each type of the visual features of the query set and each type of support set, outputting a first weight factor corresponding to the query set and a second weight factor corresponding to the support set, splicing the fusion feature prototypes of all types of the images to be detected and the support set along the channel dimension of the visual features according to the first weight factor and the second weight factor to obtain splicing features, and splicing the splicing features corresponding to the support set of each type along the length direction to obtain total splicing features; and scoring the total spliced characteristics by using a pre-trained relational scoring network, and determining the classification result of the query set according to the scoring result. Based on the working principle, the classification method provided by the embodiment acquires the semantic features of the image through the text description information of the image aiming at the image under the condition of the small sample, introduces the semantic features into the classification of the small sample image, combines the semantic features with the visual features of the image, and increases the characterization information of the image, thereby improving the classification accuracy of the image.
By way of example, a personal computer, tablet computer, etc. may be used as a terminal device. The terminal equipment can also be called user equipment and further comprises a cloud server, and the terminal equipment is connected with the cloud server in a wireless communication mode. The wireless communication means includes, but is not limited to: wireless communication methods such as bluetooth, WIFI, zigBee, GPRS, 3G, 4G, and 5G, wiTax.
The cloud server is used as a collecting and distributing place of information and is used for receiving, processing and storing image information; and the user sends an information acquisition instruction to the cloud server through the terminal equipment, and the cloud server sends a classification result of image classification to the terminal equipment after receiving the information acquisition instruction. The terminal equipment receives the classification result of the image for the user to check.
Referring to fig. 1 in combination with the above implementation environment, fig. 1 is a schematic flow chart of a small sample image classification method based on a semantic adaptive fusion mechanism according to an embodiment of the present application, where the flow chart of the method provided in the embodiment is specifically as follows, and the method includes:
s110, acquiring an image data set, and obtaining a support set and a query set with at least one category according to the image data set.
Specifically, the image data set refers to a data set generated by an image capturing device, including an image, a graphic, a photo, etc., which are all abbreviated as an image, where the image capturing device may be a device, a component, or an instrument for converting an optical image into digital data, such as a video camera, a CCD image array, or a CTOS image array. Further, a support set and a query set with at least one category are obtained according to the image data set, specifically: and randomly selecting N types of images from the image dataset, and randomly selecting K images and T images from the images of each of the N types as a support set and a query set respectively. For example, first, N kinds of images are randomly selected from the training dataset, K kinds of images are randomly selected from the images of each of the N kinds of the training dataset as the supporting set, T training images are randomly extracted from the images of the N kinds of the training dataset to form the query set, in this embodiment, the number N of the kinds of the training images in the supporting set is 10, the number K of the training images of each kind is 10, and the number T of the training images in the query set is 5. It is understood that the categories are different kinds of images.
S120, acquiring text description information of the images of each category of the support set.
Specifically, the text description information refers to target text description information obtained through wikipedia entry. The target semantic information extraction converts the text description of the image into one-dimensional semantic feature vectors, and specifically, a corpus of object descriptions such as wikipedia can be utilized to train to obtain corresponding one-dimensional semantic feature vectors.
S130, extracting visual feature prototypes of each category of the support set, and extracting first semantic features of the text description information.
In the embodiment, firstly, the feature tensor of a plurality of images is extracted through the convolutional neural network, and the feature tensor of the plurality of images is utilized to form an effective visual feature prototype with richer target characterization, wherein the feature tensor reflects the visual features of the image contour and the color, and the feature prototype can more completely characterize the visual features of the images, so that the problem of incomplete characterization of the visual feature prototype is solved.
Extracting first semantic features of text description information by using a word vector learning algorithm, obtaining semantic information of images by using the word vector learning algorithm (Glove method), and obtaining corresponding first semantic features of each type of images of a support set(d S For the depth of the first semantic feature, R represents tensor space). The related semantic features of the images are extracted, multi-mode characterization except visual features is extracted for image classification, the feature information of the images can be enriched, and the image classification effect can be improved. Correspondingly, besides the word vector learning algorithm, the semantic features of the text description information can be extracted in a single thermal coding or TF-IDF mode, and the specific extraction mode is a conventional technical means of a person skilled in the art, so that redundant description is not made here.
And S140, matching the dimension of the first semantic feature to obtain a second semantic feature which is the same as the dimension of the visual feature prototype.
In this embodiment, the first semantic feature extracted based on the word vector learning algorithm is a two-dimensional vector feature, and cannot be directly combined with a three-dimensional visual feature prototype. Therefore, in order to solve the problem of how to match the dimensions of the two, the embodiment provides a feature fusion method, which specifically includes: constructing a characteristic dimension matching network, wherein the characteristic dimension matching network is obtained by sequentially connecting a full connection layer and a deconvolution layer; sending the first semantic features into a full-connection layer for depth matching to obtain first semantic sub-features matched with the depth of the visual feature prototype; and sending the first semantic sub-features into a deconvolution layer for length matching to obtain second semantic features matched with the depth and the length of the visual feature prototype.
As shown in fig. 3, the present embodiment proposes a feature dimension adaptation method based on a full connection layer and a deconvolution layer (semantic feature dimension:-visual feature tensor dimension: />) Wherein d f For visual feature depth, n f Is the length and width of visual characteristics
First, channel dimension matching, namely depth matching, of the first semantic features is carried out on the first semantic featuresFeeding into the full connection layer fc1 to obtain a first semantic sub-feature->
Then spatial dimension matching, i.e. length matching, the first semantic sub-featureFeeding into deconvolution layer deconv to obtain the final second semantic feature +.>In summary, according to the embodiment, the size matching of the length and depth of the semantic features and the visual features is realized through the full connection layer and the deconvolution layer, so that the problem that the feature sizes are different and cannot be fused is solved.
And S150, calculating fusion weights according to the distribution of the first semantic features of the text description information of the images of each category of the support set, and fusing the visual feature prototype and the second semantic features according to the fusion weights to obtain a fusion feature prototype.
According to the embodiment, the second semantic features and the visual feature prototypes are fused according to the self data distribution of the first semantic features, fusion weights are generated, the second semantic features and the visual feature prototypes are adaptively fused according to the fusion weights, and fusion feature prototypes of the enhanced image characterization features are obtained, so that the accuracy of image classification is improved.
S160, extracting visual features of images to be detected in a query set, constructing a splicing weight calculation network to calculate fusion feature prototypes corresponding to each type of the visual features of the query set and the support set, outputting a first weight factor corresponding to the query set and a second weight factor corresponding to the support set, splicing the fusion feature prototypes of all types of the images to be detected and the support set along channel dimensions of the visual features according to the first weight factor and the second weight factor to obtain splicing features, and splicing the splicing features corresponding to the support set of each type along the length direction to obtain total splicing features.
Specifically, the visual features of the image to be tested in the query set are extracted by using a pre-trained convolutional neural network (CNN neural network), and besides, the extraction of the image features can be realized by RNN, deep learning model, etc., which are common knowledge of those skilled in the art and are not described herein again. In one embodiment, as shown in fig. 4, a splicing weight calculation network is constructed to calculate the second semantic feature and the visual feature, specifically: the visual features and the second semantic features are respectively sent into a splicing weight calculation network, and the first weight factors of the visual features and the second weight factors of the second semantic features are output; the splicing weight calculation network is formed by sequentially connecting a first convolution module, a first maximum value pooling 2 x 2, a second convolution module, a second maximum value pooling 2 x 2 and a full connection layer.
And S170, scoring the total spliced characteristics by using a pre-trained relation scoring network, and determining the classification result of the query set according to the scoring result.
In this embodiment, the relationship scoring network χ calculates and supports, for each image of the query set, a score of the relationship of each class of object of the support set, where a higher score represents a higher probability that the query set image and the support set object class are homogeneous. For the t-th query set image, fconcat t Sending the relationship scoring network to obtain a relationship scoring S t =χ(Fconcat t ),S t ∈R 1 × N . The one with the highest score is judged as the class, and a one-hot vector r is output. Training loss. The overall loss is defined as:and continuously updating parameters of the model χ, the full connection layer h and other networks by back transmission under the condition of training loss. The above process is repeated until the parameters of each network or module converge. Inputting the images to be tested into a trained relation scoring network to obtain relation scores of the images to be classified and the images of each class of the existing sample images, and finally outputting class labels with the highest relation with the classes of the images to be classified, namely, the classification results of the images to be tested. It should be understood that the construction and training of the relational scoring network are all prior art, and redundant description is not made here.
In one embodiment, please refer to the structural block diagram of the visual feature prototype extraction network shown in fig. 2, which is used to extract visual feature prototypes of each category of the support set, specifically including:
constructing a visual feature prototype extraction network, wherein the visual feature prototype extraction network consists of a visual feature extraction network and a feature weight calculation network, the visual feature extraction network is a convolutional neural network, and the feature weight calculation network consists of a plurality of convolutional layers and a full connection layer; the visual feature prototype of each category of the support set is extracted based on the pre-trained visual feature prototype extraction network.
In this embodiment, the length and width of the image are first adjusted to 224×224 to obtain a support set imageAnd query set->x and y respectively represent the image pixel position indexes, and the visual characteristics of the support set image and the query set image are respectively calculated as follows:/>Wherein (1)>For visual feature extraction network, n E [1, N]To support category indexes for all categories of the set, k ε [1, K]Image index for each category of support set object, +.>To support visual features derived from the kth image of the nth class, t.epsilon.1, T]To query the index of the images in the set, f t Q Visual features obtained for querying the collection of the t-th image. />n f Is the length and width of visual characteristics, d f Is the visual feature depth.
In one embodiment, extracting visual feature prototypes for each category of the support set based on pre-trained visual feature prototypes comprises:
extracting the target visual characteristics of each image of each category of the support set according to the visual characteristic extraction network, splicing the target visual characteristics corresponding to each image of each category along the channel dimension direction of the target visual characteristics, sending the splicing result to the characteristic weight calculation network, and calculating the weight of each image; and weighting the target visual characteristics of each image according to the weight of each image to obtain the visual characteristic prototype of each category of the support set.
In this embodiment, for the visual features obtained by supporting the collection of images, all the visual features of each category of images are spliced along the channel direction and sent to the feature weight calculation network phi to calculate the features of each imageThe features account for the weights in the target prototype feature calculation as follows:wherein (1)>The weight corresponding to the kth image of the nth category. concat is an operation of splicing features in the channel direction.
Weighting the visual features of the K images according to the feature weights of the support sets to obtain visual feature prototypes of each category of the support setsI.e. < ->
In one embodiment, the fused weights are calculated from the distribution of the first semantic features of the support set asWherein λ represents a fusion weight, +.>Representing a first semantic feature, h representing a fully connected network, n representing a category index for each category of the support set, and S representing the support set.
In a further embodiment, the fused feature prototype is calculated byWherein λ represents a fusion weight, +.>Representing a second semantic feature->Representing a visual feature prototype, n representing classes for each class of the support setAnd (3) indexing, wherein S represents a support set.
In one embodiment, the splicing feature is obtained by splicing the image to be detected and the fusion feature prototype of all the categories of the support set along the channel dimension of the visual feature according to the first weight factor and the second weight factor, and specifically includes:
determining a first duty cycle of the first weight factor to the sum of the first weight factor and the second weight factor, and a second duty cycle of the second weight factor to the sum of the first weight factor and the second weight factor;
and respectively carrying out feature splicing on the fusion feature prototype and the visual features along the channel direction according to the first duty ratio and the second duty ratio to obtain splicing features of all categories of the support set corresponding to the query set.
Specifically, in this embodiment, the first duty ratio and the second duty ratio are common knowledge, so that redundant explanation is not made, and feature stitching is performed on the fusion feature prototype and the visual feature along the length direction according to the first duty ratio and the second duty ratio, where the stitching feature calculation formula specifically is: representing a second duty cycle->Representing a first duty cycle, ">A target visual feature representing a t-th image of the query set. Splicing the fusion feature prototypes of all the categories of the support set corresponding to the query set along the length of the feature tensor of the image to obtain the total splicing feature
Based on the same inventive concept, the embodiment of the application also provides a small sample image classification device based on a semantic self-adaptive fusion mechanism, the classification device uses multi-mode information of images to increase the characterization information of the images aiming at image classification under the condition of small samples, thereby improving the classification accuracy of the images, please refer to fig. 5, fig. 5 is a structural block diagram of the small sample image classification device based on the semantic self-adaptive fusion mechanism, which is further provided by the embodiment of the application, and comprises:
a first data module 510, configured to obtain an image dataset, and obtain a support set and a query set with at least one category according to the image dataset;
a second data module 520, configured to obtain text description information of the images of each category of the support set;
a feature extraction module 530, configured to extract visual feature prototypes for each category of the support set and to extract first semantic features of the textual description information;
the dimension matching module 540 is configured to match the dimension of the first semantic feature to obtain a second semantic feature that is the same as the dimension of the visual feature prototype;
the feature fusion module 550 is used for calculating fusion weights according to the distribution of the first semantic features of the text description information of the images of each category of the support set, and fusing the visual feature prototype and the second semantic features according to the fusion weights to obtain a fusion feature prototype;
the feature stitching module 560 is configured to extract visual features of images to be tested in the query set, construct a stitching weight calculation network to calculate a fused feature prototype corresponding to each type of the visual features of the query set and each type of the support set, output a first weight factor corresponding to the query set and a second weight factor corresponding to the support set, stitch the fused feature prototypes of all types of the images to be tested and the support set along channel dimensions of the visual features according to the first weight factor and the second weight factor to obtain stitching features, and stitch the stitching features corresponding to each type of the support set along a length direction to obtain total stitching features;
the scoring module 570 is configured to score the total spliced features using a pre-trained relational scoring network, and determine a classification result of the query set according to the scoring result.
Therefore, the classification device provided by the embodiment realizes high-level multi-modal feature extraction of each category of the image through visual feature prototype extraction and first semantic feature extraction; performing dimension adaptation on the first semantic features and the visual feature prototypes, and solving the problem that different dimension features are difficult to fuse; the fusion feature prototype with enhanced characterization is obtained through the self-adaptive convex combination of the second semantic features and the visual feature prototype, so that the problem of weak feature characterization capability caused by direct feature fusion is solved, and the self-adaptive fusion enhancement among the features is realized; the method is characterized in that the support set query set sample splicing based on the weight is provided, the characteristics can be spliced according to the importance degree of sample distribution, the influence of irrelevant characteristics on a follow-up relationship scoring network is weakened, finally, the relationship score among the samples of the image data set is calculated through the relationship scoring network, and the highest score is the same type.
The embodiment of the application also discloses terminal equipment. The terminal device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps in the small sample image classification method based on the semantic adaptive fusion mechanism described in the above embodiment when executing the computer program.
The terminal device comprises a processor, a memory, a communication interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the terminal device is adapted to provide computing and control capabilities. The memory of the terminal device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the terminal device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, near Field Communication (NFC) or other technologies. The display screen of the terminal equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the terminal equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the terminal equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the above embodiments illustrate the structure of the terminal device, and are merely a structure of a portion related to the technical solution of the present disclosure, and do not limit the electronic device to which the technical solution of the present application is applied, and a specific terminal device may include more or fewer components than those described in the above embodiments, or may combine some components, or have different component arrangements.
The embodiment of the application also discloses a computer readable storage medium. The computer readable storage medium stores a computer program, which when executed by a processor, implements the steps in the small sample image classification method based on the semantic adaptive fusion mechanism described in the foregoing embodiments.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the application, and is not meant to limit the scope of the application, but to limit the application to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the application are intended to be included within the scope of the application.

Claims (10)

1. The small sample image classification method based on the semantic self-adaptive fusion mechanism is characterized by comprising the following steps of:
acquiring an image data set, and acquiring a support set and a query set with at least one category according to the image data set;
acquiring text description information of images of each category of the support set;
extracting visual feature prototypes of each category of the support set and extracting first semantic features of the text description information;
matching the dimension of the first semantic feature to obtain a second semantic feature which is the same as the dimension of the visual feature prototype;
calculating fusion weights according to the distribution of the first semantic features of the text description information of the images of each category of the support set, and fusing the visual feature prototype and the second semantic features according to the fusion weights to obtain a fusion feature prototype;
extracting visual features of images to be detected in a query set, constructing a splicing weight calculation network to calculate fusion feature prototypes corresponding to each type of the visual features of the query set and each type of support set, outputting a first weight factor corresponding to the query set and a second weight factor corresponding to the support set, splicing the fusion feature prototypes of all types of the images to be detected and the support set along the channel dimension of the visual features according to the first weight factor and the second weight factor to obtain splicing features, and splicing the splicing features corresponding to the support set of each type along the length direction to obtain total splicing features;
and scoring the total spliced characteristics by using a pre-trained relational scoring network, and determining the classification result of the query set according to the scoring result.
2. The small sample image classification method based on the semantic adaptive fusion mechanism according to claim 1, wherein a support set and a query set with at least one category are obtained according to an image dataset, specifically: and randomly selecting N types of images from the image dataset, and randomly selecting K images and T images from the images of each of the N types as a support set and a query set respectively.
3. The small sample image classification method based on the semantic adaptive fusion mechanism according to claim 1, wherein the extracting of the visual feature prototype of each category of the support set specifically comprises:
constructing a visual feature prototype extraction network, wherein the visual feature prototype extraction network consists of a visual feature extraction network and a feature weight calculation network, the visual feature extraction network is a convolutional neural network, and the feature weight calculation network consists of a plurality of convolutional layers and a full connection layer;
extracting visual feature prototypes of each category of the support set based on a pre-trained visual feature prototypes extraction network;
the first semantic feature of the text description information is extracted, specifically: and extracting the first semantic features of the text description information by using a word vector learning algorithm.
4. A small sample image classification method based on semantic adaptive fusion mechanism according to claim 3, wherein extracting a visual feature prototype of each category of the support set based on a pre-trained visual feature prototype extraction network comprises:
extracting the target visual characteristics of each image of each category of the support set according to the visual characteristic extraction network, splicing the target visual characteristics corresponding to each image of each category along the channel dimension direction of the target visual characteristics, sending the splicing result to the characteristic weight calculation network, and calculating the weight of each image;
and weighting the target visual characteristics of each image according to the weight of each image to obtain the visual characteristic prototype of each category of the support set.
5. The small sample image classification method based on the semantic adaptive fusion mechanism according to claim 1, wherein the matching the dimensions of the first semantic features specifically comprises:
constructing a characteristic dimension matching network, wherein the characteristic dimension matching network is obtained by sequentially connecting a full connection layer and a deconvolution layer;
sending the first semantic features into a full-connection layer for depth matching to obtain first semantic sub-features matched with the depth of the visual feature prototype;
and sending the first semantic sub-features into a deconvolution layer for length matching to obtain second semantic features matched with the depth and the length of the visual feature prototype.
6. The small sample image classification method based on semantic adaptive fusion mechanism according to claim 1, wherein the calculation formula for calculating the fusion weight according to the distribution of the first semantic features of the support set isWherein λ represents a fusion weight, +.>Representing a first semantic feature, h representing a fully connected network, n representing a category index for each category of the support set, and S representing the support set.
7. The small sample image classification method based on semantic adaptive fusion mechanism according to claim 1 or 6, wherein the calculation formula of the fusion feature prototype is thatWherein lambda represents the fusion weight,representing a second semantic feature->Representing a visual feature prototype, n representing a category index for each category of the support set, and S representing the support set.
8. The small sample image classification method based on the semantic adaptive fusion mechanism according to claim 1, wherein the construction of the stitching weight calculation network calculates the second semantic features and the visual features, specifically:
respectively sending the visual features of the query set and the fusion feature prototype of the support set into a splicing weight calculation network, and outputting a first weight factor corresponding to the query set and a second weight factor corresponding to the support set; the visual characteristics of each image to be tested in the query set are extracted by utilizing a pre-trained convolutional neural network; the splicing weight calculation network is formed by sequentially connecting a first convolution module, a first maximum value pooling 2 x 2, a second convolution module, a second maximum value pooling 2 x 2 and a full connection layer.
9. The small sample image classification method based on the semantic self-adaptive fusion mechanism according to claim 1, wherein the method is characterized in that fusion feature prototypes of all categories of the image to be detected and the support set are spliced along the channel dimension of the visual features according to the first weight factor and the second weight factor to obtain splicing features, and specifically comprises the following steps:
determining a first duty cycle of the first weight factor to the sum of the first weight factor and the second weight factor, and a second duty cycle of the second weight factor to the sum of the first weight factor and the second weight factor;
and respectively carrying out feature splicing on the fusion feature prototype and the visual features along the channel direction according to the first duty ratio and the second duty ratio to obtain splicing features of all categories of the support set corresponding to the query set.
10. A small sample image classification device based on a semantic adaptive fusion mechanism, comprising:
the first data module is used for acquiring an image data set and obtaining a support set and a query set with at least one category according to the image data set;
the second data module is used for acquiring the text description information of the images of each category of the support set;
the feature extraction module is used for extracting visual feature prototypes of each category of the support set and extracting first semantic features of the text description information;
the dimension matching module is used for matching the dimension of the first semantic feature to obtain a second semantic feature which is the same as the dimension of the visual feature prototype;
the feature fusion module is used for calculating fusion weight according to the distribution of the first semantic features of the text description information of the images of each category of the support set, and fusing the visual feature prototype and the second semantic features according to the fusion weight to obtain a fusion feature prototype;
the feature splicing module is used for extracting visual features of images to be detected in the query set, constructing a splicing weight calculation network to calculate fusion feature prototypes corresponding to each type of the visual features of the query set and the support set, outputting a first weight factor corresponding to the query set and a second weight factor corresponding to the support set, splicing the fusion feature prototypes of all types of the images to be detected and the support set along the channel dimension of the visual features according to the first weight factor and the second weight factor to obtain splicing features, and splicing the splicing features corresponding to the support set of each type along the length direction to obtain total splicing features;
and the scoring classification module is used for scoring the total spliced characteristics by utilizing a pre-trained relational scoring network, and determining the classification result of the query set according to the scoring result.
CN202310561130.1A 2023-05-17 2023-05-17 Small sample image classification method and device based on semantic self-adaptive fusion mechanism Pending CN116612324A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310561130.1A CN116612324A (en) 2023-05-17 2023-05-17 Small sample image classification method and device based on semantic self-adaptive fusion mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310561130.1A CN116612324A (en) 2023-05-17 2023-05-17 Small sample image classification method and device based on semantic self-adaptive fusion mechanism

Publications (1)

Publication Number Publication Date
CN116612324A true CN116612324A (en) 2023-08-18

Family

ID=87677567

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310561130.1A Pending CN116612324A (en) 2023-05-17 2023-05-17 Small sample image classification method and device based on semantic self-adaptive fusion mechanism

Country Status (1)

Country Link
CN (1) CN116612324A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994076A (en) * 2023-09-28 2023-11-03 中国海洋大学 Small sample image recognition method based on double-branch mutual learning feature generation
CN117095187A (en) * 2023-10-16 2023-11-21 四川大学 Meta-learning visual language understanding and positioning method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994076A (en) * 2023-09-28 2023-11-03 中国海洋大学 Small sample image recognition method based on double-branch mutual learning feature generation
CN116994076B (en) * 2023-09-28 2024-01-19 中国海洋大学 Small sample image recognition method based on double-branch mutual learning feature generation
CN117095187A (en) * 2023-10-16 2023-11-21 四川大学 Meta-learning visual language understanding and positioning method
CN117095187B (en) * 2023-10-16 2023-12-19 四川大学 Meta-learning visual language understanding and positioning method

Similar Documents

Publication Publication Date Title
CN109471945B (en) Deep learning-based medical text classification method and device and storage medium
CN111783902B (en) Data augmentation, service processing method, device, computer equipment and storage medium
CN116612324A (en) Small sample image classification method and device based on semantic self-adaptive fusion mechanism
WO2019015246A1 (en) Image feature acquisition
CN113657450B (en) Attention mechanism-based land battlefield image-text cross-modal retrieval method and system
CN109784405B (en) Cross-modal retrieval method and system based on pseudo-tag learning and semantic consistency
CN102144231A (en) Adaptive visual similarity for text-based image search results re-ranking
CN107683469A (en) A kind of product classification method and device based on deep learning
Zhou et al. Ladder loss for coherent visual-semantic embedding
CN106844518B (en) A kind of imperfect cross-module state search method based on sub-space learning
Wang et al. Facilitating image search with a scalable and compact semantic mapping
CN107273416B (en) Webpage hidden link detection method and device and computer readable storage medium
CN114332680A (en) Image processing method, video searching method, image processing device, video searching device, computer equipment and storage medium
CN112966135B (en) Image-text retrieval method and system based on attention mechanism and gate control mechanism
CN110929080A (en) Optical remote sensing image retrieval method based on attention and generation countermeasure network
CN116935188B (en) Model training method, image recognition method, device, equipment and medium
JP6764992B2 (en) Search system, search method, and program
CN109766455A (en) A kind of full similitude reservation Hash cross-module state search method having identification
CN114238746A (en) Cross-modal retrieval method, device, equipment and storage medium
CN109241315A (en) A kind of fast face search method based on deep learning
CN110069647A (en) Image tag denoising method, device, equipment and computer readable storage medium
CN116129168A (en) Intelligent processing method of shoes
JP2016014990A (en) Moving image search method, moving image search device, and program thereof
US20210182686A1 (en) Cross-batch memory for embedding learning
CN112257677A (en) Method and device for processing deep learning task in big data cluster

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination