CN116612324A

CN116612324A - Small sample image classification method and device based on semantic self-adaptive fusion mechanism

Info

Publication number: CN116612324A
Application number: CN202310561130.1A
Authority: CN
Inventors: 唐培人; 程旗; 高晓利; 李捷; 王维; 赵火军; 包庆红; 聂常赟
Original assignee: Sichuan Jiuzhou Electric Group Co Ltd
Current assignee: Sichuan Jiuzhou Electric Group Co Ltd
Priority date: 2023-05-17
Filing date: 2023-05-17
Publication date: 2023-08-18

Abstract

The application discloses a small sample image classification method and a device based on a semantic self-adaptive fusion mechanism, which relate to the field of image recognition classification and have the technical scheme that: the application realizes the high-level multi-modal feature extraction of each category of the target through the object visual feature prototype extraction and the semantic feature extraction; the dimension adaptation of semantic features and visual features is provided, and the fusion among different dimension modes is solved; through the self-adaptive convex combination of semantic features and visual features, effective fusion among different dimension modal features is realized, and a multi-modal fusion feature prototype with enhanced characterization is obtained; and providing support set query set sample splicing based on weight, and splicing features according to the importance degree of sample distribution. And obtaining the relationship scores among the samples through a relationship scoring network, wherein the highest score is the same type. The method provided by the application aims at image classification under the condition of a small sample, makes full use of multi-mode information, increases target characterization information, and improves target classification accuracy.

Description

Small sample image classification method and device based on semantic self-adaptive fusion mechanism

Technical Field

The application relates to the field of image recognition and classification, in particular to a small sample image classification method and device based on a semantic self-adaptive fusion mechanism.

Background

Deep learning has been widely used in the fields of image recognition, speech recognition, chess playing, and the like. However, deep learning has a large dependence on the amount of sample data, most deep learning algorithms require massive amounts of data to be trained, and convergence is slow, which prevents further application of deep learning in many scenarios where large amounts of samples are difficult to obtain.

Aiming at the problem of target identification under the condition of a small sample, the prior art provides methods such as data enhancement, initialization, measurement learning and the like. The data enhancement mainly adopts a certain method to expand the sample data volume to solve the problem of overfitting, including using an antagonism generation network to realize sample generation and fantasy features, but the generated samples and features have high similarity, and the target classification effect is difficult to effectively improve. The optimization-based method is based on the idea of meta-learning, and aims at learning a group of element classifiers, wherein the classifier can achieve better classification performance on a new task through parameter fine tuning, and a trained model can only be pre-trained and migrated on a fixed task. According to the method based on the graph neural network, a single sample is used as a node, similarity among samples is used as an edge, and the connection matrix of the graph model is calculated through iteration of the neural network model to finally infer the similarity between the sample to be identified and all the support samples, but a large amount of memory space is consumed in the model training process, and meanwhile, the calculated amount is increased rapidly along with the increase of the number of the samples.

However, conventional classification is less accurate for classifying objects under small sample conditions. In particular, the following disadvantages are present: firstly, the characterization of target features is difficult due to the rare samples; secondly, the fixed nearest neighbor classifier and the linear classifier can prevent performance optimization; thirdly, the introduction mode of the auxiliary information is too direct, and the targeted introduction of the characteristics of different samples is lacking.

Disclosure of Invention

Aiming at the problem of poor target classification accuracy under the condition of a small sample, the application provides a small sample image classification method and device based on a semantic self-adaptive fusion mechanism.

The technical aim of the application is realized by the following technical scheme:

the application provides a small sample image classification method based on a semantic self-adaptive fusion mechanism, which comprises the following steps:

acquiring an image data set, and acquiring a support set and a query set with at least one category according to the image data set;

acquiring text description information of images of each category of the support set;

extracting visual feature prototypes of each category of the support set and extracting first semantic features of the text description information;

matching the dimension of the first semantic feature to obtain a second semantic feature which is the same as the dimension of the visual feature prototype;

calculating fusion weights according to the distribution of the first semantic features of the text description information of the images of each category of the support set, and fusing the visual feature prototype and the second semantic features according to the fusion weights to obtain a fusion feature prototype;

extracting visual features of images to be detected in a query set, constructing a splicing weight calculation network to calculate fusion feature prototypes corresponding to each type of the visual features of the query set and each type of support set, outputting a first weight factor corresponding to the query set and a second weight factor corresponding to the support set, splicing the fusion feature prototypes of all types of the images to be detected and the support set along the channel dimension of the visual features according to the first weight factor and the second weight factor to obtain splicing features, and splicing the splicing features corresponding to the support set of each type along the length direction to obtain total splicing features;

and scoring the total spliced characteristics by using a pre-trained relational scoring network, and determining the classification result of the query set according to the scoring result.

In one embodiment, a support set and a query set having at least one category are obtained from an image dataset, specifically: and randomly selecting N types of images from the image dataset, and randomly selecting K images and T images from the images of each of the N types as a support set and a query set respectively.

In one embodiment, extracting visual feature prototypes for each category of the support set specifically includes:

constructing a visual feature prototype extraction network, wherein the visual feature prototype extraction network consists of a visual feature extraction network and a feature weight calculation network, the visual feature extraction network is a convolutional neural network, and the feature weight calculation network consists of a plurality of convolutional layers and a full connection layer;

extracting visual feature prototypes of each category of the support set based on a pre-trained visual feature prototypes extraction network;

the first semantic feature of the text description information is extracted, specifically: extracting first semantic features of the text description information by using a word vector learning algorithm;

in one embodiment, extracting visual feature prototypes for each category of the support set based on pre-trained visual feature prototypes comprises:

extracting the target visual characteristics of each image of each category of the support set according to the visual characteristic extraction network, splicing the target visual characteristics corresponding to each image of each category along the channel dimension direction of the target visual characteristics, sending the splicing result to the characteristic weight calculation network, and calculating the weight of each image;

and weighting the target visual characteristics of each image according to the weight of each image to obtain the visual characteristic prototype of each category of the support set.

In one embodiment, matching the dimensions of the first semantic feature specifically includes:

constructing a characteristic dimension matching network, wherein the characteristic dimension matching network is obtained by sequentially connecting a full connection layer and a deconvolution layer;

sending the first semantic features into a full-connection layer for depth matching to obtain first semantic sub-features matched with the depth of the visual feature prototype;

and sending the first semantic sub-features into a deconvolution layer for length matching to obtain second semantic features matched with the depth and the length of the visual feature prototype.

In one embodiment, the fused weights are calculated from the distribution of the first semantic features of the support set asWherein λ represents a fusion weight, +.>Representing a first semantic feature, h representing a fully connected network, n representing a category index for each category of the support set, and S representing the support set.

In one embodiment, the fused feature prototype is calculated byWherein λ represents a fusion weight, +.>Representing a second semantic feature->Representing a visual feature prototype, n representing a category index for each category of the support set, and S representing the support set.

In one embodiment, a splicing weight computing network is constructed to compute the second semantic features and the visual features, specifically: respectively sending the visual features of the query set and the fusion feature prototype of the support set into a splicing weight calculation network, and outputting a first weight factor corresponding to the query set and a second weight factor corresponding to the support set; the visual characteristics of each image to be tested in the query set are extracted by utilizing a pre-trained convolutional neural network; the splicing weight calculation network is formed by sequentially connecting a first convolution module, a first maximum value pooling 2 x 2, a second convolution module, a second maximum value pooling 2 x 2 and a full connection layer.

In one embodiment, the method includes the steps of splicing the image to be detected and the fusion feature prototype of all the categories of the support set along the channel dimension of the visual feature according to the first weight factor and the second weight factor to obtain the spliced feature, and specifically includes the following steps:

determining a first duty cycle of the first weight factor to the sum of the first weight factor and the second weight factor, and a second duty cycle of the second weight factor to the sum of the first weight factor and the second weight factor;

and respectively carrying out feature splicing on the fusion feature prototype and the visual features along the channel direction according to the first duty ratio and the second duty ratio to obtain splicing features of all categories of the support set corresponding to the query set.

In a second aspect of the present application, there is provided a small sample image classification device based on a semantic adaptive fusion mechanism, including:

the first data module is used for acquiring an image data set and obtaining a support set and a query set with at least one category according to the image data set;

the second data module is used for acquiring the text description information of the images of each category of the support set;

the feature extraction module is used for extracting visual feature prototypes of each category of the support set and extracting first semantic features of the text description information;

the dimension matching module is used for matching the dimension of the first semantic feature to obtain a second semantic feature which is the same as the dimension of the visual feature prototype;

the feature fusion module is used for calculating fusion weight according to the distribution of the first semantic features of the text description information of the images of each category of the support set, and fusing the visual feature prototype and the second semantic features according to the fusion weight to obtain a fusion feature prototype;

the feature splicing module is used for extracting visual features of images to be detected in the query set, constructing a splicing weight calculation network to calculate fusion feature prototypes corresponding to each type of the visual features of the query set and the support set, outputting a first weight factor corresponding to the query set and a second weight factor corresponding to the support set, splicing the fusion feature prototypes of all types of the images to be detected and the support set along the channel dimension of the visual features according to the first weight factor and the second weight factor to obtain splicing features, and splicing the splicing features corresponding to the support set of each type along the length direction to obtain total splicing features;

and the scoring classification module is used for scoring the total spliced characteristics by utilizing a pre-trained relational scoring network, and determining the classification result of the query set according to the scoring result.

Compared with the prior art, the application has the following beneficial effects:

1. the application realizes the high-level multi-modal feature extraction of each category of the image through the visual feature prototype extraction and the first semantic feature extraction; performing dimension adaptation on the first semantic features and the visual feature prototypes, and solving the problem that different dimension features are difficult to fuse; the fusion feature prototype with enhanced characterization is obtained through the self-adaptive convex combination of the second semantic features and the visual feature prototype, so that the problem of weak feature characterization capability caused by direct feature fusion is solved, and the self-adaptive fusion enhancement among the features is realized; the method is characterized in that the support set query set sample splicing based on the weight is provided, the characteristics can be spliced according to the importance degree of sample distribution, the influence of irrelevant characteristics on a follow-up relationship scoring network is weakened, finally, the relationship score among the samples of the image data set is calculated through the relationship scoring network, and the highest score is the same type. In summary, the classification method provided by the application aims at image classification under the condition of small samples, and the multi-mode information of the images is utilized to increase the representation information of the images, so that the classification accuracy of the images is improved.

2. The application is based on the visual feature extraction network and the feature weight calculation network of the convolutional neural network, calculates the weight of each image through the feature weight calculation network, weights the target visual feature of each image according to the weight of each image, and distributes larger weight to the feature with obvious feature, thus realizing the feature characterization enrichment of the visual feature while retaining the dominant feature.

3. The application designs a modal dimension matching method based on a full-connection layer and a deconvolution layer, realizes the depth matching of the second semantic features and the visual feature prototypes through the calculation of the full-connection layer, and realizes the length-width matching of the second semantic features and the visual feature prototypes through the deconvolution layer, thereby solving the problem that feature sizes are different and cannot be fused.

4. The application gives the fusion weight of each feature through the fully connected network based on the influence of the distribution of the text description information corresponding to the first semantic features, realizes the self-adaptive fusion enhancement between the first semantic features of the text description information, and enhances the characterization features of the image.

Drawings

The accompanying drawings, which are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings:

fig. 1 is a schematic flow chart of a small sample image classification method based on a semantic adaptive fusion mechanism according to an embodiment of the present application;

FIG. 2 is a block diagram of a visual feature prototype-extracting network according to an embodiment of the present application;

FIG. 3 is a first semantic feature and visual feature prototype dimension adaptation flow chart provided by an embodiment of the present application;

fig. 4 is a schematic structural diagram of a splicing weight calculation network according to an embodiment of the present application;

fig. 5 is a block diagram of a small sample image classification device based on a semantic adaptive fusion mechanism according to an embodiment of the present application.

Detailed Description

For the purpose of making apparent the objects, technical solutions and advantages of the present application, the present application will be further described in detail with reference to the following examples and the accompanying drawings, wherein the exemplary embodiments of the present application and the descriptions thereof are for illustrating the present application only and are not to be construed as limiting the present application.

It should be appreciated that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

As described in the background, conventional classification is less accurate for classifying objects under small sample conditions. In particular, the following disadvantages are present: firstly, the characterization of target features is difficult due to the rare samples; secondly, the fixed nearest neighbor classifier and the linear classifier can prevent performance optimization; thirdly, the introduction mode of the auxiliary information is too direct, and the targeted introduction of the characteristics of different samples is lacking. Therefore, the embodiment of the application provides a small sample image classification method based on a semantic self-adaptive fusion mechanism, which is applied to terminal equipment, and a small sample image classification device based on the semantic self-adaptive fusion mechanism is operated on the terminal equipment. A classification device as shown in fig. 5, which acquires an image dataset and obtains a support set and a query set having at least one category from the image dataset; acquiring text description information of images of each category of the support set; extracting visual feature prototypes of each category of the support set and extracting first semantic features of the text description information; matching the dimension of the first semantic feature to obtain a second semantic feature which is the same as the dimension of the visual feature prototype; calculating fusion weights according to the distribution of the first semantic features of the text description information of the images of each category of the support set, and fusing the visual feature prototype and the second semantic features according to the fusion weights to obtain a fusion feature prototype; extracting visual features of images to be detected in a query set, constructing a splicing weight calculation network to calculate fusion feature prototypes corresponding to each type of the visual features of the query set and each type of support set, outputting a first weight factor corresponding to the query set and a second weight factor corresponding to the support set, splicing the fusion feature prototypes of all types of the images to be detected and the support set along the channel dimension of the visual features according to the first weight factor and the second weight factor to obtain splicing features, and splicing the splicing features corresponding to the support set of each type along the length direction to obtain total splicing features; and scoring the total spliced characteristics by using a pre-trained relational scoring network, and determining the classification result of the query set according to the scoring result. Based on the working principle, the classification method provided by the embodiment acquires the semantic features of the image through the text description information of the image aiming at the image under the condition of the small sample, introduces the semantic features into the classification of the small sample image, combines the semantic features with the visual features of the image, and increases the characterization information of the image, thereby improving the classification accuracy of the image.

By way of example, a personal computer, tablet computer, etc. may be used as a terminal device. The terminal equipment can also be called user equipment and further comprises a cloud server, and the terminal equipment is connected with the cloud server in a wireless communication mode. The wireless communication means includes, but is not limited to: wireless communication methods such as bluetooth, WIFI, zigBee, GPRS, 3G, 4G, and 5G, wiTax.

The cloud server is used as a collecting and distributing place of information and is used for receiving, processing and storing image information; and the user sends an information acquisition instruction to the cloud server through the terminal equipment, and the cloud server sends a classification result of image classification to the terminal equipment after receiving the information acquisition instruction. The terminal equipment receives the classification result of the image for the user to check.

Referring to fig. 1 in combination with the above implementation environment, fig. 1 is a schematic flow chart of a small sample image classification method based on a semantic adaptive fusion mechanism according to an embodiment of the present application, where the flow chart of the method provided in the embodiment is specifically as follows, and the method includes:

s110, acquiring an image data set, and obtaining a support set and a query set with at least one category according to the image data set.

Specifically, the image data set refers to a data set generated by an image capturing device, including an image, a graphic, a photo, etc., which are all abbreviated as an image, where the image capturing device may be a device, a component, or an instrument for converting an optical image into digital data, such as a video camera, a CCD image array, or a CTOS image array. Further, a support set and a query set with at least one category are obtained according to the image data set, specifically: and randomly selecting N types of images from the image dataset, and randomly selecting K images and T images from the images of each of the N types as a support set and a query set respectively. For example, first, N kinds of images are randomly selected from the training dataset, K kinds of images are randomly selected from the images of each of the N kinds of the training dataset as the supporting set, T training images are randomly extracted from the images of the N kinds of the training dataset to form the query set, in this embodiment, the number N of the kinds of the training images in the supporting set is 10, the number K of the training images of each kind is 10, and the number T of the training images in the query set is 5. It is understood that the categories are different kinds of images.

S120, acquiring text description information of the images of each category of the support set.

Specifically, the text description information refers to target text description information obtained through wikipedia entry. The target semantic information extraction converts the text description of the image into one-dimensional semantic feature vectors, and specifically, a corpus of object descriptions such as wikipedia can be utilized to train to obtain corresponding one-dimensional semantic feature vectors.

S130, extracting visual feature prototypes of each category of the support set, and extracting first semantic features of the text description information.

In the embodiment, firstly, the feature tensor of a plurality of images is extracted through the convolutional neural network, and the feature tensor of the plurality of images is utilized to form an effective visual feature prototype with richer target characterization, wherein the feature tensor reflects the visual features of the image contour and the color, and the feature prototype can more completely characterize the visual features of the images, so that the problem of incomplete characterization of the visual feature prototype is solved.

Extracting first semantic features of text description information by using a word vector learning algorithm, obtaining semantic information of images by using the word vector learning algorithm (Glove method), and obtaining corresponding first semantic features of each type of images of a support set(d _S For the depth of the first semantic feature, R represents tensor space). The related semantic features of the images are extracted, multi-mode characterization except visual features is extracted for image classification, the feature information of the images can be enriched, and the image classification effect can be improved. Correspondingly, besides the word vector learning algorithm, the semantic features of the text description information can be extracted in a single thermal coding or TF-IDF mode, and the specific extraction mode is a conventional technical means of a person skilled in the art, so that redundant description is not made here.

And S140, matching the dimension of the first semantic feature to obtain a second semantic feature which is the same as the dimension of the visual feature prototype.

In this embodiment, the first semantic feature extracted based on the word vector learning algorithm is a two-dimensional vector feature, and cannot be directly combined with a three-dimensional visual feature prototype. Therefore, in order to solve the problem of how to match the dimensions of the two, the embodiment provides a feature fusion method, which specifically includes: constructing a characteristic dimension matching network, wherein the characteristic dimension matching network is obtained by sequentially connecting a full connection layer and a deconvolution layer; sending the first semantic features into a full-connection layer for depth matching to obtain first semantic sub-features matched with the depth of the visual feature prototype; and sending the first semantic sub-features into a deconvolution layer for length matching to obtain second semantic features matched with the depth and the length of the visual feature prototype.

As shown in fig. 3, the present embodiment proposes a feature dimension adaptation method based on a full connection layer and a deconvolution layer (semantic feature dimension:-visual feature tensor dimension: />) Wherein d _f For visual feature depth, n _f Is the length and width of visual characteristics

First, channel dimension matching, namely depth matching, of the first semantic features is carried out on the first semantic featuresFeeding into the full connection layer fc1 to obtain a first semantic sub-feature->

Then spatial dimension matching, i.e. length matching, the first semantic sub-featureFeeding into deconvolution layer deconv to obtain the final second semantic feature +.>In summary, according to the embodiment, the size matching of the length and depth of the semantic features and the visual features is realized through the full connection layer and the deconvolution layer, so that the problem that the feature sizes are different and cannot be fused is solved.

And S150, calculating fusion weights according to the distribution of the first semantic features of the text description information of the images of each category of the support set, and fusing the visual feature prototype and the second semantic features according to the fusion weights to obtain a fusion feature prototype.

According to the embodiment, the second semantic features and the visual feature prototypes are fused according to the self data distribution of the first semantic features, fusion weights are generated, the second semantic features and the visual feature prototypes are adaptively fused according to the fusion weights, and fusion feature prototypes of the enhanced image characterization features are obtained, so that the accuracy of image classification is improved.

S160, extracting visual features of images to be detected in a query set, constructing a splicing weight calculation network to calculate fusion feature prototypes corresponding to each type of the visual features of the query set and the support set, outputting a first weight factor corresponding to the query set and a second weight factor corresponding to the support set, splicing the fusion feature prototypes of all types of the images to be detected and the support set along channel dimensions of the visual features according to the first weight factor and the second weight factor to obtain splicing features, and splicing the splicing features corresponding to the support set of each type along the length direction to obtain total splicing features.

Specifically, the visual features of the image to be tested in the query set are extracted by using a pre-trained convolutional neural network (CNN neural network), and besides, the extraction of the image features can be realized by RNN, deep learning model, etc., which are common knowledge of those skilled in the art and are not described herein again. In one embodiment, as shown in fig. 4, a splicing weight calculation network is constructed to calculate the second semantic feature and the visual feature, specifically: the visual features and the second semantic features are respectively sent into a splicing weight calculation network, and the first weight factors of the visual features and the second weight factors of the second semantic features are output; the splicing weight calculation network is formed by sequentially connecting a first convolution module, a first maximum value pooling 2 x 2, a second convolution module, a second maximum value pooling 2 x 2 and a full connection layer.

And S170, scoring the total spliced characteristics by using a pre-trained relation scoring network, and determining the classification result of the query set according to the scoring result.

In this embodiment, the relationship scoring network χ calculates and supports, for each image of the query set, a score of the relationship of each class of object of the support set, where a higher score represents a higher probability that the query set image and the support set object class are homogeneous. For the t-th query set image, fconcat _t Sending the relationship scoring network to obtain a relationship scoring S _t ＝χ(Fconcat _t )，S _t ∈R ¹ × ^N . The one with the highest score is judged as the class, and a one-hot vector r is output. Training loss. The overall loss is defined as:and continuously updating parameters of the model χ, the full connection layer h and other networks by back transmission under the condition of training loss. The above process is repeated until the parameters of each network or module converge. Inputting the images to be tested into a trained relation scoring network to obtain relation scores of the images to be classified and the images of each class of the existing sample images, and finally outputting class labels with the highest relation with the classes of the images to be classified, namely, the classification results of the images to be tested. It should be understood that the construction and training of the relational scoring network are all prior art, and redundant description is not made here.

In one embodiment, please refer to the structural block diagram of the visual feature prototype extraction network shown in fig. 2, which is used to extract visual feature prototypes of each category of the support set, specifically including:

constructing a visual feature prototype extraction network, wherein the visual feature prototype extraction network consists of a visual feature extraction network and a feature weight calculation network, the visual feature extraction network is a convolutional neural network, and the feature weight calculation network consists of a plurality of convolutional layers and a full connection layer; the visual feature prototype of each category of the support set is extracted based on the pre-trained visual feature prototype extraction network.

In this embodiment, the length and width of the image are first adjusted to 224×224 to obtain a support set imageAnd query set->x and y respectively represent the image pixel position indexes, and the visual characteristics of the support set image and the query set image are respectively calculated as follows:/>Wherein (1)>For visual feature extraction network, n E [1, N]To support category indexes for all categories of the set, k ε [1, K]Image index for each category of support set object, +.>To support visual features derived from the kth image of the nth class, t.epsilon.1, T]To query the index of the images in the set, f _t ^Q Visual features obtained for querying the collection of the t-th image. />n _f Is the length and width of visual characteristics, d _f Is the visual feature depth.

extracting the target visual characteristics of each image of each category of the support set according to the visual characteristic extraction network, splicing the target visual characteristics corresponding to each image of each category along the channel dimension direction of the target visual characteristics, sending the splicing result to the characteristic weight calculation network, and calculating the weight of each image; and weighting the target visual characteristics of each image according to the weight of each image to obtain the visual characteristic prototype of each category of the support set.

In this embodiment, for the visual features obtained by supporting the collection of images, all the visual features of each category of images are spliced along the channel direction and sent to the feature weight calculation network phi to calculate the features of each imageThe features account for the weights in the target prototype feature calculation as follows:wherein (1)>The weight corresponding to the kth image of the nth category. concat is an operation of splicing features in the channel direction.

Weighting the visual features of the K images according to the feature weights of the support sets to obtain visual feature prototypes of each category of the support setsI.e. < ->

In a further embodiment, the fused feature prototype is calculated byWherein λ represents a fusion weight, +.>Representing a second semantic feature->Representing a visual feature prototype, n representing classes for each class of the support setAnd (3) indexing, wherein S represents a support set.

In one embodiment, the splicing feature is obtained by splicing the image to be detected and the fusion feature prototype of all the categories of the support set along the channel dimension of the visual feature according to the first weight factor and the second weight factor, and specifically includes:

Specifically, in this embodiment, the first duty ratio and the second duty ratio are common knowledge, so that redundant explanation is not made, and feature stitching is performed on the fusion feature prototype and the visual feature along the length direction according to the first duty ratio and the second duty ratio, where the stitching feature calculation formula specifically is: representing a second duty cycle->Representing a first duty cycle, ">A target visual feature representing a t-th image of the query set. Splicing the fusion feature prototypes of all the categories of the support set corresponding to the query set along the length of the feature tensor of the image to obtain the total splicing feature

Based on the same inventive concept, the embodiment of the application also provides a small sample image classification device based on a semantic self-adaptive fusion mechanism, the classification device uses multi-mode information of images to increase the characterization information of the images aiming at image classification under the condition of small samples, thereby improving the classification accuracy of the images, please refer to fig. 5, fig. 5 is a structural block diagram of the small sample image classification device based on the semantic self-adaptive fusion mechanism, which is further provided by the embodiment of the application, and comprises:

a first data module 510, configured to obtain an image dataset, and obtain a support set and a query set with at least one category according to the image dataset;

a second data module 520, configured to obtain text description information of the images of each category of the support set;

a feature extraction module 530, configured to extract visual feature prototypes for each category of the support set and to extract first semantic features of the textual description information;

the dimension matching module 540 is configured to match the dimension of the first semantic feature to obtain a second semantic feature that is the same as the dimension of the visual feature prototype;

the feature fusion module 550 is used for calculating fusion weights according to the distribution of the first semantic features of the text description information of the images of each category of the support set, and fusing the visual feature prototype and the second semantic features according to the fusion weights to obtain a fusion feature prototype;

the feature stitching module 560 is configured to extract visual features of images to be tested in the query set, construct a stitching weight calculation network to calculate a fused feature prototype corresponding to each type of the visual features of the query set and each type of the support set, output a first weight factor corresponding to the query set and a second weight factor corresponding to the support set, stitch the fused feature prototypes of all types of the images to be tested and the support set along channel dimensions of the visual features according to the first weight factor and the second weight factor to obtain stitching features, and stitch the stitching features corresponding to each type of the support set along a length direction to obtain total stitching features;

the scoring module 570 is configured to score the total spliced features using a pre-trained relational scoring network, and determine a classification result of the query set according to the scoring result.

Therefore, the classification device provided by the embodiment realizes high-level multi-modal feature extraction of each category of the image through visual feature prototype extraction and first semantic feature extraction; performing dimension adaptation on the first semantic features and the visual feature prototypes, and solving the problem that different dimension features are difficult to fuse; the fusion feature prototype with enhanced characterization is obtained through the self-adaptive convex combination of the second semantic features and the visual feature prototype, so that the problem of weak feature characterization capability caused by direct feature fusion is solved, and the self-adaptive fusion enhancement among the features is realized; the method is characterized in that the support set query set sample splicing based on the weight is provided, the characteristics can be spliced according to the importance degree of sample distribution, the influence of irrelevant characteristics on a follow-up relationship scoring network is weakened, finally, the relationship score among the samples of the image data set is calculated through the relationship scoring network, and the highest score is the same type.

The embodiment of the application also discloses terminal equipment. The terminal device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps in the small sample image classification method based on the semantic adaptive fusion mechanism described in the above embodiment when executing the computer program.

The terminal device comprises a processor, a memory, a communication interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the terminal device is adapted to provide computing and control capabilities. The memory of the terminal device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the terminal device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, near Field Communication (NFC) or other technologies. The display screen of the terminal equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the terminal equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the terminal equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the above embodiments illustrate the structure of the terminal device, and are merely a structure of a portion related to the technical solution of the present disclosure, and do not limit the electronic device to which the technical solution of the present application is applied, and a specific terminal device may include more or fewer components than those described in the above embodiments, or may combine some components, or have different component arrangements.

The embodiment of the application also discloses a computer readable storage medium. The computer readable storage medium stores a computer program, which when executed by a processor, implements the steps in the small sample image classification method based on the semantic adaptive fusion mechanism described in the foregoing embodiments.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the application, and is not meant to limit the scope of the application, but to limit the application to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the application are intended to be included within the scope of the application.

Claims

1. The small sample image classification method based on the semantic self-adaptive fusion mechanism is characterized by comprising the following steps of:

2. The small sample image classification method based on the semantic adaptive fusion mechanism according to claim 1, wherein a support set and a query set with at least one category are obtained according to an image dataset, specifically: and randomly selecting N types of images from the image dataset, and randomly selecting K images and T images from the images of each of the N types as a support set and a query set respectively.

3. The small sample image classification method based on the semantic adaptive fusion mechanism according to claim 1, wherein the extracting of the visual feature prototype of each category of the support set specifically comprises:

the first semantic feature of the text description information is extracted, specifically: and extracting the first semantic features of the text description information by using a word vector learning algorithm.

4. A small sample image classification method based on semantic adaptive fusion mechanism according to claim 3, wherein extracting a visual feature prototype of each category of the support set based on a pre-trained visual feature prototype extraction network comprises:

5. The small sample image classification method based on the semantic adaptive fusion mechanism according to claim 1, wherein the matching the dimensions of the first semantic features specifically comprises:

6. The small sample image classification method based on semantic adaptive fusion mechanism according to claim 1, wherein the calculation formula for calculating the fusion weight according to the distribution of the first semantic features of the support set isWherein λ represents a fusion weight, +.>Representing a first semantic feature, h representing a fully connected network, n representing a category index for each category of the support set, and S representing the support set.

7. The small sample image classification method based on semantic adaptive fusion mechanism according to claim 1 or 6, wherein the calculation formula of the fusion feature prototype is thatWherein lambda represents the fusion weight,representing a second semantic feature->Representing a visual feature prototype, n representing a category index for each category of the support set, and S representing the support set.

8. The small sample image classification method based on the semantic adaptive fusion mechanism according to claim 1, wherein the construction of the stitching weight calculation network calculates the second semantic features and the visual features, specifically:

respectively sending the visual features of the query set and the fusion feature prototype of the support set into a splicing weight calculation network, and outputting a first weight factor corresponding to the query set and a second weight factor corresponding to the support set; the visual characteristics of each image to be tested in the query set are extracted by utilizing a pre-trained convolutional neural network; the splicing weight calculation network is formed by sequentially connecting a first convolution module, a first maximum value pooling 2 x 2, a second convolution module, a second maximum value pooling 2 x 2 and a full connection layer.

9. The small sample image classification method based on the semantic self-adaptive fusion mechanism according to claim 1, wherein the method is characterized in that fusion feature prototypes of all categories of the image to be detected and the support set are spliced along the channel dimension of the visual features according to the first weight factor and the second weight factor to obtain splicing features, and specifically comprises the following steps:

10. A small sample image classification device based on a semantic adaptive fusion mechanism, comprising: