CN112801159B

CN112801159B - Zero-small sample machine learning method and system fusing image and text description thereof

Info

Publication number: CN112801159B
Application number: CN202110083109.6A
Authority: CN
Inventors: 黄健; 潘崇煜; 刘权; 郝建国; 张中杰; 龚建兴
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2022-07-19
Anticipated expiration: 2041-01-21
Also published as: CN112801159A

Abstract

The invention discloses a zero-small sample machine learning method and a zero-small sample machine learning system fusing images and text description thereof, which comprise an image convolution encoder, a text mapping encoder and a relation measurement network. The invention aims at zero-small sample machine learning of fused images and text description thereof, starts from the fusion of the mapping characteristics of the images and the text, can quickly identify and classify new class data under the condition of only having a small number of label samples or class text description, has better performance in zero-small sample learning experiments, high identification accuracy and better learning ability and generalization ability.

Description

Zero-small sample machine learning method and system fusing image and text description thereof

Technical Field

The invention relates to a weak supervision machine learning technology under a small sample condition in the field of artificial intelligence, in particular to a zero-small sample machine learning method and system fusing images and text description thereof.

Background

In recent years, with the continuous development of deep learning technology, supervised learning based on large-scale label data training has achieved remarkable achievement. However, in the fields of economy, military affairs, medical treatment and the like, large-scale label data is difficult to obtain, manual labeling is time-consuming and labor-consuming, and a large amount of label data is not even available in many cases. Meanwhile, for a new concept which is difficult to obtain a large amount of label data, some prior description information, such as attribute information of a new category concept, text description, and even a category name containing semantic information, can be obtained in advance. Therefore, the image classification and identification of the new category concept under the condition of only having a small number of label samples or text feature description becomes an effective way for deep learning to be applied to practical application.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: the invention provides a zero-small sample machine learning method and a zero-small sample machine learning system for fusing images and text description thereof, which are oriented to zero-small sample machine learning for fusing images and text description thereof, starts from the fusion of mapping characteristics of the images and the text, adopts image convolution coding, text mapping coding and relation network learning modes aiming at typical application of image identification classification under the condition of only having a small number of label samples or text description, can quickly identify and classify new class data under the condition of only having a small number of label samples or class text description, and has better performance, high identification accuracy, better learning ability and generalization ability.

In order to solve the technical problems, the invention adopts the technical scheme that:

a zero-small sample machine learning method fusing images and text description thereof comprises the following steps:

1) for support set D_t ^supportEach image sample in (a) and its class text description vector (I)_i ^t,u_i ^t): computing image samples I based on trained image convolution encoder_i ^tIs shown inImage feature vector x_i ^t＝Φ(I_i ^t；θ*_CNN) Wherein θ_CNNConvolving the encoder parameters for the trained image; computing a class text description vector u based on a trained text mapping encoder_i ^tIs mapped to a feature vector s_i ^t＝Ψ(u_i ^t；θ*_Mapping) Wherein θ ·_MappingMapping parameters of an encoder for the trained text;

2) for support set D_t ^supportMedium k-th class image feature vector x_k,1,x_k,2,…,x_k,NTaking the average value as the characteristic central point x of the image of the category_k(ii) a Based on the category image characteristic central point x_kAnd the mapping feature vector s of the class_kCalculating the category feature characterization vector v_kWherein k is 1,2, …, M is a support set D_t ^supportN is the support set D_t ^supportThe number of samples in each category;

3) for test set D_t ^queryTo-be-identified sample I in (1)^queryFirstly, calculating an image feature vector x based on a trained image convolution encoder^query＝Φ(I^query；θ*_CNN) (ii) a Then, calculating an image feature vector x based on the trained relation measurement network^queryAnd supporting each category feature characterization vector v in the sample set_kDegree of similarity of

Wherein theta_RelationMeasuring parameters of the network for the trained relationship, wherein k is 1,2, …, M;

4) in support set D_t ^supportOf the M categories, the similarity Re (x) is selected^query,v_k) The largest class is used as the prediction class label y of the sample to be identified^query。

Optionally, the category image feature center point x in step 2)_kThe formula of the calculation function is:

in the above formula, x_k,iIs the i-th class image feature vector, N is the support set D_t ^supportThe number of samples per category.

Optionally, the category feature characterization vector v in step 2)_kThe formula of the calculation function is:

in the above formula, x_kIs the image feature center point of the k-th class, s_kA mapped feature vector for the kth class category.

Optionally, the image convolutional encoder is a convolutional neural network encoder, the convolutional neural network encoder includes 4 convolutional layers connected in sequence, each convolutional layer is 64 channels, a 3 × 3 convolutional kernel, batch regularization and ReLU nonlinear operation are used, and 2 × 2 max pooling is used for downsampling the feature map after the first two convolutional layers.

Optionally, the text mapping encoder is a fully-connected neural network model, the fully-connected neural network model includes an input layer, a hidden layer and an output layer, the hidden layer adopts ReLU as a nonlinear operation, and meanwhile, the dimension is set to be the same as the output dimension of the image convolution encoder, and the dimension of the output layer is consistent with the dimension of the image feature vector; the fully-connected neural network model adopts pruning operation in the training process; in the testing stage, all full-connection parameters are calculated in the prediction process of the sample to be identified.

Optionally, the relational metric network includes 2 convolutional layers and 2 fully-connected layers, and is configured to connect two sample center point feature vectors of a support sample and a sample to be tested, and then sequentially pass through the 2 convolutional layers and the 2 fully-connected layers, and finally output similarity of the two center point feature vectors, and the convolutional layers are set as 64 channels, and 3 × 3 convolutional kernels, and after the convolutional operation, batch regularization, ReLU nonlinear operation, and 2 × 2 maximum pooling dimension reduction operation are performed.

Optionally, step 1) is preceded by the step of training an image convolution encoder, a text mapping encoder, and a relationship metric network:

s1) in the training set D_sWherein the batch training set (D) is sampled in the mode of M-way N-shot Q-query_s ^train,D_s ^val) For each image sample I therein_i ^sCalculating the image feature vector x by using an image convolution encoder_i ^s＝Φ(I_i ^s；θ_CNN) Wherein theta_CNNParameters of a convolutional encoder of the image; for image sample I_i ^sCorresponding category text description vector u_i ^sCalculating its mapping feature vector s by using text mapping coder_i ^s＝Ψ(u_i ^s；θ_Mapping) Wherein theta_MappingMapping parameters of an encoder for the text;

s2) for meta training set D_s ^trainMedium k-th (k-1, 2, …, M) image feature vector x_k,1,x_k,2,…,x_k,NTaking the average value as the characteristic central point x of the image of the category_k(ii) a Then, based on the category image feature central point x_kAnd the mapping feature vector s of the class_kCalculating the class feature characterization vector v_kEach category comprises N image samples, but the same category corresponds to the same category description vector and is coded into a mapping characteristic vector of the category by a text mapping coder;

s3) for meta test set D_s ^valAny one sample of { (x)_i ^s,s_i ^s) I-1, 2, …, M x Q, and calculating the image feature x through a relational network_i ^sAnd the class feature characterization vector v_kDegree of similarity of

Wherein theta is_RelationMeasuring parameters of the network for the relationship, wherein k is 1,2, …, M is a support set D_t ^supportQ is the total number of classes per training phaseMeta training set D in individual batch data_s ^trainThe number of samples in each category;

s4) calculating each meta-test sample x_i ^sThe generated forward error L ═ L₁+λL₂Wherein L is₁＝MSE(Re(x_i ^s,v_k),l(x_i ^s,v_k) For classification loss, label l (x)_i ^s,v_k) Representing two samples x_i ^s,v_kWhether the same type exists or not is judged, and the value is {0,1}; l is a radical of an alcohol₂＝MSE(x_i ^s,s_i ^s) For matching loss, representing the error of an image feature space and a semantic mapping space, wherein MSE is a mean square error calculation function; on the basis, the accumulated error of M x Q element test samples is calculated and used as the error of the training batch, and the parameter theta of the image convolution encoder is updated through error back propagation and gradient reduction_CNNParameter theta of text mapping coder_MappingAnd a parameter θ of the relationship metric network_Relation；

S5), judging whether the preset ending condition is reached, if the preset ending condition is reached, ending the training, and recording the finally obtained image convolution encoder parameter theta_CNNParameter theta of text mapping coder_MappingAnd a parameter θ of the relationship metric network_RelationAnd exits, otherwise jumps to step S1) to continue training.

Optionally, the preset ending condition in step S5) refers to that the number of iterations reaches a preset upper limit or the classification loss reaches an error loss lower limit.

In addition, the invention also provides a zero-thumbnail local machine learning device fusing the image and the text description thereof, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the steps of the zero-thumbnail local machine learning method fusing the image and the text description thereof.

Additionally, the present invention also provides a computer readable storage medium having stored therein a computer program programmed or configured to execute the zero-subsample machine learning method of the fused image and its textual description.

Compared with the prior art, the invention has the following advantages: the zero-small sample machine learning method for fusing the image and the text description thereof is oriented to zero-small sample machine learning for fusing the image and the text description thereof, starts from the fusion of the mapping characteristics of the image and the text, aims at typical application of image identification classification under the condition of only having a small number of label samples or text description, adopts the modes of image convolution coding, text mapping coding and relational network learning, and can quickly identify and classify new class data under the condition of only having a small number of label samples or class text description, thereby showing better performance in zero-small sample learning experiments, having high identification accuracy and better learning ability and generalization ability.

Drawings

FIG. 1 is a schematic diagram of a test/application flow of a method according to an embodiment of the present invention.

Fig. 2 is a schematic flow chart of a network architecture according to an embodiment of the present invention.

FIG. 3 is a schematic structural diagram of an image convolution encoder according to an embodiment of the present invention.

FIG. 4 is a block diagram of a text mapping encoder according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a relationship metric network according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a network training process in the embodiment of the present invention.

FIG. 7 is a diagram illustrating test results in an embodiment of the present invention.

Detailed Description

As shown in fig. 1 and fig. 2, the zero-small sample machine learning method for fusing images and text descriptions thereof in the present embodiment includes:

1) for support set D_t ^supportEach image sample in (a) and its class text description vector (I)_i ^t,u_i ^t): computing image samples I based on trained image convolution encoder_i ^tImage feature vector x of_i ^t＝Φ(I_i ^t；θ*_CNN) Which isIn theta_CNNConvolving the encoder parameters for the trained images; calculating category text description vector u based on trained text mapping encoder_i ^tIs mapped to a feature vector s_i ^t＝Ψ(u_i ^t；θ*_Mapping) Wherein θ ·_MappingMapping parameters of an encoder for the trained text;

2) for support set D_t ^supportMedium k-th class image feature vector x_k,1,x_k,2,…,x_k,NTaking the average value as the characteristic central point x of the category image_k(ii) a Based on the class of image feature central point x_kAnd the mapping feature vector s of the category_kCalculating the category feature characterization vector v_kWherein k is 1,2, …, M is a support set D_t ^supportN is the support set D_t ^supportThe number of samples in each category;

3) for test set D_t ^querySample I to be identified in^queryFirstly, calculating the image feature vector x based on the trained image convolution encoder^query＝Φ(I^query；θ*_CNN) (ii) a Then, calculating an image feature vector x based on the trained relation metric network^queryAnd supporting each category feature characterization vector v in the sample set_kDegree of similarity of

Wherein theta_RelationMeasuring parameters of the network for the trained relationships, wherein k is 1,2, …, M;

4) in support set D_t ^supportOf the M categories, the similarity Re (x) is selected^query,v_k) The largest class is used as the prediction class label y of the sample to be identified^queryIt can be expressed as:

thus realizing the image recognition of the sample to be recognized.

In this embodiment, the category image feature center point x in step 2)_kThe calculation function expression of (a) is:

In this embodiment, the category feature characterization vector v in step 2)_kThe formula of the calculation function is:

The zero-small sample machine learning system in the method of the embodiment is configured as shown in fig. 2, a main structure of a network includes two data streams, namely, an image encoding data stream and a text mapping data stream, an image sample and a text description thereof respectively pass through the two data streams to form a feature vector, and the two features are fused to form a category center point. The multi-class support samples and the class central points of the samples to be tested are measured through a relationship network to form similarity vectors of all classes, so that a classification loss is formed for training or online classification of the samples to be tested is performed.

The image convolutional encoder is used to characterize and encode image data. As shown in fig. 3, the image Convolutional encoder in this embodiment is a Convolutional Neural Network (CNN) encoder, which includes 4 Convolutional layers connected in sequence, each Convolutional layer is 64 channels, a 3 × 3 Convolutional kernel is used, batch regularization and ReLU nonlinear operation are performed, and 2 × 2 maximum pooling is used to perform downsampling on the feature map after the first two Convolutional layers. In this embodiment, all input images are set to 84 × 84 pixel resolution, and form 4096-dimensional image features after convolutional network coding.

The text mapping encoder is used for characterizing and encoding the text description information. As shown in fig. 4, in this embodiment, the text mapping encoder is a fully-connected neural network model, the fully-connected neural network model includes an input layer, a hidden layer, and an output layer, the hidden layer uses ReLU as a nonlinear operation, and meanwhile, the dimension is set to be the same as the output dimension of the image convolution encoder, and the dimension of the output layer is consistent with the dimension of the image feature vector; the fully-connected neural network model adopts pruning operation in the training process; in the testing stage, all full-connection parameters are calculated in the prediction process of the sample to be identified. The dimension of the input layer depends on the type of text description information, such as a category name, a category multidimensional attribute description vector, a long text description and the like, in the embodiment, the ReLU is adopted as a nonlinear operation by the hidden layer, and the dimension is set to 4096 dimensions; in the model training process, in order to avoid parameter overfitting, pruning (dropout) operation is adopted, and the pruning rate is 0.5.

The image sample and the text description thereof are coded by a convolution coder and a mapping coder to respectively form the image characteristic and the mapping characteristic of a certain class sample. And after the two characteristics are weighted and averaged, forming a characteristic vector of the sample center point of the category. The relation measurement network is a functional module for judging the similarity relation of the feature vectors of the center points of the two samples of the support sample and the sample to be tested. As shown in fig. 5, the relational metric network in this embodiment includes 2 convolutional layers and 2 fully-connected layers, and is configured to connect two sample center point feature vectors of a support sample and a sample to be tested, and then sequentially pass through the 2 convolutional layers and the 2 fully-connected layers, and finally output a similarity between the two center point feature vectors, where the convolutional layers are set to 64 channels, and 3 × 3 convolution kernels are performed, and after the convolution operation, batch regularization, ReLU nonlinear operation, and 2 × 2 max pooling dimensionality reduction operation are performed.

In this embodiment, a meta-learning training strategy commonly used in the field, that is, an experimental setup of an M-way N-shot Q-query, is adopted for zero-small sample image classification. In the training process, parameter iteration updating is carried out on batch (batch) data, and the selection process of each batch of data is as follows: in thatTraining set D_sRandomly selecting M categories, and randomly selecting N samples as a meta-training set D for each category_s ^trainIn addition, Q samples are randomly selected to be used as a meta-test set D_s ^val. Then using the batch of data (D)_s ^train,D_s ^val) And updating model parameters. In the testing stage, in the testing set D_tM categories are randomly selected, and N samples are randomly selected from each category as a support set D_t ^supportAnd randomly selecting samples as a test set D_t ^queryTo test set D_t ^queryThe samples in (1) are subjected to image classification and identification, and a prediction class label is output. As shown in fig. 6, step 1) in this embodiment includes the steps of training an image convolution encoder, a text mapping encoder, and a relationship metric network:

s1) in training set D_sWherein the batch training set (D) is sampled in the mode of M-way N-shot Q-query_s ^train,D_s ^val) For each image sample I therein_i ^sCalculating its image characteristic vector x by image convolution encoder_i ^s＝Φ(I_i ^s；θ_CNN) Wherein theta_CNNParameters of a convolutional encoder of the image; for image sample I_i ^sCorresponding category text description vector u_i ^sCalculating its mapping feature vector s by using text mapping coder_i ^s＝Ψ(u_i ^s；θ_Mapping) Wherein theta_MappingMapping parameters of an encoder for the text;

s2) for meta training set D_s ^trainMedium k-th (k-1, 2, …, M) image feature vector x_k,1,x_k,2,…,x_k,NTaking the average value as the characteristic central point x of the image of the category_k(ii) a Then, based on the category image characteristic central point x_kAnd the mapping feature vector s of the class_kCalculating the class feature characterization vector v_kWherein each class contains N image samples, but the same class corresponds to the same class description vector, and is coded into the class by a text mapping coderThe mapped feature vector of (2);

Wherein theta is_RelationMeasuring parameters of the network for the relationship, wherein k is 1,2, …, M is a support set D_t ^supportQ is a meta-training set D in each batch of data in the training phase_s ^trainThe number of samples in each category;

s4) calculating each meta-test sample x_i ^sThe generated forward error L ═ L₁+λL₂Wherein L is₁＝MSE(Re(x_i ^s,v_k),l(x_i ^s,v_k) For classification loss, label l (x)_i ^s,v_k) Representing two samples x_i ^s,v_kWhether the same type exists or not is judged, and the value is {0,1}; l is₂＝MSE(x_i ^s,s_i ^s) For matching loss, representing the error of an image feature space and a semantic mapping space, wherein MSE is a mean square error calculation function; on the basis, the accumulated error of M x Q element test samples is calculated and used as the error of the training batch, and the parameter theta of the image convolution encoder is updated through error back propagation and gradient reduction_CNNParameter theta of text mapping coder_MappingAnd a parameter θ of the relationship metric network_Relation；

The preset ending condition in step S5) may be that the number of iterations reaches a preset upper limit or that the classification loss reaches an error loss lower limit.

For zero-small sample image classification, the present example employs AI competition datasets with actual backgrounds, including the zhenjiang lab (ZheJiangLab Cup, ZLC) zero sample image classification Dataset held in 2018 by the jianjiang laboratory and the Large Scale Attribute Dataset (LAD) used in the 2018 global AI challenge race zero sample learning competition. Wherein the ZLC dataset comprises 4 independent sections, a, B, C, D; the LAD dataset contains 5 broad subsets, namely animal, fruit, vehicle, electronics and hairstyle. Specific information of these data sets is shown in table 1.

Table 1 zero-small sample image classification dataset.

In the calculation example, 5-way 1-shot 15-query experiment settings are selected, in each experiment process, the identification accuracy is calculated according to the ratio of the number of correct identification samples to the total number of samples, and the average value of the accuracy of 600 experiments in each group of data sets is taken as a final evaluation index. As a comparison experiment, the present example selects different methods under various architectures as reference methods, including a small sample learning Model-relationship Network (relationship Network) proposed by Sung et al in 2018 and a zero sample learning Model-depth coding Model (DEM) proposed by Zhang et al in 2017. All comparative methods were run under the same experimental data and set-up conditions, and the results are shown in table 2 and fig. 7.

Table 2 small sample image classification example test recognition rate.

Comparison method	ZLC_A	ZLC_B	ZLC_C	ZLC_D	Animal(s) production	Electronic device	Fruit	Hair style	Vehicle with a steering wheel
										Relationship network	49.68	49.92	52.33	48.48	47.94	45.85	49.62	33.88	54.73
Depth coding model	26.97	29.92	30.20	30.34	38.58	31.73	34.11	27.98	43.23
										Method for producing a composite material	50.76	50.42	56.03	52.21	50.21	46.03	51.14	34.16	58.87

As can be seen by combining table 2 and fig. 7, compared with the relational network and the depth coding model, the method of the present embodiment obtains the highest recognition accuracy in all the experimental data sets, which indicates that the method of the present embodiment can effectively classify and recognize new classes under the condition of a limited sample and text attribute description thereof, and has good learning ability and generalization ability.

In summary, the zero-small sample learning is a weak supervised machine learning method for image classification and identification of new category concepts under the condition of only a few label samples or text descriptions. The text description is typically a description of various attributes of the image category, but may simply be its category name. The method comprises 3 functional modules of an image convolution encoder, a text mapping encoder and a relation measurement network, wherein the image and corresponding text description codes are respectively coded to an image feature space through the image convolution encoder and the text mapping encoder to form image features and mapping features thereof, the two features are fused to form various category central points, and finally, learning and classification of the image under the condition of zero-small samples are carried out based on the relation measurement network. Aiming at the image recognition of small samples and even no samples, the embodiment of the invention provides a zero-small sample learning method for fusing a matching loss function, in the current large-amount image classification recognition model, only image data is relied on, semantic information contained in text description and class name is not considered, the embodiment of the invention fuses images and text description information thereof, and provides a parallel network architecture containing image coding and text mapping coding, so as to solve the image recognition problem under the condition of small samples and even no samples. Meanwhile, the method integrates the matching loss of the image and the text mapping characteristics into the model training process, reduces the problem of semantic gap of image characteristic and text characteristic space, accelerates the model convergence speed and improves the training efficiency. In a zero-small sample image classification experiment, the method achieves a good identification effect. In a zero-small sample learning experiment carried out based on a large-scale image data set and a plurality of comparison methods, the method obtains the recognition accuracy rate superior to that of the current typical algorithm, and shows that the learning system has better learning capability and generalization capability.

In addition, the embodiment also provides a zero-subsample machine learning device for fusing the images and the text description thereof, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the steps of the zero-subsample machine learning method for fusing the images and the text description thereof.

Furthermore, the present embodiment also provides a computer-readable storage medium having stored therein a computer program programmed or configured to execute the zero-thumbnail machine learning method of the aforementioned fusion image and its text description.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products according to embodiments of the present application, wherein the instructions that execute on the flowcharts and/or processors of the computer program product create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. A zero-small sample machine learning method fusing images and text description thereof is characterized by comprising the following steps:

1) for supporting setD _t ^supportEach image sample in (a) and its category text description vector: (I _i ^t ,u _i ^t): computing image samples based on a trained image convolution encoderI _i ^tImage feature vector ofx _i ^t =Φ (I _i ^t; θ* _CNN) In whichθ* _CNNConvolving the encoder parameters for the trained image; computing category text description vectors based on a trained text mapping encoderu _i ^tIs mapped to a feature vectors _i ^t=Ψ(u _i ^t; θ* _Mapping) Whereinθ* _MappingMapping parameters of an encoder for the trained text; the image convolutional encoder is a convolutional neural network encoder, the convolutional neural network encoder comprises 4 convolutional layers which are sequentially connected, each convolutional layer is 64 channels, 3 × 3 convolutional kernels are adopted, batch regularization and ReLU nonlinear operation are adopted, and 2 × 2 maximum pooling is adopted to perform down-sampling on the feature map after the first two convolutional layers; the text mapping encoder is a fully-connected neural network model, the fully-connected neural network model comprises an input layer, a hidden layer and an output layer, the hidden layer adopts ReLU as nonlinear operation, meanwhile, the dimensionality is set to be the same as the output dimensionality of the image convolution encoder, and the dimensionality of the output layer is kept consistent with the dimensionality of the image feature vector; in the testing stage, all full-connection parameters are calculated in the prediction process of the sample to be identified;

2) for supporting setD _t ^supportTo middlekClass image feature vectorx _k,1, x _k,2, …, x _k,NTaking the average value as the characteristic central point of the image of the categoryx _k(ii) a Based on the class image characteristic central pointx _kAnd the mapping feature vector of the categorys _kCalculating the class feature characterization vectorv _kWhereink=1,2,…,M，MTo a support setD _t ^supportThe total number of categories of (a) is,Nto a support setD _t ^supportThe number of samples in each category;

3) for test setsD _t ^queryTo be identified inI ^queryFirstly, calculating the image characteristic vector based on the trained image convolution encoderx ^query =Φ (I ^query ;θ* _CNN) (ii) a Then, image feature vectors are calculated based on the trained relation measurement networkx ^queryAnd supporting each category feature characterization vector in the sample setv _kDegree of similarity of (2)Re(x ^query,v _k) =φ(x ^query, v _k; θ* _Relation) Whereinθ* _RelationMeasuring parameters of the network for the trained relationship, whereink=1,2,…,M(ii) a The relation measurement network comprises 2 convolutional layers and 2 fully-connected layers, is used for connecting two sample central point feature vectors of a support sample and a sample to be tested, then sequentially passes through the 2 convolutional layers and the 2 fully-connected layers, and finally outputs the similarity of the two central point feature vectors, wherein the convolutional layers are set to be 64 channels, 3 × 3 convolutional kernels, and batch regularization, ReLU nonlinear operation and 2 × 2 maximum pooling dimension reduction operation are carried out after the convolutional operation;

4) in the support setD _t ^supportIs/are as followsMSelecting similarity from each categoryRe(x ^query,v _k) The largest class is used as a prediction class label of a sample to be identifiedy ^query。

2. The zero-small sample machine learning method for fusing images and text description thereof as claimed in claim 1, wherein the class image feature center point in step 2)x _kThe formula of the calculation function is:

in the above formula, the first and second carbon atoms are,x _k,iis as followsiClass diagramThe feature vectors of the image are then compared to one another,Nto a support setD _t ^supportThe number of samples per category.

3. The zero-small sample machine learning method for fusing images and text description thereof according to claim 1, characterized in that the category feature characterization vector in step 2)v _kThe formula of the calculation function is:

in the above formula, the first and second carbon atoms are,x _kis as followskThe center point of the image feature of the class category,s _kis as followskA mapping feature vector for class.

4. The zero-small sample machine learning method of fusing images and their text descriptions according to claim 1, characterized in that step 1) is preceded by the step of training an image convolution encoder, a text mapping encoder and a relational metric network:

s1) in the training setD _sTo ChineseM-way N-shot Q-sample batch training set by way of query(s) ((D _s ^train, D _s ^val) For each image sample thereinI _i ^s ,Calculating image characteristic vector by image convolution coderx _i ^s =Φ (I _i ^s; θ _CNN) Whereinθ _CNNParameters of a convolutional encoder of the image; for image samplesI _i ^sCorresponding category text description vectoru _i ^sCalculating the mapping feature vector by using a text mapping coders _i ^s=Ψ(u _i ^s; θ _Mapping) Whereinθ _MappingMapping parameters of an encoder for the text;

s2) for meta training setD _s ^trainTo middlekClass image feature vectorx _k,1, x _k,2, …, x _k,NTaking the mean value as the feature central point of the category imagex _kIn whichk=1,2,…,M(ii) a Then, based on the class image feature central pointx _kAnd the mapping feature vector of the categorys _kCalculating the class feature characterization vectorv _kWherein each category comprisesNThe image samples, but the same category corresponds to the same category description vector, are coded into the mapping feature vector of the category by a text mapping coder;

s3) for meta-test setD _s ^valAny one sample of { (x _i ^s, s _i ^s), i=1, 2,…,M*QCalculating the image characteristics through a relationship networkx _i ^sAnd the class feature characterization vectorv _kDegree of similarity ofRe(x _i ^s,v _k) =φ(x _i ^s,v _k; θ _Relation) In whichθ _RelationMeasuring parameters of the network for the relationship, whereink=1,2,…,M，MTo a support setD _t ^supportThe total number of categories of (a) is,Qtraining set for meta-elements in each batch of data in training phaseD _s ^trainThe number of samples in each category;

s4) calculating each meta-test samplex _i ^sGenerated forward errorL=L ₁+λL ₂Wherein, in the step (A),L ₁=MSE(Re(x _i ^s,v _k), l (x _i ^s,v _k) For classification loss, labell (x _i ^s,v _k) Representing two samplesx _i ^s,v _kWhether the same type exists or not is judged, and the value is {0,1};L ₂=MSE(x _i ^s, s _i ^s) For matching loss, representing the error of an image feature space and a semantic mapping space, wherein MSE is a mean square error calculation function; on the basis of this, calculateM*QThe accumulated error of the element test sample and the error of the training batch are used for updating the parameters of the image convolution encoder through error back propagation and gradient descentθ* _CNNParameters of a text mapping encoderθ* _MappingAnd parameters of a relationship metric networkθ* _Relation；

S5) judging whether the preset ending condition is reached, if so, ending the training, and recording the finally obtained image convolution encoder parametersθ* _CNNParameters of a text mapping encoderθ* _MappingAnd parameters of a relationship metric networkθ* _RelationAnd exits, otherwise jumps to step S1) to continue training.

5. The zero-small sample machine learning method for fusing images and text descriptions thereof according to claim 4, wherein the preset ending condition in step S5) is that the iteration number reaches a preset upper limit or the classification loss reaches an error loss lower limit.

6. A zero-thumbnail machine learning device fusing images and text descriptions thereof, comprising a microprocessor and a memory connected with each other, characterized in that the microprocessor is programmed or configured to execute the steps of the zero-thumbnail machine learning method fusing images and text descriptions thereof according to any one of claims 1-5.

7. A computer-readable storage medium having stored thereon a computer program programmed or configured to perform the zero-subsample machine learning method of fusing images and their textual descriptions of any of claims 1-5.