CN112801159B - Zero-small sample machine learning method and system fusing image and text description thereof - Google Patents

Zero-small sample machine learning method and system fusing image and text description thereof Download PDF

Info

Publication number
CN112801159B
CN112801159B CN202110083109.6A CN202110083109A CN112801159B CN 112801159 B CN112801159 B CN 112801159B CN 202110083109 A CN202110083109 A CN 202110083109A CN 112801159 B CN112801159 B CN 112801159B
Authority
CN
China
Prior art keywords
image
mapping
category
text
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110083109.6A
Other languages
Chinese (zh)
Other versions
CN112801159A (en
Inventor
黄健
潘崇煜
刘权
郝建国
张中杰
龚建兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202110083109.6A priority Critical patent/CN112801159B/en
Publication of CN112801159A publication Critical patent/CN112801159A/en
Application granted granted Critical
Publication of CN112801159B publication Critical patent/CN112801159B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a zero-small sample machine learning method and a zero-small sample machine learning system fusing images and text description thereof, which comprise an image convolution encoder, a text mapping encoder and a relation measurement network. The invention aims at zero-small sample machine learning of fused images and text description thereof, starts from the fusion of the mapping characteristics of the images and the text, can quickly identify and classify new class data under the condition of only having a small number of label samples or class text description, has better performance in zero-small sample learning experiments, high identification accuracy and better learning ability and generalization ability.

Description

Zero-small sample machine learning method and system fusing image and text description thereof
Technical Field
The invention relates to a weak supervision machine learning technology under a small sample condition in the field of artificial intelligence, in particular to a zero-small sample machine learning method and system fusing images and text description thereof.
Background
In recent years, with the continuous development of deep learning technology, supervised learning based on large-scale label data training has achieved remarkable achievement. However, in the fields of economy, military affairs, medical treatment and the like, large-scale label data is difficult to obtain, manual labeling is time-consuming and labor-consuming, and a large amount of label data is not even available in many cases. Meanwhile, for a new concept which is difficult to obtain a large amount of label data, some prior description information, such as attribute information of a new category concept, text description, and even a category name containing semantic information, can be obtained in advance. Therefore, the image classification and identification of the new category concept under the condition of only having a small number of label samples or text feature description becomes an effective way for deep learning to be applied to practical application.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: the invention provides a zero-small sample machine learning method and a zero-small sample machine learning system for fusing images and text description thereof, which are oriented to zero-small sample machine learning for fusing images and text description thereof, starts from the fusion of mapping characteristics of the images and the text, adopts image convolution coding, text mapping coding and relation network learning modes aiming at typical application of image identification classification under the condition of only having a small number of label samples or text description, can quickly identify and classify new class data under the condition of only having a small number of label samples or class text description, and has better performance, high identification accuracy, better learning ability and generalization ability.
In order to solve the technical problems, the invention adopts the technical scheme that:
a zero-small sample machine learning method fusing images and text description thereof comprises the following steps:
1) for support set Dt supportEach image sample in (a) and its class text description vector (I)i t,ui t): computing image samples I based on trained image convolution encoderi tIs shown inImage feature vector xi t=Φ(Ii t;θ*CNN) Wherein θCNNConvolving the encoder parameters for the trained image; computing a class text description vector u based on a trained text mapping encoderi tIs mapped to a feature vector si t=Ψ(ui t;θ*Mapping) Wherein θ ·MappingMapping parameters of an encoder for the trained text;
2) for support set Dt supportMedium k-th class image feature vector xk,1,xk,2,…,xk,NTaking the average value as the characteristic central point x of the image of the categoryk(ii) a Based on the category image characteristic central point xkAnd the mapping feature vector s of the classkCalculating the category feature characterization vector vkWherein k is 1,2, …, M is a support set Dt supportN is the support set Dt supportThe number of samples in each category;
3) for test set Dt queryTo-be-identified sample I in (1)queryFirstly, calculating an image feature vector x based on a trained image convolution encoderquery=Φ(Iquery;θ*CNN) (ii) a Then, calculating an image feature vector x based on the trained relation measurement networkqueryAnd supporting each category feature characterization vector v in the sample setkDegree of similarity of
Figure BDA0002909822200000021
Wherein thetaRelationMeasuring parameters of the network for the trained relationship, wherein k is 1,2, …, M;
4) in support set Dt supportOf the M categories, the similarity Re (x) is selectedquery,vk) The largest class is used as the prediction class label y of the sample to be identifiedquery
Optionally, the category image feature center point x in step 2)kThe formula of the calculation function is:
Figure BDA0002909822200000022
in the above formula, xk,iIs the i-th class image feature vector, N is the support set Dt supportThe number of samples per category.
Optionally, the category feature characterization vector v in step 2)kThe formula of the calculation function is:
Figure BDA0002909822200000023
in the above formula, xkIs the image feature center point of the k-th class, skA mapped feature vector for the kth class category.
Optionally, the image convolutional encoder is a convolutional neural network encoder, the convolutional neural network encoder includes 4 convolutional layers connected in sequence, each convolutional layer is 64 channels, a 3 × 3 convolutional kernel, batch regularization and ReLU nonlinear operation are used, and 2 × 2 max pooling is used for downsampling the feature map after the first two convolutional layers.
Optionally, the text mapping encoder is a fully-connected neural network model, the fully-connected neural network model includes an input layer, a hidden layer and an output layer, the hidden layer adopts ReLU as a nonlinear operation, and meanwhile, the dimension is set to be the same as the output dimension of the image convolution encoder, and the dimension of the output layer is consistent with the dimension of the image feature vector; the fully-connected neural network model adopts pruning operation in the training process; in the testing stage, all full-connection parameters are calculated in the prediction process of the sample to be identified.
Optionally, the relational metric network includes 2 convolutional layers and 2 fully-connected layers, and is configured to connect two sample center point feature vectors of a support sample and a sample to be tested, and then sequentially pass through the 2 convolutional layers and the 2 fully-connected layers, and finally output similarity of the two center point feature vectors, and the convolutional layers are set as 64 channels, and 3 × 3 convolutional kernels, and after the convolutional operation, batch regularization, ReLU nonlinear operation, and 2 × 2 maximum pooling dimension reduction operation are performed.
Optionally, step 1) is preceded by the step of training an image convolution encoder, a text mapping encoder, and a relationship metric network:
s1) in the training set DsWherein the batch training set (D) is sampled in the mode of M-way N-shot Q-querys train,Ds val) For each image sample I thereini sCalculating the image feature vector x by using an image convolution encoderi s=Φ(Ii s;θCNN) Wherein thetaCNNParameters of a convolutional encoder of the image; for image sample Ii sCorresponding category text description vector ui sCalculating its mapping feature vector s by using text mapping coderi s=Ψ(ui s;θMapping) Wherein thetaMappingMapping parameters of an encoder for the text;
s2) for meta training set Ds trainMedium k-th (k-1, 2, …, M) image feature vector xk,1,xk,2,…,xk,NTaking the average value as the characteristic central point x of the image of the categoryk(ii) a Then, based on the category image feature central point xkAnd the mapping feature vector s of the classkCalculating the class feature characterization vector vkEach category comprises N image samples, but the same category corresponds to the same category description vector and is coded into a mapping characteristic vector of the category by a text mapping coder;
s3) for meta test set Ds valAny one sample of { (x)i s,si s) I-1, 2, …, M x Q, and calculating the image feature x through a relational networki sAnd the class feature characterization vector vkDegree of similarity of
Figure BDA0002909822200000031
Wherein theta isRelationMeasuring parameters of the network for the relationship, wherein k is 1,2, …, M is a support set Dt supportQ is the total number of classes per training phaseMeta training set D in individual batch datas trainThe number of samples in each category;
s4) calculating each meta-test sample xi sThe generated forward error L ═ L1+λL2Wherein L is1=MSE(Re(xi s,vk),l(xi s,vk) For classification loss, label l (x)i s,vk) Representing two samples xi s,vkWhether the same type exists or not is judged, and the value is {0,1}; l is a radical of an alcohol2=MSE(xi s,si s) For matching loss, representing the error of an image feature space and a semantic mapping space, wherein MSE is a mean square error calculation function; on the basis, the accumulated error of M x Q element test samples is calculated and used as the error of the training batch, and the parameter theta of the image convolution encoder is updated through error back propagation and gradient reductionCNNParameter theta of text mapping coderMappingAnd a parameter θ of the relationship metric networkRelation
S5), judging whether the preset ending condition is reached, if the preset ending condition is reached, ending the training, and recording the finally obtained image convolution encoder parameter thetaCNNParameter theta of text mapping coderMappingAnd a parameter θ of the relationship metric networkRelationAnd exits, otherwise jumps to step S1) to continue training.
Optionally, the preset ending condition in step S5) refers to that the number of iterations reaches a preset upper limit or the classification loss reaches an error loss lower limit.
In addition, the invention also provides a zero-thumbnail local machine learning device fusing the image and the text description thereof, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the steps of the zero-thumbnail local machine learning method fusing the image and the text description thereof.
Additionally, the present invention also provides a computer readable storage medium having stored therein a computer program programmed or configured to execute the zero-subsample machine learning method of the fused image and its textual description.
Compared with the prior art, the invention has the following advantages: the zero-small sample machine learning method for fusing the image and the text description thereof is oriented to zero-small sample machine learning for fusing the image and the text description thereof, starts from the fusion of the mapping characteristics of the image and the text, aims at typical application of image identification classification under the condition of only having a small number of label samples or text description, adopts the modes of image convolution coding, text mapping coding and relational network learning, and can quickly identify and classify new class data under the condition of only having a small number of label samples or class text description, thereby showing better performance in zero-small sample learning experiments, having high identification accuracy and better learning ability and generalization ability.
Drawings
FIG. 1 is a schematic diagram of a test/application flow of a method according to an embodiment of the present invention.
Fig. 2 is a schematic flow chart of a network architecture according to an embodiment of the present invention.
FIG. 3 is a schematic structural diagram of an image convolution encoder according to an embodiment of the present invention.
FIG. 4 is a block diagram of a text mapping encoder according to an embodiment of the present invention.
Fig. 5 is a schematic structural diagram of a relationship metric network according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of a network training process in the embodiment of the present invention.
FIG. 7 is a diagram illustrating test results in an embodiment of the present invention.
Detailed Description
As shown in fig. 1 and fig. 2, the zero-small sample machine learning method for fusing images and text descriptions thereof in the present embodiment includes:
1) for support set Dt supportEach image sample in (a) and its class text description vector (I)i t,ui t): computing image samples I based on trained image convolution encoderi tImage feature vector x ofi t=Φ(Ii t;θ*CNN) Which isIn thetaCNNConvolving the encoder parameters for the trained images; calculating category text description vector u based on trained text mapping encoderi tIs mapped to a feature vector si t=Ψ(ui t;θ*Mapping) Wherein θ ·MappingMapping parameters of an encoder for the trained text;
2) for support set Dt supportMedium k-th class image feature vector xk,1,xk,2,…,xk,NTaking the average value as the characteristic central point x of the category imagek(ii) a Based on the class of image feature central point xkAnd the mapping feature vector s of the categorykCalculating the category feature characterization vector vkWherein k is 1,2, …, M is a support set Dt supportN is the support set Dt supportThe number of samples in each category;
3) for test set Dt querySample I to be identified inqueryFirstly, calculating the image feature vector x based on the trained image convolution encoderquery=Φ(Iquery;θ*CNN) (ii) a Then, calculating an image feature vector x based on the trained relation metric networkqueryAnd supporting each category feature characterization vector v in the sample setkDegree of similarity of
Figure BDA0002909822200000041
Wherein thetaRelationMeasuring parameters of the network for the trained relationships, wherein k is 1,2, …, M;
4) in support set Dt supportOf the M categories, the similarity Re (x) is selectedquery,vk) The largest class is used as the prediction class label y of the sample to be identifiedqueryIt can be expressed as:
Figure BDA0002909822200000042
thus realizing the image recognition of the sample to be recognized.
In this embodiment, the category image feature center point x in step 2)kThe calculation function expression of (a) is:
Figure BDA0002909822200000043
in the above formula, xk,iIs the i-th class image feature vector, N is the support set Dt supportThe number of samples per category.
In this embodiment, the category feature characterization vector v in step 2)kThe formula of the calculation function is:
Figure BDA0002909822200000051
in the above formula, xkIs the image feature center point of the k-th class, skA mapped feature vector for the kth class category.
The zero-small sample machine learning system in the method of the embodiment is configured as shown in fig. 2, a main structure of a network includes two data streams, namely, an image encoding data stream and a text mapping data stream, an image sample and a text description thereof respectively pass through the two data streams to form a feature vector, and the two features are fused to form a category center point. The multi-class support samples and the class central points of the samples to be tested are measured through a relationship network to form similarity vectors of all classes, so that a classification loss is formed for training or online classification of the samples to be tested is performed.
The image convolutional encoder is used to characterize and encode image data. As shown in fig. 3, the image Convolutional encoder in this embodiment is a Convolutional Neural Network (CNN) encoder, which includes 4 Convolutional layers connected in sequence, each Convolutional layer is 64 channels, a 3 × 3 Convolutional kernel is used, batch regularization and ReLU nonlinear operation are performed, and 2 × 2 maximum pooling is used to perform downsampling on the feature map after the first two Convolutional layers. In this embodiment, all input images are set to 84 × 84 pixel resolution, and form 4096-dimensional image features after convolutional network coding.
The text mapping encoder is used for characterizing and encoding the text description information. As shown in fig. 4, in this embodiment, the text mapping encoder is a fully-connected neural network model, the fully-connected neural network model includes an input layer, a hidden layer, and an output layer, the hidden layer uses ReLU as a nonlinear operation, and meanwhile, the dimension is set to be the same as the output dimension of the image convolution encoder, and the dimension of the output layer is consistent with the dimension of the image feature vector; the fully-connected neural network model adopts pruning operation in the training process; in the testing stage, all full-connection parameters are calculated in the prediction process of the sample to be identified. The dimension of the input layer depends on the type of text description information, such as a category name, a category multidimensional attribute description vector, a long text description and the like, in the embodiment, the ReLU is adopted as a nonlinear operation by the hidden layer, and the dimension is set to 4096 dimensions; in the model training process, in order to avoid parameter overfitting, pruning (dropout) operation is adopted, and the pruning rate is 0.5.
The image sample and the text description thereof are coded by a convolution coder and a mapping coder to respectively form the image characteristic and the mapping characteristic of a certain class sample. And after the two characteristics are weighted and averaged, forming a characteristic vector of the sample center point of the category. The relation measurement network is a functional module for judging the similarity relation of the feature vectors of the center points of the two samples of the support sample and the sample to be tested. As shown in fig. 5, the relational metric network in this embodiment includes 2 convolutional layers and 2 fully-connected layers, and is configured to connect two sample center point feature vectors of a support sample and a sample to be tested, and then sequentially pass through the 2 convolutional layers and the 2 fully-connected layers, and finally output a similarity between the two center point feature vectors, where the convolutional layers are set to 64 channels, and 3 × 3 convolution kernels are performed, and after the convolution operation, batch regularization, ReLU nonlinear operation, and 2 × 2 max pooling dimensionality reduction operation are performed.
In this embodiment, a meta-learning training strategy commonly used in the field, that is, an experimental setup of an M-way N-shot Q-query, is adopted for zero-small sample image classification. In the training process, parameter iteration updating is carried out on batch (batch) data, and the selection process of each batch of data is as follows: in thatTraining set DsRandomly selecting M categories, and randomly selecting N samples as a meta-training set D for each categorys trainIn addition, Q samples are randomly selected to be used as a meta-test set Ds val. Then using the batch of data (D)s train,Ds val) And updating model parameters. In the testing stage, in the testing set DtM categories are randomly selected, and N samples are randomly selected from each category as a support set Dt supportAnd randomly selecting samples as a test set Dt queryTo test set Dt queryThe samples in (1) are subjected to image classification and identification, and a prediction class label is output. As shown in fig. 6, step 1) in this embodiment includes the steps of training an image convolution encoder, a text mapping encoder, and a relationship metric network:
s1) in training set DsWherein the batch training set (D) is sampled in the mode of M-way N-shot Q-querys train,Ds val) For each image sample I thereini sCalculating its image characteristic vector x by image convolution encoderi s=Φ(Ii s;θCNN) Wherein thetaCNNParameters of a convolutional encoder of the image; for image sample Ii sCorresponding category text description vector ui sCalculating its mapping feature vector s by using text mapping coderi s=Ψ(ui s;θMapping) Wherein thetaMappingMapping parameters of an encoder for the text;
s2) for meta training set Ds trainMedium k-th (k-1, 2, …, M) image feature vector xk,1,xk,2,…,xk,NTaking the average value as the characteristic central point x of the image of the categoryk(ii) a Then, based on the category image characteristic central point xkAnd the mapping feature vector s of the classkCalculating the class feature characterization vector vkWherein each class contains N image samples, but the same class corresponds to the same class description vector, and is coded into the class by a text mapping coderThe mapped feature vector of (2);
s3) for meta test set Ds valAny one sample of { (x)i s,si s) I-1, 2, …, M x Q, and calculating the image feature x through a relational networki sAnd the class feature characterization vector vkDegree of similarity of
Figure BDA0002909822200000061
Wherein theta isRelationMeasuring parameters of the network for the relationship, wherein k is 1,2, …, M is a support set Dt supportQ is a meta-training set D in each batch of data in the training phases trainThe number of samples in each category;
s4) calculating each meta-test sample xi sThe generated forward error L ═ L1+λL2Wherein L is1=MSE(Re(xi s,vk),l(xi s,vk) For classification loss, label l (x)i s,vk) Representing two samples xi s,vkWhether the same type exists or not is judged, and the value is {0,1}; l is2=MSE(xi s,si s) For matching loss, representing the error of an image feature space and a semantic mapping space, wherein MSE is a mean square error calculation function; on the basis, the accumulated error of M x Q element test samples is calculated and used as the error of the training batch, and the parameter theta of the image convolution encoder is updated through error back propagation and gradient reductionCNNParameter theta of text mapping coderMappingAnd a parameter θ of the relationship metric networkRelation
S5), judging whether the preset ending condition is reached, if the preset ending condition is reached, ending the training, and recording the finally obtained image convolution encoder parameter thetaCNNParameter theta of text mapping coderMappingAnd a parameter θ of the relationship metric networkRelationAnd exits, otherwise jumps to step S1) to continue training.
The preset ending condition in step S5) may be that the number of iterations reaches a preset upper limit or that the classification loss reaches an error loss lower limit.
For zero-small sample image classification, the present example employs AI competition datasets with actual backgrounds, including the zhenjiang lab (ZheJiangLab Cup, ZLC) zero sample image classification Dataset held in 2018 by the jianjiang laboratory and the Large Scale Attribute Dataset (LAD) used in the 2018 global AI challenge race zero sample learning competition. Wherein the ZLC dataset comprises 4 independent sections, a, B, C, D; the LAD dataset contains 5 broad subsets, namely animal, fruit, vehicle, electronics and hairstyle. Specific information of these data sets is shown in table 1.
Table 1 zero-small sample image classification dataset.
Figure BDA0002909822200000071
In the calculation example, 5-way 1-shot 15-query experiment settings are selected, in each experiment process, the identification accuracy is calculated according to the ratio of the number of correct identification samples to the total number of samples, and the average value of the accuracy of 600 experiments in each group of data sets is taken as a final evaluation index. As a comparison experiment, the present example selects different methods under various architectures as reference methods, including a small sample learning Model-relationship Network (relationship Network) proposed by Sung et al in 2018 and a zero sample learning Model-depth coding Model (DEM) proposed by Zhang et al in 2017. All comparative methods were run under the same experimental data and set-up conditions, and the results are shown in table 2 and fig. 7.
Table 2 small sample image classification example test recognition rate.
Comparison method ZLC_A ZLC_B ZLC_C ZLC_D Animal(s) production Electronic device Fruit Hair style Vehicle with a steering wheel
Relationship network 49.68 49.92 52.33 48.48 47.94 45.85 49.62 33.88 54.73
Depth coding model 26.97 29.92 30.20 30.34 38.58 31.73 34.11 27.98 43.23
Method for producing a composite material 50.76 50.42 56.03 52.21 50.21 46.03 51.14 34.16 58.87
As can be seen by combining table 2 and fig. 7, compared with the relational network and the depth coding model, the method of the present embodiment obtains the highest recognition accuracy in all the experimental data sets, which indicates that the method of the present embodiment can effectively classify and recognize new classes under the condition of a limited sample and text attribute description thereof, and has good learning ability and generalization ability.
In summary, the zero-small sample learning is a weak supervised machine learning method for image classification and identification of new category concepts under the condition of only a few label samples or text descriptions. The text description is typically a description of various attributes of the image category, but may simply be its category name. The method comprises 3 functional modules of an image convolution encoder, a text mapping encoder and a relation measurement network, wherein the image and corresponding text description codes are respectively coded to an image feature space through the image convolution encoder and the text mapping encoder to form image features and mapping features thereof, the two features are fused to form various category central points, and finally, learning and classification of the image under the condition of zero-small samples are carried out based on the relation measurement network. Aiming at the image recognition of small samples and even no samples, the embodiment of the invention provides a zero-small sample learning method for fusing a matching loss function, in the current large-amount image classification recognition model, only image data is relied on, semantic information contained in text description and class name is not considered, the embodiment of the invention fuses images and text description information thereof, and provides a parallel network architecture containing image coding and text mapping coding, so as to solve the image recognition problem under the condition of small samples and even no samples. Meanwhile, the method integrates the matching loss of the image and the text mapping characteristics into the model training process, reduces the problem of semantic gap of image characteristic and text characteristic space, accelerates the model convergence speed and improves the training efficiency. In a zero-small sample image classification experiment, the method achieves a good identification effect. In a zero-small sample learning experiment carried out based on a large-scale image data set and a plurality of comparison methods, the method obtains the recognition accuracy rate superior to that of the current typical algorithm, and shows that the learning system has better learning capability and generalization capability.
In addition, the embodiment also provides a zero-subsample machine learning device for fusing the images and the text description thereof, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the steps of the zero-subsample machine learning method for fusing the images and the text description thereof.
Furthermore, the present embodiment also provides a computer-readable storage medium having stored therein a computer program programmed or configured to execute the zero-thumbnail machine learning method of the aforementioned fusion image and its text description.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products according to embodiments of the present application, wherein the instructions that execute on the flowcharts and/or processors of the computer program product create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims (7)

1. A zero-small sample machine learning method fusing images and text description thereof is characterized by comprising the following steps:
1) for supporting setD t support Each image sample in (a) and its category text description vector: (I i t ,u i t ): computing image samples based on a trained image convolution encoderI i t Image feature vector ofx i t =Φ (I i t ; θ* CNN ) In whichθ* CNN Convolving the encoder parameters for the trained image; computing category text description vectors based on a trained text mapping encoderu i t Is mapped to a feature vectors i t =Ψ(u i t ; θ* Mapping ) Whereinθ* Mapping Mapping parameters of an encoder for the trained text; the image convolutional encoder is a convolutional neural network encoder, the convolutional neural network encoder comprises 4 convolutional layers which are sequentially connected, each convolutional layer is 64 channels, 3 × 3 convolutional kernels are adopted, batch regularization and ReLU nonlinear operation are adopted, and 2 × 2 maximum pooling is adopted to perform down-sampling on the feature map after the first two convolutional layers; the text mapping encoder is a fully-connected neural network model, the fully-connected neural network model comprises an input layer, a hidden layer and an output layer, the hidden layer adopts ReLU as nonlinear operation, meanwhile, the dimensionality is set to be the same as the output dimensionality of the image convolution encoder, and the dimensionality of the output layer is kept consistent with the dimensionality of the image feature vector; in the testing stage, all full-connection parameters are calculated in the prediction process of the sample to be identified;
2) for supporting setD t support To middlekClass image feature vectorx k,1, x k,2, …, x k,N Taking the average value as the characteristic central point of the image of the categoryx k (ii) a Based on the class image characteristic central pointx k And the mapping feature vector of the categorys k Calculating the class feature characterization vectorv k Whereink=1,2,…,MMTo a support setD t support The total number of categories of (a) is,Nto a support setD t support The number of samples in each category;
3) for test setsD t query To be identified inI query Firstly, calculating the image characteristic vector based on the trained image convolution encoderx query =Φ (I query ;θ* CNN ) (ii) a Then, image feature vectors are calculated based on the trained relation measurement networkx query And supporting each category feature characterization vector in the sample setv k Degree of similarity of (2)Re(x query ,v k ) =φ(x query , v k ; θ* Relation ) Whereinθ* Relation Measuring parameters of the network for the trained relationship, whereink=1,2,…,M(ii) a The relation measurement network comprises 2 convolutional layers and 2 fully-connected layers, is used for connecting two sample central point feature vectors of a support sample and a sample to be tested, then sequentially passes through the 2 convolutional layers and the 2 fully-connected layers, and finally outputs the similarity of the two central point feature vectors, wherein the convolutional layers are set to be 64 channels, 3 × 3 convolutional kernels, and batch regularization, ReLU nonlinear operation and 2 × 2 maximum pooling dimension reduction operation are carried out after the convolutional operation;
4) in the support setD t support Is/are as followsMSelecting similarity from each categoryRe(x query ,v k ) The largest class is used as a prediction class label of a sample to be identifiedy query
2. The zero-small sample machine learning method for fusing images and text description thereof as claimed in claim 1, wherein the class image feature center point in step 2)x k The formula of the calculation function is:
Figure 753872DEST_PATH_IMAGE001
in the above formula, the first and second carbon atoms are,x k,i is as followsiClass diagramThe feature vectors of the image are then compared to one another,Nto a support setD t support The number of samples per category.
3. The zero-small sample machine learning method for fusing images and text description thereof according to claim 1, characterized in that the category feature characterization vector in step 2)v k The formula of the calculation function is:
Figure 467750DEST_PATH_IMAGE002
in the above formula, the first and second carbon atoms are,x k is as followskThe center point of the image feature of the class category,s k is as followskA mapping feature vector for class.
4. The zero-small sample machine learning method of fusing images and their text descriptions according to claim 1, characterized in that step 1) is preceded by the step of training an image convolution encoder, a text mapping encoder and a relational metric network:
s1) in the training setD s To ChineseM-way N-shot Q-sample batch training set by way of query(s) ((D s train , D s val ) For each image sample thereinI i s ,Calculating image characteristic vector by image convolution coderx i s =Φ (I i s ; θ CNN ) Whereinθ CNN Parameters of a convolutional encoder of the image; for image samplesI i s Corresponding category text description vectoru i s Calculating the mapping feature vector by using a text mapping coders i s =Ψ(u i s ; θ Mapping ) Whereinθ Mapping Mapping parameters of an encoder for the text;
s2) for meta training setD s train To middlekClass image feature vectorx k,1, x k,2, …, x k,N Taking the mean value as the feature central point of the category imagex k In whichk=1,2,…,M(ii) a Then, based on the class image feature central pointx k And the mapping feature vector of the categorys k Calculating the class feature characterization vectorv k Wherein each category comprisesNThe image samples, but the same category corresponds to the same category description vector, are coded into the mapping feature vector of the category by a text mapping coder;
s3) for meta-test setD s val Any one sample of { (x i s , s i s ), i=1, 2,…,M*QCalculating the image characteristics through a relationship networkx i s And the class feature characterization vectorv k Degree of similarity ofRe(x i s ,v k ) =φ(x i s ,v k ; θ Relation ) In whichθ Relation Measuring parameters of the network for the relationship, whereink=1,2,…,MMTo a support setD t support The total number of categories of (a) is,Qtraining set for meta-elements in each batch of data in training phaseD s train The number of samples in each category;
s4) calculating each meta-test samplex i s Generated forward errorL=L 1+λL 2Wherein, in the step (A),L 1=MSE(Re(x i s ,v k ), l (x i s ,v k ) For classification loss, labell (x i s ,v k ) Representing two samplesx i s ,v k Whether the same type exists or not is judged, and the value is {0,1};L 2=MSE(x i s , s i s ) For matching loss, representing the error of an image feature space and a semantic mapping space, wherein MSE is a mean square error calculation function; on the basis of this, calculateM*QThe accumulated error of the element test sample and the error of the training batch are used for updating the parameters of the image convolution encoder through error back propagation and gradient descentθ* CNN Parameters of a text mapping encoderθ* Mapping And parameters of a relationship metric networkθ* Relation
S5) judging whether the preset ending condition is reached, if so, ending the training, and recording the finally obtained image convolution encoder parametersθ* CNN Parameters of a text mapping encoderθ* Mapping And parameters of a relationship metric networkθ* Relation And exits, otherwise jumps to step S1) to continue training.
5. The zero-small sample machine learning method for fusing images and text descriptions thereof according to claim 4, wherein the preset ending condition in step S5) is that the iteration number reaches a preset upper limit or the classification loss reaches an error loss lower limit.
6. A zero-thumbnail machine learning device fusing images and text descriptions thereof, comprising a microprocessor and a memory connected with each other, characterized in that the microprocessor is programmed or configured to execute the steps of the zero-thumbnail machine learning method fusing images and text descriptions thereof according to any one of claims 1-5.
7. A computer-readable storage medium having stored thereon a computer program programmed or configured to perform the zero-subsample machine learning method of fusing images and their textual descriptions of any of claims 1-5.
CN202110083109.6A 2021-01-21 2021-01-21 Zero-small sample machine learning method and system fusing image and text description thereof Active CN112801159B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110083109.6A CN112801159B (en) 2021-01-21 2021-01-21 Zero-small sample machine learning method and system fusing image and text description thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110083109.6A CN112801159B (en) 2021-01-21 2021-01-21 Zero-small sample machine learning method and system fusing image and text description thereof

Publications (2)

Publication Number Publication Date
CN112801159A CN112801159A (en) 2021-05-14
CN112801159B true CN112801159B (en) 2022-07-19

Family

ID=75811067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110083109.6A Active CN112801159B (en) 2021-01-21 2021-01-21 Zero-small sample machine learning method and system fusing image and text description thereof

Country Status (1)

Country Link
CN (1) CN112801159B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377990B (en) * 2021-06-09 2022-06-14 电子科技大学 Video/picture-text cross-modal matching training method based on meta-self learning
CN114898193A (en) * 2022-07-11 2022-08-12 之江实验室 Manifold learning-based image feature fusion method and device and image classification system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232341A (en) * 2019-05-30 2019-09-13 重庆邮电大学 Based on convolution-stacking noise reduction codes network semi-supervised learning image-recognizing method
CN110363239A (en) * 2019-07-04 2019-10-22 中国人民解放军国防科技大学 Multi-mode data-oriented hand sample machine learning method, system and medium
CN111291212A (en) * 2020-01-24 2020-06-16 复旦大学 Zero sample sketch image retrieval method and system based on graph convolution neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8768050B2 (en) * 2011-06-13 2014-07-01 Microsoft Corporation Accurate text classification through selective use of image data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232341A (en) * 2019-05-30 2019-09-13 重庆邮电大学 Based on convolution-stacking noise reduction codes network semi-supervised learning image-recognizing method
CN110363239A (en) * 2019-07-04 2019-10-22 中国人民解放军国防科技大学 Multi-mode data-oriented hand sample machine learning method, system and medium
CN111291212A (en) * 2020-01-24 2020-06-16 复旦大学 Zero sample sketch image retrieval method and system based on graph convolution neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Towards zero-shot learning generalization via a cosine distance loss;Chongyu Pan,et al.;《Neurocomputing》;20200314;167-176 *
融合零样本学习和小样本学习的弱监督学习方法综述;潘崇煜;《系统工程与电子技术》;20201031;2246-2256 *

Also Published As

Publication number Publication date
CN112801159A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
He et al. Asymptotic soft filter pruning for deep convolutional neural networks
Howard et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications
CN106991372B (en) Dynamic gesture recognition method based on mixed deep learning model
Yoon et al. Combined group and exclusive sparsity for deep neural networks
CN111523546B (en) Image semantic segmentation method, system and computer storage medium
CN110852168A (en) Pedestrian re-recognition model construction method and device based on neural framework search
CN112801159B (en) Zero-small sample machine learning method and system fusing image and text description thereof
CN107526785A (en) File classification method and device
KR20180120061A (en) Artificial neural network model learning method and deep learning system
CN109543502A (en) A kind of semantic segmentation method based on the multiple dimensioned neural network of depth
CN109273054B (en) Protein subcellular interval prediction method based on relational graph
Ding et al. Where to prune: Using LSTM to guide data-dependent soft pruning
CN111723915B (en) Target detection method based on deep convolutional neural network
CN114821271B (en) Model training method, image description generation device and storage medium
CN110705600A (en) Cross-correlation entropy based multi-depth learning model fusion method, terminal device and readable storage medium
Latif et al. Cotton Leaf Diseases Recognition Using Deep Learning and Genetic Algorithm.
CN111310820A (en) Foundation meteorological cloud chart classification method based on cross validation depth CNN feature integration
CN114067294A (en) Text feature fusion-based fine-grained vehicle identification system and method
CN117710841A (en) Small target detection method and device for aerial image of unmanned aerial vehicle
CN116524282A (en) Discrete similarity matching classification method based on feature vectors
CN116665039A (en) Small sample target identification method based on two-stage causal intervention
CN116129176A (en) Few-sample target detection method based on strong-correlation dynamic learning
CN115240782A (en) Drug attribute prediction method, device, electronic device and storage medium
CN114780725A (en) Text classification algorithm based on deep clustering
Liu et al. Multi-digit Recognition with Convolutional Neural Network and Long Short-term Memory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant