CN112257808B

CN112257808B - Integrated collaborative training method and device for zero sample classification and terminal equipment

Info

Publication number: CN112257808B
Application number: CN202011202927.5A
Authority: CN
Inventors: 郭毅博; 范一鸣; 王海迪; 孟文化; 姜晓恒; 徐明亮
Original assignee: Zhengzhou University
Current assignee: Zhengzhou University
Priority date: 2020-11-02
Filing date: 2020-11-02
Publication date: 2022-11-11
Anticipated expiration: 2040-11-02
Also published as: CN112257808A

Abstract

The invention relates to an integrated collaborative training method, a device and a terminal device for zero sample classification, which are characterized in that an obtained data set is divided into a training set and a testing set, the training set is respectively called a visible class and an invisible class, attribute prediction networks with different structures are trained, two networks are selected as a main network and a secondary network from the training set, attribute mapping parameters are calculated, virtual features of the invisible class are synthesized according to the attribute mapping parameters, the virtual features and a plurality of classifiers are combined to complete training of the classifiers, the invisible class features are extracted by using the main network and the secondary network, the invisible class features are predicted by using the classifiers, pseudo labels are given to the invisible classes meeting conditions according to a classifier voting mechanism, the invisible classes given with the pseudo labels are added into the training set to train the attribute prediction networks again, the prediction precision of a network model is improved, meanwhile, different ZSL embedding methods can be used for training to further select the main network and the secondary network, the method is easy to be expanded to other zero sample learning methods, and the performance of the method is improved.

Description

Integrated collaborative training method and device for zero sample classification and terminal equipment

Technical Field

The invention relates to an integrated collaborative training method and device for zero sample classification and terminal equipment.

Background

Due to the effectiveness of deep learning on the image recognition problem, the supervised image recognition method achieves the exclamatory result in many fields, but a considerable amount of labeled samples are often needed to train a good enough network recognition model, and the model trained by using the known samples can only recognize the object class contained in the training set, so that the ability of recognizing the object class not contained in the training set is lacked. However, in real life, image data of some categories is deficient, image categories to be identified are continuously increased, meanwhile, the cost of retraining the model is high each time different categories of data are increased, the image identification field should not completely depend on the method which needs a large number of samples, and therefore, more challenging zero-sample learning is proposed, and the method aims to identify target examples from new category images which are never seen.

Early research on Zero-sample Learning dates back to 2008, larochelle H et al used a Zero-data Learning (ZSL) method for the character classification problem, palatucci M et al formally proposed a ZSL concept in the next year, larochelle H et al also proposed an attribute-based class migration Learning mechanism and an Attributes With Attributes (AWA) data set in the current year, and firstly proposed Direct Attribute Prediction (DAP) and Indirect Attribute Prediction (IAP) using the image recognition field as an application scene, because the study is different from the thinking way of the traditional image recognition task and the development requirements of the image recognition field, and Zero-sample Learning begins to attract wide attention. In the zero sample task, all categories are provided with relevant descriptions, such as common attribute characteristics of color, wings, crawling, tail and the like, which are shared among the categories, and the mapping relation between the image and the category label in the supervised image recognition problem is converted into the mapping relation between the image and the semantic and the category.

In the early ZSL method, classifiers of each attribute in DAP and IAP are trained independently, and the relationship between attributes in classes is not considered, so the latest ZSL method almost designs different constraint terms aiming at image visual features or semantics to learn the mapping between the image visual features and the class embedding, or constructs a general embedding space for the images and the semantic attributes to learn the mapping between the image visual features and the class embedding, for example, SJE proposed by Akata Z et al in 2015 completes compatibility modeling from the visual space to the semantic space by training a structured support vector machine, EXEM proposed by changpiyo S et al in 2017 projects semantic information to the visual feature center in the visual space, and LDF proposed by Li Y et al in 2018 constructs a potential feature embedding space to associate the visual and semantic information. However, the final objective of ZSL is to predict the object classes not contained in the training set, since the same attribute between the known class and the unknown class often contains different appearances (for example, the tail is an option, and the pigtail is very different in appearance compared with the tail of animals such as tigers and zebras), the domain shift problem may be caused, that is, the difference in corresponding visual features between different classes of the same attribute may be very large, when the network model is used to classify the test set, the new classes that have never been found are often classified into the known classes of the training set, resulting in poor prediction accuracy of the network model.

Disclosure of Invention

The invention aims to provide an integrated collaborative training method, an integrated collaborative training device and terminal equipment for zero sample classification, and aims to solve the problem that a network model obtained by the existing training method is poor in prediction accuracy.

In order to solve the problems, the invention adopts the following technical scheme:

an integrated collaborative training method for zero sample classification, comprising:

acquiring a data set and an attribute library thereof, dividing the data set into a training set and a testing set, and respectively calling the training set and the testing set as a visible class and an invisible class;

training attribute prediction networks with different structures, and selecting two networks as a main network and an auxiliary network according to the robustness and generalization capability of the different networks to invisible classes;

calculating the mapping relation between the visible class attributes and the invisible class attributes in the attribute library to obtain attribute mapping parameters, respectively extracting the image characteristics of the visible class by using a main network and a secondary network, and synthesizing the virtual characteristics of the invisible class according to the attribute mapping parameters;

the virtual features and the classifiers are combined to complete the training of the classifiers, the invisible features are extracted by using a main network and a secondary network, the invisible features are predicted by using the classifiers, the invisible classes meeting the conditions are endowed with pseudo labels according to a classifier voting mechanism, and the invisible classes endowed with the pseudo labels are added into a training set to train the attribute prediction network again.

Preferably, the dividing the data set into a training set and a test set, and the training set and the test set being respectively referred to as a visible class and an invisible class includes:

the data set is divided into a training set and a test set, and the training set and the test set are respectively called a visible class D ^S And invisible class D ^U ；

Wherein the visible sample label is

Representing a visible class data set D ^S The picture of (a) is the ith picture,

is composed of

The category label of (a) is used,

invisible class sample label is

Y ^U ∪Y ^S = Y; for each Y ∈ Y, there is a semantic attribute A associated with it _y ＝{a ₁ ，a ₂ …，a _l+n }。

Preferably, the training of the attribute prediction networks with different structures selects two networks as a primary network and a secondary network according to the robustness and generalization capability of the different networks to invisible classes, and includes:

training attribute prediction networks of different structures, attribute predictionNetwork by feature extraction function

And a classification function phi _main Two parts, wherein the feature extraction function

As formula (1), the classification function φ _main Is formula (2);

wherein, W _cnn Parameters representing convolutional layers in the network, x represents the input image sample, W _main Parameters representing a network full connection layer;

sending the visible classes into an attribute prediction network to train through a formula (3), and using a self-adaptive moment estimation optimizer by the optimizer;

where σ represents a sigmoid activation function, a _i Is x _i The attribute tag of (2);

the output result of the network is the predicted invisible class attribute, and the predicted invisible class attribute is input into a formula (4) to predict the class of the invisible class;

wherein phi _pre A semantic attribute representing a prediction of the network,

representing true not in property librariesVisible class semantic attributes;

taking the formula (5) as an evaluation index, and selecting two networks with the highest evaluation indexes as a main network and an auxiliary network;

wherein, the first and the second end of the pipe are connected with each other,

indicates the Top-1 accuracy of the c-th class in the invisible classes, and gamma indicates the total number of classes in the test set that are invisible.

Preferably, the calculating a mapping relationship between attributes of a visible class and an invisible class in an attribute library to obtain attribute mapping parameters, respectively extracting image features of the visible class by using a primary network and a secondary network, and synthesizing virtual features of the invisible class according to the attribute mapping parameters includes:

regularizing the attributes in the attribute library, and calculating the attribute mapping parameters between the invisible classes and the visible classes through a formula (6)

an attribute representing an invisible class is represented by,

an attribute representing a visible class;

visible class feature usage

Obtaining;

virtual feature adoption of invisible classes

In a manner described above.

Preferably, the training of the classifier is completed by combining the virtual features with a plurality of classifiers, the invisible features are extracted by using a primary network and a secondary network, the invisible features are predicted by using the classifier, the invisible classes meeting the conditions are endowed with pseudo labels according to a classifier voting mechanism, the invisible classes endowed with the pseudo labels are added into a training set to train the attribute prediction network again, and the method includes:

training the classifier according to the synthesized virtual features of the invisible classes and corresponding labels, obtaining predicted semantic attributes through a formula (7), and predicting the invisible classes through a formula (8);

wherein, F _{classification} Representing a classifier used by the network;

and according to a classifier voting mechanism, giving a pseudo label to the invisible class meeting the conditions, adding the invisible class given with the pseudo label into a training set, and training the attribute prediction network again until the training is finished.

An integrated co-training apparatus for zero sample classification, comprising:

the data dividing module is used for acquiring a data set and an attribute library thereof, dividing the data set into a training set and a test set, and respectively calling the training set and the test set as a visible class and an invisible class;

the main network acquisition module and the auxiliary network acquisition module are used for training attribute prediction networks with different structures, and selecting two networks as a main network and an auxiliary network according to the robustness and generalization capability of the different networks to invisible classes;

the invisible class virtual feature synthesis module is used for calculating the mapping relation between visible class attributes and invisible class attributes in the attribute library to obtain attribute mapping parameters, respectively extracting the image features of the visible class by using the main network and the auxiliary network, and synthesizing the invisible class virtual features according to the attribute mapping parameters;

and the network training module is used for combining the virtual features with the plurality of classifiers to complete the training of the classifiers, extracting invisible features by using the main and auxiliary networks, predicting the invisible features by using the classifiers, endowing the invisible classes meeting the conditions with pseudo labels according to a classifier voting mechanism, adding the invisible classes endowed with the pseudo labels into a training set, and training the attribute prediction network again.

A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the integrated co-training method for zero sample classification as described above when executing the computer program.

The invention has the beneficial effects that: the method comprises the steps of dividing a data set into a training set and a testing set, respectively calling the training set and the testing set as a visible class and an invisible class, training attribute prediction networks with different structures, selecting two networks from the training set and the testing set as a main network and an auxiliary network according to robustness and generalization capability of different networks to the invisible class, synthesizing invisible class characteristics by combining attribute mapping parameters between the visible class and the invisible class, and finally using a plurality of classifiers to assign invisible class pseudo labels to the synthesized characteristics.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings required to be used in the embodiments will be briefly described as follows:

fig. 1 is a schematic overall flowchart of an integrated collaborative training method for zero sample classification according to an embodiment of the present application;

FIG. 2 is a flowchart of an algorithm corresponding to the integrated co-training method for zero sample classification;

FIG. 3 is a graph showing the relationship between the number of training cycles and TOP-1 accuracy;

FIG. 4 is a schematic overall structure diagram of an integrated cooperative training apparatus for zero sample classification according to a second embodiment of the present application;

fig. 5 is a schematic structural diagram of a terminal device according to a third embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless otherwise specifically stated.

The integrated collaborative training method for zero sample classification provided by the embodiment of the application can be applied to terminal equipment such as a mobile phone, a tablet computer, a notebook computer and a personal computer, and the embodiment of the application does not limit the specific type of the terminal equipment. That is, the carrier of the client corresponding to the integrated collaborative training method for zero sample classification provided in the embodiment of the present application may be any one of the above terminal devices.

In order to explain the technical means described in the present application, the following description will be given by way of specific embodiments.

The first part is network selection, modifies full connection layers of pre-training convolutional neural networks with different structures, directly uses the full connection layers to learn mapping between visual features and semantics, and selects two networks as a main network and a secondary network according to robustness and generalization capability of different networks to test samples; the second part is pseudo label prediction, firstly, extracting characteristics from a training class and a testing class by using a main network and an auxiliary network, calculating mapping parameters between semantics of a testing class sample and a training class sample, and combining the mapping parameters with the training class characteristics to generate testing class virtual characteristics; and then constructing different classifiers, sending the virtual characteristics of the test class and corresponding labels into the classifiers for training, predicting the characteristics extracted from the test class, endowing the test class samples meeting the conditions with pseudo labels according to a classifier voting mechanism, adding the test class samples endowed with the pseudo labels into a training set, and training the convolutional neural network again. And finally, repeatedly executing the pseudo label prediction process until the network accuracy rate is not changed obviously.

Referring to fig. 1, which is a flowchart of an implementation process of an integrated collaborative training method for zero sample classification according to an embodiment of the present application, for convenience of description, only a part related to the embodiment of the present application is shown.

The integrated collaborative training method for zero sample classification comprises the following steps:

step S101: acquiring a data set and an attribute library thereof, dividing the data set into a training set and a testing set, and respectively calling the training set and the testing set as a visible class and an invisible class:

acquiring a data set and an attribute library thereof, and dividing the data set into a training set and a test set, such as: dividing a training set and a test set according to a preset proportion, and respectively calling the training set and the test set as a visible class D ^S And invisible class D ^U 。

Wherein the visible sample label is

I.e. there are a total of l pictures that have been marked,

is composed of

The category label of (a) is set,

invisible class sample label is

I.e. there are a total of n pictures that are not marked,

Y ^U ∪Y ^S = Y. For each Y ∈ Y, there is a semantic attribute A associated with it _y ＝{a ₁ ，a ₂ …，a _l+n }。

Step S102: training attribute prediction networks with different structures, and selecting two networks as a main network and an auxiliary network according to the robustness and generalization capability of different networks to invisible classes:

fig. 2 is a flowchart of an algorithm corresponding to the integrated collaborative training method for zero sample classification provided by the present invention. The algorithm starts from the selection of the network, and aims to select two networks with strong robustness and generalization capability for invisible classes to form a cooperative network, wherein the two networks are respectively called as a main network and a secondary network according to network performance. For the same data set, the models obtained by learning different network architectures are different from the prediction distribution of the test set, meanwhile, the model follows the standard collaborative training rule, random noise is added into the training set to construct different data sets, and after the main network and the auxiliary network are selected, the auxiliary network is used for retraining the data set constructed by the random noise. It is assumed that the data can be classified from different angles to achieve a complementary effect. For the ZSL method based on embedding, training data are required to be sent into a convolutional neural network to obtain visual features and projected to a semantic space. Based on the idea of a collaborative training algorithm, by combining a data set added with noise, networks with different structures and a semantic classifier and combining the correlation between global features and image semantics, the prediction error in embedding visible labels is relieved in a self-adaptive manner from different angles of images, and the predicted invisible labels are added into a training set, so that the problem of domain deviation caused by only using the visible labels in the training process is solved.

The specific implementation of step S102 is given below:

training attribute prediction networks of different structures, the attribute prediction networks being derived from feature extraction functions

The classification function phi is given by equation (1) _main Is formula (2);

wherein, W _cnn Parameters representing convolutional layers in the network, x represents the input image sample, W _main Parameters representing the network full connection layer.

The visible classes are fed into the attribute prediction network to be trained through formula (3), in this embodimentOn the training problem of the network, by giving the training sample with label

The loss function is minimized and different embedding methods have different loss functions. Equation (3) uses a binary cross entropy loss function to update the network parameters. The optimizer uses an adaptive moment estimation optimizer.

Where σ represents a sigmoid activation function, a _i Is x _i The attribute tag of (1).

The output result of the network is the predicted invisible class attribute, which is input into formula (4) to predict the class of the invisible class, in this embodiment, a metric measure using cosine similarity as the closeness of the prediction semantics and the invisible class semantics is used.

Wherein phi _pre A semantic attribute representing a prediction of the network,

representing the real invisible class semantic attributes in the attribute library.

wherein the content of the first and second substances,

It should be understood that, in order to select a network with more robustness and generalization capability, as a specific embodiment, experiments are performed on the current mainstream convolutional neural networks VGG, google lenet, resNet and the EfficientNet which is the strongest in ImageNet, and two networks with the best performance are selected as the primary and secondary networks, as shown in table 1, table 1 is the average TOP-1 accuracy of different convolutional neural networks on different data sets, and it can be seen that the network effect of the ResNet series network is the best.

TABLE 1

Step S103: calculating the mapping relation between the visible class attributes and the invisible class attributes in the attribute library to obtain attribute mapping parameters, respectively extracting the image characteristics of the visible class by using a main network and an auxiliary network, and synthesizing the virtual characteristics of the invisible class according to the attribute mapping parameters:

Wherein the content of the first and second substances,

an attribute representing an invisible class is represented by,

attributes representing visible classes.

Prototype features are central features of all samples of each class, and virtual prototypes of invisible classes are synthesized by using visible class prototype features based on semantic information and used as input of a semantic classifier in the proposed algorithm. Visible type prototype feature (i.e. theVisible class feature) use

Is obtained by

And acquiring visible prototype features.

The virtual prototype features of the invisible class (i.e. the virtual features of the invisible class) can be obtained by using ridge regression:

step S104: the virtual features and the classifiers are combined to complete the training of the classifiers, the invisible features are extracted by using a main network and a secondary network, the invisible features are predicted by using the classifiers, pseudo labels are given to the invisible classes meeting the conditions according to a classifier voting mechanism, the invisible classes given the pseudo labels are added into a training set to train the attribute prediction network again:

the virtual features and a plurality of classifiers are combined to complete the training of the classifiers, the invisible features are extracted by using a main network and a secondary network, the invisible features are predicted by using the classifiers, and the invisible virtual features to be synthesized are

And corresponding labels are fed into the classifier, and the classifier is trained. In this embodiment, the current mainstream classifiers such as lasso regression, ridge regression, bayesian ridge regression, linear regression, support vector machine, random forest and the like can be considered.

The predicted semantic attributes of the prediction are obtained by equation (7):

wherein, F _{classification} Representing the classifiers used by the network. Considering samples from multiple angles, improving generalization performance and reducing entrance of single classifierThe risk of local minimum points, followed by prediction of invisible classes by equation (8):

as a specific embodiment, under a certain hardware condition, such as a hardware condition with a CPU of i7-8700k, the average TOP-1 accuracy and training time of different classifiers on the CUB data set are counted, as shown in Table 2, and Table 2 shows the average TOP-1 accuracy and training time of different classifiers on the CUB data set, and considering both the accuracy and the time, LASSO regression, ridge regression, and Bayesian ridge regression can be selected as the classifier.

TABLE 2

And then, according to a classifier voting mechanism, giving a pseudo label to an invisible class meeting the conditions, wherein a general classifier voting rule means that a plurality of classifiers often have different prediction results, so that a voted classifier is obtained on the basis of the base classifiers, and the class with the most votes is taken as the class to be predicted. Here, there are three classifiers corresponding to the primary and secondary networks and each network, and a voting rule is formulated: selecting the best 5 prediction results from the classifier prediction results according to the accuracy, judging whether the number of invisible samples is more than half of the total number of invisible samples when 4 classifier voting rules are used, if so, adopting 4 classifier voting rules, otherwise, adopting 3 classifier voting rules to the D ^U A pseudo label is assigned and 3 classifier voting rules are employed when executing to the fourth loop. And (4) counting the accuracy of the network model under the SUN data set under different voting numbers Z, as shown in FIG. 3, it can be seen that the accuracy is not increased in the third cycle, and the network model reaches the best performance under the marking rule by considering the trend that the voting number can be increased by changing.

After the invisible classes are marked, the invisible classes are added into a training set to train the network again until the accuracy rate is not changed obviously (generally, the accuracy rate is not changed obviously when the 5 th cycle is executed), the invisible classes endowed with pseudo labels are added into the training set to train the attribute prediction network again (the steps S103 and S104 are repeatedly executed) until the training is finished (for example, the accuracy rate is not changed obviously, or the cycle number reaches the preset number).

Table 3 is a comparison graph of experimental results of the method provided in the present application and other existing methods, and at the same time, LFGAA proposed by Liu Y et al in 2019 is introduced as a method for training a network in a network selection module, from which it can be seen that the performance of the method provided in the present application is improved compared with other existing methods.

TABLE 3

In Table 3, y _u Mean Top-1 accuracy, y, representing invisible class samples _s Indicating the average Top-1 accuracy of the visual type samples tested, "-" indicating that the prior art did not disclose this result.

As can be seen from the above steps, the invisible class virtual prototype feature synthesized based on semantic information is utilized in the embodiment, but the invisible class virtual prototype feature cannot be guaranteed to be the same as a real image, so that a reliable invisible class label is obtained and put into a training set to retrain a network, so that a prediction error in embedding of the visible class label can be overcome, and the problem of domain deviation caused by training only using the visible class is further alleviated.

Fig. 4 shows a structural block diagram of an integrated cooperative training apparatus for zero sample classification provided in embodiment two of the present application, and for convenience of description, only the parts related to the embodiment of the present application are shown.

Referring to fig. 4, an integrated co-training apparatus 200 for zero sample classification includes:

the data dividing module 201 is configured to obtain a data set and an attribute library thereof, divide the data set into a training set and a test set, and respectively refer to the training set and the test set as a visible class and an invisible class;

the primary and secondary network acquisition module 202 is used for training attribute prediction networks with different structures, and selecting two networks as primary and secondary networks according to the robustness and generalization capability of the different networks to invisible classes;

the invisible class virtual feature synthesis module 203 is used for calculating the mapping relationship between the visible class and the invisible class attributes in the attribute library to obtain attribute mapping parameters, respectively extracting the image features of the visible class by using the main and auxiliary networks, and synthesizing the invisible class virtual features according to the attribute mapping parameters;

the network training module 204 is configured to combine the virtual features with multiple classifiers to complete training of the classifiers, extract invisible features using the primary and secondary networks, predict the invisible features using the classifiers, assign pseudo labels to the invisible classes meeting the conditions according to a classifier voting mechanism, add the invisible classes assigned with the pseudo labels into a training set, and train the attribute prediction network again.

It should be noted that, for the information interaction, the execution process, and other contents between the above-mentioned devices/modules, because the same concept is based on, the specific functions and the technical effects of the embodiment of the integrated collaborative training method for zero sample classification in the present application may be specifically referred to the section of the embodiment of the integrated collaborative training method for zero sample classification, and are not described herein again.

It is clear to those skilled in the art that, for the convenience and simplicity of description, the above division of the functional modules is only used for illustration, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the integrated cooperative training apparatus for zero sample classification 200 is divided into different functional modules to perform all or part of the above described functions. Each functional module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional modules are only used for distinguishing one functional module from another, and are not used for limiting the protection scope of the application. The specific working process of each functional module in the above description may refer to the corresponding process in the foregoing embodiment of the integrated collaborative training method for zero sample classification, and is not described herein again.

Fig. 5 is a schematic structural diagram of a terminal device according to a third embodiment of the present application. As shown in fig. 5, the terminal device 300 includes: a processor 302, a memory 301, and a computer program 303 stored in the memory 301 and operable on the processor 302. The number of the processors 302 is at least one, and fig. 5 takes one as an example. The processor 302, when executing the computer program 303, implements the implementation steps of the above-described integrated co-training method for zero sample classification, i.e., the steps shown in fig. 1.

The specific implementation process of the terminal device 300 can be seen in the above embodiment of the integrated collaborative training method for zero sample classification.

Illustratively, the computer program 303 may be partitioned into one or more modules/units that are stored in the memory 301 and executed by the processor 302 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 303 in the terminal device 300.

The terminal device 300 may be a computing device such as a desktop computer, a notebook, a palm computer, a main control, or a mobile terminal such as a mobile phone. Terminal device 300 may include, but is not limited to, a processor and a memory. Those skilled in the art will appreciate that fig. 5 is only an example of the terminal device 300 and does not constitute a limitation of the terminal device 300, and may include more or less components than those shown, or combine some of the components, or different components, for example, the terminal device 300 may further include input and output devices, network access devices, buses, etc.

The Processor 302 may be a CPU (Central Processing Unit), other general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field-Programmable Gate Array), other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 301 may be an internal storage unit of the terminal device 300, such as a hard disk or a memory. The memory 301 may also be an external storage device of the terminal device 300, such as a plug-in hard disk, SMC (Smart Media Card), SD (Secure Digital Card), flash Card, or the like provided on the terminal device 300. Further, the memory 301 may also include both an internal storage unit of the terminal device 300 and an external storage device. The memory 301 is used for storing an operating system, application programs, a boot loader, data, and other programs, such as program codes of the computer program 303. The memory 301 may also be used to temporarily store data that has been output or is to be output.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program may implement the steps in the above embodiment of the integrated collaborative training method for zero sample classification.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the embodiment of the integrated collaborative training method for zero sample classification described above may be implemented by instructing relevant hardware by a computer program, and the computer program 303 may be stored in a computer-readable storage medium, and when being executed by the processor 302, the computer program 303 may implement the steps of the embodiment of the integrated collaborative training method for zero sample classification described above. Wherein the computer program 303 comprises computer program code, and the computer program 303 code may be in a source code form, an object code form, an executable file or some intermediate form, and the like. The computer-readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, ROM (Read-Only Memory), RAM (Random Access Memory), electrical carrier wave signal, telecommunication signal, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In some jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and proprietary practices.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one type of logical function division, and other division manners may be available in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. An integrated collaborative training method for zero sample classification, comprising:

training attribute prediction networks with different structures, and selecting two networks as a main network and an auxiliary network according to the robustness and generalization capability of different networks to invisible classes;

combining the virtual features with a plurality of classifiers to complete the training of the classifiers, extracting invisible features by using a main network and a secondary network, predicting the invisible features by using the classifiers, giving pseudo labels to the invisible classes meeting conditions according to a classifier voting mechanism, adding the invisible classes given with the pseudo labels into a training set, and re-training an attribute prediction network;

the dividing of the data set into a training set and a test set, and the training set and the test set being respectively referred to as a visible class and an invisible class, includes:

Wherein the visible sample label is

Representing a visible class data set D ^S The (4) th picture in (1),

is composed of

The category label of (a) is set,

invisible class sample label is

Y ^U ∪Y ^S = Y; for each Y ∈ Y, there is a semantic attribute A associated with it _y ＝{a ₁ ,a ₂ …,a _l+n }；

The method for training the attribute prediction networks with different structures selects two networks as a main network and an auxiliary network according to the robustness and the generalization capability of the different networks to invisible classes, and comprises the following steps:

training attribute prediction networks of different structures, wherein the attribute prediction networks are extracted by feature extraction functions

The classification function phi is given by equation (1) _main Is formula (2);

wherein, W _cnn Parameters representing convolutional layers in the network, x representing input image samples, W _main Parameters representing a network full connection layer;

sending the visible classes into an attribute prediction network to be trained through a formula (3), and using a self-adaptive moment estimation optimizer by an optimizer;

wherein σ represents sigmoid activation function, a _i Is x _i The attribute tag of (1);

wherein phi is _pre Language expressing network predictionAn attribute is defined as an attribute of the object,

representing real invisible semantic attributes in an attribute library;

wherein the content of the first and second substances,

top-1 accuracy, representing the c-th class in the invisible class, y represents the total number of invisible class classes in the test set;

the method comprises the following steps of calculating the mapping relation between attributes of a visible class and an invisible class in an attribute library to obtain attribute mapping parameters, respectively extracting image features of the visible class by using a main network and a secondary network, and synthesizing virtual features of the invisible class according to the attribute mapping parameters, and comprises the following steps:

Wherein the content of the first and second substances,

an attribute representing an invisible class is provided,

an attribute representing a visible class;

visible type feature usage

Obtaining;

virtual feature adoption of invisible classes

In a manner described above.

2. The integrated collaborative training method for zero sample classification as claimed in claim 1, wherein the training of the classifier is completed by combining the virtual features with a plurality of classifiers, the invisible class features are extracted by using a primary and secondary network, the invisible class features are predicted by using the classifiers, the invisible class meeting the conditions is assigned with a pseudo label according to a classifier voting mechanism, and the invisible class assigned with the pseudo label is added into a training set to train the attribute prediction network again, including:

wherein, F _{classification} Representing a classifier used by the network;

3. An integrated co-training apparatus for zero sample classification, comprising:

the data dividing module is used for acquiring the data set and the attribute library thereof, dividing the data set into a training set and a testing set, and respectively calling the training set and the testing set as a visible class and an invisible class;

the main and auxiliary network acquisition module is used for training attribute prediction networks with different structures, and selecting two networks as a main network and an auxiliary network according to the robustness and generalization capability of the different networks to invisible classes;

the network training module is used for combining the virtual features with the plurality of classifiers to complete the training of the classifiers, extracting invisible features by using a main network and a secondary network, predicting the invisible features by using the classifiers, endowing the invisible classes meeting the conditions with pseudo labels according to a classifier voting mechanism, adding the invisible classes endowed with the pseudo labels into a training set, and training the attribute prediction network again;

the dividing the data set into a training set and a test set, and respectively calling the training set and the test set as a visible class and an invisible class includes:

Wherein the visible sample label is

is composed of

The category label of (a) is set,

invisible class sample label is

Y ^U ∪Y ^S = Y; for each Y e Y, there is a semantic attribute A associated with it _y ＝{a ₁ ,a ₂ …,a _l+n }；

As formula (1), the classification function φ _main Is formula (2);

where σ represents a sigmoid activation function, a _i Is x _i The attribute tag of (1);

wherein phi _pre A semantic attribute representing a prediction of the network,

representing real invisible semantic attributes in an attribute library;

top-1 accuracy representing the c-th category in the invisible category, and γ representing the total number of the invisible category in the test set;

the method comprises the following steps of calculating the mapping relation between attributes of a visible class and an invisible class in an attribute library to obtain attribute mapping parameters, respectively extracting image features of the visible class by using a main network and an auxiliary network, and synthesizing virtual features of the invisible class according to the attribute mapping parameters, and comprises the following steps:

regularizing the attributes in the attribute library, and calculating attribute mapping parameters between the invisible classes and the visible classes through a formula (6)

Wherein the content of the first and second substances,

an attribute representing an invisible class is represented by,

an attribute representing a visible class;

visible type feature usage

Obtaining;

virtual feature adoption of invisible classes

Is obtained in the following manner.

4. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the integrated co-training method for zero sample classification according to any of claims 1-2 when executing the computer program.