WO2018032354A1

WO2018032354A1 - Method and apparatus for zero-shot learning

Info

Publication number: WO2018032354A1
Application number: PCT/CN2016/095512
Authority: WO
Inventors: Yunlong YU
Original assignee: Nokia Technologies Oy; Nokia Technologies (Beijing) Co., Ltd.
Priority date: 2016-08-16
Filing date: 2016-08-16
Publication date: 2018-02-22
Also published as: CN109643384A; EP3500978A4; EP3500978A1

Abstract

Embodiments of the present disclosure provide a method, apparatus and computer program product for ZSL. The method comprises: constructing a dictionary model based on visual features and semantic features of multimedia content of seen classes, the semantic features corresponding to the visual features; reconstructing visual features of multimedia content of unseen classes using the dictionary model and semantic features of multimedia content of unseen classes; and determining a class of a testing sample based on comparison of a visual feature of the testing sample and reconstructed visual features.

Description

METHOD AND APPARATUS FOR ZERO-SHOT LEARNING

FIELD OF THE INVENTION

Embodiments of the present disclosure generally relate to information processing， and more particularly to a method， apparatus and computer program product for Zero-Shot Learning (ZSL) .

BACKGROUND

ZSL refers to a learning process where no training samples are available to discriminate new classes (also called unseen classes) . It aims at improving the scalability of conventional classification methods. It appears frequently in practice because of the enormous amount of real-world object classes that are still constantly changing. It would be too time-consuming and expensive to obtain human annotated labels for each of the classes.

ZSL can be widely used in applications of natural scene understanding， object recognition， autonomous vehicles， virtual reality， and so on. For example， in the application of autonomous vehicles， surrounding objects need to be recognized. Conventional recognition methods need to predefine some classes， and then to train a model to recognize objects in these classes. However， if there is an object in an unseen class， the model will fail to recognize it. ZSL is proposed to solve this problem. With ZSL， the model can recognize the objects not only in the seen classes but also in the unseen classes.

Conventional methods for ZSL generally apply one transformation matrix to embed the visual features of testing samples into a semantic space， or two transformation matrixes to embed both of the visual features and semantic features of the testing samples into the semantic space. In this way， a connection between the visual features and the semantic features is bridged and a class of the testing samples of unseen classes can be inferred by using the nearest neighborhood method. However， the conventional methods for ZSL cannot reflect intrinsical structures in the semantic space， leading to unsatisfying performance.

SUMMARY

In general， example embodiments of the present disclosure include a method， apparatus and computer program product for ZSL.

In a first aspect of the present disclosure， a method is provided. The method comprises： constructing a dictionary model based on visual features and semantic features of multimedia content of seen classes， the semantic features corresponding to the visual features； reconstructing visual features of multimedia content of unseen classes using the dictionary model and semantic features of multimedia content of unseen classes； and determining a class of a testing sample based on comparison of a visual feature of the testing sample and reconstructed visual features.

In some embodiments， determining the class of the testing sample comprises： in response to the visual feature of the testing sample being closest to one of the reconstructed visual features， designating the class of the testing sample to be a class associated with the one of the reconstructed visual features.

In some embodiments， constructing the dictionary model comprises： randomly initializing model parameters for the dictionary model； and updating the model parameters so as to obtain a minimum of an objective function for the dictionary model， the objective function being defined at least by the model parameters.

In some embodiments， the model parameters include at least one of the following： a dictionary matrix， a dictionary coefficient matrix and a transformation matrix.

In some embodiments， the objective function for the dictionary model is formulized as：

wherein ||·||_F represents an operation of solving an F-norm， X represents the visual features of multimedia content of the seen classes， and Y represents the semantic features of multimedia content of the seen classes， D represents a dictionary matrix， P represents a transformation matrix， C represents a dictionary coefficient matrix， and λ represents a predetermined constant.

In some embodiments， the semantic features of the multimedia content include at least one of the following： semantic attributes and distributed text representations of the multimedia content.

In some embodiments， the visual features of the multimedia content include at least one of the following： color features， texture features， motion features and Convolutional Neural Network features of the multimedia content.

In a second aspect of the present disclosure， an apparatus is provided. The apparatus comprises at least one processor and at least one memory including computer program code. The at least one memory and the computer program code are configured to， with the at least one processor， cause the apparatus at least to： construct a dictionary model based on visual features and semantic features of multimedia content of seen classes， the semantic features corresponding to the visual features； reconstruct visual features of multimedia content of unseen classes using the dictionary model and semantic features of multimedia content of unseen classes； and determine a class of a testing sample based on comparison of a visual feature of the testing sample and reconstructed visual features.

In a third aspect of the present disclosure， an apparatus is provided. The apparatus comprises means for performing the method in the first aspect of the present disclosure.

In a fourth aspect of the present disclosure， a computer program product is provided. The computer program product comprises at least one computer readable non-transitory memory medium having program code stored thereon， the program code which， when executed by an apparatus， causes the apparatus to perform the method in the first aspect of the present disclosure.

It is to be understood that the Summary is not intended to identify key or essential features of embodiments of the present disclosure， nor is it intended to be used to limit the scope of the present disclosure. Other features of the present disclosure will become easily comprehensible through the description below.

BRIEF DESCRIPTION OF THE DRAWINGS

Through the more detailed description of some embodiments of the present disclosure in the accompanying drawings， the above and other objects， features and advantages of the present disclosure will become more apparent， wherein：

Fig. 1 schematically shows an architecture in which embodiments of the present disclosure can be implemented；

Fig. 2 is a flowchart of a method in accordance with embodiments of the present disclosure； and

Fig. 3 shows a block diagram of an example computer system suitable for implementing embodiments of the present invention.

Throughout the drawings， same or similar reference numerals represent the same or similar element.

DETAILED DESCRIPTION

Principles of the present disclosure will now be described with reference to some example embodiments. It is to be understood that these embodiments are described for the purpose of illustration only and help those skilled in the art to understand and implement the present disclosure， without suggesting any limitations as to the scope of the invention. The invention described herein can be implemented in various manners other than the ones describe below.

As used herein， the term “includes” and its variants are to be read as opened terms that mean “includes， but is not limited to. ” The term “based on” is to be read as “based at least in part on. ” The term “one embodiment” and “an embodiment” are to be read as “at least one embodiment. ” The term “another embodiment” is to be read as “at least one other embodiment. ” Other definitions， explicit and implicit， may be included below.

In general， embodiments of the present disclosure tackle the ZSL task with the idea of dictionary learning. In particular， in accordance with the embodiments of the present disclosure， a dictionary model is constructed by using visual features and semantic features of multimedia content of seen classes. With the dictionary model， semantic features of multimedia content of unseen classes are embedded into the visual space. A class of a testing sample of multimedia content of unseen classes is determined based on comparison of a visual feature of the testing sample and reconstructed visual features. Details of the embodiments of the present disclosure will be described with reference to Figs. 1 to 3.

Reference is first made to Fig. 1， which schematically shows an architecture 100 in which embodiments of the present disclosure can be implemented. It is to be understood that the structure and functionality of the architecture 100 are described only for the purpose of illustration without suggesting any limitations as to the scope of the present disclosure described herein. The present disclosure described herein can be embodied with a different structure and/or functionality.

The architecture 100 includes a training system 110 and a testing system 120. The training system 110 is configured to receive visual features 112 of multimedia content of seen classes and semantic features 114 of multimedia content of seen classes. The semantic features 114 correspond to the visual features 112. The training system 110 is further configured to construct a dictionary model based on the visual features 112 and the semantic features 114.

Examples of multimedia content include， but are not limited to， images， video and the like. Examples of the visual features 112 include， but are not limited to， color features， texture features， motion features， Convolutional Neural Network (CNN) features and the like. Examples of the semantic features 114 include， but are not limited to， semantic attributes of the multimedia content， distributed text representations of the multimedia content and the like.

The testing system 120 is configured to receive the dictionary model from the training system 110， semantic features 126 of multimedia content of unseen classes， and a visual feature 128 of a testing sample. The testing system 120 is further configured to output a classification result of the testing sample.

Specifically， the testing system 120 includes a reconstructing unit 122 and a classifier 124. The reconstructing unit 122 is configured to reconstruct visual features of multimedia content of unseen classes using the dictionary model and the semantic features 126 of multimedia content of unseen classes.

The classifier 124 is configured to receive the reconstructed visual features of the seen classes from the reconstructing unit 122 and the visual feature 128 of the testing sample. The classifier 124 is further configured to determine a class of the testing sample based on comparison of the visual feature of the testing sample and the reconstructed visual features. The classifier 124 is further configured to output the classification result of the testing sample.

Fig. 2 shows a flowchart of a method 200 for ZSL in accordance with embodiments of the present disclosure. The method 200 may be implemented in the architecture 100 as shown in Fig. 1.

As shown， the method 200 is entered in step 210， where the training system 110 constructs a dictionary model based on the visual features 112 and semantic features 114 of multimedia content of seen classes.

It is to be appreciated that any known feature extracting methods may be used for extract the visual features 112 and the corresponding semantic features 114 from training samples of the seen classes， and that the description thereof is omitted for the purpose of conciseness.

Generally， the dictionary model may be associated with one or more model parameters. In this regard， the dictionary model may be constructed by training the model parameters with the training system 110. In some embodiments， the model parameters comprise at least one of a dictionary matrix， a dictionary coefficient matrix and a transformation matrix.

In addition， for the purpose of constructing the dictionary model， an objective function for the dictionary model may be predetermined and the objective function may be defined at least by the model parameters for the dictionary model. As a non-limiting example， the objective function of the dictionary model may be formulized as below：

where ||·||_F represent an operation of solving a F-norm and F may be in the range of 2 to 4；

represents the visual features of the training samples of seen classes；

represents the semantic features of the training samples corresponding to the visual features； N represents the number of the training samples of seen classes； d_x and d_y represent dimensionalities of the matrixes X and Y； respectively； D represents a dictionary matrix； P represents a transformation matrix；

represents a dictionary coefficient matrix and d represents a dimensionality of the dictionary coefficient matrix C； and λ represents a predetermined constant for balancing the importance of the two terms in Equation (1) and may be in the range of 0.001 to 1000.

In some embodiments， constructing the dictionary model comprises randomly initializing model parameters for the dictionary model， and updating the model parameters so as to obtain a minimum of an objective function for the dictionary model. In other words， the model parameters are optimized so as to obtain a minimum of an objective function for the dictionary model. For example， in the case of Equation (1) ， the dictionary matrix D， the dictionary coefficient matrix C and the transformation matrix P may be optimized so as to obtain a minimum of the objective function denoted by Equation (1) as below：

where d_i represents an i^th base vector in the dictionary matrix D， i ∈ {1； 2； ...； N} and I represents an identity matrix.

In some embodiments， a joint optimization process may be used for optimizing the dictionary matrix D， the dictionary coefficient matrix C and the transformation matrix P.

Consider a non-limiting example of the joint optimization process. First， the dictionary matrix D and the transformation matrix P may be randomly initialized， respectively. Then， the dictionary coefficient matrix C may be optimized by using Equation (3) as below：

The optimized dictionary coefficient matrix C may be represented as：

C ＝ (D^TD + λI) ^-1 (λPY + D^TX) (4)

Next， the dictionary matrix D and the dictionary coefficient matrix C may be fixed. Then， the transformation matrix P may be optimized by using Equation (5) as below：

The optimized transformation matrix P may be represented as：

P ＝λCY (λYY^T + τI) ^-1 (6)

Afterwards， the transformation matrix P and the dictionary coefficient matrix C may be fixed. Then， the dictionary matrix D may be optimized by using Equation (7) as below：

Equation (7) may be solved by the known Alternating Direction Method of Multipliers (ADMM) and the description thereof is omitted for the purpose of conciseness.

It is to be understood that the operations of optimizing the dictionary matrix D， the dictionary coefficient matrix C and the transformation matrix P， as described above， may be performed iteratively until convergence condition is satisfied.

Referring back to Fig. 2， in step 220， the reconstructing unit 122 reconstructs visual features of multimedia content of unseen classes using the dictionary model and semantic features 126 of multimedia content of unseen classes.

Still consider the non-limiting example as described above. It is assumed that the semantic features 126 may be represented byyv， v ∈ {1， 2， ...， m} ， where m represents the number of unseen classes. Then， the visual features may be reconstructed by multiplying the optimized dictionary matrix D and the optimized transformation matrix P by the semantic features y_v. That is， the reconstructed visual features of multimedia content of unseen classes may be represented as DPy_v， v ∈ {1， 2， ...， m} .

In step 230， the classifier 124 determines a class of a testing sample based on comparison of a visual feature of the testing sample and reconstructed visual features.

In some embodiments， the classifier 124 may determine the class of the testing sample by using the nearest neighborhood method. In this regard， determining the class of the testing sample comprises： in response to the visual feature of the testing sample being closest to one of the reconstructed visual features， designating the class of the testing sample to be a class associated with the one of the reconstructed visual features. It is to be appreciated that the nearest neighborhood method is described by way of example without suggesting any limitation to the scope of the present disclosure. The classifier 124 may determine the class of the testing sample by using other suitable methods than the nearest neighborhood method.

To sum up， in accordance with the embodiments of present disclosure， the dictionary model is constructed based on the visual features and semantic features of multimedia content of seen classes. In other words， the dictionary model is learned from both the visual space and the semantic space. Thus， compared with conventional models， the dictionary model may reflect the intrinsical structures in the semantic space， leading to better performance of classification.

In addition， because the model parameters for the dictionary model are jointly optimized， the better performance of classification may be also guaranteed. Further， in the embodiments of present disclosure， because no sparse constraints are imposed on the dictionary coefficient matrix C， the optimization process may be implemented very fast.

Fig. 3 shows a block diagram of an example computer system suitable for implementing embodiments of the present invention. As shown， the computer system 300 comprises a central processing unit (CPU) 301 which is capable of performing various processes in accordance with a program stored in a read only memory (ROM) 302 or a program loaded from a storage unit 308 to a random access memory (RAM) 303. In the RAM 303， data required when the CPU 301 performs the various processes or the like is also stored as required. The CPU 301， the ROM 302 and the RAM 303 are connected to one another via a bus 304. An input/output (I/O) interface 305 is also connected to the bus 304.

The following components are connected to the I/O interface 305： an input unit 306 including a keyboard， a mouse， or the like； an output unit 307 including a display such as a cathode ray tube (CRT) ， a liquid crystal display (LCD) ， or the like， and a loudspeaker or the like； the storage unit 308 including a hard disk or the like； and a communication unit 309 including a network interface card such as a LAN card， a modem， or the like. The communication unit 309 performs a communication process via the network such as the internet. A drive 310 is also connected to the I/O interface 305 as required. A removable medium 311， such as a magnetic disk， an optical disk， a magneto-optical disk， a semiconductor memory， or the like， is mounted on the drive 310 as required， so that a computer program read therefrom is installed into the storage unit 308 as required.

Specifically， in accordance with embodiments of the present invention， the processes described above with reference to Figures 2-4 may be implemented as computer software programs. For example， embodiments of the present invention comprise a computer program product including a computer program tangibly embodied on a machine readable medium， the computer program including program code for performing the method 200. In such embodiments， the computer program may be downloaded and mounted from the network via the communication unit 309， and/or installed from the removable medium 311.

Generally speaking， various example embodiments of the present disclosure may be implemented in hardware or special purpose circuits， software， logic or any combination thereof. Some aspects may be implemented in hardware， while other aspects may be implemented in firmware or software which may be executed by a controller， microprocessor or other computing device. While various aspects of the example embodiments of the present disclosure are illustrated and described as block diagrams， flowcharts， or using some other pictorial representation， it will be appreciated that the blocks， apparatus， systems， techniques or methods described herein may be implemented in， as non-limiting examples， hardware， software， firmware， special purpose circuits or logic， general purpose hardware or controller or other computing devices， or some combination thereof.

Additionally， various blocks shown in the flowcharts may be viewed as method steps， and/or as operations that result from operation of computer program code， and/or as a plurality of coupled logic circuit elements constructed to carry out the associated function (s) . For example， embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine readable medium， the computer program containing program codes configured to carry out the methods as described above.

In the context of the disclosure， a machine readable medium may be any tangible medium that can contain， or store a program for use by or in connection with an instruction execution system， apparatus， or device. The machine readable medium may be a machine readable signal medium or a machine readable storage medium. A machine readable medium may include but not limited to an electronic， magnetic， optical， electromagnetic， infrared， or semiconductor system， apparatus， or device， or any suitable combination of the foregoing. More specific examples of the machine readable storage medium would include an electrical connection having one or more wires， a portable computer diskette， a hard disk， a random access memory (RAM) ， a read-only memory (ROM) ， an erasable programmable read-only memory (EPROM or Flash memory) ， an optical fiber， a portable compact disc read-only memory (CD-ROM) ， an optical storage device， a magnetic storage device， or any suitable combination of the foregoing.

Computer program code for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer program codes may be provided to a processor of a general purpose computer， special purpose computer， or other programmable data processing apparatus， such that the program codes， when executed by the processor of the computer or other programmable data processing apparatus， cause the functions/operations specified in the flowcharts and/or block diagrams to be implemented. The program code may execute entirely on a computer， partly on the computer， as a stand-alone software package， partly on the computer and partly on a remote computer or entirely on the remote computer or server.

Further， while operations are depicted in a particular order， this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order， or that all illustrated operations be performed， to achieve desirable results. In certain circumstances， multitasking and parallel processing may be advantageous. Likewise， while several specific implementation details are contained in the above discussions， these should not be construed as limitations on the scope of any invention or of what may be claimed， but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely， various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub-combination.

Various modifications， adaptations to the foregoing example embodiments of the present disclosure may become apparent to those skilled in the relevant arts in view of the foregoing description， when read in conjunction with the accompanying drawings. Any and all modifications will still fall within the scope of the non-limiting and example embodiments of the present disclosure. Furthermore， other embodiments of the present disclosure set forth herein will come to mind to one skilled in the art to which these embodiments of the present disclosure pertain having the benefit of the teachings presented in the foregoing descriptions and the drawings.

It will be appreciated that the embodiments of the present disclosure are not to be limited to the specific embodiments as discussed above and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are used herein， they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

A method， comprising：

constructing a dictionary model based on visual features and semantic features of multimedia content of seen classes， the semantic features corresponding to the visual features；

reconstructing visual features of multimedia content of unseen classes using the dictionary model and semantic features of multimedia content of unseen classes； and

determining a class of a testing sample based on comparison of a visual feature of the testing sample and reconstructed visual features.
The method of Claim 1， wherein determining the class of the testing sample comprises：

in response to the visual feature of the testing sample being closest to one of the reconstructed visual features， designating the class of the testing sample to be a class associated with the one of the reconstructed visual features.
The method of Claim 1， wherein constructing the dictionary model comprises：

randomly initializing model parameters for the dictionary model； and

updating the model parameters so as to obtain a minimum of an objective function for the dictionary model， the objective function being defined at least by the model parameters.
The method of Claim 3， wherein the model parameters include at least one of the following： a dictionary matrix， a dictionary coefficient matrix and a transformation matrix.
The method of Claim 4， wherein the objective function for the dictionary model is formulized as：

wherein ||·||_F represents an operation of solving an F-norm， X represents the visual features of multimedia content of the seen classes， and Y represents the semantic features of multimedia content of the seen classes， D represents a dictionary matrix， P represents a transformation matrix， C represents a dictionary coefficient matrix， and λ represents a predetermined constant.
The method of any of Claims 1 to 5， wherein the semantic features of the multimedia content include at least one of the following： semantic attributes and distributed text representations of the multimedia content.
The method of any of Claims 1 to 5， wherein the visual features of the multimedia content include at least one of the following： color features， texture features， motion features and Convolutional Neural Network features of the multimedia content.
An apparatus， comprising：

at least one processor； and

at least one memory including computer program code；

the at least one memory and the computer program code are configured to， with the at least one processor， cause the apparatus at least to：

construct a dictionary model based on visual features and semantic features of multimedia content of seen classes， the semantic features corresponding to the visual features；

reconstruct visual features of multimedia content of unseen classes using the dictionary model and semantic features of multimedia content of unseen classes； and

determine a class of a testing sample based on comparison of a visual feature of the testing sample and reconstructed visual features.
The apparatus of Claim 8， wherein the at least one memory and the computer program code are configured to， with the at least one processor， cause the apparatus to：

designate the class of the testing sample to be a class associated with one of the reconstructed visual features in response to the visual feature of the testing sample being closest to the one of the reconstructed visual features.
The apparatus of Claim 8， wherein the at least one memory and the computer program code are configured to， with the at least one processor， cause the apparatus to：

randomly initialize model parameters for the dictionary model； and

update the model parameters to obtain a minimum of an objective function for the dictionary model so as to construct the dictionary model， the objective function being defined at least by the model parameters.
The apparatus of Claim 10， wherein the model parameters include at least one of the following： a dictionary matrix， a dictionary coefficient matrix and a transformation matrix.
The apparatus of Claim 11， wherein the objective function for the dictionary model is formulized as：

wherein ||·||_F represents an operation of solving an F-norm， X represents the visual features of multimedia content of the seen classes， and Y represents the semantic features of multimedia content of the seen classes， D represents a dictionary matrix， P represents a transformation matrix， C represents a dictionary coefficient matrix， and λ represents a predetermined constant.
The apparatus of any of Claims 8 to 12， wherein the semantic features of the multimedia content include at least one of the following： semantic attributes and distributed text representations of the multimedia content.
The apparatus of any of Claims 8 to 12， wherein the visual features of the multimedia content include at least one of the following： color features， texture features， motion features and Convolutional Neural Network features of the multimedia content.
An apparatus comprising means for performing the method according to any of Claims 1 to 7.
A computer program product comprising at least one computer readable non-transitory memory medium having program code stored thereon， the program code which， when executed by an apparatus， causes the apparatus to perform the method according to any of Claims 1 to 7.