CN106227836B

CN106227836B - Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters

Info

Publication number: CN106227836B
Application number: CN201610595620.3A
Authority: CN
Inventors: 熊红凯; 倪赛杰
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2016-07-26
Filing date: 2016-07-26
Publication date: 2020-07-14
Anticipated expiration: 2036-07-26
Also published as: CN106227836A

Abstract

The invention discloses an unsupervised joint visual concept learning system and method based on images and characters, which comprises the following steps: the system comprises a character analysis module, a radix example learning module and a multi-task clustering module, wherein: the character analysis module extracts corresponding nouns as visual concepts and the basic number words thereof as additional constraint information of a next module by utilizing the social media to describe the additional sentences of the images; the radix example learning module trains a classifier of each visual concept using radix-guided multi-example learning methods; the multitask clustering module handles diversity among concepts, i.e., clustering nouns referring to similar objects into one large class as visual concepts. The invention can effectively solve the problem of complex realization of manual calibration under large-scale data by using unsupervised automatic learning.

Description

Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters

Technical Field

The invention relates to a visual concept method in the field of computer vision, in particular to an unsupervised joint visual concept learning system and method based on images and characters.

Background

In the field of computer vision, conventional image classification and object detection methods rely more or less on manual labeling, such as image-level or image instance-level labeling. In recent years, with the development of computer technology and the appearance of big data, large-scale visual concept learning becomes an emerging research hotspot, and it is very difficult to manually label millions or even tens of millions of data, so that it is a current demand to use unsupervised learning to carry out large-scale visual concept learning.

Since it is particularly difficult to learn visual concepts solely from the picture itself, existing methods are mostly done with supervision or weak supervision. Existing visual concept learning methods are mainly classified into two categories: search engine based and social resource based methods. The method based on the search engine utilizes the BING API and the like to input key search words to collect training pictures, and then takes the key words as category labels of visual concepts; and the method based on the social resource directly utilizes the pictures and the related word description of the social platform to carry out the joint visual concept learning.

In the 'New: Extracting visual knowledge from web data' published by Chen et al in the 'IEEE International Conference on Computer Vision' (IEEE ICCV) Conference of 2013, a visual concept learning method based on a search engine is proposed, which comprises the steps of collecting a part of pictures for each concept, then iteratively mining the common sense relations (such as position relations and the like) of each example in the pictures, and continuously refining a detector of the visual concept by using the searched result. However, this method based on search engine requires manual setting of the kind of visual concepts, which is not feasible in practical application due to its large number; and the searched image is much simpler than a natural image, so that the diversity of each object cannot be learned.

Socher et al published "group composition information for defining and describing images with content" in the 2013 NIPS Deep L earning Workshop meeting, which presented a social networking resource-based visual concept learning method.

Disclosure of Invention

The invention provides an unsupervised joint visual concept learning system and method based on images and characters aiming at the defects in the prior art, and the unsupervised automatic learning can be used for effectively solving the problem of complex implementation of manual calibration under large-scale data.

According to a first object of the present invention, there is provided an unsupervised joint visual concept learning method based on images and characters, comprising:

a character analysis step: extracting corresponding nouns for given sentence description by using a character analysis tool, performing part-of-speech tagging on each word in the sentence, and extracting singular nouns and plural nouns as labels of a radix example learning module; extracting additional constraint information that the nouns learn as a cardinality example corresponding to cardinality, namely quantity, in addition to the nouns themselves;

cardinality example learning step: firstly, extracting a salient region in an image corresponding to sentence description, and then guiding a classifier for multi-example learning to train each visual concept by using radix information extracted in the step of character analysis, namely extracting the number of objects with the corresponding number of radix from each image to improve the classification accuracy of the visual concept learning, so as to obtain a visual concept classifier; each visual concept classifier obtained by training in the step is used as the input of the character analysis step;

and (3) multitask clustering step: the visual concept classifier obtained by training the radix example learning step utilizes multi-task clustering to gather nouns referring to similar objects into a large class as a visual concept to process diversity among concepts so as to obtain a more compact and robust visual concept.

Preferably, in the text parsing step: the extraction of noun base is divided into "exact" and "approximate", the exact "base is determined by the number modifier before the ranking, and the" approximate "plural noun base is defined as" 2 ", because at least two objects correspond to the graph.

Preferably, the radix example learning step is performed at an image area block level rather than at an entire image level, because a natural image often contains a plurality of objects.

Preferably, the radix example learning step, in which the image contains at least one corresponding example, and is called "positive package", the classification error of each "positive package" is the maximum value of the scores of all examples in the package, and the classification error of each "negative package" is the error average of the corresponding radix examples; the final classification error function is the sum of all "positive-packet" and "negative-packet" classification errors.

More preferably, compared with a method that only one positive case is extracted from one packet, the radix example learning step can extract more examples from the image, and a classifier with stronger generalization performance is obtained.

More preferably, the radix example learning step, wherein the classification error function is trained using a stochastic gradient descent method until the network converges.

Preferably, in the multitask clustering step, the objective function is composed of two terms of a clustering error and a regularization error.

More preferably, the regularization error is: a penalty function for measuring weight magnitude and a regular function for measuring similarity between classes.

According to a second object of the present invention, there is provided an unsupervised joint visual concept learning system based on images and characters, comprising: a character analysis module, a radix example learning module and a multi-task clustering module,

the character analysis module extracts corresponding nouns for given sentence description by using a character analysis tool, performs part-of-speech tagging on each word in the sentence and extracts singular nouns and plural nouns as labels of the radix example learning module; besides the nouns, extracting the cardinality, namely the number corresponding to the nouns as additional constraint information of a cardinality example learning module;

the radix example learning module firstly extracts a salient region in an image corresponding to sentence description, and then guides a classifier for multi-example learning to train each visual concept by using radix information extracted in the last module, namely extracts the number of objects with the corresponding number of radix for each image to improve the classification accuracy of the visual concept learning, and obtains the visual concept classifier; each visual concept classifier obtained by the module training is used as the input of the next module;

the multi-task clustering module is used for gathering nouns referring to similar objects into a large class by using the visual concept classifier obtained by training of the radix example learning module through multi-task clustering to be used as a visual concept to process diversity among concepts so as to obtain a more compact and robust visual concept.

Preferably, the word parsing module extracts a noun corresponding base as additional constraint information of a next module, in addition to that singular and plural nouns themselves can be labels of visual concepts.

Preferably, the extraction of the noun base of the text parsing module is divided into "exact" and "approximate", the exact base is determined by the number modifier before the name, and the complex noun base (such as "some") of "approximate" is defined as "2" because at least two objects correspond to the graph, the extraction of the noun base can provide information for the next module, and the scene understanding is improved.

The radix example learning module firstly extracts a salient region in each image, and then guides a classifier for multi-example learning to train each visual concept by using radix information, namely, the number of objects with corresponding quantity of radix is extracted for each image, compared with the method that only one positive example is extracted from one package of conventional multi-example learning, the radix example learning module can extract the positive examples with corresponding quantity of scene description, and the classification accuracy of the visual concept learning is improved.

Preferably, the radix example learning module processes for image area block level rather than whole image level, because a natural image often contains multiple objects (such as "sky", "beach" and "tourist"), which would result in poor target detection results if the whole image were input using traditional image classification methods.

Preferably, the radix examples learning module trains a classifier of each visual concept extracted by the last module using multi-example learning. The multi-example learning module is different from the traditional classifier training in that each positive packet contains at least one example instead of all positive examples; and the negative cases are all contained in the negative bag.

Preferably, the classification error for each "negative packet" (i.e., the image does not contain the example) of the radix example learning module is the maximum of all example scores in the packet; the classification error for each "positive packet" (i.e., the image contains at least one corresponding instance) is the error average for the corresponding base number of instances.

Preferably, compared with a method that only one positive example is extracted from one package, the radix example learning module can extract more examples from the image and obtain a classifier with stronger generalization performance, thereby improving the scene understanding and target detection capability

Preferably, the error function of the radix example learning module is trained using a stochastic gradient descent method until the network converges.

The multitask clustering module processes diversity among concepts, for example, both "girl" and "policeman" refer to "peoples", so in order to obtain a more robust classifier, terms referring to similar objects are aggregated into a large class by utilizing multitask clustering as a visual concept.

Preferably, because the diversity of extracted nouns, such as "girl" and "policeman" both refer to "peoples", to obtain a more robust classifier, nouns referring to similar objects are clustered into one large class using multitask clustering as a visual concept.

Preferably, the objective function of the multitask clustering module consists of two terms, a clustering error and a regularization error.

Compared with the prior art, the invention has the following beneficial effects:

the realization of manual calibration under the existing large-scale data is complex: the existing method based on the search engine needs to manually set the type of the visual concept, and the searched image is too simple and has no diversity; the existing non-engine search-based method does not consider the similarity between concepts to cause the redundancy of visual concepts, and cannot obtain a robust object detection and classifier.

Aiming at the problems, the invention adopts the technical scheme of unsupervised visual concept learning, utilizes natural language processing and salient region extraction, provides a radix-oriented multi-example learning method, and trains the classifier of each visual concept. Meanwhile, the method for multi-task clustering is proposed to gather similar nouns into one class so as to obtain more robust visual concept classification. And finally, the problem that manual calibration is complicated to realize under the existing large-scale data can be well solved.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

fig. 2 is a block diagram of a system according to an embodiment of the invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

As shown in fig. 1, the present invention provides an unsupervised joint visual concept learning method based on images and characters, aiming at the problem of complex implementation of manual calibration under large-scale data:

and (3) multitask clustering step: clustering with multitask into a large class nouns referring to similar objects is used as a visual concept to deal with diversity between concepts to obtain a more compact and robust visual concept.

The specific implementation technology of each step is described in the following description of corresponding modules of the system embodiments.

As shown in fig. 2, which is a block diagram of a structure of an unsupervised joint visual concept learning method based on images and characters, which is corresponding to the above method, and implements the above method, the system includes: the system comprises a character analysis module, a radix example learning module and a multi-task clustering module, wherein:

the multitask clustering module gathers nouns referring to similar objects into a large class by utilizing multitask clustering to be used as visual concepts to process diversity among the concepts so as to obtain more compact and robust visual concepts.

In this embodiment, the term base number of the text parsing module is extracted into two types, namely "accurate" and "approximate", where the "accurate" base number is determined by the number modifiers before the name, and the "approximate" plural term base number (e.g., "some") is defined as "2" because there are at least two objects corresponding to the graph.

Thus, the radix vector representation in each graph may be represented as N ═ N₁，n₂，...，n_KN if the kth noun in the list is not mentioned in the figure_k0, otherwise n_kIs equal toThe noun extracted cardinality.

In this embodiment, the radix example learning module trains the classifier for each visual concept using multi-example learning. The score obtained by the kth classifier on the significant area block x is defined as:

is to map the original d-dimensional features to an h-dimensional h × d matrix, w, shared by all classifiers_kIs the weight of the kth visual concept classifier and x is the feature representation of the region block.

In this embodiment, the classification error for each "negative packet" (i.e., the image does not contain the example) of the radix example learning module is the maximum of all example scores in the packet; the classification error for each "positive packet" (i.e., the image contains at least one corresponding instance) is the error average for the corresponding base number of instances. Thus, the classification score for each picture X is:

wherein

Is to satisfy

The "main example" of (1),

is the n-th_iFraction of area block, n_kIs an example cardinality of the class contained in the packet.

In this embodiment, compared with a method in which only one positive case is extracted from one packet, the radix case learning module can extract more cases from an image, and obtain a classifier with stronger generalization performance.

In this embodiment, the error function of the example radix learning module is trained by using a stochastic gradient descent method until the network converges.

In this embodiment, since the diversity of extracted nouns, such as "airplan" and "helicopter" both refer to "plane", to obtain a more robust classifier, the nouns referring to similar objects are clustered into a large class by using multitask clustering as a visual concept. Note the mapped region Block feature x'_iX, the fraction g of this region block_k(x)＝w^Tx＝w^Tx′_iAnd w is the weight of each visual concept classifier, and is used for mapping the original d-dimensional features to an h-dimensional h × d matrix shared by all the classifiers, wherein the sum of w is obtained by training, and x is the feature representation of the region block.

In this embodiment, the objective function of the multitask clustering module is composed of two terms, namely a clustering error and a regularization error:

wherein the clustering error

To average classification error:

m is the total number of class instances, K is the number of all classes, w_kIs the weight of the kth visual concept classifier, and W ═ W₁，...，w_k，...w_K]And x is a characteristic representation of the region block.

The regularization error Ω (W, V) is: a penalty function for measuring weight magnitude and a regular function for measuring similarity between classes:

Ω(W，V)＝Ω_mag(W)+αΩ_inter(W，V)+βΩ_intra(W，V) (5)

Ω_magis an amplitude penalty term, Ω, of the weight W_interAnd omega_intraFor rights within and between classes, respectivelyRemade regularization, α and β are regularization coefficients, respectively, and V is A (A)^TA)^-1A^T，A∈{0，1}^K×TThe method is a cluster label assignment of visual concepts, if the kth visual concept belongs to the T-th cluster category, a (K, T) ═ 1, where K and T are the number of visual concept categories and the number of cluster categories, respectively.

For the non-convex optimization problem, a convex function relaxation method is adopted to optimize a group of semi-positive definite convex set matrixes to obtain parameters W and V.

Effects of the implementation

According to the steps, the system and the steps in the invention content are adopted for implementation, the data used for the experiment are derived from 12 thousands of samples of the data set MicroSoft CoCo, and each sample comprises a picture and five sentence statements. Four of the major classes were selected for the experiments, namely: peoples, vehicle, airlane and monitor, therefore, were trained with 10873 pictures in the training set, and 2568 pictures in the validation set were tested. The invention features 4096-dimensional vectors computed from a convolutional neural network. The embodiment system respectively compares three methods of strong supervision, if supervision and unsupervised for the application of target detection. Wherein, the strong supervision respectively compares DPM and R-CNN methods, the weak supervision compares PR methods, the unsupervised compares PBM methods, the average accuracy rates obtained on the four types of objects are respectively 0.349,0.506.0.268 and 0.218, the average accuracy rate of the method provided by the invention is 0.454, and the average accuracy rate is obviously improved.

Experiments show that the unsupervised joint visual concept learning system based on the images and the characters has a good effect in the problem of target detection.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. An unsupervised joint visual concept learning method based on images and characters is characterized by comprising the following steps:

cardinality example learning step: firstly, extracting a salient region in an image corresponding to sentence description, and then guiding a classifier for multi-example learning to train each visual concept by using radix information extracted in the step of character analysis, namely extracting the number of objects with the corresponding number of radix from each image to improve the classification accuracy of the visual concept learning, so as to obtain a visual concept classifier; each visual concept classifier obtained by training in the step is used as the input of the multi-task clustering step;

2. The unsupervised joint visual concept learning method based on images and characters as claimed in claim 1, wherein the character parsing step comprises: the extraction of noun base is divided into "exact" and "approximate", the exact "base is determined by the number modifier before the ranking, and the" approximate "plural noun base is defined as" 2 ", because at least two objects correspond to the graph.

3. The method as claimed in claim 1, wherein the basic learning step is performed at the image block level instead of the whole image level, because a natural image often contains multiple objects.

4. The unsupervised joint visual concept learning method based on images and texts as claimed in claim 1, wherein said radix sample learning step, wherein the images containing no corresponding sample are called "negative packet", the images containing at least one corresponding sample are called "positive packet", the classification error of each "negative packet" is the maximum value of all sample scores in the packet, and the classification error of each "positive packet" is the error average of the corresponding radix samples; the final classification error function is the sum of all "positive-packet" and "negative-packet" classification errors.

5. The method as claimed in claim 4, wherein the step of learning radix examples can extract more examples from the image and obtain a classifier with higher generalization performance than a method of extracting only one positive example from one packet.

6. The method of claim 5, wherein the step of radix exemplar learning, wherein the classification error function is trained using a stochastic gradient descent method until network convergence.

7. The unsupervised joint visual concept learning method based on images and characters as claimed in any one of claims 1-6, wherein the objective function of the multi-task clustering step is composed of both clustering error and regularization error.

8. The method of claim 7, wherein the regularization error is: a penalty function for measuring weight magnitude and a regular function for measuring similarity between classes.

9. An unsupervised joint visual concept learning system based on images and text for implementing the method of any one of claims 1-8, comprising: the system comprises a character analysis module, a radix example learning module and a multi-task clustering module, wherein: