CN106227836B - Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters - Google Patents

Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters Download PDF

Info

Publication number
CN106227836B
CN106227836B CN201610595620.3A CN201610595620A CN106227836B CN 106227836 B CN106227836 B CN 106227836B CN 201610595620 A CN201610595620 A CN 201610595620A CN 106227836 B CN106227836 B CN 106227836B
Authority
CN
China
Prior art keywords
visual concept
learning
radix
module
nouns
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610595620.3A
Other languages
Chinese (zh)
Other versions
CN106227836A (en
Inventor
熊红凯
倪赛杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201610595620.3A priority Critical patent/CN106227836B/en
Publication of CN106227836A publication Critical patent/CN106227836A/en
Application granted granted Critical
Publication of CN106227836B publication Critical patent/CN106227836B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an unsupervised joint visual concept learning system and method based on images and characters, which comprises the following steps: the system comprises a character analysis module, a radix example learning module and a multi-task clustering module, wherein: the character analysis module extracts corresponding nouns as visual concepts and the basic number words thereof as additional constraint information of a next module by utilizing the social media to describe the additional sentences of the images; the radix example learning module trains a classifier of each visual concept using radix-guided multi-example learning methods; the multitask clustering module handles diversity among concepts, i.e., clustering nouns referring to similar objects into one large class as visual concepts. The invention can effectively solve the problem of complex realization of manual calibration under large-scale data by using unsupervised automatic learning.

Description

Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters
Technical Field
The invention relates to a visual concept method in the field of computer vision, in particular to an unsupervised joint visual concept learning system and method based on images and characters.
Background
In the field of computer vision, conventional image classification and object detection methods rely more or less on manual labeling, such as image-level or image instance-level labeling. In recent years, with the development of computer technology and the appearance of big data, large-scale visual concept learning becomes an emerging research hotspot, and it is very difficult to manually label millions or even tens of millions of data, so that it is a current demand to use unsupervised learning to carry out large-scale visual concept learning.
Since it is particularly difficult to learn visual concepts solely from the picture itself, existing methods are mostly done with supervision or weak supervision. Existing visual concept learning methods are mainly classified into two categories: search engine based and social resource based methods. The method based on the search engine utilizes the BING API and the like to input key search words to collect training pictures, and then takes the key words as category labels of visual concepts; and the method based on the social resource directly utilizes the pictures and the related word description of the social platform to carry out the joint visual concept learning.
In the 'New: Extracting visual knowledge from web data' published by Chen et al in the 'IEEE International Conference on Computer Vision' (IEEE ICCV) Conference of 2013, a visual concept learning method based on a search engine is proposed, which comprises the steps of collecting a part of pictures for each concept, then iteratively mining the common sense relations (such as position relations and the like) of each example in the pictures, and continuously refining a detector of the visual concept by using the searched result. However, this method based on search engine requires manual setting of the kind of visual concepts, which is not feasible in practical application due to its large number; and the searched image is much simpler than a natural image, so that the diversity of each object cannot be learned.
Socher et al published "group composition information for defining and describing images with content" in the 2013 NIPS Deep L earning Workshop meeting, which presented a social networking resource-based visual concept learning method.
Disclosure of Invention
The invention provides an unsupervised joint visual concept learning system and method based on images and characters aiming at the defects in the prior art, and the unsupervised automatic learning can be used for effectively solving the problem of complex implementation of manual calibration under large-scale data.
According to a first object of the present invention, there is provided an unsupervised joint visual concept learning method based on images and characters, comprising:
a character analysis step: extracting corresponding nouns for given sentence description by using a character analysis tool, performing part-of-speech tagging on each word in the sentence, and extracting singular nouns and plural nouns as labels of a radix example learning module; extracting additional constraint information that the nouns learn as a cardinality example corresponding to cardinality, namely quantity, in addition to the nouns themselves;
cardinality example learning step: firstly, extracting a salient region in an image corresponding to sentence description, and then guiding a classifier for multi-example learning to train each visual concept by using radix information extracted in the step of character analysis, namely extracting the number of objects with the corresponding number of radix from each image to improve the classification accuracy of the visual concept learning, so as to obtain a visual concept classifier; each visual concept classifier obtained by training in the step is used as the input of the character analysis step;
and (3) multitask clustering step: the visual concept classifier obtained by training the radix example learning step utilizes multi-task clustering to gather nouns referring to similar objects into a large class as a visual concept to process diversity among concepts so as to obtain a more compact and robust visual concept.
Preferably, in the text parsing step: the extraction of noun base is divided into "exact" and "approximate", the exact "base is determined by the number modifier before the ranking, and the" approximate "plural noun base is defined as" 2 ", because at least two objects correspond to the graph.
Preferably, the radix example learning step is performed at an image area block level rather than at an entire image level, because a natural image often contains a plurality of objects.
Preferably, the radix example learning step, in which the image contains at least one corresponding example, and is called "positive package", the classification error of each "positive package" is the maximum value of the scores of all examples in the package, and the classification error of each "negative package" is the error average of the corresponding radix examples; the final classification error function is the sum of all "positive-packet" and "negative-packet" classification errors.
More preferably, compared with a method that only one positive case is extracted from one packet, the radix example learning step can extract more examples from the image, and a classifier with stronger generalization performance is obtained.
More preferably, the radix example learning step, wherein the classification error function is trained using a stochastic gradient descent method until the network converges.
Preferably, in the multitask clustering step, the objective function is composed of two terms of a clustering error and a regularization error.
More preferably, the regularization error is: a penalty function for measuring weight magnitude and a regular function for measuring similarity between classes.
According to a second object of the present invention, there is provided an unsupervised joint visual concept learning system based on images and characters, comprising: a character analysis module, a radix example learning module and a multi-task clustering module,
the character analysis module extracts corresponding nouns for given sentence description by using a character analysis tool, performs part-of-speech tagging on each word in the sentence and extracts singular nouns and plural nouns as labels of the radix example learning module; besides the nouns, extracting the cardinality, namely the number corresponding to the nouns as additional constraint information of a cardinality example learning module;
the radix example learning module firstly extracts a salient region in an image corresponding to sentence description, and then guides a classifier for multi-example learning to train each visual concept by using radix information extracted in the last module, namely extracts the number of objects with the corresponding number of radix for each image to improve the classification accuracy of the visual concept learning, and obtains the visual concept classifier; each visual concept classifier obtained by the module training is used as the input of the next module;
the multi-task clustering module is used for gathering nouns referring to similar objects into a large class by using the visual concept classifier obtained by training of the radix example learning module through multi-task clustering to be used as a visual concept to process diversity among concepts so as to obtain a more compact and robust visual concept.
Preferably, the word parsing module extracts a noun corresponding base as additional constraint information of a next module, in addition to that singular and plural nouns themselves can be labels of visual concepts.
Preferably, the extraction of the noun base of the text parsing module is divided into "exact" and "approximate", the exact base is determined by the number modifier before the name, and the complex noun base (such as "some") of "approximate" is defined as "2" because at least two objects correspond to the graph, the extraction of the noun base can provide information for the next module, and the scene understanding is improved.
The radix example learning module firstly extracts a salient region in each image, and then guides a classifier for multi-example learning to train each visual concept by using radix information, namely, the number of objects with corresponding quantity of radix is extracted for each image, compared with the method that only one positive example is extracted from one package of conventional multi-example learning, the radix example learning module can extract the positive examples with corresponding quantity of scene description, and the classification accuracy of the visual concept learning is improved.
Preferably, the radix example learning module processes for image area block level rather than whole image level, because a natural image often contains multiple objects (such as "sky", "beach" and "tourist"), which would result in poor target detection results if the whole image were input using traditional image classification methods.
Preferably, the radix examples learning module trains a classifier of each visual concept extracted by the last module using multi-example learning. The multi-example learning module is different from the traditional classifier training in that each positive packet contains at least one example instead of all positive examples; and the negative cases are all contained in the negative bag.
Preferably, the classification error for each "negative packet" (i.e., the image does not contain the example) of the radix example learning module is the maximum of all example scores in the packet; the classification error for each "positive packet" (i.e., the image contains at least one corresponding instance) is the error average for the corresponding base number of instances.
Preferably, compared with a method that only one positive example is extracted from one package, the radix example learning module can extract more examples from the image and obtain a classifier with stronger generalization performance, thereby improving the scene understanding and target detection capability
Preferably, the error function of the radix example learning module is trained using a stochastic gradient descent method until the network converges.
The multitask clustering module processes diversity among concepts, for example, both "girl" and "policeman" refer to "peoples", so in order to obtain a more robust classifier, terms referring to similar objects are aggregated into a large class by utilizing multitask clustering as a visual concept.
Preferably, because the diversity of extracted nouns, such as "girl" and "policeman" both refer to "peoples", to obtain a more robust classifier, nouns referring to similar objects are clustered into one large class using multitask clustering as a visual concept.
Preferably, the objective function of the multitask clustering module consists of two terms, a clustering error and a regularization error.
More preferably, the regularization error is: a penalty function for measuring weight magnitude and a regular function for measuring similarity between classes.
Compared with the prior art, the invention has the following beneficial effects:
the realization of manual calibration under the existing large-scale data is complex: the existing method based on the search engine needs to manually set the type of the visual concept, and the searched image is too simple and has no diversity; the existing non-engine search-based method does not consider the similarity between concepts to cause the redundancy of visual concepts, and cannot obtain a robust object detection and classifier.
Aiming at the problems, the invention adopts the technical scheme of unsupervised visual concept learning, utilizes natural language processing and salient region extraction, provides a radix-oriented multi-example learning method, and trains the classifier of each visual concept. Meanwhile, the method for multi-task clustering is proposed to gather similar nouns into one class so as to obtain more robust visual concept classification. And finally, the problem that manual calibration is complicated to realize under the existing large-scale data can be well solved.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a flow chart of a method according to an embodiment of the present invention;
fig. 2 is a block diagram of a system according to an embodiment of the invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.
As shown in fig. 1, the present invention provides an unsupervised joint visual concept learning method based on images and characters, aiming at the problem of complex implementation of manual calibration under large-scale data:
a character analysis step: extracting corresponding nouns for given sentence description by using a character analysis tool, performing part-of-speech tagging on each word in the sentence, and extracting singular nouns and plural nouns as labels of a radix example learning module; extracting additional constraint information that the nouns learn as a cardinality example corresponding to cardinality, namely quantity, in addition to the nouns themselves;
cardinality example learning step: firstly, extracting a salient region in an image corresponding to sentence description, and then guiding a classifier for multi-example learning to train each visual concept by using radix information extracted in the step of character analysis, namely extracting the number of objects with the corresponding number of radix from each image to improve the classification accuracy of the visual concept learning, so as to obtain a visual concept classifier; each visual concept classifier obtained by training in the step is used as the input of the character analysis step;
and (3) multitask clustering step: clustering with multitask into a large class nouns referring to similar objects is used as a visual concept to deal with diversity between concepts to obtain a more compact and robust visual concept.
The specific implementation technology of each step is described in the following description of corresponding modules of the system embodiments.
As shown in fig. 2, which is a block diagram of a structure of an unsupervised joint visual concept learning method based on images and characters, which is corresponding to the above method, and implements the above method, the system includes: the system comprises a character analysis module, a radix example learning module and a multi-task clustering module, wherein:
the character analysis module extracts corresponding nouns for given sentence description by using a character analysis tool, performs part-of-speech tagging on each word in the sentence and extracts singular nouns and plural nouns as labels of the radix example learning module; besides the nouns, extracting the cardinality, namely the number corresponding to the nouns as additional constraint information of a cardinality example learning module;
the radix example learning module firstly extracts a salient region in an image corresponding to sentence description, and then guides a classifier for multi-example learning to train each visual concept by using radix information extracted in the last module, namely extracts the number of objects with the corresponding number of radix for each image to improve the classification accuracy of the visual concept learning, and obtains the visual concept classifier; each visual concept classifier obtained by the module training is used as the input of the next module;
the multitask clustering module gathers nouns referring to similar objects into a large class by utilizing multitask clustering to be used as visual concepts to process diversity among the concepts so as to obtain more compact and robust visual concepts.
In this embodiment, the term base number of the text parsing module is extracted into two types, namely "accurate" and "approximate", where the "accurate" base number is determined by the number modifiers before the name, and the "approximate" plural term base number (e.g., "some") is defined as "2" because there are at least two objects corresponding to the graph.
Thus, the radix vector representation in each graph may be represented as N ═ N1,n2,...,nKN if the kth noun in the list is not mentioned in the figurek0, otherwise nkIs equal toThe noun extracted cardinality.
In this embodiment, the radix example learning module trains the classifier for each visual concept using multi-example learning. The score obtained by the kth classifier on the significant area block x is defined as:
Figure BDA0001060499160000061
is to map the original d-dimensional features to an h-dimensional h × d matrix, w, shared by all classifierskIs the weight of the kth visual concept classifier and x is the feature representation of the region block.
In this embodiment, the classification error for each "negative packet" (i.e., the image does not contain the example) of the radix example learning module is the maximum of all example scores in the packet; the classification error for each "positive packet" (i.e., the image contains at least one corresponding instance) is the error average for the corresponding base number of instances. Thus, the classification score for each picture X is:
Figure BDA0001060499160000062
wherein
Figure BDA0001060499160000063
Is to satisfy
Figure BDA0001060499160000064
The "main example" of (1),
Figure BDA0001060499160000065
is the n-thiFraction of area block, nkIs an example cardinality of the class contained in the packet.
In this embodiment, compared with a method in which only one positive case is extracted from one packet, the radix case learning module can extract more cases from an image, and obtain a classifier with stronger generalization performance.
In this embodiment, the error function of the example radix learning module is trained by using a stochastic gradient descent method until the network converges.
In this embodiment, since the diversity of extracted nouns, such as "airplan" and "helicopter" both refer to "plane", to obtain a more robust classifier, the nouns referring to similar objects are clustered into a large class by using multitask clustering as a visual concept. Note the mapped region Block feature x'iX, the fraction g of this region blockk(x)=wTx=wTx′iAnd w is the weight of each visual concept classifier, and is used for mapping the original d-dimensional features to an h-dimensional h × d matrix shared by all the classifiers, wherein the sum of w is obtained by training, and x is the feature representation of the region block.
In this embodiment, the objective function of the multitask clustering module is composed of two terms, namely a clustering error and a regularization error:
Figure BDA0001060499160000072
wherein the clustering error
Figure BDA0001060499160000073
To average classification error:
Figure BDA0001060499160000071
m is the total number of class instances, K is the number of all classes, wkIs the weight of the kth visual concept classifier, and W ═ W1,...,wk,...wK]And x is a characteristic representation of the region block.
The regularization error Ω (W, V) is: a penalty function for measuring weight magnitude and a regular function for measuring similarity between classes:
Ω(W,V)=Ωmag(W)+αΩinter(W,V)+βΩintra(W,V) (5)
Ωmagis an amplitude penalty term, Ω, of the weight WinterAnd omegaintraFor rights within and between classes, respectivelyRemade regularization, α and β are regularization coefficients, respectively, and V is A (A)TA)-1AT,A∈{0,1}K×TThe method is a cluster label assignment of visual concepts, if the kth visual concept belongs to the T-th cluster category, a (K, T) ═ 1, where K and T are the number of visual concept categories and the number of cluster categories, respectively.
For the non-convex optimization problem, a convex function relaxation method is adopted to optimize a group of semi-positive definite convex set matrixes to obtain parameters W and V.
Effects of the implementation
According to the steps, the system and the steps in the invention content are adopted for implementation, the data used for the experiment are derived from 12 thousands of samples of the data set MicroSoft CoCo, and each sample comprises a picture and five sentence statements. Four of the major classes were selected for the experiments, namely: peoples, vehicle, airlane and monitor, therefore, were trained with 10873 pictures in the training set, and 2568 pictures in the validation set were tested. The invention features 4096-dimensional vectors computed from a convolutional neural network. The embodiment system respectively compares three methods of strong supervision, if supervision and unsupervised for the application of target detection. Wherein, the strong supervision respectively compares DPM and R-CNN methods, the weak supervision compares PR methods, the unsupervised compares PBM methods, the average accuracy rates obtained on the four types of objects are respectively 0.349,0.506.0.268 and 0.218, the average accuracy rate of the method provided by the invention is 0.454, and the average accuracy rate is obviously improved.
Experiments show that the unsupervised joint visual concept learning system based on the images and the characters has a good effect in the problem of target detection.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims (9)

1. An unsupervised joint visual concept learning method based on images and characters is characterized by comprising the following steps:
a character analysis step: extracting corresponding nouns for given sentence description by using a character analysis tool, performing part-of-speech tagging on each word in the sentence, and extracting singular nouns and plural nouns as labels of a radix example learning module; extracting additional constraint information that the nouns learn as a cardinality example corresponding to cardinality, namely quantity, in addition to the nouns themselves;
cardinality example learning step: firstly, extracting a salient region in an image corresponding to sentence description, and then guiding a classifier for multi-example learning to train each visual concept by using radix information extracted in the step of character analysis, namely extracting the number of objects with the corresponding number of radix from each image to improve the classification accuracy of the visual concept learning, so as to obtain a visual concept classifier; each visual concept classifier obtained by training in the step is used as the input of the multi-task clustering step;
and (3) multitask clustering step: the visual concept classifier obtained by training the radix example learning step utilizes multi-task clustering to gather nouns referring to similar objects into a large class as a visual concept to process diversity among concepts so as to obtain a more compact and robust visual concept.
2. The unsupervised joint visual concept learning method based on images and characters as claimed in claim 1, wherein the character parsing step comprises: the extraction of noun base is divided into "exact" and "approximate", the exact "base is determined by the number modifier before the ranking, and the" approximate "plural noun base is defined as" 2 ", because at least two objects correspond to the graph.
3. The method as claimed in claim 1, wherein the basic learning step is performed at the image block level instead of the whole image level, because a natural image often contains multiple objects.
4. The unsupervised joint visual concept learning method based on images and texts as claimed in claim 1, wherein said radix sample learning step, wherein the images containing no corresponding sample are called "negative packet", the images containing at least one corresponding sample are called "positive packet", the classification error of each "negative packet" is the maximum value of all sample scores in the packet, and the classification error of each "positive packet" is the error average of the corresponding radix samples; the final classification error function is the sum of all "positive-packet" and "negative-packet" classification errors.
5. The method as claimed in claim 4, wherein the step of learning radix examples can extract more examples from the image and obtain a classifier with higher generalization performance than a method of extracting only one positive example from one packet.
6. The method of claim 5, wherein the step of radix exemplar learning, wherein the classification error function is trained using a stochastic gradient descent method until network convergence.
7. The unsupervised joint visual concept learning method based on images and characters as claimed in any one of claims 1-6, wherein the objective function of the multi-task clustering step is composed of both clustering error and regularization error.
8. The method of claim 7, wherein the regularization error is: a penalty function for measuring weight magnitude and a regular function for measuring similarity between classes.
9. An unsupervised joint visual concept learning system based on images and text for implementing the method of any one of claims 1-8, comprising: the system comprises a character analysis module, a radix example learning module and a multi-task clustering module, wherein:
the character analysis module extracts corresponding nouns for given sentence description by using a character analysis tool, performs part-of-speech tagging on each word in the sentence and extracts singular nouns and plural nouns as labels of the radix example learning module; besides the nouns, extracting the cardinality, namely the number corresponding to the nouns as additional constraint information of a cardinality example learning module;
the radix example learning module firstly extracts a salient region in an image corresponding to sentence description, and then guides a classifier for multi-example learning to train each visual concept by using radix information extracted in the last module, namely extracts the number of objects with the corresponding number of radix for each image to improve the classification accuracy of the visual concept learning, and obtains the visual concept classifier; each visual concept classifier obtained by the module training is used as the input of the next module;
the multi-task clustering module is used for gathering nouns referring to similar objects into a large class by using the visual concept classifier obtained by training of the radix example learning module through multi-task clustering to be used as a visual concept to process diversity among concepts so as to obtain a more compact and robust visual concept.
CN201610595620.3A 2016-07-26 2016-07-26 Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters Active CN106227836B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610595620.3A CN106227836B (en) 2016-07-26 2016-07-26 Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610595620.3A CN106227836B (en) 2016-07-26 2016-07-26 Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters

Publications (2)

Publication Number Publication Date
CN106227836A CN106227836A (en) 2016-12-14
CN106227836B true CN106227836B (en) 2020-07-14

Family

ID=57533062

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610595620.3A Active CN106227836B (en) 2016-07-26 2016-07-26 Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters

Country Status (1)

Country Link
CN (1) CN106227836B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682696B (en) * 2016-12-29 2019-10-08 华中科技大学 The more example detection networks and its training method refined based on online example classification device
CN106815604B (en) * 2017-01-16 2019-09-27 大连理工大学 Method for viewing points detecting based on fusion of multi-layer information
US10803356B2 (en) * 2017-04-07 2020-10-13 Hrl Laboratories, Llc Method for understanding machine-learning decisions based on camera data
CN108205684B (en) * 2017-04-25 2022-02-11 北京市商汤科技开发有限公司 Image disambiguation method, device, storage medium and electronic equipment
CN108062574B (en) * 2017-12-31 2020-06-16 厦门大学 Weak supervision target detection method based on specific category space constraint

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102930297A (en) * 2012-11-05 2013-02-13 北京理工大学 Emotion recognition method for enhancing coupling hidden markov model (HMM) voice-vision fusion
CN104217225A (en) * 2014-09-02 2014-12-17 中国科学院自动化研究所 A visual target detection and labeling method
CN104217008A (en) * 2014-09-17 2014-12-17 中国科学院自动化研究所 Interactive type labeling method and system for Internet figure video
CN105469041A (en) * 2015-11-19 2016-04-06 上海交通大学 Facial point detection system based on multi-task regularization and layer-by-layer supervision neural networ

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9965704B2 (en) * 2014-10-31 2018-05-08 Paypal, Inc. Discovering visual concepts from weakly labeled image collections

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102930297A (en) * 2012-11-05 2013-02-13 北京理工大学 Emotion recognition method for enhancing coupling hidden markov model (HMM) voice-vision fusion
CN104217225A (en) * 2014-09-02 2014-12-17 中国科学院自动化研究所 A visual target detection and labeling method
CN104217008A (en) * 2014-09-17 2014-12-17 中国科学院自动化研究所 Interactive type labeling method and system for Internet figure video
CN105469041A (en) * 2015-11-19 2016-04-06 上海交通大学 Facial point detection system based on multi-task regularization and layer-by-layer supervision neural networ

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Grounded compositional semantics for finding and describing images with sentences";Socher等;《NIPS Deep Learning Workshop》;20131231;1-12页 *
"Neil:Extracting visual knowledge from web data";Chen等;《IEEE International Conference on Computer Vision》;20131208;1409-1416页 *

Also Published As

Publication number Publication date
CN106227836A (en) 2016-12-14

Similar Documents

Publication Publication Date Title
CN107169049B (en) Application tag information generation method and device
US20210256051A1 (en) Theme classification method based on multimodality, device, and storage medium
CN110119786B (en) Text topic classification method and device
CN110852368B (en) Global and local feature embedding and image-text fusion emotion analysis method and system
CN113283551B (en) Training method and training device of multi-mode pre-training model and electronic equipment
CN106227836B (en) Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters
CN105279495B (en) A kind of video presentation method summarized based on deep learning and text
Liu et al. Open-world semantic segmentation via contrasting and clustering vision-language embedding
CN108595708A (en) A kind of exception information file classification method of knowledge based collection of illustrative plates
CN107562918A (en) A kind of mathematical problem knowledge point discovery and batch label acquisition method
CN113177124A (en) Vertical domain knowledge graph construction method and system
CN110619051B (en) Question sentence classification method, device, electronic equipment and storage medium
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN109213853A (en) A kind of Chinese community's question and answer cross-module state search method based on CCA algorithm
Liu et al. Compact feature learning for multi-domain image classification
CN110472652A (en) A small amount of sample classification method based on semanteme guidance
CN106601235A (en) Semi-supervision multitask characteristic selecting speech recognition method
CN111858896A (en) Knowledge base question-answering method based on deep learning
CN116756690A (en) Cross-language multi-mode information fusion method and device
CN116049367A (en) Visual-language pre-training method and device based on non-supervision knowledge enhancement
CN105740879B (en) The zero sample image classification method based on multi-modal discriminant analysis
CN112749556B (en) Multi-language model training method and device, storage medium and electronic equipment
Ling et al. A facial expression recognition system for smart learning based on YOLO and vision transformer
CN116977992A (en) Text information identification method, apparatus, computer device and storage medium
Feng et al. ModelsKG: A Design and Research on Knowledge Graph of Multimodal Curriculum Based on PaddleOCR and DeepKE

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant