EP2907079A1 - Procede de classification d'un objet multimodal - Google Patents
Procede de classification d'un objet multimodalInfo
- Publication number
- EP2907079A1 EP2907079A1 EP13774134.4A EP13774134A EP2907079A1 EP 2907079 A1 EP2907079 A1 EP 2907079A1 EP 13774134 A EP13774134 A EP 13774134A EP 2907079 A1 EP2907079 A1 EP 2907079A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- modality
- matrix
- multimedia
- dictionary
- recoding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5846—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
- G06F18/256—Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/10—Recognition assisted with metadata
Definitions
- the present invention relates to a method of classifying a multimodal object.
- the present invention is in the field of the detection and automatic recognition of multimodal objects called "multimedia", that is to say described in at least two ways, for example objects formed by an image and a set of textual words associated with this image. More specifically, the present invention is in the field of supervised classification. It can be applied in particular to the classification and search of multimedia information in databases.
- a document or object called "multimedia” essentially comprises several modalities.
- a multimedia object may consist of an image accompanied by textual information, which may be designated "tags" according to the English name.
- a multimedia object may also consist of a web page with one or more images and textual content.
- a multimedia object may also consist, for example, of a digitized document divided into several channels, for example a channel comprising textual information coming from an optical character recognition method, commonly referred to as OCR, a channel comprising illustrations and images. photographs identified in the document.
- a multimedia object may also consist, for example, of a video sequence separated into several channels, for example a visual channel comprising the images of the video sequence, a sound channel comprising the soundtrack of the sequence, a textual channel comprising, for example, sub-channels. titles, or textual information derived from a speech transcription method in text, a channel comprising metadata relating to the video sequence, for example relating to the date, the author, the title, the format of the sequence, etc.
- multimedia in the desired form or limited to one of the modalities of the multimedia object sought; for example, in the case where the searched multimedia object is an image associated with textual tags, a request may comprise visual information alone, or textual information alone.
- the search then consists of finding in the database the multimedia documents most similar to the request, for example to present them in order of relevance.
- a multimedia document is delicate, because of the heterogeneous nature of the modalities defining it.
- the visual modality in the classification of images associated with textual content, the visual modality can be transformed into feature vectors forming a low level visual description; the textual modality can be mapped in a dictionary reflecting a particular language or subdomain of the latter.
- supervised classification technique features are extracted from a plurality of objects, for the purpose of feeding a learning system, together with labels or "labels", to produce a model, which processing is performed offline. .
- a so-called test object also undergoes and in a similar way a feature extraction, the extracted characteristics being compared to the offline product model to allow prediction, the aforementioned steps being performed online.
- late fusion In order to overcome the problem related to the heterogeneity of the modalities, it is possible, according to a first technique known as late fusion, to proceed to the description and the classification of multimedia objects separately for the different modes according to which the is defined, then to merge late the results obtained for the different modalities.
- the late fusion technique is described in detail hereinafter with reference to FIG.
- early fusion According to an alternative method, called early fusion, the terms are merged at the extraction of characteristics.
- the early fusion technique is described in detail hereinafter with reference to FIG.
- An object of the present invention is to provide a more compact method of describing multimedia objects than the known methods, allowing both to combine different modalities of multimedia objects to best describe the content, the method being able to operate independently of the content itself of the objects.
- signatures be determined for multimedia objects, these signatures resulting from a combination of information according to different modalities.
- the present invention is thus based on an early fusion technique, and is based firstly on multimedia codes allowing the coding of words according to a first modality, for example textual, on words following a second modality, for example visual, extracts from a multimedia object, and secondly on the determination of "multimedia word bag” type signatures, like word bag techniques used for monomodal objects, explained below.
- the subject of the invention is a method for classifying a multimodal test object, referred to as a multimedia test object, described in at least a first and a second embodiment, characterized in that it comprises an off-site construction step.
- a multimedia test object described in at least a first and a second embodiment, characterized in that it comprises an off-site construction step.
- said re-encoding matrix is constructed at least by the following steps:
- said first modality may be textual
- said second modality may be visual
- the test object being a test image associated with textual tags
- said dictionary following the first modality being a dictionary textual
- said dictionary following the second modality being a visual dictionary
- the classification method may comprise a sequence of at least the following steps, performed offline:
- An unsupervised classification step called the clustering step of the normalized recoding matrix, generating the multimedia dictionary.
- the classification method may comprise a sequence of at least the following steps, performed online:
- the recoding step may be based on a locally constrained linear encoding technique.
- said normalization step may comprise a standardization of the row-recoding matrix according to the L1 standard.
- said clustering step can be performed from an algorithm of K-averages.
- the present invention also relates to a device for classifying a test object comprising means adapted for implementing a classification method according to one of the described embodiments.
- the present invention also relates to a computer program comprising instructions for implementing a classification method according to one of the described embodiments.
- An advantage provided by the present invention is that a method according to one of the described embodiments requires learning only a single multimedia model.
- FIG. 1 a diagram illustrating a supervised classification technique of images
- FIG. 2 a diagram illustrating a technique for supervised classification of multimodal documents, according to a late fusion method
- FIG. 3 a diagram illustrating a technique for supervised classification of multimodal documents, according to an early fusion method
- FIG. 4 a logic diagram illustrating a method of classifying a multimedia object according to an exemplary embodiment of the present invention
- FIG. 5 a diagram illustrating the principle of constructing a recoding matrix and a multimedia dictionary, in a method as illustrated by FIG. 4;
- Figure 6 is a diagram illustrating the main input and output data in a method as illustrated in Figure 4.
- Figure 7 is a diagram schematically illustrating a visual context recognition device according to an exemplary embodiment of the present invention.
- Figure 1 shows a diagram illustrating the supervised classification technique, introduced previously. It should be noted that the example illustrated in FIG. 1 applies to the classification of all types of objects, for example visual objects such as images, or textual objects.
- a supervised classification method comprises, in particular, a learning phase 1 1 carried out offline, and a testing phase 13 carried out online.
- the learning phase 11 and the test phase 13 each comprise a feature extraction step 11, 131 making it possible to describe an object, for example an image, with a vector of determined dimension.
- the learning step 11 is to extract the characteristics on a large number of learning objects 1 13; a series of signatures and the corresponding labels 1 12 feed a learning module 1 15, implementing a learning step and then producing a model 135.
- the test step 13 consists in describing, by means of the characteristic extraction step 131, an object called test object 133 by a vector of the same nature as during the learning phase 11. This vector is applied at the input of the aforementioned model 135.
- the model 135 outputs a prediction 137 of the test object label 133.
- the prediction associates the label (or labels) with the most relevant one (s) with the test object among the set of possible labels.
- This relevance is calculated by means of a decision function associated with the learning model learned on the learning base depending on the learning algorithm used.
- the label of an object indicates its degree of belonging to each of the concepts considered. For example, if three classes are considered, for example the classes "beach”, "city” and “mountain", the label is a three-dimensional vector, each component of which is a real number. For example, each component can be a real number between 0 if the object does not contain the concept, and 1 if the image contains the concept in a certain way.
- the learning technique may be based on a technique known per se, such as the technique of wide margin separators, commonly referred to as SVM corresponding to the terminology “Support Vector Machine”, on a technique called “ boosting ", or on a technique of the type designated by the acronym MKL corresponding to the English terminology” Multiple Kernel Learning ".
- Figure 2 presents a diagram illustrating a technique of supervised classification of multimodal documents, using a late fusion method.
- a supervised classification system for multimedia objects comprises, in particular, a learning phase 1 1 carried out offline, and a testing phase 13 carried out online.
- the learning phase 1 1 and the test phase 13 each comprise two feature extraction steps 1 1 1, 1 1 1 'and 131, 131' making it possible to describe a multimedia object, bimodal in the example illustrated by FIG. figure, for example an image associated with textual content.
- the learning phase 11 comprises a characteristic extraction step 1 1 1 according to a first modality, for example a visual mode, and a characteristic extraction step 11 'in a second modality, for example textual.
- the learning step 11 is to extract the characteristics on a large number of learning objects 1 13; a series of signatures and corresponding labels 12, feed a first learning module 1 15 relating to the first modality, and a second learning module 1 15 'relating to the second modality, the two learning modules 1 15 , 1 15 'implementing a learning step and then producing respectively a first model 135 according to the first modality, and a second model 135' according to the second modality.
- the test step 13 consists in describing, by means of two feature extraction steps 131, 131 'respectively according to the first and the second modality, a so-called object test 133 by vectors of the same nature, respectively according to the first and the second modality, as during the learning phase 1 1. These two vectors are input respectively of the two models 135, 135 'above. Each model 135, 135 'produces at its output respectively a first prediction 137 relating to the first modality and a second prediction 137' relating to the second modality, labels of the test object 133. The labels following the two modalities are then merged in a merging step 23, producing a single multimodal label. The melting step 23 is thus applied only online. The prediction associates the label (or labels) with the most relevant one (s) to the test object among a set of possible labels.
- Figure 3 presents a diagram illustrating a technique of supervised classification of multimodal documents, according to an early fusion method.
- a supervised classification system for multimedia objects comprises, in particular, a learning phase 1 1 carried out offline and a test phase 13 carried out online.
- the learning phase 1 1 and the test phase 13 each comprise two characteristic extraction steps 1 1 1, 1 1 1 'and 131, 131' making it possible to describe a multimedia object, bimodal in the example illustrated by the figure, for example an image associated with textual content.
- the learning phase 11 comprises a characteristic extraction step 1 1 1 according to a first modality, for example a visual mode, and a characteristic extraction step 11 'in a second modality, for example textual.
- an early fusion step 31 makes it possible to generate multimedia characteristics 310 from the characteristics extracted according to the first and the second modality at the feature extraction steps 1 1 1, 1 1 1 '.
- a learning module 1 15 implementing a learning step makes it possible to generate a multimedia model 335 from the multimedia characteristics 310 generated during the early merging step 31 and a plurality of tags 1 12.
- an early melting step 33 operating identically to the early melting step 31 applied during the learning phase 11, makes it possible to generate multimedia characteristics 330 online, from the extracted features according to the first and second modes at the feature extraction steps 1 1 1, 1 1 1 'based on a test object 133.
- the multimedia model 335 outputs a prediction 337 of the label of the test object 133.
- the prediction associates the label (or labels) with the most relevant (s) to the test object among the set of possible labels.
- a classification method according to the present invention is based on the early fusion principle illustrated above.
- a classification method according to the present invention applies in particular to the feature extraction steps.
- Such extraction techniques involve a step of extracting local descriptors from an image, to reconstruct a final signature, by a so-called "visual word bag” approach, commonly designated by the initials BOV corresponding to the English terminology “ Bag Of Visual terms "or” Bag Of Visterms ".
- a so-called "visual word bag” approach commonly designated by the initials BOV corresponding to the English terminology “ Bag Of Visual terms "or” Bag Of Visterms ".
- one or a plurality of local descriptors are extracted from the image considered, from pixels or "patches" dense in the image, or more generally sites in the image.
- local descriptors are associated with as many patches, which can in particular be defined by their location or locality, for example by coordinates (x, y) in a Cartesian coordinate system in which is also defined the domain of considered image, a patch that can be limited to one pixel, or consist of a block of a plurality of pixels.
- the local descriptors are then recoded during a coding step or "coding" in a feature space or “feature space” according to the English terminology, according to a reference dictionary, commonly referred to as "codebook”. .
- the recoded vectors are then aggregated during an aggregation or "pooling" step into a single signature vector.
- the coding step can in particular be based on a so-called “hard coding” technique, commonly referred to as “Hard Coding” or the corresponding acronym HC.
- Hard coding techniques are for example described in the publication of S. Lazebnik, C. Schmid and J. Ponce "Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories" above, or in the publication of J. Sivic and A. Zisserman “Video google: a text retrieval approach to object matching in videos” in ICCV, 2003.
- a local descriptor is recoded into a vector having a unique "1" on the dimension corresponding to the index of its nearest neighbor in the reference dictionary, and a plurality of "0" elsewhere.
- a step of coding by hard coding thus leads to the production of a histogram of occurrence of the visual words of the reference dictionary most present, a visual word of the reference dictionary being considered present when it is closest to a local descriptor of the image considered.
- the coding step can also be based on a so-called “soft coding” technique, commonly referred to as “Soft Coding” or the acronym SC.
- Soft Coding or the acronym SC.
- a soft coding technique is notably described in the publication by J. Van Gemert, C. Veenman, A. Smeulders and J. Geusebroek "Visual word ambiguity" - PAMI, 2009.
- a local descriptor is recoded according to its similarity to each of the visual words of the reference dictionary.
- the similarity is for example calculated as a decreasing function of the distance, typically an exponential of the opposite of the distance.
- the coding step can also be based on a so-called “locally constrained linear encoding” technique, commonly referred to as “Locally constrained Linear Coding” or the corresponding acronym LLC. LLC-like techniques are described in particular in the publication of S. Gao, I. Tsang, L. Chia and P. Zhao, "Local features are not lonely - Laplacian sparse coding for image classification" In CVPR, 201 1, in the publication of L. Liu, L. Wang and X. Liu "In defense of soft-assignment coding" in CVPR, 201 1, or in the publication of J. Yang, K. Yu, Y. Gong and T. Huang " The principle of this technique is to restrict the soft-type coding to the nearest neighbors of the descriptors in the feature space, for example from 5 to 20 nearest neighbors of the reference dictionary. In this way, the coding noise can be reduced significantly.
- the coding step can also be based on a so-called "locally constrained salient coding” technique, commonly referred to as “Locally constrained Salient Coding” where each descriptor is coded only on its nearest neighbor by associating it with a response, referred to as “saliency”, which depends on the relative distances of nearest neighbors to the descriptor. In other words, the smaller the distance from the nearest neighbor to the descriptor compared to the distances of the other close neighbors to the same descriptor, the greater the relevance.
- a technique of the "saliency coding” type is particularly described in the publication by Y. Huang, K. Huang, Y. Yu, and T. Tan. Salient coding for image classification, in CVPR, 201 1.
- FIG. 4 presents a logic diagram illustrating a method of classifying a multimedia object according to an exemplary embodiment of the present invention.
- the exemplary embodiment described below with reference to FIG. 4 applies to the description and classification of image-type multimedia objects associated with textual content, for example textual tags. It is to observe that this is a non-limiting example of the present invention, and that other modalities than visual or textual modalities can be envisaged and treated in a similar manner. In addition, the example described below applies to bimodal objects, but a larger number of modalities may be considered.
- the classification method may comprise a first preliminary step 401, making it possible to calculate the visual local characteristics on a learning basis, and to deduce therefrom a visual dictionary W v of a size K v , for example by an unsupervised classification method , denoted by the term "clustering", for example according to the K-means or "K-means” algorithm according to the English terminology, making it possible to partition local descriptors in a plurality k of sets in order to minimize the error of descriptor reconstruction by the centroid inside each partition. It is also possible to use other methods of learning the reference dictionary, such as for example the random draw of local descriptors or parsimonious coding.
- the classification method may also comprise a second preliminary step 403, which may for example be performed before, after, or in parallel with the first prior step 401, making it possible to construct a textual dictionary W T by selecting textual tags representative of a corpus of learning, or by a specific ad-hoc dictionary, the textual dictionary W T being of size K T.
- a second preliminary step 403 which may for example be performed before, after, or in parallel with the first prior step 401, making it possible to construct a textual dictionary W T by selecting textual tags representative of a corpus of learning, or by a specific ad-hoc dictionary, the textual dictionary W T being of size K T.
- each multimedia object that is to say each image with textual content in the example described, is represented by a plurality of textual tags able to be subsequently coded by one of the K T possible textual tags forming the textual dictionary W T , and a plurality of visual words, able to be subsequently coded by one of the K visual words v possible forming the visual dictionary W v .
- the classification method may then comprise an extraction step 405 of the local characteristics of the image, in which the local characteristics of the image are extracted and coded on the visual dictionary W v , and then aggregated according to a pooling technique.
- the coding may for example be a hard coding and consist in determining the occurrence of visual words of the visual dictionary W v closest to the local characteristics of the image, followed for example by an aggregation of average type.
- the extraction step 405 mentioned above may be followed by a construction step 407 of a matrix for recoding the textual tags to K v lines and K T columns, denoted X, whose coefficients are denoted X (i, j), i being an integer between 1 and K v , and j being an integer between 1 and K T , the recoding matrix X expressing the frequency of each visual word of the visual dictionary W v for each textual tag of the textual dictionary W T.
- the construction step 407 may, for example, begin with a null X recoding matrix and then increment the coefficient X (i, j) by 1 each time a learning image associated with the textual tag has a visual local characteristic. close to the visual word j.
- the construction step 407 of the X recoding matrix may be followed by a normalization step 409 of the X recoding matrix, for example according to the L1 standard per line.
- the normalization step 409 of the X-recoding matrix can then be followed by a clustering step 41 1 on the columns of the X-recoding matrix, for example according to a K-means algorithm or another of the clustering algorithms cited. previously.
- a multimedia dictionary W m can be obtained, whose size is K m .
- the multimedia dictionary W m then forms a new representation space for the multimedia objects, the lines of the multimedia dictionary W m thus constitute multimedia words.
- Each text tag represented by a column of the X recoding matrix, can then be recoded on this new representation space, during a recoding step 413.
- Several coding methods can be applied.
- the coding may in particular be based on one of the abovementioned techniques, that is to say on a "hard coding” technique, a “soft coding” technique, a “locally constrained linear coding” technique, a “coding” technique. locally stressed coding constraint ".
- a text tag code Xi i.e. a column of the X recoding matrix of an image data is the descriptor that must be encoded on the multimedia dictionary W n according to relation (1) below:
- x denotes a column of the recoding matrix X corresponding to the text tag considered; zy, a vector of size K m , is the code recoding x, on the multimedia dictionary; N k (Xj) denotes the set of k nearest neighbors of the vector xi among the columns of the recoding matrix X (k may for example be chosen equal to 5); ⁇ denotes a control parameter: the larger it is, the less the most remote multimedia words will have influence on the coding; m j and m r are the multimedia words obtained previously.
- the recoding step 413 may finally be followed by an aggregation step 415, called “pooling", aggregating the recoded text tags into a single vector representing the image.
- the aggregation step can be based on a sum, an average, or consider the maximum of each dimension, that is to say the maximum per multimedia word, the latter method being commonly referred to as "maximum pooling".
- FIG. 5 presents a diagram illustrating the principle of constructing an X recoding matrix and a multimedia dictionary W m , implemented during the construction step 407 in a method as illustrated by FIG. 4.
- a visual word occurrence matrix 501 can be learned on a learning base comprising a plurality of N images.
- the occurrence matrix of visual words 501 thus comprises N rows and K v columns.
- Each line of the visual word occurrence matrix 501 includes the respective visual signatures of the N images.
- a first intermediate matrix 503 denoted by V can be constructed, comprising K T columns, each column corresponding to a text tag.
- the intermediate matrix 503 can be constructed from a null matrix, then in a given column of the intermediate matrix 503, for each image among the plurality N, the presence or absence of each text tag is noted, the presence of a text tag in an image then introducing the value "1" in the column corresponding to this text tag.
- an image l m is associated with the textual tags t 1 and t j
- an image l n is associated with textual tags t 1 and t k .
- the visual words for which the text tag in question is present can be collected, ie the visual words associated with the value 1 in the column of the first intermediate matrix 503 corresponding to the textual tag considered. This action being able to form a process step and being represented by a block 504 in FIG.
- a second intermediate matrix 505 can then be constructed, this matrix comprising K v columns and K T lines. For each line, that is to say for each textual tag of the textual dictionary W T , is carried an aggregation of the occurrence of the corresponding visual words y collected during the previous step. For example, the occurrences of visual words for which a given text tag is present can be summed, an average or a maximum that can also be retained.
- the coefficients composing the second intermediate matrix 505 can be formulated according to the following relation (2):
- d k denotes the kth document in the training database D
- t j a text tag in the set of textual tags T dk relating to the document d k
- V (i, k ) refers to the occurrence of the ith visual word in the document k .
- the coefficients composing the second intermediate matrix 505 can be formulated according to the following relation (3):
- D denotes the learning base comprising N images
- d k denotes the kth document in the training database D
- t j a text tag in the set of textual tags T dk relating to the document d k
- V (i, k) denotes the occurrence of the i-th visual word in the document d k .
- the recoding matrix X can then be obtained from a normalization, for example by line following the norm L1, of the second intermediate matrix 505.
- the multimedia dictionary W m can then be obtained from a clustering on the columns of the X recoding matrix, for example according to a K-means algorithm or another of the clustering algorithms mentioned above. .
- Figure 6 shows a diagram illustrating the main input and output data in a classification method according to the logic diagram of Figure 4, as described above.
- FIG. 6 illustrates an example of an X recoding matrix, whose columns correspond to as many textual tags of the textual dictionary W T , and the lines to as many visual words of the visual dictionary W v .
- the recoding matrix X allows the construction of the multimedia dictionary W m , via a clustering step 41 1 as described above with reference to FIG. 4.
- Each text tag of a test image 533 can then be recoded on the multimedia dictionary W m , during the recoding step 413 described above with reference to FIG. 4.
- a recoded matrix Z can thus be obtained.
- the recoded matrix comprises as many lines as textual tags associated with the test image 533, and as many columns as multimedia words of the multimedia dictionary W m .
- FIG. 7 is a diagram schematically illustrating a visual context recognition device according to an exemplary embodiment of the present invention.
- a classification device may be implemented by dedicated computing means or via software instructions executed by a microprocessor connected to a data memory.
- a microprocessor connected to a data memory.
- FIG. 7 describes in a nonlimiting manner the classification device in terms of software modules, being considered that certain modules described can be subdivided into several modules, or grouped together. .
- the classification device 70 receives as input a multimedia object I in a digital form, for example grasped by input means arranged upstream, not shown in the figure.
- a microprocessor 700 connected to a data memory 702 allows the implementation of software modules whose software instructions are stored in the data memory 702 or a dedicated memory.
- the images, text tags or other objects according to defined methods, and the descriptors may be stored in a memory 704 forming a database.
- the classification device may be configured to implement a classification method according to one of the described embodiments.
- the implementation of a classification method can be carried out by means of a computer program comprising instructions provided for this purpose.
- the computer program can be recorded on a processor-readable recording medium.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Library & Information Science (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FR1259769A FR2996939B1 (fr) | 2012-10-12 | 2012-10-12 | Procede de classification d'un objet multimodal |
PCT/EP2013/070776 WO2014056819A1 (fr) | 2012-10-12 | 2013-10-07 | Procede de classification d'un objet multimodal |
Publications (1)
Publication Number | Publication Date |
---|---|
EP2907079A1 true EP2907079A1 (fr) | 2015-08-19 |
Family
ID=47741006
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP13774134.4A Withdrawn EP2907079A1 (fr) | 2012-10-12 | 2013-10-07 | Procede de classification d'un objet multimodal |
Country Status (4)
Country | Link |
---|---|
US (1) | US9569698B2 (fr) |
EP (1) | EP2907079A1 (fr) |
FR (1) | FR2996939B1 (fr) |
WO (1) | WO2014056819A1 (fr) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104239359B (zh) * | 2013-06-24 | 2017-09-01 | 富士通株式会社 | 基于多模态的图像标注装置以及方法 |
CN103561276B (zh) * | 2013-11-07 | 2017-01-04 | 北京大学 | 一种图像视频编解码方法 |
US9964499B2 (en) * | 2014-11-04 | 2018-05-08 | Toshiba Medical Systems Corporation | Method of, and apparatus for, material classification in multi-energy image data |
CN105095863B (zh) * | 2015-07-14 | 2018-05-25 | 西安电子科技大学 | 基于相似性权值的半监督字典学习的人体行为识别方法 |
CA3017697C (fr) * | 2016-03-17 | 2021-01-26 | Imagia Cybernetics Inc. | Procede et systeme pour traiter une tache avec robustesse par rapport a des informations d'entree manquantes |
KR102593438B1 (ko) | 2017-11-17 | 2023-10-24 | 삼성전자주식회사 | 뉴럴 네트워크 학습 방법 및 디바이스 |
US11528248B2 (en) * | 2020-06-10 | 2022-12-13 | Bank Of America Corporation | System for intelligent multi-modal classification in a distributed technical environment |
CN113642598B (zh) * | 2021-06-25 | 2024-02-23 | 南京邮电大学 | 基于显著性编码和软分配的局部聚合描述子向量算法 |
CN117476247B (zh) * | 2023-12-27 | 2024-04-19 | 杭州乐九医疗科技有限公司 | 一种疾病多模态数据智能分析方法 |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6298169B1 (en) * | 1998-10-27 | 2001-10-02 | Microsoft Corporation | Residual vector quantization for texture pattern compression and decompression |
US6968092B1 (en) * | 2001-08-21 | 2005-11-22 | Cisco Systems Canada Co. | System and method for reduced codebook vector quantization |
US7129954B2 (en) * | 2003-03-07 | 2006-10-31 | Kabushiki Kaisha Toshiba | Apparatus and method for synthesizing multi-dimensional texture |
JP4199170B2 (ja) * | 2004-07-20 | 2008-12-17 | 株式会社東芝 | 高次元テクスチャマッピング装置、方法及びプログラム |
US8532927B2 (en) * | 2008-11-07 | 2013-09-10 | Intellectual Ventures Fund 83 Llc | Generating photogenic routes from starting to destination locations |
US8442309B2 (en) * | 2009-06-04 | 2013-05-14 | Honda Motor Co., Ltd. | Semantic scene segmentation using random multinomial logit (RML) |
US8171049B2 (en) * | 2009-09-18 | 2012-05-01 | Xerox Corporation | System and method for information seeking in a multimedia collection |
FR2989494B1 (fr) * | 2012-04-16 | 2014-05-09 | Commissariat Energie Atomique | Procede de reconnaissance d'un contexte visuel d'une image et dispositif correspondant |
US20140229307A1 (en) * | 2013-02-12 | 2014-08-14 | Ebay Inc. | Method of identifying outliers in item categories |
-
2012
- 2012-10-12 FR FR1259769A patent/FR2996939B1/fr not_active Expired - Fee Related
-
2013
- 2013-10-07 WO PCT/EP2013/070776 patent/WO2014056819A1/fr active Application Filing
- 2013-10-07 EP EP13774134.4A patent/EP2907079A1/fr not_active Withdrawn
- 2013-10-07 US US14/434,723 patent/US9569698B2/en not_active Expired - Fee Related
Non-Patent Citations (2)
Title |
---|
None * |
See also references of WO2014056819A1 * |
Also Published As
Publication number | Publication date |
---|---|
FR2996939B1 (fr) | 2014-12-19 |
US9569698B2 (en) | 2017-02-14 |
WO2014056819A1 (fr) | 2014-04-17 |
US20150294194A1 (en) | 2015-10-15 |
FR2996939A1 (fr) | 2014-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP2907079A1 (fr) | Procede de classification d'un objet multimodal | |
US11709883B2 (en) | Image based content search and recommendations | |
US8213725B2 (en) | Semantic event detection using cross-domain knowledge | |
JP5351958B2 (ja) | デジタルコンテンツ記録のための意味論的イベント検出 | |
US20230376527A1 (en) | Generating congruous metadata for multimedia | |
US10528613B2 (en) | Method and apparatus for performing a parallel search operation | |
US20120179704A1 (en) | Textual query based multimedia retrieval system | |
Natarajan et al. | BBN VISER TRECVID 2013 Multimedia Event Detection and Multimedia Event Recounting Systems. | |
US10489681B2 (en) | Method of clustering digital images, corresponding system, apparatus and computer program product | |
Jiang | Super: towards real-time event recognition in internet videos | |
EP2839410B1 (fr) | Procédé de reconnaissance d'un contexte visuel d'une image et dispositif correspondant | |
Kalaiarasi et al. | Clustering of near duplicate images using bundled features | |
JP5592337B2 (ja) | コンテンツ変換方法、コンテンツ変換装置及びコンテンツ変換プログラム | |
Guzman-Zavaleta et al. | A robust and low-cost video fingerprint extraction method for copy detection | |
Bhattacharya et al. | A survey of landmark recognition using the bag-of-words framework | |
Belhi et al. | CNN Features vs Classical Features for Largescale Cultural Image Retrieval | |
US12020484B2 (en) | Methods and systems for grouping of media based on similarities between features of the media | |
US20220292809A1 (en) | Methods and systems for grouping of media based on similarities between features of the media | |
Diou et al. | Vitalas at trecvid-2008 | |
Lisena et al. | Understanding videos with face recognition: a complete pipeline and applications | |
SOMASEKAR et al. | VECTORIZATION USING LONG SHORT-TERM MEMORY NEURAL NETWORK FOR CONTENT-BASED IMAGE RETRIEVAL MODEL | |
Ishikawa et al. | Uni-and multimodal methods for single-and multi-label recognition | |
Wang et al. | Scene image retrieval via re-ranking semantic and packed dense interestpoints | |
EP3420470A1 (fr) | Procédé de description de documents multimedia par traduction inter-modalités, système et programme d'ordinateur associés | |
Anwar et al. | Recent progress in attributes based learning: A survey |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20150325 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
DAX | Request for extension of the european patent (deleted) | ||
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
17Q | First examination report despatched |
Effective date: 20190129 |
|
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06K 9/00 20060101AFI20200113BHEP Ipc: G06F 17/18 20060101ALI20200113BHEP Ipc: G06F 16/40 20190101ALI20200113BHEP Ipc: G06K 9/62 20060101ALI20200113BHEP |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: GRANT OF PATENT IS INTENDED |
|
INTG | Intention to grant announced |
Effective date: 20200220 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20200702 |