CN101714262B

CN101714262B - Method for reconstructing three-dimensional scene of single image

Info

Publication number: CN101714262B
Application number: CN2009102424163A
Authority: CN
Inventors: 王亦洲; 张哲斌; 高文
Original assignee: Peking University
Current assignee: Beijing Shenrui Bolian Technology Co Ltd
Priority date: 2009-12-10
Filing date: 2009-12-10
Publication date: 2011-12-21
Anticipated expiration: 2029-12-10
Also published as: CN101714262A

Abstract

The invention discloses a method for reconstructing a three-dimensional scene of a single image, which comprises the following steps of: a image input step, inputting each frame of image in an image sequence; a feature extracting step, extracting features from a scene of a image, carrying out scene classification and object recognition based on the extracted features to acquire semantic information, and simultaneously extracting monocular geometrical information based on the extracted features to acquire the monocular geometrical information; an object detection step, carrying out an object detection based on the extracted features and with the scene classification as a reference; a three-dimensional graphic primitive model selection step, selecting a three-dimensional graphic primitive model according to a result of the object detection; and a scene three-dimensional model generation step, carrying out inference and verification on the scene three-dimensional model according to scene semantic prior, the three-dimensional graphic primitive model and the monocular geometrical information to generate a final scene three-dimensional model.

Description

The method for reconstructing three-dimensional scene of single image

Technical field

The present invention relates to a kind of method for reconstructing three-dimensional scene of single image, belong to computer vision and technical field of image processing.

Background technology

The three-dimensional structure of restoration scenario is a classical problem and the typical ill-conditioning problem in the computer vision from single image, its difficulty is, image is to have been projected on the two-dimensional imaging plane from three dimensions through video camera by scene content, geological information about the scene three-dimensional structure inevitably has been subjected to loss, thereby makes this problem morbid stateization.And the mankind can identify three-dimensional informations such as the space structure in the scene, far and near relation accurately by the cognition of self from piece image.So, in computer vision field, be devoted to imitate human cognitive about the 3 D scene rebuilding of single image always, from the Pixel-level information of image, obtain the various information that help three-dimensional scenic to understand and realize three-dimensional scenic reconstruct.

Research about the single image three-dimensional reconstruction is one of focus of computer vision field always, and annual all have a large amount of articles to deliver, and proposes new method or theory.Classical single image three-dimensional rebuilding method comprises: thus utilize the parallel lines, the parallel surface that exist in the scene to concern that the deduction of carrying out end point, vanishing line obtains scene geometric information; Utilize texture consistency constraint in the scene, the magnitude relationship that has more similar texture is recovered the spatial depth of its position (level) information; Cause that according to atmospheric concentration, fog or focal length of camera in the natural scene sharpness of the imaging of zones of different on the image infers the spatial information of scene etc.

The problem that said method obviously exists is, and is strong to the dependence of data, thereby obviously is not that all images its needed information all can occur and causes using method for reconstructing to be rebuild.The reason of searching to the bottom is only image low-level image feature information to be handled and has ignored the vital role of scene (three-dimensional structure) being understood about scene high-layer semantic information and related constraint thereof, and then has caused semantic blank between image information and the scene information.In recent years, to fill up above-mentioned blank just needed for the tremendous development of modeling method, scene classification technology, object recognition technique and the machine learning method of understanding about scene in the computer vision field.For example the people such as Hoiem in people such as the Saxena of Stanford University and champagne branch school, University of Illinois propose to utilize the method for machine learning to merge multiple bottom-up information and learn model of place, and then the three-dimensional structure of scene is inferred.Wherein, the depth map that Saxena uses the scene depth detection instrument to be obtained carries out model learning, is setting up related between characteristics of image and the degree of depth under Markov field (MRF) model, thereby is carrying out the scene three-dimensional reconstruction.But, do not have the related of clear and definite three-dimensional model and particular type object, thereby can't use more high-rise semantic information help three-dimensional reconstruction because it has used the expression of three-dimensional grid model scene.People such as Hoiem are divided into scene up and down several and the prospect object that lands by first summary, and set up characteristics of image and above-mentioned relation of all categories by the learning classification device, thereby can infer the geometric attribute (several and prospect land object) up and down of each pixel in the image, its equally also not clear and definite object type.

In addition, Peking University in the application number of on March 11st, 2009 application be 200810224347.9, denomination of invention is " a kind of three-dimensional rebuilding method based on image " (patent document 1), patent document 1 discloses a kind of multiple image three-dimensional rebuilding method based on the unique point constraint, this method comprises: to the three-dimensional reconstruction of every width of cloth image characteristic point, calculate the unique point of every width of cloth image, each unique point is carried out three-dimensional reconstruction, determine described hunting zone for the treatment of the reconstruction point locus; Point in the described hunting zone is sampled, all sampled points are projected to image, obtain according to colour consistency then and treat the position of reconstruction point in the space.In addition, three-dimensional reconstruction technology based on single image is also disclosed in patent document 1, it utilizes the result of statistical learning to obtain the classification of some structural informations such as sky, ground, buildings facade in the scene or obtains the feature description of scene and the relation of the degree of depth, utilizes the corresponding relation of these classification results or feature and the degree of depth to carry out simple three-dimensional reconstruction to scene.But it does not still solve the problems of the technologies described above.

Summary of the invention

The object of the present invention is to provide a kind of method for reconstructing three-dimensional scene of the single image based on scene classification and object identification, compare with existing single image method for reconstructing, this method is relaxed greatly to the constraint of view data, and adaptability is wider, has also improved the performance of method for reconstructing simultaneously.It has been used for reference by machine learning and has set up the method that concerns between scene geometric attribute and the characteristics of image, introduced the expression of three-dimensional picture primitive simultaneously, by reasoning and calculation to three-dimensional picture primitive and combination thereof, the method for reconstructing three-dimensional scene that formation one cover fusion bottom layer image feature, high-layer semantic information (scene type, object type), basic primitive are represented.

Method for reconstructing three-dimensional scene according to the single image of first aspect present invention may further comprise the steps: image input step, each width of cloth image in the input image sequence; Characteristic extraction step is extracted feature from the scene of described image, carry out scene classification, object identification based on the feature of extracting, thereby the semantic information of obtaining is carried out the extraction of monocular geological information based on the feature of described extraction simultaneously, and obtains the monocular geological information; The object detection step, based on the feature of described extraction, and the described scene classification of reference carries out object detection; The three-dimensional picture basic-element model is selected step, selects the three-dimensional picture basic-element model according to the result of described object detection; And scene three-dimensional model generation step, carry out the reasoning and the checking of scene three-dimensional model according to the semantic priori of scene, described three-dimensional picture basic-element model and described monocular geological information, thereby generate final scene three-dimensional model.

In the method for reconstructing three-dimensional scene of above-mentioned single image, select also to comprise between the step that in described object detection step and described three-dimensional picture basic-element model the object parts detect step, result based on described object detection, described object parts are detected, wherein, select in the step at described three-dimensional picture basic-element model, the result who detects according to described object detection and described object parts selects described three-dimensional picture basic-element model.

In the method for reconstructing three-dimensional scene of above-mentioned single image, in described characteristic extraction step, utilizing context dependent image in the scene of described image and bottom-up and top-down inference method that the semantic information of described scene is carried out the description of stratification, is scene classification layer, object layer, object component layer and characteristics of image layer four levels with the scene description of described image.

In the method for reconstructing three-dimensional scene of above-mentioned single image, in the stratification of the semantic information of described scene is described, obtain the basic semantic information of the scene of described image based on scene classification, object identification, use the context relation of various ingredients in the described scene simultaneously, it is prior model, strengthening semantic information, and described basic semantic information is retrained.

In the method for reconstructing three-dimensional scene of above-mentioned single image, use Markov random field to describe spatial relationship and the semantic relation between each element in described object layer and the described object component layer, use context-free grammar to carry out modeling, describe to image feature information, object component information, object classification information, the unified of scene classification information to form by base pixel information to the containment relationship of element between each layer or according to belonging to relation.

In the method for reconstructing three-dimensional scene of above-mentioned single image, utilize described semantic constraint relation and described monocular geological information, described three-dimensional picture primitive is verified and made up, thereby solve the scene three-dimensional model of whole scene.

In the method for reconstructing three-dimensional scene of above-mentioned single image, generate in the step at described scene three-dimensional model, utilize following mathematical model 1:

M～P(M|I)∝P(I|M)P(M)

Mathematical model 1

M^{*} = \underset{M}{\arg \max} P (I | M) P (M)

P (I|M) is a likelihood model, and I is the single image of input, and M understands the three-dimensional of single image, and promptly M represents the scene three-dimensional model, wherein,

M＝(n，m ₁，m ₂，...，m _n)

m _i＝(l _i，θ _i)

Scene three-dimensional model M is by n submodel m _iConstitute submodel m _iBy class label l _iWhich kind of object specifies it is, and relevant parameter θ _iSpecify position and the attitude of this submodel under world coordinates,

Described likelihood model P (I|M) form is as follows:

P (I | M) = Π_{i = 1}^{n} P (I | s_{i}, m_{i}) Π_{i = 1}^{n} P (s_{i} | m_{i})

In described likelihood model P (I|M), φ _i(s _i, m _i, f _i(I)) be illustrated under the scene semantic tagger information of deduction with original image sequence in the fitting degree of counterpart, f _i(I) the three-dimensional submodel m of expression _iProcess is projected in characteristics of image corresponding in the original image,

Expression is selected the reliability of a certain three-dimensional primitive information according to the semantic information of being inferred,

P (M) = Π_{k = 1}^{C} P_{k} (n_{k}) Π_{i = 1, (i, j) &Element; ϵ}^{n} Q_{i} (m_{i}, m_{j}) Π_{l = 2}^{4} \underset{i}{Π} H_{li} (s_{li}, S_{(l - 1) i})

= Π_{k = 1}^{C} P_{k} (n_{k}) Π_{i = 1}^{n} \exp {- \underset{(i, j) &Element; ϵ}{Σ} ψ_{i} (m_{i}, m_{j})} Π_{l = 2}^{4} \underset{i}{Π} \exp {- η_{li} (s_{li}, S_{(l - 1) i}) - γ_{li} (s_{li}, s_{lj})}

Wherein, P _k(n _k) expression is about the number n of the submodel of k type objects _kPriori, ψ _i(m _i, m _j) be described in submodel in the whole described scene three-dimensional model and the consistance aspect object classification, position, attitude and yardstick between the submodel around it, η _Li(s _Li, S _{(l-1) i}) represent certain layer in the semantic information s of i node _LiWith its next straton node set S _{(l-1) i}Relation between the priori; γ _Li(s _Li, s _Lj) representing that wherein i, j, k, l are natural number with the priori of the semantic relation between the layer neighborhood of nodes, c is the natural number more than or equal to k.

In the method for reconstructing three-dimensional scene of above-mentioned single image, described characteristics of image comprises external appearance characteristic and geometric properties, and described external appearance characteristic comprises color, texture and illumination at least; Described geometric properties is according to vanishing line or texture similarity relation and the acquisition of image blurring degree.

In the method for reconstructing three-dimensional scene of above-mentioned single image, utilize sorter in described image, to mark all types of objects that it includes, and from the three-dimensional modeling data storehouse, choose the three-dimensional model primitive of corresponding with it object or object component categories based on described mark, initial geometric model as each type objects of present image correspondence, described sorter is trained at object identification and object component identification and is obtained, and adds scene classification as constraint condition.

Than prior art, the present invention has following beneficial effect:

The first, proposed to represent about the uniform mathematical model that image scene is understood, comprise scene classification, object identification, object parts composition etc., thereby can realize the stratification of image scene is understood, thereby the semantic information of these stratification at first can have an integral body to describe the limited condition of data of having relaxed to scene; Secondly it can drive reconstruction algorithm.

The second, a kind of three-dimensional picture basic-element model system of selection based on image, semantic information has been proposed, make image information and graphical information obtain fusion, the three-dimensional picture basic-element model that semantic information drives is selected to the base unit that constitutes whole three-dimensional scenic, thus avoided existing monocular method for reconstructing since use based on the reconstruction problem of dtmf distortion DTMF that causes about the whole scene three-dimensional model of grid model.

Three, owing to used unified scene method for expressing, the model of place that this method proposes can utilize the bottom-up and top-down inference method that combines to carry out the study and the calculating of model well, it has mainly utilized the selection of the method driving model primitive of object (parts) identification, thus the speed-up computation process; And utilize the prioris such as context relation in the scene to realize the combination and the checking of whole scene model, scene is understood and the computational accuracy of scene rebuilding thereby improve.

Other advantage of the present invention will be described in the following description, and advantage of the present invention and beneficial effect can by embodiment represent or those skilled in the art direct derivation goes out with practical experience in conjunction with the embodiments.

Description of drawings

When considered in conjunction with the accompanying drawings, by the reference following detailed, can more completely understand the present invention better and learn wherein many attendant advantages easily, but accompanying drawing described herein is used to provide further understanding of the present invention, constitute a part of the present invention, illustrative examples of the present invention and explanation thereof are used to explain the present invention, do not constitute to improper qualification of the present invention, wherein:

Fig. 1 is the process flow diagram of method for reconstructing three-dimensional scene of the single image of first embodiment of the invention.

Fig. 2 is the process flow diagram of method for reconstructing three-dimensional scene of the single image of second embodiment of the invention.

Fig. 3 is the process flow diagram of method for reconstructing three-dimensional scene of the single image of third embodiment of the invention.

Embodiment

Fig. 1 is the process flow diagram of method for reconstructing three-dimensional scene of the single image of first embodiment of the invention.As shown in Figure 1, may further comprise the steps:

S110 image input step, each width of cloth image in the input image sequence;

The S120 characteristic extraction step, from the scene of described image, extract feature, carry out scene classification, object identification based on the feature of extracting, thus the semantic information of obtaining, simultaneously carry out the extraction of monocular geological information, and obtain the monocular geological information based on the feature of described extraction;

S130 object detection step, based on the feature of described extraction, and the described scene classification of reference carries out object detection;

S140 three-dimensional picture basic-element model is selected step, selects the three-dimensional picture basic-element model according to the result of described object detection; And

S150 scene three-dimensional model generates step, carries out the reasoning and the checking of scene three-dimensional model according to the semantic priori of scene, described three-dimensional picture basic-element model and described monocular geological information, thereby generates final scene three-dimensional model.

Fig. 2 is the process flow diagram of method for reconstructing three-dimensional scene of the single image of second embodiment of the invention.As shown in Figure 2, in the method for reconstructing three-dimensional scene of above-mentioned single image, select also to comprise between the step S140 that at described object detection step S130 and described three-dimensional picture basic-element model the object parts detect step S135, result based on described object detection, described object parts are detected, wherein, select in the step at described three-dimensional picture basic-element model, the result who detects according to described object detection and described object parts selects described three-dimensional picture basic-element model.

Fig. 3 is the process flow diagram of method for reconstructing three-dimensional scene of the single image of third embodiment of the invention.As shown in Figure 3, in single image method for reconstructing three-dimensional scene of the present invention, having proposed a kind of mathematical model of understanding about single image scene (semantic and how much) unification, stratification represents, in conjunction with methods such as scene classification, object identification, the extractions of monocular three-dimensional geometric information, the context relation that fusion constitutes about scene, use comes the study of implementation model and the semantic analytical Calculation of understanding for scene in conjunction with bottom-up and top-down inference method, thereby constitutes the complete single image method for reconstructing of a cover.

In single image method for reconstructing three-dimensional scene of the present invention, use scene classification and methods such as object identification, object component identification are obtained the basic semantic information about scene, use the context relation reinforcement semantic information of various ingredients in the scene that basic semantic information is retrained simultaneously.

In addition, the result who uses semantic information to infer carries out basic three-dimensional picture basic-element model and selects, thereby the contact of the image information of setting up and scene three-dimensional model elementary cell, select thereby form the semantic three-dimensional picture basic-element model that drives, for the three-dimensional model of whole scene provides basic building block.

In addition, utilize semantic constraint relation and monocular geological information, the three-dimensional picture basic-element model is verified and made up, thereby constitute the three-dimensional model of whole scene, detailed process is seen the embodiment part.

Particularly, use for reference the description that stochastic context associated picture grammer comes the semantic information of scene is carried out stratification.Scene description about piece image is divided into scene classification layer, object layer, object component layer and characteristics of image layer four levels; (Markov Random Field MRF) describes spatial relationship between each element and semantic relation in this layer wherein to use Markov random field in object layer and object component layer; Use context free grammar (Stochastic Context Free Grammar, SCFG) carry out modeling to the containment relationship of element between model middle level and the layer or according to belonging to relation, describe to image feature information, object component information, object classification information, the unified of scene classification information thereby form by base pixel information.

We turn to a problem of asking for maximum a posteriori probability under the bayesian theory framework to the single image scene three-dimensional reconstruction form based on scene classification and object identification, promptly under the condition of given image sequence, calculate the three-dimensional model of an optimum, make this model that the understanding about the three-dimensional information that can provide input picture can be provided.This probability model has following form:

M～P(M|I)∝P(I|M)P(M)

M^{*} = \underset{M}{\arg \max} P (I | M) P (M)

Wherein, P (I|M) is a likelihood model, and I is the single width scene image of input, and M understands the three-dimensional of image, i.e. three-dimensional model.This model has following form:

M＝(n，m ₁，m ₂，...，m _n)

m _i＝(l _i，θ _i)

The implication of following formula M set is: scene three-dimensional model M is made of n submodel, and submodel is by class label l _iSpecifying it is which kind of object (such as vehicle, building, trees, pedestrian etc.), and relevant parameter θ _iSpecify position and the attitude of this submodel under world coordinates.In the process of implementation, according to class label l _iCall corresponding basic model from model bank, promptly the three-dimensional picture basic-element model constitutes.

Likelihood model

Likelihood model P (I|M) form is as follows:

P (I | M) = Π_{i = 1}^{n} P (I | s_{i}, m_{i}) Π_{i = 1}^{n} P (s_{i} | m_{i})

In this likelihood model, φ _i(s _i, m _i, f _i(I)) be illustrated under the scene semantic tagger information of deduction with the original image preface in the fitting degree of counterpart (feature).f _i(I) the three-dimensional submodel m of expression _iThrough being projected in characteristics of image corresponding in the original image, these features had both comprised external appearance characteristic (color, texture, illumination etc.), and geometric properties (mainly provided by vanishing line, also can utilize texture similarity relation and image blurring degree to obtain) also is provided.To φ _i(s _i, m _i, f _iWhen (I)) calculating, external appearance characteristic can help us to differentiate the present image zone effectively more near which kind of object, thereby for selecting three-dimensional submodel that foundation is provided; Geometric properties is weighed the projection of selected three-dimensional model and the otherness of image geometry feature (as the distance between the lines feature on projected outline and the image).

The reliability of a certain three-dimensional primitive information is selected in expression according to the semantic information of being inferred.Model of place primitive of being inferred and a certain class semantic tagger information and feature can help us accurately to recover position, attitude and the yardstick etc. of three-dimensional model in world coordinate system.Above-mentioned model is actually a kind of the considering about semanteme hypothesis and model hypothesis and image raw information fitting degree that is proposed for each zone of image in the computation process, and wherein i is a natural number, and c is the natural number more than or equal to k.

Prior model

Prior model in the maximum a posteriori formula that provides in front can be decomposed into:

P (M) = Π_{k = 1}^{C} P_{k} (n_{k}) Π_{i = 1, (i, j) &Element; ϵ}^{n} Q_{i} (m_{i}, m_{j}) Π_{l = 2}^{4} \underset{i}{Π} H_{li} (s_{li}, S_{(l - 1) i})

= Π_{k = 1}^{C} P_{k} (n_{k}) Π_{i = 1}^{n} \exp {- \underset{(i, j) &Element; ϵ}{Σ} ψ_{i} (m_{i}, m_{j})} Π_{l = 2}^{4} \underset{i}{Π} \exp {- η_{li} (s_{li}, S_{(l - 1) i}) - γ_{li} (s_{li}, s_{lj})}

This prior model is made up of three parts, P in first _k(n _k) expression is about the number n of the submodel of k type objects _kPriori, ψ in second portion _i(m _i, m _j) described in whole three-dimensional model, around submodel and its between submodel, the consistance at aspects such as object classification, position, attitude and yardsticks.For example vehicle should appear on the highway, and the trees in roadside can be arranged or the like one usually.During third part is described and to be described about the scene stratification in each layer and layer with layer between the priori of semantic relation, wherein η _Li(s _Li, S _{(l-1) i}) represent the semantic information s of i node in certain layer (for example being the 1st layer) _LiWith its next straton node set S _{(l-1) i}Relation between the priori; γ _Li(s _Li, s _Lj) expression is with the priori of the semantic relation of layer between the neighborhood of nodes, i wherein, j, k, l are natural number.

The semantic three-dimensional basic-element model that drives is selected

When extracting characteristics of image (external appearance characteristic and partial geometry feature), image is carried out initial mark, that is: use the sorter (as Adaboost or SVM etc.) that trains in image, to mark out all types of objects that it includes.Mark we can choose corresponding with it object classification from scene three-dimensional modeling data storehouse three-dimensional picture basic-element model based on these, as the initial geometric model of each type objects of present image correspondence.

Setting up the three-dimensional picture basic-element model is the basis that forms final model of place, also is the prerequisite (driving the extraction of basic three-dimensional model primitive in computation process by semantic information) of calculating the key of three-dimensional model fast simultaneously.This comprises how determining the primitive representation, the set of relations between attribute and the primitive individuality etc.Here, we are according to people's the angle of cognition and the basic general knowledge of actual environment, use for reference the thought of parameterized GEON model, by hand set up common basic-element model storehouse, comprise relation between dissimilar models and attribute thereof and the different model (mutual exclusion, compatible etc.).

The calculating of probability model

Method about Model Calculation, its core strategy is: merge methods such as many scene classifications, object identification, the extraction of monocular three-dimensional geometric information, in conjunction with the context relation that constitutes about scene, under the bayesian theory framework, utilize bottom-up and top-down computing mechanism, the posterior probability that generates three-dimensional scene models by maximization is carried out the parsing of scene semanteme, and each object wherein carried out three-dimensionalreconstruction, comprise choosing of model and asking for its parameter (position, attitude, yardstick).Wherein bottom-up process mainly is to the calculating of carrying out the object classification and chooses three-dimensional basic-element model thus.Top-down computation process is mainly finished the checking of scene semantic information, and finishes the calculating of whole scene model.

More than the present invention is described in detail, used specific case herein principle of the present invention and embodiment set forth, the explanation of above embodiment just is used for help understanding method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1. the method for reconstructing three-dimensional scene of a single image is characterized in that, may further comprise the steps:

The image input step, each width of cloth image in the input image sequence;

Characteristic extraction step, from the scene of described image, extract feature, carry out scene classification, object identification based on the feature of extracting, thereby obtain the scene semantic information, simultaneously carry out the extraction of monocular geological information, and obtain the monocular geological information based on the feature of described extraction;

The object detection step extracting feature from the scene of image when, is utilized sorter to mark its all types of objects that include in described each width of cloth image, and is carried out object detection with reference to described scene classification; Described sorter obtains at object identification and the training of object component identification, and adds scene classification as constraint condition;

The three-dimensional picture basic-element model is selected step, according to the described all types of objects of mark, chooses the three-dimensional picture basic-element model from the three-dimensional modeling data storehouse; And

The scene three-dimensional model generates step, carries out the reasoning and the checking of scene three-dimensional model according to described scene semantic information, described three-dimensional picture basic-element model and described monocular geological information, thereby generates final scene three-dimensional model.

2. the method for reconstructing three-dimensional scene of single image according to claim 1, it is characterized in that, select also to comprise between the step that in described object detection step and described three-dimensional picture basic-element model the object parts detect step, result based on described object detection, described object parts are detected, wherein, select in the step at described three-dimensional picture basic-element model, the result who detects according to described object detection and described object parts selects described three-dimensional picture basic-element model.

3. the method for reconstructing three-dimensional scene of single image according to claim 2, it is characterized in that, in described characteristic extraction step, utilizing context dependent image in the scene of described image and bottom-up and top-down inference method that described scene semantic information is carried out the description of stratification, is scene classification layer, object layer, object component layer and characteristics of image layer four levels with the scene description of described image.

4. the method for reconstructing three-dimensional scene of single image according to claim 3, it is characterized in that, in the stratification of the semantic information of described scene is described, obtain the described scene semantic information of image based on scene classification, object identification, use the context relation of various ingredients in the described scene simultaneously, be prior model, strengthening semantic information, and described scene semantic information retrained.

5. the method for reconstructing three-dimensional scene of single image according to claim 3, it is characterized in that, use Markov random field to describe spatial relationship and the semantic relation between each element in described object layer and the described object component layer, use context-free grammar to carry out modeling, describe to image feature information, object component information, object classification information, the unified of scene classification information to form by base pixel information to the containment relationship of element between each layer or according to belonging to relation.