WO2014040559A1 - 场景识别的方法和装置 - Google Patents

场景识别的方法和装置 Download PDF

Info

Publication number
WO2014040559A1
WO2014040559A1 PCT/CN2013/083501 CN2013083501W WO2014040559A1 WO 2014040559 A1 WO2014040559 A1 WO 2014040559A1 CN 2013083501 W CN2013083501 W CN 2013083501W WO 2014040559 A1 WO2014040559 A1 WO 2014040559A1
Authority
WO
WIPO (PCT)
Prior art keywords
scene
local
target
identified
detectors
Prior art date
Application number
PCT/CN2013/083501
Other languages
English (en)
French (fr)
Inventor
姜育刚
刘洁
王栋
郑莹斌
薛向阳
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP13837155.4A priority Critical patent/EP2884428A4/en
Publication of WO2014040559A1 publication Critical patent/WO2014040559A1/zh
Priority to US14/657,121 priority patent/US9465992B2/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/35Categorising the entire scene, e.g. birthday party or wedding scene
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements

Definitions

  • the present invention relates to the field of information technology and, more particularly, to a method and apparatus for scene recognition. Background technique
  • Image scene recognition refers to the use of visual information of images to automatically process and analyze images, and to identify and identify specific scenes (such as kitchens, streets, mountains, etc.). Judging the scene in an image not only helps to understand the overall semantic content of the image, but also provides a basis for the identification of specific targets and events in the image. Therefore, scene recognition plays an important role in computer automatic image understanding. Scene recognition technology can be applied to many practical problems, such as intelligent image management and retrieval.
  • the existing scene recognition technology first describes the visual information of the image, which is also called the visual feature extraction of the image; then the extracted visual features are matched by using the template (or classifier) that has been acquired for different scenes ( Or classification), and get the final scene recognition results.
  • a common method of extracting visual features is to calculate statistics that represent low-level visual information in an image.
  • These visual features include features that describe color information, features that describe texture information, and features that describe shape information. After obtaining low-level visual information, the above features can be classified by a pre-trained classifier to obtain a final recognition result.
  • the main disadvantage of this method is that the low-level visual features have weaker resolving power for different scenes, and it is impossible to effectively distinguish and identify some scenes with close color, texture and other information (such as study and library), thus affecting the scene. Identify performance.
  • Another existing method uses mid-level feature representation (or "attribute") for scene recognition. This type of approach first requires the design of a large number of visual concept detectors. The resulting connections detected by the visual concept detector form a mid-level feature representation. Finally, the feature is classified by the classifier to obtain the final recognition result.
  • the main disadvantages of this method include: 1. The method uses the detection results of the entire target of the marked object (such as “athlete”, “soccer”, etc.) as the middle layer feature, and the description ability is limited, such as Only a part of an object appears in the scene (such as “athletes only expose the legs", etc.), it can not be detected; 2.
  • Embodiments of the present invention provide a method and apparatus for scene recognition, which can improve scene recognition performance.
  • a method for scene recognition comprising: training a plurality of local detectors by a training image set, and one of the plurality of local detectors corresponds to a local area of a type of target,
  • the target of the type includes at least two local regions; the plurality of local detectors are used to detect the scene to be identified, and the feature of the target-based local region of the to-be-identified scene is acquired; and the feature of the target-based local region according to the to-be-identified scene Identify the scene to be identified.
  • the method further includes: combining the local detectors whose similarities are higher than a predetermined threshold among the plurality of local detectors to obtain a composite local detector set; using the multiple local detectors Detecting the to-be-identified scene and acquiring the feature of the target-based local area of the to-be-identified scene, the specific detection is performed by: detecting the to-be-identified scene by using a local detector in the composite local detector set, and acquiring the target-based scene of the to-be-identified scene The characteristics of the local area.
  • the similarity includes a degree of similarity between features of the local regions of the training images corresponding to the plurality of local detectors.
  • the to-be-identified scene is identified according to a feature of the target-based local area of the to-be-identified scene,
  • the specific implementation is: using a classifier to classify features of the target-based local area of the to-be-identified scene, and acquiring a scene recognition result.
  • acquiring a feature of the target-based local area of the to-be-identified scene is: acquiring a response graph of the to-be-identified scene by using each local detector that detects the to-be-identified scene; dividing the response graph into a plurality of grids, and using a maximum response value in each grid For each feature of the trellis, the features of all the trellis of the response graph are used as features corresponding to the response graph, and the features corresponding to the response graphs acquired by the local detectors that detect the to-be-identified scene are used as the target of the to-be-identified scene.
  • the characteristics of the local area is: acquiring a response graph of the to-be-identified scene by using each local detector that detects the to-be-identified scene; dividing the response graph into a plurality of grids, and using a maximum response value in each grid
  • a device for scene recognition comprising: a generating module, configured to train a plurality of local detectors by a training image set, and one of the plurality of local detectors corresponds to a type of target a local area, the type of the target includes at least two partial areas, and the detecting module is configured to detect the to-be-identified scene by using the plurality of local detectors obtained by the generating module, and acquire the target-based local area of the to-be-identified scene And a recognition module, configured to identify the to-be-identified scene according to a feature of the target-based local area of the to-be-identified scene acquired by the detection module.
  • the apparatus further includes: a merging module, configured to combine the local detectors whose similarities are higher than a predetermined threshold among the plurality of local detectors to obtain a composite local detector set;
  • the module is further configured to detect the to-be-identified scene by using a local detector in the composite local detector set, and acquire a feature of the target-based local area of the to-be-identified scene.
  • the similarity includes a degree of similarity between features of the local regions of the training images corresponding to the plurality of local detectors.
  • the identifying module is specifically configured to use the classifier to target the local part of the to-be-identified scene The characteristics of the area are classified to obtain the scene recognition result.
  • the detecting module is specifically configured to use each of the detected scenes to be identified
  • the local detector obtains a response graph of the to-be-identified scene, divides the response graph into a plurality of grids, and uses a maximum response value in each grid as a feature of each grid, and the characteristics of all the grids of the response graph are used as the
  • the feature corresponding to the response map acquired by the local detector that detects the scene to be identified is used as the feature of the target-based local region of the to-be-identified scene.
  • the method and apparatus for scene recognition detects a scene to be identified by using a local detector of a local area corresponding to the target, and the acquired feature of the target-based local area of the to-be-identified scene can be more completely represented.
  • FIG. 1 is a schematic flowchart of a method for scene recognition according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of an example of a method of scene recognition according to an embodiment of the present invention.
  • FIG. 3 is another schematic flowchart of a method for scene recognition according to an embodiment of the present invention.
  • 4 is a schematic diagram of another example of a method of scene recognition according to an embodiment of the present invention.
  • FIG. 5 is still another schematic flowchart of a method for scene recognition according to an embodiment of the present invention.
  • 6 is a schematic block diagram of an apparatus for scene recognition according to an embodiment of the present invention.
  • FIG. 7 is another schematic block diagram of an apparatus for scene recognition according to an embodiment of the present invention.
  • FIG. 8 is a schematic block diagram of an apparatus for scene recognition according to another embodiment of the present invention. detailed description
  • FIG. 1 shows a schematic flow chart of a method 100 of scene recognition in accordance with an embodiment of the present invention. As shown in FIG. 1, the method 100 includes:
  • the plurality of local detectors are used to detect a scene to be identified, and acquire a feature of the target-based local area of the to-be-identified scene;
  • the device for scene recognition firstly obtains a plurality of local detectors from the training image set, wherein a local detector corresponds to a local region of a type of target, and then, the plurality of local detectors are used to detect Identification field Obtaining a feature of the target-based local area of the to-be-identified scene, and identifying the to-be-identified scene according to the feature of the target-based local area of the to-be-identified scene. Since the local detector corresponds to a local area of the target, the scene is detected by the local detector, and the features of the local area of the target can be obtained.
  • the method for scene recognition in the embodiment of the present invention detects a scene to be identified by using a local detector of a local area corresponding to the target, and the acquired feature of the target-based local area of the to-be-identified scene can more completely represent the image information, thereby enabling Improve scene recognition performance.
  • the scene recognition device trains a plurality of local detectors from the training image set.
  • each type of object is divided into a plurality of partial regions, that is, each type of target includes at least two partial regions.
  • Generating a local detector requires the use of a training image set with annotations that not only requires the target category (eg, "referee") that the image has, but also the specific location information of the overall target in the image (no need for each target local) s position).
  • the local detectors of each type of target can utilize the existing variability based on local models (Deformable Part-based Models). Obtained for the "DPM" algorithm.
  • the DPM algorithm will automatically identify the most unique parts of each type of target (such as the "head”, “torso”, “lower limb”, etc.) of each type of target based on the input parameters (such as the number of parts), so that the corresponding parameters are obtained.
  • a local local detector such as the "head”, “torso”, “lower limb”, etc.
  • the device for scene recognition uses the plurality of local detectors to detect the to-be-identified scene, and acquires features of the target-based local region of the to-be-identified scene.
  • the scene recognition device uses the local detectors to detect the scene to be identified, and obtains the features of the local regions corresponding to the local detectors, and the features of the local regions constitute the target-based portion of the scene to be identified.
  • the characteristics of the area As shown in FIG. 2, the image is detected by a local detector corresponding to different parts of the human body (for example, the head, the trunk, the upper arm, the arm, and the leg), and the features of different parts of each target (the person in FIG. 2) are obtained, thereby constituting
  • the entire image scene is based on features of different parts of the human body.
  • the method 100 further includes:
  • step S120 includes:
  • S121 Detect the to-be-identified scene by using a local detector in the composite local detector set, and acquire a feature of the target-based local area of the to-be-identified scene.
  • Different types of targets may have local areas of commonality, such as the head of the athlete and the head of the referee.
  • the local detectors with higher similarity among the plurality of local detectors may be combined, that is, the local detectors whose similarities are higher than the predetermined threshold are merged, and then the combined local detectors are used to detect Identify the scene.
  • the composite local detector set represents a set of local detectors obtained by combining the plurality of local detectors. If a part of the local detectors are combined, the combined local detector set includes the merged The local detector and another partially uncombined local detector, if all local detectors are combined, the composite local detector set includes only the combined local detector.
  • the merging of the local detectors can be based on information of the local regions of the respective images.
  • the semantics of the merged region may be limited to ensure that the merged local detectors are semantically highly correlated. For example, the "head” of "referee” and the “head” of “athlete” can be merged, and the "head” of "cat” is not allowed to merge.
  • the similarity of the local detectors includes the degree of similarity between features of the local regions of the training images corresponding to the local detectors. For example, in the local detector set to be merged, for each local detector, its corresponding image local region is found on its corresponding training image, according to the low-level features (colors) of the local training image corresponding to each local detector. The degree of similarity of textures, etc., gives the similarity of each local detector. The similarity is higher, i.e., local detectors above a predetermined threshold (e.g., 0.8) can be merged.
  • the method of merging can be aligned with the upper left corner of the cartridge, and the upper left corner of the filter matrix corresponding to the local detector that needs to be merged is aligned and averaged.
  • the local detector pi is trained by the training image set A, and the local detector p2 is obtained from the training image set B. Both pi and p2 correspond to the head, and pi and p2 are combined to obtain a local detector p. If the detection is performed by pi and p2, each target is detected twice, and the detected by the combined local detector p is detected only once, and the repeated detection is avoided.
  • the method for scene recognition in the embodiment of the present invention by combining the local detectors and using the combined local detectors to detect the scene to be identified, not only the acquired features of the scene to be recognized can completely represent the image information, but also Avoid local repeated detection, effectively reduce the feature information dimension, and thus improve the scene recognition performance.
  • acquiring a feature of the target-based local area of the to-be-identified scene includes:
  • the feature corresponding to the response map acquired by the local detector of the scene is used as the feature of the target-based local area of the to-be-identified scene.
  • each local detector if the local detector is merged, it refers to the combined local detector, and the response map for the local detector is generated on the image by sliding the window on the image. .
  • each of the local detector 1 to the local detector N detects an image of the feature to be extracted, that is, an image of the scene to be recognized.
  • Each local detector generates a response map.
  • the response graph generated by each local detector can be divided into three ways (1*1, 3*1, and 2*2). For each lattice after the division, the maximum response value in the lattice is taken as the feature of the lattice, so that each local detector can generate an 8-dimensional (1 * 1 +3 * 1 +2*2) response characteristic. .
  • the final feature is obtained by joining/combining all of the local detector generation features, ie the features of the target-based local area of the scene to be identified. Assuming that the number of local detectors is N, then the resulting local area feature dimension is 8N dimensions. It should be noted that the example of FIG. 5 is only intended to assist those skilled in the art to better understand the embodiments of the present invention and not to limit the scope of the embodiments of the present invention.
  • the image may also be multi-scale transformed to calculate the above features separately.
  • the input image is halved or doubled to obtain two images of different scales.
  • the same method is used to calculate the features on the two images, each obtaining an 8N-dimensional feature.
  • the total feature is described as 3*8*N dimensions.
  • the use of multi-scale images makes the final feature more robust to scale transformations at the target site.
  • the device for identifying the scene identifies the to-be-recognized scene according to the feature of the target-based local area of the to-be-identified scene.
  • the means for identifying the scene identifies the scene based on the features.
  • the S130 includes:
  • the classifier is used to classify the features of the target-based local area of the scene to be identified, and obtain the scene recognition result.
  • a Support Vector Machines (SVM) classifier of linear kernel functions can be utilized.
  • SVM Support Vector Machines
  • Given a scene category first need Collecting the training samples of the scene, the overall labeling of the image, that is, whether the scene is included, extracting the features proposed by the embodiments of the present invention, that is, the features of the local regions based on the targets; and then training the SVM classifier of a linear kernel function by using the training samples. . If there are multiple scene categories, then multiple classifiers are trained.
  • the trained scene classifier is used to classify the feature of the target local area of the image scene, and the output is the recognition confidence of the corresponding scene of the classifier, wherein the recognition confidence is high,
  • the scene to be identified is similar to the scene corresponding to the classifier, thereby obtaining a scene recognition result.
  • the method for scene recognition in the embodiment of the present invention detects a scene to be identified by using a local detector of a local area corresponding to the target, and the acquired feature of the target-based local area of the scene to be recognized may more completely represent the image information, and further By combining the local detectors and using the combined local detectors to detect the scene to be identified, not only the acquired features of the scene to be identified can completely represent the image information, but also avoid local repeated detection, and effectively reduce the feature information dimension. The number can be used to improve scene recognition performance.
  • the size of the sequence numbers of the above processes does not mean the order of execution, and the order of execution of each process should be determined by its function and internal logic, and should not be taken to the embodiments of the present invention.
  • the implementation process constitutes any limitation.
  • FIG. 6 shows a schematic block diagram of an apparatus 600 for scene recognition in accordance with an embodiment of the present invention.
  • the apparatus 600 includes:
  • a generating module 610 configured to train, by the training image set, a plurality of local detectors, one of the plurality of local detectors corresponding to a local area of a type of target, the type of target including at least two partial areas;
  • the detecting module 620 is configured to detect, by using the multiple local detectors obtained by the generating module 610, a scene to be identified, and acquire a feature of the target-based local area of the to-be-identified scene;
  • the identification module 630 is configured to identify the to-be-identified scene according to the feature of the target-based local area of the to-be-identified scene acquired by the detection module 620.
  • the generating module 610 trains a plurality of local detectors from the training image set, wherein a local detector corresponds to a local area of a type of target, and then the detecting module 620 utilizes the multiple local detecting
  • the device detects the scene to be identified, acquires the feature of the target-based local area of the to-be-identified scene, and the recognition module 630 further determines the target based on the target to be identified.
  • the feature of the local area identifies the scene to be identified. Since the local detector corresponds to a local area of the target, the scene is detected by the local detector, and the features of the local area of the target can be obtained.
  • the apparatus for scene recognition detects a scene to be identified by using a local detector of a local area corresponding to the target, and the acquired feature of the target-based local area of the to-be-identified scene can more completely represent the image information, thereby enabling Improve scene recognition performance.
  • the generation module 610 uses a set of training images with annotations that not only need to have the target category (eg, "referee") that the image has, but also the specific location information of the overall target in the image (the location of each target is not required) For each type of target, generally 100 or more samples are needed. Based on the labeled samples, the existing DPM algorithm is used to obtain local detectors for each type of target. The DPM algorithm will automatically identify the most unique parts of each type of target (such as the "head”, “torso”, “lower limb”, etc.) of each type of target based on the input parameters (such as the number of parts), so that the corresponding parameters are obtained. A local local detector.
  • the detecting module 620 uses the local detectors to detect the scene to be identified, and obtains the features of the local regions corresponding to the local detectors, and the features of the local regions constitute the features of the target-based local regions of the scene to be identified. For example, by using a local detector corresponding to different parts of the human body (eg, head, torso, upper arm, arm, and leg) as shown in FIG. 2 to detect an image, characteristics of different parts of each target (person in FIG. 2) are obtained. Thereby constituting the features of the entire image scene based on different parts of the human body.
  • a local detector corresponding to different parts of the human body eg, head, torso, upper arm, arm, and leg
  • the device 600 further includes: a merging module 640, configured to merge the local detectors whose similarities are higher than a predetermined threshold among the plurality of local detectors, Obtaining a composite local detector set;
  • the detecting module 620 is further configured to detect the to-be-identified scene by using a local detector in the composite local detector set, and acquire a feature of the target-based local area of the to-be-identified scene.
  • the merging module 640 combines the local detectors with higher similarity among the plurality of local detectors, that is, the local detectors whose similarity is higher than the predetermined threshold, and then the detection module 620 utilizes the combined The local detector detects the scene to be identified.
  • the similarity includes a similarity degree of features between local regions of the training image corresponding to the plurality of local detectors.
  • the local detector set to be merged for each local detector, its corresponding image local area is found on its corresponding training image, according to the low-level features (colors) of the local training image corresponding to each local detector.
  • the degree of similarity of textures, etc. gives the similarity of each local detector.
  • Phase The degree of similarity that is, local detectors above a predetermined threshold (e.g., 0.8) can be combined.
  • the method of merging can be aligned with the upper left corner of the cartridge, and the upper left corner of the filter matrix corresponding to the local detectors to be merged is aligned and averaged.
  • the device for scene recognition by combining the local detectors and detecting the scene to be identified by using the combined local detectors, not only the acquired features of the scene to be recognized can completely represent the image information, but also can avoid local parts.
  • the repeated detection effectively reduces the feature information dimension, thereby improving the scene recognition performance.
  • the detecting module 620 is specifically configured to acquire a response graph of the to-be-identified scene by using a local detector that detects the to-be-identified scene, and divide the response graph into a plurality of grids.
  • the maximum response value in each trellis is used as a feature of each trellis, and the features of all the trellis of the response graph are used as features corresponding to the response graph, and the response graphs obtained by the local detectors that detect the scene to be identified are corresponding.
  • the feature acts as a feature of the target-based local area of the scene to be identified.
  • the identification module 630 identifies the to-be-identified scene according to the feature of the target-based local area of the to-be-identified scene acquired by the detection module 620.
  • the identification module 630 is specifically configured to use a classifier to classify features of the target-based local area of the to-be-identified scene to obtain a scene recognition result.
  • an SVM classifier that trains multiple linear kernel functions with training samples is first used.
  • the recognition module 630 uses the trained scene classifier to classify the features of the target local area of the image scene, and outputs the recognition confidence of the scene corresponding to the classifier, thereby obtaining the scene recognition result.
  • the apparatus 600 for scene recognition according to an embodiment of the present invention may correspond to an execution subject in a method of scene recognition according to an embodiment of the present invention, and the above-described and other operations and/or functions of respective modules in the apparatus 600 are respectively implemented in order to implement FIG.
  • the corresponding processes to the respective methods in FIG. 5 are not described here.
  • the apparatus for scene recognition detects a scene to be identified by using a local detector of a local area corresponding to the target, and the acquired feature of the target-based local area of the to-be-identified scene can more completely represent the image information, and further, The local detectors are combined, and the combined local detector is used to detect the scene to be identified, and not only the acquired features of the scene to be recognized can completely represent the image information, but also can avoid local repeated detection and effectively reduce the feature information dimension. Thereby the scene recognition performance can be improved.
  • FIG. 8 shows a schematic block diagram of an apparatus 800 for scene recognition in accordance with another embodiment of the present invention. As shown in FIG. 8, the apparatus 800 includes: a processor 810, an input device 820, and an output device 830;
  • the processor 810 is trained by the training image set input by the input device 820 to obtain a plurality of local detectors, one of the plurality of local detectors corresponding to a local region of a class of targets, the class of objects including at least two localities And detecting, by the plurality of local detectors, the scene to be identified input by the input device 820, acquiring a feature of the target-based local area of the to-be-identified scene, and identifying the to-be-identified according to the feature of the target-based local area of the to-be-identified scene The scene is outputted by the output device 830.
  • the device for scene recognition detects a scene to be identified by using a local detector of a local area corresponding to the target, and the acquired feature of the target-based local area of the to-be-identified scene can more completely represent the image information, thereby improving the scene. Identify performance.
  • the processor 810 is further configured to combine the local detectors whose similarities are higher than a predetermined threshold among the plurality of local detectors to obtain a composite local detector set; and use the local detection in the composite local detector set.
  • the device detects the to-be-identified scene, and acquires a feature of the target-based local area of the to-be-identified scene.
  • the similarity includes a degree of similarity between features of the local regions of the training images corresponding to the plurality of local detectors.
  • the device for scene recognition by combining the local detectors and detecting the scene to be identified by using the combined local detectors, not only the acquired features of the scene to be recognized can completely represent the image information, but also can avoid local parts.
  • the repeated detection effectively reduces the feature information dimension, thereby improving the scene recognition performance.
  • the processor 810 is specifically configured to use a classifier to classify features of the target-based local area of the to-be-identified scene to obtain a scene recognition result.
  • the processor 810 is configured to acquire, by using a local detector that detects the to-be-identified scene, a response graph of the to-be-identified scene, and divide the response graph into a plurality of grids, and maximize the maximum of each grid.
  • the response value is used as the feature of each of the grids, and the features of all the grids of the response graph are used as the features corresponding to the response map, and the features corresponding to the response maps acquired by the local detectors that detect the scene to be identified are used as the scene to be identified.
  • the characteristics of the local area based on the target.
  • the apparatus 800 for scene recognition according to an embodiment of the present invention may correspond to an execution subject in a method of scene recognition according to an embodiment of the present invention, and the above and other operations of respective modules in the apparatus 800 For the purpose of implementing the corresponding processes of the respective methods in FIG. 1 to FIG. 5, the functions and/or functions are not described herein.
  • the apparatus for scene recognition detects a scene to be identified by using a local detector of a local area corresponding to the target, and the acquired feature of the target-based local area of the to-be-identified scene can more completely represent the image information, and further,
  • the local detectors are combined, and the combined local detector is used to detect the scene to be identified, and not only the acquired features of the scene to be recognized can completely represent the image information, but also can avoid local repeated detection and effectively reduce the feature information dimension. Thereby the scene recognition performance can be improved.
  • the term "and/or” is merely an association describing the associated object, indicating that there may be three relationships.
  • a and / or B can mean: A exists separately, there are A and B, and there are three cases of B alone.
  • the character "/" in this article generally indicates that the contextual object is an "or" relationship.
  • the disclosed systems, devices, and methods may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, or an electrical, mechanical or other form of connection.
  • the unit described as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, may be located in one place. Or it can be distributed to multiple network elements. Some or all of the units may be selected according to actual needs to achieve the objectives of the embodiments of the present invention.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • the technical solution of the present invention contributes in essence or to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium.
  • a number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, and the like, which can store program codes. .

Abstract

本发明公开了一种场景识别的方法和装置。该方法包括:由训练图像集训练得到多个局部检测器,该多个局部检测器中的一个局部检测器对应一类目标的一个局部区域,该一类目标包括至少两个局部区域;利用该多个局部检测器检测待识别场景,获取该待识别场景的基于目标的局部区域的特征;根据该待识别场景的基于目标的局部区域的特征识别该待识别场景。本发明实施例的场景识别的方法和装置,利用对应目标的局部区域的局部检测器检测待识别场景,获取的待识别场景的基于目标的局部区域的特征可以更完整地表示图像信息,从而能够提高场景识别性能。

Description

场景识别的方法和装置 本申请要求于 2012 年 9 月 14 日提交中国专利局、 申请号为 201210341511.0、 发明名称为"场景识别的方法和装置"的中国专利申请的优 先权, 其全部内容通过引用结合在本申请中。 技术领域
本发明涉及信息技术领域,并且更具体地,涉及场景识别的方法和装置。 背景技术
图像场景识别是指利用图像的视觉信息, 自动对图像进行处理和分析, 并判断和识别出其中所带有的特定场景(如厨房、 街道、 山峦等)。 判断一 张图像中的场景不仅有助于对图像的整体语义内容的理解,还能为图像中具 体的目标和事件的识别提供依据, 因此场景识别对计算机自动图像理解起着 重要的作用。 场景识别技术可以应用于很多实际问题, 如智能图像管理与检 索等。
现有的场景识别技术首先对图像的视觉信息进行描述, 这一过程也称为 图像的视觉特征提取;然后利用已经获取的针对不同场景的模板(或分类器) 对提取的视觉特征进行匹配(或分类), 并获取最终的场景识别结果。
提取视觉特征的一种通用方法是计算出代表图像画面中低层视觉信息 的统计。 这些视觉特征包括描述颜色信息的特征, 描述纹理信息的特征, 以 及描述形状信息的特征等。 在得到低层视觉信息后, 就可以通过预先训练的 分类器对上述特征进行分类, 进而得到最终识别结果。 这种方法的主要缺点 是低层视觉特征对不同场景的分辨能力较弱, 无法对一些带有接近的颜色、 纹理等信息的场景(如书房和图书馆)进行有效地区分和识别, 从而影响场 景识别性能。
现有的另一种方法采用中层特征表示 (或称 "属性")进行场景识别。 这类方法首先需要设计大量的视觉概念检测器。视觉概念检测器检测的结果 连接构成中层特征表示。 最后利用分类器对该特征进行分类, 进而得到最终 的识别结果。这种方法主要的缺点包括: 1. 方法采用标注对象的整个目标的 检测结果(比如 "运动员"、 "足球" 等)作为中层特征, 描述能力有限, 如 某个对象只有一部分出现在场景中 (如 "运动员只露出腿" 等), 则无法检 测; 2.检测器集合中可能存在重复: 对每个训练图片集标注的每类对象训练 一个检测器, 由于可能存在某些类的图像含义相近(比如 "裁判" 和 "运动 员"), 导致分别由这些类训练得到的检测器存在重复或高度相似, 一方面造 成特征信息高维灾难, 另一方面多次重复检测出的结果会相对抑制出现较少 的检测结果, 从而影响场景识别性能。 发明内容
本发明实施例提供了一种场景识别的方法和装置, 能够提高场景识别性 能。
第一方面, 提供了一种场景识别的方法, 该方法包括: 由训练图像集训 练得到多个局部检测器, 该多个局部检测器中的一个局部检测器对应一类目 标的一个局部区域, 该一类目标包括至少两个局部区域; 利用该多个局部检 测器检测待识别场景, 获取该待识别场景的基于目标的局部区域的特征; 根 据该待识别场景的基于目标的局部区域的特征识别该待识别场景。
在第一种可能的实现方式中, 该方法还包括: 将该多个局部检测器中相 似度高于预定阈值的局部检测器进行合并, 得到合成局部检测器集合; 利用 该多个局部检测器检测待识别场景,获取该待识别场景的基于目标的局部区 域的特征, 具体实现为: 利用该合成局部检测器集合中的局部检测器检测该 待识别场景, 获取该待识别场景的基于目标的局部区域的特征。
在第二种可能的实现方式中, 结合第一方面的第一种可能的实现方式, 该相似度包括该多个局部检测器对应的训练图像的局部区域的特征之间的 相似程度。
在第三种可能的实现方式中,结合第一方面或第一方面的第一种或第二 种可能的实现方式,根据该待识别场景的基于目标的局部区域的特征识别该 待识别场景, 具体实现为: 利用分类器对该待识别场景的基于目标的局部区 域的特征进行分类, 获取场景识别结果。
在第四种可能的实现方式中,结合第一方面或第一方面的第一种或第二 种或第三种可能的实现方式, 获取该待识别场景的基于目标的局部区域的特 征, 具体实现为: 利用每一个检测该待识别场景的局部检测器获取该待识别 场景的响应图; 将该响应图分格成多个格子, 将每个格子中的最大响应值作 为每个格子的特征, 将该响应图的所有格子的特征作为该响应图对应的特 征,将所有检测该待识别场景的局部检测器获取的响应图对应的特征作为该 待识别场景的基于目标的局部区域的特征。
第二方面, 提供了一种场景识别的装置, 该装置包括: 生成模块, 用于 由训练图像集训练得到多个局部检测器, 该多个局部检测器中的一个局部检 测器对应一类目标的一个局部区域, 该一类目标包括至少两个局部区域; 检 测模块, 用于利用该生成模块得到的该多个局部检测器检测待识别场景, 获 取该待识别场景的基于目标的局部区域的特征; 识别模块, 用于根据该检测 模块获取的该待识别场景的基于目标的局部区域的特征识别该待识别场景。
在第一种可能的实现方式中, 该装置还包括: 合并模块, 用于将该多个 局部检测器中相似度高于预定阈值的局部检测器进行合并,得到合成局部检 测器集合; 该检测模块还用于利用该合成局部检测器集合中的局部检测器检 测该待识别场景, 获取该待识别场景的基于目标的局部区域的特征。
在第二种可能的实现方式中, 结合第二方面的第一种可能的实现方式, 该相似度包括该多个局部检测器对应的训练图像的局部区域的特征之间的 相似程度。
在第三种可能的实现方式中,结合第二方面或第二方面的第一种或第二 种可能的实现方式, 该识别模块具体用于利用分类器对该待识别场景的基于 目标的局部区域的特征进行分类, 获取场景识别结果。
在第四种可能的实现方式中,结合第二方面或第二方面的第一种或第二 种或第三种可能的实现方式, 该检测模块具体用于利用每一个检测该待识别 场景的局部检测器获取该待识别场景的响应图, 将该响应图分格成多个格 子, 将每个格子中的最大响应值作为每个格子的特征, 将该响应图的所有格 子的特征作为该响应图对应的特征,将所有检测该待识别场景的局部检测器 获取的响应图对应的特征作为该待识别场景的基于目标的局部区域的特征。
基于上述技术方案, 本发明实施例的场景识别的方法和装置, 利用对应 目标的局部区域的局部检测器检测待识别场景, 获取的待识别场景的基于目 标的局部区域的特征可以更完整地表示图像信息,从而能够提高场景识别性 h
匕。 附图说明 为了更清楚地说明本发明实施例的技术方案, 下面将对本发明实施例中 所需要使用的附图作筒单地介绍, 显而易见地, 下面描述中的附图仅仅是本 发明的一些实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳动的 前提下, 还可以根据这些附图获得其他的附图。
图 1是根据本发明实施例的场景识别的方法的示意性流程图。
图 2是根据本发明实施例的场景识别的方法的一个例子的示意图。 图 3是根据本发明实施例的场景识别的方法的另一示意性流程图。 图 4是根据本发明实施例的场景识别的方法的另一个例子的示意图。 图 5是根据本发明实施例的场景识别的方法的又一示意性流程图。 图 6是根据本发明实施例的场景识别的装置的示意性框图。
图 7是根据本发明实施例的场景识别的装置的另一示意性框图。
图 8是根据本发明另一实施例的场景识别的装置的示意性框图。 具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行 清楚、 完整地描述, 显然, 所描述的实施例是本发明的一部分实施例, 而不 是全部实施例。 基于本发明中的实施例, 本领域普通技术人员在没有作出创 造性劳动的前提下所获得的所有其他实施例, 都应属于本发明保护的范围。
图 1示出了根据本发明实施例的场景识别的方法 100的示意性流程图。 如图 1所示, 该方法 100包括:
S110, 由训练图像集训练得到多个局部检测器, 该多个局部检测器中的 一个局部检测器对应一类目标的一个局部区域, 该一类目标包括至少两个局 部区域;
S120, 利用该多个局部检测器检测待识别场景, 获取该待识别场景的基 于目标的局部区域的特征;
S130,根据该待识别场景的基于目标的局部区域的特征识别该待识别场 景。
对应整个目标的检测器在只出现目标的局部区域时,无法检测到该目标 的局部区域的特征, 因而影响场景识别性能。 在本发明实施例中, 场景识别 的装置首先由训练图像集训练得到多个局部检测器, 其中, 一个局部检测器 对应一类目标的一个局部区域, 然后, 利用该多个局部检测器检测待识别场 景, 获取该待识别场景的基于目标的局部区域的特征, 再根据该待识别场景 的基于目标的局部区域的特征识别该待识别场景。 由于局部检测器对应目标 的局部区域, 因此利用局部检测器检测场景, 能够得到目标的局部区域的特 征。
因此, 本发明实施例的场景识别的方法, 利用对应目标的局部区域的局 部检测器检测待识别场景, 获取的待识别场景的基于目标的局部区域的特征 可以更完整地表示图像信息, 从而能够提高场景识别性能。
在 S110中, 场景识别的装置由训练图像集训练得到多个局部检测器。 在本发明实施例中, 将每一类目标分为多个局部区域, 即每一类目标包 括至少两个局部区域。 生成局部检测器需要使用带有标注的训练图像集, 该 标注不仅需要有图像具有的目标类别 (例如, "裁判"), 还需要整体目标在 图像中的具体位置信息 (不需要每个目标局部的位置)。 对于每一类目标, 一般需要 100个或更多的样本, 在标注样本的基础上, 每一类目标的局部检 测器可以利用现有的可变性基于局部模型 (Deformable Part-based Models, 筒称为 "DPM" )算法得到。 DPM算法将根据输入参数(如局部的数目 ) 自 动确认每类目标中最独特的几个局部(如 "裁判" 的 "头部"、 "躯干"、 "下 肢" 等), 从而得到对应这几个局部的局部检测器。
在 S120中, 场景识别的装置利用该多个局部检测器检测待识别场景, 获取该待识别场景的基于目标的局部区域的特征。
在生成了局部检测器后, 场景识别的装置利用这些局部检测器检测待识 别场景, 得到各局部检测器对应的局部区域的特征, 由这些局部区域的特征 构成该待识别场景的基于目标的局部区域的特征。 如图 2所示, 利用对应人 体不同部位(例如, 头、 躯干、 上臂、 小臂和腿) 的局部检测器检测图像, 得到各个目标(图 2中的人)的不同部位的特征, 从而构成整个图像场景的 基于人体不同部位的特征。
在本发明实施例中, 如图 3所示, 可选地, 该方法 100还包括:
S 140 ,将该多个局部检测器中相似度高于预定阈值的局部检测器进行合 并, 得到合成局部检测器集合;
对应地, 所述步骤 S120包括:
S121 , 利用该合成局部检测器集合中的局部检测器检测该待识别场景, 获取该待识别场景的基于目标的局部区域的特征。 不同类目标可能具有有共性的局部区域, 例如, 运动员的头部与裁判的 头部。 为了避免局部的重复检测, 可以将多个局部检测器中相似度较高的局 部检测器进行合并, 即将相似度高于预定阈值的局部检测器进行合并, 然后 利用合并后的局部检测器检测待识别场景。
在本发明实施例中,合成局部检测器集合表示将该多个局部检测器合并 之后得到的局部检测器的集合, 若有一部分局部检测器进行了合并, 则该合 成局部检测器集合包括合并后的局部检测器和另一部分未合并的局部检测 器, 若全部局部检测器都进行了合并, 则该合成局部检测器集合只包括合并 后的局部检测器。
局部检测器的合并可以基于相应图像局部区域的信息。 可选地, 可对待 合并区域的语义进行一定的限制, 以保证合并的局部检测器在语义上是高度 相关的。 例如, "裁判" 的 "头部" 和 "运动员" 的 "头部" 是可以合并的, 同 "猫" 的 "头部" 则不允许合并。
可选地,局部检测器的相似度包括局部检测器对应的训练图像的局部区 域的特征之间的相似程度。 例如, 在待合并的局部检测器集合中, 对每个局 部检测器, 在其对应的训练图像上找出其对应的图像局部区域, 根据各局部 检测器对应的局部训练图像的低层特征(颜色纹理等)的相似程度得到各局 部检测器的相似度。 相似度较高, 即高于预定阈值(例如, 0.8 )的局部检测 器可以进行合并。 合并的方式可以采用筒单的左上角对齐平均, 即将需要合 并的局部检测器对应的滤波矩阵左上角对齐后求平均值。
如图 4所示,由训练图像集 A训练得到局部检测器 pi ,由训练图像集 B 得到局部检测器 p2, pi和 p2都对应头部, 将 pi和 p2进行合并后得到局部 检测器 p。 如果利用 pi和 p2进行检测, 则会对各目标检测两次, 而利用合 并后的局部检测器 p进行检测, 则只检测一次, 避免了重复检测。
因此, 本发明实施例的场景识别的方法, 通过对局部检测器进行合并, 并利用合并后的局部检测器检测待识别场景, 不但获取的待识别场景的特征 能完整地表示图像信息,还能避免局部的重复检测,有效降低特征信息维数, 从而能够提高场景识别性能。
在本发明实施例中, 可选地, 获取该待识别场景的基于目标的局部区域 的特征, 包括:
利用每一个检测该待识别场景的局部检测器获取该待识别场景的响应 图;
将该响应图分格成多个格子,将每个格子中的最大响应值作为每个格子 的特征, 将该响应图的所有格子的特征作为该响应图对应的特征, 将所有检 测该待识别场景的局部检测器获取的响应图对应的特征作为该待识别场景 的基于目标的局部区域的特征。
给定一幅图像, 对每一个局部检测器, 如果局部检测器进行了合并, 则 指合并后的局部检测器,通过在图像上滑动窗口的方法在图像上生成针对该 局部检测器的响应图。如图 5所示,局部检测器 1到局部检测器 N中的每一 个局部检测器分别检测需要提取特征的图像, 即待识别场景的图像。 每一个 局部检测器生成一张响应图。 可选的, 可以以 3种方式(1*1 , 3*1和 2*2 ) 对每个局部检测器生成的响应图进行分格。 对分格后的每个格子, 将该格子 中最大响应值作为该格子的特征, 这样每个局部检测器则可以生成一个 8维 (1 * 1 +3 * 1 +2*2)的响应特征。 将所有局部检测器生成特征连接 /组合在一起即 可得到最终的特征, 即待识别场景的基于目标的局部区域的特征。 假设局部 检测器数量为 N个, 那么最终生成的局部区域特征维度是 8N维。 应注意, 图 5的例子只是为了帮助本领域技术人员更好地理解本发明实施例, 而非限 制本发明实施例的范围。
可选地, 还可以对图像进行多尺度的变换, 分别计算上述特征。 例如, 将输入图像分别调小一半或调大一倍得到两幅不同尺度的图像。在这两幅图 像上采用同样方法计算特征, 各得到一个 8N维的特征。 加上原始图像的特 征, 总共的特征描述为 3*8*N维。 利用多尺度图像使得最终的特征对目标局 部的尺度变换更为鲁棒。
在 S130中, 识别场景的装置根据该待识别场景的基于目标的局部区域 的特征识别该待识别场景。
在获得待识别场景的基于目标的局部区域的特征后,识别场景的装置根 据这些特征识别该场景。 可选的, S130包括:
利用分类器对该待识别场景的基于目标的局部区域的特征进行分类,获 取场景识别结果。
具体而言, 首先需要针对场景类别, 根据本发明实施例的基于目标的局 部区域的特征训练分类器。例如,可以利用线性核函数的支撑向量机( Support Vector Machines, 筒称为 "SVM" )分类器。 给定一个场景类别, 首先需要 收集该场景的训练样本, 图像的整体标注, 即是否含有该场景, 提取本发明 实施例提出的特征, 即基于目标的局部区域的特征; 然后利用这些训练样本 训练一个线性核函数的 SVM分类器。 如有多个场景类别, 则训练多个分类 器。 给定一个新的图像, 利用训练好的场景分类器对该图像场景的基于目标 的局部区域的特征进行分类,输出为该分类器对应场景的识别置信度, ,其中, 识别置信度高, 则待识别场景与该分类器对应的场景相似, 从而得到场景识 别结果。
因此, 本发明实施例的场景识别的方法, 利用对应目标的局部区域的局 部检测器检测待识别场景, 获取的待识别场景的基于目标的局部区域的特征 可以更完整地表示图像信息, 进一步地, 通过对局部检测器进行合并, 并利 用合并后的局部检测器检测待识别场景, 不但获取的待识别场景的特征能完 整地表示图像信息, 还能避免局部的重复检测, 有效降低特征信息维数, 从 而能够提高场景识别性能。
应理解, 在本发明的各种实施例中, 上述各过程的序号的大小并不意味 着执行顺序的先后, 各过程的执行顺序应以其功能和内在逻辑确定, 而不应 对本发明实施例的实施过程构成任何限定。
上文中结合图 1至图 5 , 详细描述了根据本发明实施例的场景识别的方 法, 下面将结合图 6至图 8, 描述根据本发明实施例的场景识别的装置。
图 6示出了根据本发明实施例的场景识别的装置 600的示意性框图。如 图 6所示, 该装置 600包括:
生成模块 610, 用于由训练图像集训练得到多个局部检测器, 该多个局 部检测器中的一个局部检测器对应一类目标的一个局部区域, 该一类目标包 括至少两个局部区域;
检测模块 620, 用于利用该生成模块 610得到的该多个局部检测器检测 待识别场景, 获取该待识别场景的基于目标的局部区域的特征;
识别模块 630, 用于根据该检测模块 620获取的该待识别场景的基于目 标的局部区域的特征识别该待识别场景。
在本发明实施例中, 首先, 生成模块 610由训练图像集训练得到多个局 部检测器, 其中, 一个局部检测器对应一类目标的一个局部区域, 然后, 检 测模块 620利用该多个局部检测器检测待识别场景, 获取该待识别场景的基 于目标的局部区域的特征,识别模块 630再根据该待识别场景的基于目标的 局部区域的特征识别该待识别场景。 由于局部检测器对应目标的局部区域, 因此利用局部检测器检测场景, 能够得到目标的局部区域的特征。
因此, 本发明实施例的场景识别的装置, 利用对应目标的局部区域的局 部检测器检测待识别场景, 获取的待识别场景的基于目标的局部区域的特征 可以更完整地表示图像信息, 从而能够提高场景识别性能。
生成模块 610使用带有标注的训练图像集,该标注不仅需要有图像具有 的目标类别(例如, "裁判" ),还需要整体目标在图像中的具体位置信息(不 需要每个目标局部的位置), 对于每一类目标, 一般需要 100个或更多的样 本, 在标注样本的基础上, 利用现有的 DPM算法得到每一类目标的局部检 测器。 DPM 算法将根据输入参数(如局部的数目 ) 自动确认每类目标中最 独特的几个局部 (如 "裁判" 的 "头部"、 "躯干"、 "下肢" 等), 从而得到 对应这几个局部的局部检测器。
检测模块 620利用这些局部检测器检测待识别场景,得到各局部检测器 对应的局部区域的特征, 由这些局部区域的特征构成该待识别场景的基于目 标的局部区域的特征。 例如, 利用如图 2所示的对应人体不同部位(如, 头、 躯干、 上臂、 小臂和腿)的局部检测器检测图像, 得到各个目标(图 2中的 人)的不同部位的特征,从而构成整个图像场景的基于人体不同部位的特征。
在本发明实施例中, 如图 7所示, 可选地, 该装置 600还包括: 合并模块 640, 用于将该多个局部检测器中相似度高于预定阈值的局部 检测器进行合并, 得到合成局部检测器集合;
该检测模块 620还用于利用该合成局部检测器集合中的局部检测器检测 该待识别场景, 获取该待识别场景的基于目标的局部区域的特征。
不同类目标可能具有有共性的局部区域, 例如, 运动员的头部与裁判的 头部。 为了避免局部的重复检测, 合并模块 640将多个局部检测器中相似度 较高的局部检测器进行合并, 即将相似度高于预定阈值的局部检测器进行合 并, 然后检测模块 620利用合并后的局部检测器检测待识别场景。
在本发明实施例中, 可选地, 该相似度包括该多个局部检测器对应的训 练图像的局部区域之间的特征的相似程度。
例如, 在待合并的局部检测器集合中, 对每个局部检测器, 在其对应的 训练图像上找出其对应的图像局部区域,根据各局部检测器对应的局部训练 图像的低层特征(颜色纹理等)的相似程度得到各局部检测器的相似度。 相 似度较高, 即高于预定阈值(如 0.8 ) 的局部检测器可以进行合并。 合并的 方式可以采用筒单的左上角对齐平均, 即将需要合并的局部检测器对应的滤 波矩阵左上角对齐后求平均值。
本发明实施例的场景识别的装置, 通过对局部检测器进行合并, 并利用 合并后的局部检测器检测待识别场景, 不但获取的待识别场景的特征能完整 地表示图像信息, 还能避免局部的重复检测, 有效降低特征信息维数, 从而 能够提高场景识别性能。
在本发明实施例中, 可选地, 该检测模块 620具体用于利用每一个检测 该待识别场景的局部检测器获取该待识别场景的响应图,将该响应图分格成 多个格子, 将每个格子中的最大响应值作为每个格子的特征, 将该响应图的 所有格子的特征作为该响应图对应的特征,将所有检测该待识别场景的局部 检测器获取的响应图对应的特征作为该待识别场景的基于目标的局部区域 的特征。
识别模块 630根据该检测模块 620获取的该待识别场景的基于目标的局 部区域的特征识别该待识别场景。
可选地,该识别模块 630具体用于利用分类器对该待识别场景的基于目 标的局部区域的特征进行分类, 获取场景识别结果。
例如, 首先利用训练样本训练多个线性核函数的 SVM分类器。 给定一 个新的图像,识别模块 630利用训练好的场景分类器对该图像场景的基于目 标的局部区域的特征进行分类, 输出为该分类器对应场景的识别置信度, 从 而得到场景识别结果。
根据本发明实施例的场景识别的装置 600可对应于根据本发明实施例的 场景识别的方法中的执行主体, 并且装置 600中的各个模块的上述和其它操 作和 /或功能分别为了实现图 1至图 5中的各个方法的相应流程, 为了筒洁, 在此不再赘述。
本发明实施例的场景识别的装置, 利用对应目标的局部区域的局部检测 器检测待识别场景, 获取的待识别场景的基于目标的局部区域的特征可以更 完整地表示图像信息, 进一步地, 通过对局部检测器进行合并, 并利用合并 后的局部检测器检测待识别场景, 不但获取的待识别场景的特征能完整地表 示图像信息, 还能避免局部的重复检测, 有效降低特征信息维数, 从而能够 提高场景识别性能。 图 8 示出了根据本发明另一实施例的场景识别的装置 800 的示意性框 图。 如图 8所示, 该装置 800包括: 处理器 810、 输入装置 820和输出装置 830;
处理器 810 由输入装置 820输入的训练图像集训练得到多个局部检测 器, 该多个局部检测器中的一个局部检测器对应一类目标的一个局部区域, 该一类目标包括至少两个局部区域, 利用该多个局部检测器检测输入装置 820输入的待识别场景, 获取该待识别场景的基于目标的局部区域的特征, 根据该待识别场景的基于目标的局部区域的特征识别该待识别场景,将识别 结果通过输出装置 830输出。
本发明实施例的场景识别的装置, 利用对应目标的局部区域的局部检测 器检测待识别场景, 获取的待识别场景的基于目标的局部区域的特征可以更 完整地表示图像信息, 从而能够提高场景识别性能。
可选地,该处理器 810还用于将该多个局部检测器中相似度高于预定阈 值的局部检测器进行合并, 得到合成局部检测器集合; 利用该合成局部检测 器集合中的局部检测器检测该待识别场景, 获取该待识别场景的基于目标的 局部区域的特征。
可选地,该相似度包括该多个局部检测器对应的训练图像的局部区域的 特征之间的相似程度。
本发明实施例的场景识别的装置, 通过对局部检测器进行合并, 并利用 合并后的局部检测器检测待识别场景, 不但获取的待识别场景的特征能完整 地表示图像信息, 还能避免局部的重复检测, 有效降低特征信息维数, 从而 能够提高场景识别性能。
可选地,该处理器 810具体用于利用分类器对该待识别场景的基于目标 的局部区域的特征进行分类, 获取场景识别结果。
可选地,该处理器 810具体用于利用每一个检测该待识别场景的局部检 测器获取该待识别场景的响应图, 将该响应图分格成多个格子, 将每个格子 中的最大响应值作为每个格子的特征,将该响应图的所有格子的特征作为该 响应图对应的的特征,将所有检测该待识别场景的局部检测器获取的响应图 对应的特征作为该待识别场景的基于目标的局部区域的特征。
根据本发明实施例的场景识别的装置 800可对应于根据本发明实施例的 场景识别的方法中的执行主体, 并且装置 800中的各个模块的上述和其它操 作和 /或功能分别为了实现图 1至图 5中的各个方法的相应流程, 为了筒洁, 在此不再赘述。
本发明实施例的场景识别的装置, 利用对应目标的局部区域的局部检测 器检测待识别场景, 获取的待识别场景的基于目标的局部区域的特征可以更 完整地表示图像信息, 进一步地, 通过对局部检测器进行合并, 并利用合并 后的局部检测器检测待识别场景, 不但获取的待识别场景的特征能完整地表 示图像信息, 还能避免局部的重复检测, 有效降低特征信息维数, 从而能够 提高场景识别性能。
应理解, 在本发明实施例中, 术语 "和 /或"仅仅是一种描述关联对象的 关联关系, 表示可以存在三种关系。 例如, A和 /或 B, 可以表示: 单独存在 A, 同时存在 A和 B, 单独存在 B这三种情况。 另外, 本文中字符 "/" , 一 般表示前后关联对象是一种 "或" 的关系。
本领域普通技术人员可以意识到, 结合本文中所公开的实施例描述的各 示例的单元及算法步骤, 能够以电子硬件、 计算机软件或者二者的结合来实 现, 为了清楚地说明硬件和软件的可互换性, 在上述说明中已经按照功能一 般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执 行, 取决于技术方案的特定应用和设计约束条件。 专业技术人员可以对每个 特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超 出本发明的范围。
所属领域的技术人员可以清楚地了解到, 为了描述的方便和筒洁, 上述 描述的系统、 装置和单元的具体工作过程, 可以参考前述方法实施例中的对 应过程, 在此不再赘述。
在本申请所提供的几个实施例中, 应该理解到, 所揭露的系统、 装置和 方法, 可以通过其它的方式实现。 例如, 以上所描述的装置实施例仅仅是示 意性的, 例如, 所述单元的划分, 仅仅为一种逻辑功能划分, 实际实现时可 以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个 系统, 或一些特征可以忽略, 或不执行。 另外, 所显示或讨论的相互之间的 耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或 通信连接, 也可以是电的, 机械的或其它的形式连接。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作 为单元显示的部件可以是或者也可以不是物理单元, 即可以位于一个地方, 或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或 者全部单元来实现本发明实施例方案的目的。
另外, 在本发明各个实施例中的各功能单元可以集成在一个处理单元 中, 也可以是各个单元单独物理存在, 也可以是两个或两个以上单元集成在 一个单元中。 上述集成的单元既可以采用硬件的形式实现, 也可以采用软件 功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销 售或使用时, 可以存储在一个计算机可读取存储介质中。 基于这样的理解, 本发明的技术方案本质上或者说对现有技术做出贡献的部分, 或者该技术方 案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在 一个存储介质中, 包括若干指令用以使得一台计算机设备(可以是个人计算 机, 服务器, 或者网络设备等)执行本发明各个实施例所述方法的全部或部 分步骤。 而前述的存储介质包括: U盘、 移动硬盘、 只读存储器(ROM, Read-Only Memory )、 随机存取存储器 ( RAM, Random Access Memory )、 磁碟或者光盘等各种可以存储程序代码的介质。
以上所述, 仅为本发明的具体实施方式, 但本发明的保护范围并不局限 于此, 任何熟悉本技术领域的技术人员在本发明揭露的技术范围内, 可轻易 想到各种等效的修改或替换, 这些修改或替换都应涵盖在本发明的保护范围 之内。 因此, 本发明的保护范围应以权利要求的保护范围为准。

Claims

权利要求
1、 一种场景识别的方法, 其特征在于, 包括:
由训练图像集训练得到多个局部检测器,所述多个局部检测器中的一个 局部检测器对应一类目标的一个局部区域, 所述一类目标包括至少两个局部 区域;
利用所述多个局部检测器检测待识别场景,获取所述待识别场景的基于 目标的局部区域的特征;
根据所述待识别场景的基于目标的局部区域的特征识别所述待识别场 景。
2、 根据权利要求 1所述的方法, 其特征在于, 所述方法还包括: 将所述多个局部检测器中相似度高于预定阈值的局部检测器进行合并, 得到合成局部检测器集合;
所述利用所述多个局部检测器检测待识别场景,获取所述待识别场景的 基于目标的局部区域的特征, 包括:
利用所述合成局部检测器集合中的局部检测器检测所述待识别场景,获 取所述待识别场景的基于目标的局部区域的特征。
3、 根据权利要求 2所述的方法, 其特征在于, 所述相似度包括所述多 个局部检测器对应的训练图像的局部区域的特征之间的相似程度。
4、 根据权利要求 1至 3中任一项所述的方法, 其特征在于, 所述根据 所述待识别场景的基于目标的局部区域的特征识别所述待识别场景, 包括: 利用分类器对所述待识别场景的基于目标的局部区域的特征进行分类, 获取场景识别结果。
5、 根据权利要求 1至 4中任一项所述的方法, 其特征在于, 所述获取 所述待识别场景的基于目标的局部区域的特征, 包括:
利用每一个检测所述待识别场景的局部检测器获取所述待识别场景的 响应图;
将所述响应图分格成多个格子,将每个所述格子中的最大响应值作为每 个所述格子的特征,将所述响应图的所有格子的特征作为所述响应图对应的 特征,将所有检测所述待识别场景的局部检测器获取的响应图对应的特征作 为所述待识别场景的基于目标的局部区域的特征。
6、 一种场景识别的装置, 其特征在于, 包括: 生成模块, 用于由训练图像集训练得到多个局部检测器, 所述多个局部 检测器中的一个局部检测器对应一类目标的一个局部区域, 所述一类目标包 括至少两个局部区域;
检测模块,用于利用所述生成模块得到的所述多个局部检测器检测待识 别场景, 获取所述待识别场景的基于目标的局部区域的特征;
识别模块,用于根据所述检测模块获取的所述待识别场景的基于目标的 局部区域的特征识别所述待识别场景。
7、 根据权利要求 6所述的装置, 其特征在于, 所述装置还包括: 合并模块,用于将所述多个局部检测器中相似度高于预定阈值的局部检 测器进行合并, 得到合成局部检测器集合;
所述检测模块还用于利用所述合成局部检测器集合中的局部检测器检 测所述待识别场景, 获取所述待识别场景的基于目标的局部区域的特征。
8、 根据权利要求 7所述的装置, 其特征在于, 所述相似度包括所述多 个局部检测器对应的训练图像的局部区域的特征之间的相似程度。
9、 根据权利要求 6至 8中任一项所述的装置, 其特征在于, 所述识别 模块具体用于利用分类器对所述待识别场景的基于目标的局部区域的特征 进行分类, 获取场景识别结果。
10、 根据权利要求 6至 9中任一项所述的装置, 其特征在于, 所述检测 模块具体用于利用每一个检测所述待识别场景的局部检测器获取所述待识 别场景的响应图, 将所述响应图分格成多个格子, 将每个所述格子中的最大 响应值作为每个所述格子的特征,将所述响应图的所有格子的特征作为所述 响应图对应的特征,将所有检测所述待识别场景的局部检测器获取的响应图 对应的特征作为所述待识别场景的基于目标的局部区域的特征。
PCT/CN2013/083501 2012-09-14 2013-09-13 场景识别的方法和装置 WO2014040559A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP13837155.4A EP2884428A4 (en) 2012-09-14 2013-09-13 SCENE RECOGNITION AND DEVICE
US14/657,121 US9465992B2 (en) 2012-09-14 2015-03-13 Scene recognition method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210341511.0A CN103679189B (zh) 2012-09-14 2012-09-14 场景识别的方法和装置
CN201210341511.0 2012-09-14

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US14/657,121 Continuation US9465992B2 (en) 2012-09-14 2015-03-13 Scene recognition method and apparatus

Publications (1)

Publication Number Publication Date
WO2014040559A1 true WO2014040559A1 (zh) 2014-03-20

Family

ID=50277642

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/083501 WO2014040559A1 (zh) 2012-09-14 2013-09-13 场景识别的方法和装置

Country Status (4)

Country Link
US (1) US9465992B2 (zh)
EP (1) EP2884428A4 (zh)
CN (1) CN103679189B (zh)
WO (1) WO2014040559A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395917A (zh) * 2019-08-15 2021-02-23 纳恩博(北京)科技有限公司 区域的识别方法及装置、存储介质、电子装置

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095902B (zh) * 2014-05-23 2018-12-25 华为技术有限公司 图片特征提取方法及装置
CN104318208A (zh) * 2014-10-08 2015-01-28 合肥工业大学 一种基于图分割和实例学习的视频场景检测方法
CN105678267A (zh) * 2016-01-08 2016-06-15 浙江宇视科技有限公司 一种场景识别方法及装置
CN105809146B (zh) * 2016-03-28 2019-08-30 北京奇艺世纪科技有限公司 一种图像场景识别方法和装置
CN106295523A (zh) * 2016-08-01 2017-01-04 马平 一种基于svm的公共场合人流量检测方法
US10061984B2 (en) 2016-10-24 2018-08-28 Accenture Global Solutions Limited Processing an image to identify a metric associated with the image and/or to determine a value for the metric
CN108229493A (zh) * 2017-04-10 2018-06-29 商汤集团有限公司 对象验证方法、装置和电子设备
CN109389136A (zh) * 2017-08-08 2019-02-26 上海为森车载传感技术有限公司 分类器训练方法
CN108830908A (zh) * 2018-06-15 2018-11-16 天津大学 一种基于人工神经网络的魔方颜色识别方法
CN108900769B (zh) * 2018-07-16 2020-01-10 Oppo广东移动通信有限公司 图像处理方法、装置、移动终端及计算机可读存储介质
CN112560840B (zh) * 2018-09-20 2023-05-12 西安艾润物联网技术服务有限责任公司 多个识别区域识别方法、识别终端及可读存储介质
CN111488751A (zh) * 2019-01-29 2020-08-04 北京骑胜科技有限公司 二维码图像处理方法、装置、电子设备及存储介质
CN109858565B (zh) * 2019-02-28 2022-08-12 南京邮电大学 基于深度学习的融合全局特征和局部物品信息的家庭室内场景识别方法
CN109919244B (zh) * 2019-03-18 2021-09-07 北京字节跳动网络技术有限公司 用于生成场景识别模型的方法和装置
CN110245628B (zh) * 2019-06-19 2023-04-18 成都世纪光合作用科技有限公司 一种检测人员讨论场景的方法和装置
CN111144378B (zh) * 2019-12-30 2023-10-31 众安在线财产保险股份有限公司 一种目标对象的识别方法及装置
CN111368761B (zh) * 2020-03-09 2022-12-16 腾讯科技(深圳)有限公司 店铺营业状态识别方法、装置、可读存储介质和设备
CN111580060B (zh) * 2020-04-21 2022-12-13 北京航空航天大学 目标姿态识别的方法、装置和电子设备
CN113205037B (zh) * 2021-04-28 2024-01-26 北京百度网讯科技有限公司 事件检测的方法、装置、电子设备以及可读存储介质
CN113486942A (zh) * 2021-06-30 2021-10-08 武汉理工光科股份有限公司 一种重复火警判定方法、装置、电子设备及存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101277394A (zh) * 2007-02-19 2008-10-01 精工爱普生株式会社 信息处理方法,信息处理设备和程序
CN101996317A (zh) * 2010-11-01 2011-03-30 中国科学院深圳先进技术研究院 人体中标记物的识别方法及装置
CN102426653A (zh) * 2011-10-28 2012-04-25 西安电子科技大学 基于第二代Bandelet变换和星型模型的静态人体检测方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5963670A (en) * 1996-02-12 1999-10-05 Massachusetts Institute Of Technology Method and apparatus for classifying and identifying images
EP1959668A3 (en) 2007-02-19 2009-04-22 Seiko Epson Corporation Information processing method, information processing apparatus, and program
CN101127029A (zh) * 2007-08-24 2008-02-20 复旦大学 用于在大规模数据分类问题中训练svm分类器的方法
JP4772839B2 (ja) * 2008-08-13 2011-09-14 株式会社エヌ・ティ・ティ・ドコモ 画像識別方法および撮像装置
CN101968884A (zh) * 2009-07-28 2011-02-09 索尼株式会社 检测视频图像中的目标的方法和装置
US9659364B2 (en) * 2010-03-11 2017-05-23 Koninklijke Philips N.V. Probabilistic refinement of model-based segmentation
US20120213426A1 (en) * 2011-02-22 2012-08-23 The Board Of Trustees Of The Leland Stanford Junior University Method for Implementing a High-Level Image Representation for Image Analysis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101277394A (zh) * 2007-02-19 2008-10-01 精工爱普生株式会社 信息处理方法,信息处理设备和程序
CN101996317A (zh) * 2010-11-01 2011-03-30 中国科学院深圳先进技术研究院 人体中标记物的识别方法及装置
CN102426653A (zh) * 2011-10-28 2012-04-25 西安电子科技大学 基于第二代Bandelet变换和星型模型的静态人体检测方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2884428A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395917A (zh) * 2019-08-15 2021-02-23 纳恩博(北京)科技有限公司 区域的识别方法及装置、存储介质、电子装置
CN112395917B (zh) * 2019-08-15 2024-04-12 纳恩博(北京)科技有限公司 区域的识别方法及装置、存储介质、电子装置

Also Published As

Publication number Publication date
EP2884428A1 (en) 2015-06-17
CN103679189B (zh) 2017-02-01
US9465992B2 (en) 2016-10-11
CN103679189A (zh) 2014-03-26
EP2884428A4 (en) 2015-10-21
US20150186726A1 (en) 2015-07-02

Similar Documents

Publication Publication Date Title
WO2014040559A1 (zh) 场景识别的方法和装置
Ji et al. Interactive body part contrast mining for human interaction recognition
Dantone et al. Human pose estimation using body parts dependent joint regressors
Wolf et al. Evaluation of video activity localizations integrating quality and quantity measurements
US9098740B2 (en) Apparatus, method, and medium detecting object pose
Nghiem et al. Head detection using kinect camera and its application to fall detection
Wang et al. Lying pose recognition for elderly fall detection
Gao et al. Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition
Li et al. Modeling occlusion by discriminative and-or structures
Mousse et al. Percentage of human-occupied areas for fall detection from two views
Chen et al. TriViews: A general framework to use 3D depth data effectively for action recognition
Ren et al. 3d object detection with latent support surfaces
Iazzi et al. Fall detection based on posture analysis and support vector machine
Zhou et al. Human action recognition toward massive-scale sport sceneries based on deep multi-model feature fusion
Kinghorn et al. Deep learning based image description generation
CN113139415A (zh) 视频关键帧提取方法、计算机设备和存储介质
WO2014006786A1 (ja) 特徴量抽出装置および特徴量抽出方法
Ahmad et al. Embedded deep vision in smart cameras for multi-view objects representation and retrieval
Huang et al. Person re-identification based on hierarchical bipartite graph matching
Alese et al. Design and implementation of gait recognition system
Trong et al. A survey about view-invariant human action recognition
Shao et al. A comparative study of video-based object recognition from an egocentric viewpoint
Protopapadakis et al. Multidimensional trajectory similarity estimation via spatial-temporal keyframe selection and signal correlation analysis
Mousse et al. A multi-view human bounding volume estimation for posture recognition in elderly monitoring system
Ren et al. Human fall detection model with lightweight network and tracking in video

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13837155

Country of ref document: EP

Kind code of ref document: A1

REEP Request for entry into the european phase

Ref document number: 2013837155

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2013837155

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE