WO2019007253A1 - Image recognition method, apparatus and device, and readable medium - Google Patents

Image recognition method, apparatus and device, and readable medium Download PDF

Info

Publication number
WO2019007253A1
WO2019007253A1 PCT/CN2018/093350 CN2018093350W WO2019007253A1 WO 2019007253 A1 WO2019007253 A1 WO 2019007253A1 CN 2018093350 W CN2018093350 W CN 2018093350W WO 2019007253 A1 WO2019007253 A1 WO 2019007253A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
feature set
image
target object
context
Prior art date
Application number
PCT/CN2018/093350
Other languages
French (fr)
Chinese (zh)
Inventor
李博
张伦
楚汝峰
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2019007253A1 publication Critical patent/WO2019007253A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]

Definitions

  • the present application relates to the field of image processing technologies, and in particular, to an image recognition method, apparatus and device, and a readable medium.
  • image recognition technology is used to classify target objects, which has wide application value in products such as driverless cars and smart refrigerators.
  • the feature extraction model is generally used to extract the feature of the entire image containing the target object, and a feature image of the entire image is generated, and the feature image is composed of the extracted features and extracted.
  • the feature includes at least one of image features such as a color feature, a texture feature, a shape feature, and a spatial relationship feature, and then a single fixed-size rectangular frame is used to frame the feature object (such as a car, food, etc.) in the feature image.
  • the selected feature is selected as the target feature, and then the target feature is input into the classification model for classification.
  • the same target object may appear in different areas of the image containing the target object.
  • most of the ingredients in the smart refrigerator are randomly placed in the refrigerator by the user.
  • the food inside the refrigerator is obtained from the image.
  • Current image recognition techniques are prone to erroneous recognition results when identifying such images.
  • the present application provides an image recognition method, apparatus and device, and a readable medium.
  • an image recognition method including the steps of:
  • the target object is identified based on the selected feature set.
  • an electronic device including:
  • a memory that stores processor executable instructions
  • the processor is coupled to the memory for reading program instructions stored in the memory, and in response, performing the following operations:
  • the target object is identified based on the selected feature set.
  • an image recognition apparatus including:
  • An image acquisition module configured to acquire an image to be identified
  • a feature extraction module configured to obtain a feature image of the image to be identified, wherein the feature image is used to describe a feature of the image to be identified;
  • a feature selection module configured to select at least two feature sets describing the target object from the obtained feature images
  • a target recognition module is configured to identify the target object based on the selected feature set.
  • one or more machine readable medium having stored thereon instructions that, when executed by one or more processors, cause the terminal device to perform the method described above.
  • selecting a feature capable of describing the target object from the feature image when selecting a feature capable of describing the target object from the feature image, selecting a plurality of feature sets from different regions in the feature image can effectively represent similar target objects in different positions in the image, thereby enabling Accurately identify the target object.
  • FIG. 1 is a flowchart of an image recognition method according to an exemplary embodiment of the present application.
  • FIG. 2a is a block diagram of a system for image recognition shown in an exemplary embodiment of the present application
  • FIG. 2b is an interaction diagram of an image recognition method according to another exemplary embodiment of the present application.
  • 2c is a schematic diagram of a pooling operation and an implementation process of adjusting pixels in an image recognition method according to an exemplary embodiment of the present application;
  • 2d is a schematic diagram of a target recognition process in an image recognition method illustrated by an exemplary embodiment of the present application
  • FIG. 3 is a logic block diagram of an image recognition apparatus according to an exemplary embodiment of the present application.
  • FIG. 4 is a hardware configuration diagram of an image recognition apparatus according to an exemplary embodiment of the present application.
  • first, second, third, etc. may be used to describe various information in this application, such information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as the second information without departing from the scope of the present application.
  • second information may also be referred to as the first information.
  • word "if” as used herein may be interpreted as "when” or "when” or "in response to determination.”
  • FIG. 1 is a flowchart of an image recognition method according to an exemplary embodiment of the present application.
  • the embodiment can be applied to various electronic devices having image processing functions, and may include the following steps S101-S104:
  • Step S101 Acquire an image to be identified.
  • Step S102 Obtain a feature image of the image to be identified, and the feature image is used to describe a feature of the image to be identified.
  • Step S103 Select at least two feature sets describing the target object from the obtained feature images.
  • Step S104 Identify the target object based on the selected feature set.
  • the acquired image may be an image directly collected by an image acquisition module (such as a camera), or may be image data after image preprocessing.
  • the image preprocessing mentioned herein may include improving image recognition accuracy.
  • Useful image processing for example, color space transformation on scene text images, position correction of word images in text word images, denoising processing on character images, and the like.
  • a feature extraction algorithm such as a convolutional neural network model, a classifier, and a multi-level network structure may be used to extract features to generate a feature image, and each region of the feature image contains various extracted features.
  • the target object when the target object is identified, the specific location of the target object needs to be located.
  • a full convolutional neural network capable of effectively retaining the position information of the target object may be used to characterize the acquired image.
  • the full convolutional neural network may include a full convolutional layer of AlexNet, GoogleNet, VGGNet, ResNet, or other convolutional neural network models.
  • the feature describing the target object may be selected from the obtained feature image based on the region, and the feature image is selected from the feature image.
  • the selected features in the same area constitute a feature set describing the target object.
  • the feature set contains different feature quantities, and the selected feature set has different sizes in the feature image.
  • the convolution operation result of the region containing the target object is greater than a predetermined threshold, which can be set by the designer when verifying the distribution of the convolution operation result of the positive and negative samples on the set when training the classifier and the feature extraction model.
  • the value is generally greater than or equal to 0 and less than 1, such as 0.3, 0.5, etc.
  • the feature set whose convolution operation result is greater than a predetermined threshold may be used as a feature set describing the target object by the following sliding window technique:
  • a plurality of candidate feature sets are selected from the obtained feature images by using sliding windows of various sizes, and the size of the sliding window may include 8*16, 8*8, 16*16, 16*32, 16*8, 32* 64, 32*32, 32*16, and 64*32.
  • a convolution operation is performed on the selected candidate feature set.
  • a candidate feature set having a convolution operation result greater than a predetermined threshold is selected as a feature set describing the target object.
  • the volume feature image may be a convolution feature block obtained by extracting the acquired image by the full convolutional neural network, and when the feature set describing the target object is selected by using a sliding window, the convolution feature block may be searched for. There may be areas of the target object. For each position on the convolution feature block, the sizes are 8*16, 8*8, 16*16, 16*32, 16*8, 32*64, 32*32, 32*16, and 64*32, respectively.
  • the rectangular frame (sliding window) marks each rectangular area, wherein the rectangular frame can mark each rectangular area with a region identification such as a 4-dimensional vector [c x , c y , w 0 , h 0 ], (c x , c y ) Indicates the coordinates of the center point of the rectangle.
  • w 0 and h 0 represent the width and height of the rectangle, respectively, corresponding to the size of the rectangle.
  • the height and width of the convolutional feature block are large, if for each position on the convolutional feature block, the size is 8*16, 8*8, 16*16, 16*32, respectively.
  • 16*8, 32*64, 32*32, 32*16, and 64*32 rectangular frames (sliding windows) mark each rectangular area, and then extract the features in each rectangular area, which will constitute a large number of feature sets.
  • the calculation of the convolution operation is large, and the number of feature sets describing the target object may be more, which may increase the calculation amount of the image recognition process and reduce the image recognition efficiency. In order to solve these problems, the description target is considered.
  • the result of the convolution operation of the feature set of the object is larger than the convolution operation result of the feature set of the other non-target object, and the feature set whose convolution operation result exceeds the predetermined threshold may be determined as the candidate feature set, and the determined candidate feature set is collected.
  • the result of the product operation is that the feature set of the top N is selected as the feature set describing the target object, and N is greater than 1 and smaller than the total number of the determined candidate feature sets. When the number of candidate feature sets is large, N may be 300.
  • each feature set composed of features in each rectangular region may be input into a predetermined feature set screening model, and the feature set screening model performs convolution operation on each feature set, and the convolution operation result exceeds a predetermined threshold.
  • the feature set is determined as a candidate feature set, and the determined candidate feature set, the convolution operation result is a top N feature set selected as a feature set describing the target object, where N is greater than 1 and less than the total number of the determined candidate feature sets.
  • the designer of the present application can determine the specific value of N, such as 300, according to the application scenario and the computing power of the electronic device running the image recognition method of the present application.
  • the feature set screening model mentioned here may be a deep neural network model, a network model of a multi-level structure, or a probability model based on image color, edge, and super-pixel features.
  • the obtained image may include a variety of background information in addition to the target object. These background information may cause a certain degree of trouble to the target recognition.
  • Contextual features may be added during the target recognition process, the added contextual features including local context features and/or global context features.
  • the context feature of the feature set is selected from the feature image, and then the target object is identified according to the selected feature set and the context feature.
  • the target recognition process can process more features related to the target object.
  • the target object can be excluded. For example, the ship and the sea are always together. If the ship and the tree are detected together, the target object recognition error is indicated.
  • the target object when the context feature of the feature set is selected from the feature image (the context feature of the feature set may refer to a context feature corresponding to different features of the target object in the feature set), the target object may be selected and described.
  • the center point of the region to which the feature set belongs may be the reference point, and the side length of the region to which the feature set belongs is increased by 0.5 times to form the region to which the local context feature belongs, and then The feature of the region is extracted as a local context feature of the feature set.
  • the edge length of the region to which the local context feature of the feature set belongs is 1.5 times the side length of the region to which the feature set belongs, and may include more features related to the target object, so as to facilitate the comparison of the volume. Small target object.
  • the target object may be identified based on the features, such as extracting the extracted feature set of the target object and the context feature into the trained classifier for target classification, but
  • the operation faces a huge computational challenge, and the classifier is prone to overfitting when computing over a large number of features.
  • the present application may perform a pooling operation on the feature set and the context feature of the feature set, and then identify the target object according to the feature set and the context feature obtained by the pooling operation.
  • the pooling mentioned here is used to reduce the dimension of the feature set and the probability of over-fitting.
  • the statistics of the features of different locations are aggregated. For example, when the pooling operation is performed, a certain area of the feature set can be used. The average (or maximum) of each feature replaces each feature.
  • the specified features extracted by each feature extraction channel may be pooled separately;
  • the specified features extracted by the feature extraction channel have different coordinates in the feature image.
  • the region to which each feature set belongs within the convolutional feature block can be divided into three parts, and when the pooling operation is performed, the entire convolutional network is separately
  • the feature (designated feature) in the first partial region extracted by the first feature extraction channel is subjected to a pooling operation, and the feature (designated feature) in the second partial region extracted by the second feature extraction channel of the full convolution network is separately pooled
  • the feature (designated feature) in the third partial region extracted by the third feature extraction channel of the full convolution network is separately pooled.
  • the feature set and the context feature obtained by the pooling operation may also be adjusted to matched pixels; and then the target object is identified according to the adjusted feature set and the context feature.
  • the matched pixels are generally smaller than the pixels of each feature set, and the designer of the present application can determine matching pixels according to the application scenario and the computing power of the electronic device running the image recognition method of the present application. In some scenarios, the image is considered.
  • the matched pixels may include at least two of 3*12, 12*3, 5*10, 10*5, and 7*7.
  • the selected feature set is adjusted to the matched pixel, and then the target object is identified according to the adjusted feature set.
  • the feature set context feature of a matched pixel may be used as a feature describing a branch of the target object, and the number of features of the branch is H 0 ⁇ W 0 ⁇ ((3 ⁇ h i ⁇ w i ) ⁇ (C+1)), where H 0 , W 0 , C 0 respectively represent the height, width and channel number of the feature image (eg, convolutional feature block), h i ⁇ w i ⁇ 3 ⁇ 12 , 12 ⁇ 3, 5 ⁇ 10, 10 ⁇ 5, 7 ⁇ 7 ⁇ , C represents the number of categories of the target object, +1 is the background is also counted as a target category, each position on h i ⁇ w i is A vector of 3 ⁇ (C + 1) dimensions, which includes three (C + 1)-dimensional vectors.
  • the pixel-adjusted feature may be input into the target recognition model, and the target recognition model corresponds to each feature set of each branch in the process of identifying the target. And the context feature of the feature set, a class vector and a position offset vector of the region to which the feature set belongs are generated.
  • the target recognition model mentioned here may be a classification model such as a classifier.
  • the length of the category vector may be (C+1), and each dimension vector element may represent a probability p j , j ⁇ ⁇ 0, . . . , C ⁇ of the target object belonging to a certain category, where 0 represents a background class.
  • the target recognition model determines a final target class vector and target position offset vector based on predetermined vector screening criteria.
  • the position offset vector may be a 4-dimensional vector, and each dimension element of the vector represents a position offset vector [ ⁇ x , ⁇ y , ⁇ w , ⁇ h ] of the region to which the feature set belongs, and the position offset vector corresponds to 4 dimensions.
  • the corresponding vector after the target object is adjusted is [c x + w 0 ⁇ ⁇ x , c y + h 0 ⁇ ⁇ y , w 0 ⁇ ⁇ w , h 0 ⁇ ⁇ h ].
  • the category vector with the largest vector element may be selected from the category vector corresponding to each feature set of each branch as the final identified target category vector, and then the target category vector is corresponding.
  • the position offset vector of the feature set is the finally identified position offset vector.
  • the category vector with the largest vector element is selected as the final recognized target class vector, it can be selected according to the following formula:
  • score represents a category vector
  • the elements on each dimension of the vector represent the probability that the target object belongs to the corresponding category
  • C represents the number of categories
  • A represents the sub-index (the number of types of predetermined pixels).
  • there are 2 types of target objects, one is a dog and the other is a cat, then C 2, the first dimension of the category vector indicates the possibility that the target object belongs to the cat category, and the second dimension represents The target object belongs to the category of the dog.
  • each branch has a maximum value, assuming score2, score3, ...
  • the maximum value is selected as the maximum maximum value
  • the category vector to which the selected final maximum value belongs is determined as the target category vector capable of determining the category to which the target object belongs.
  • the target category vector may also be determined based on the mean, minimum, and median of all dimensions of the category vector.
  • the target recognition model may output the target class vector and the target position offset vector as the recognition result, where the target class vector is the category vector to which the largest vector element of the class vector corresponding to each feature set belongs.
  • the target position offset vector is a position offset vector of a feature set corresponding to the target category vector.
  • the target class vector is c i , i ⁇ ⁇ 1,..., C+1 ⁇ , and the vector element is not the probability value of the target object belonging to the corresponding category.
  • the request may be obtained.
  • Softmax converts the target category vector into a probability form target category vector. The formula for Softmax is as follows:
  • Pi is the probability form target category vector.
  • the category to which the target object belongs can be obtained, and then the initial position [c x , c y , w 0 , h 0 ] of the region to which the feature set corresponding to the target category vector is associated and the target in the recognition result are combined.
  • the position offset vector can obtain the position of the target object in the image: [c x + w 0 ⁇ ⁇ x , c y + h 0 ⁇ ⁇ y , w 0 ⁇ ⁇ w , h 0 ⁇ ⁇ h ].
  • the image recognition of the embodiment of the present application can detect the category and location of all target objects in an image. If the image to be recognized is an image obtained by the storage room of the smart refrigerator, the target object is a smart refrigerator.
  • the food stored in the storage room based on the image recognition results, can further carry out statistics on relevant information in the field of smart refrigerators, such as: counting the number of ingredients in the same category, the number of ingredients in all categories, etc., and then based on the statistical results can be Accurate and effective intelligent management of food, such as: can change the operating mode of the refrigerator, always keep the food in the best storage state, allowing users to know the quantity of food in the refrigerator, preservation and quality information anytime, anywhere through mobile phones or computers, and can remind users Regular food supplements, etc.
  • the road condition in front of the automobile can be accurately recognized, and corresponding driving operations are performed based on the road condition, such as: obstacles are bypassed when driving unmanned.
  • FIG. 2a is a block diagram of a system 200 for implementing image recognition according to an exemplary embodiment of the present application.
  • the system 200 is applicable to various electronic devices having image processing functions, and may include sequentially connected cameras.
  • the full convolutional neural network 220, the feature set generation module 230, the feature set screening model 240, the pooling operation module 260, the pixel adjustment module 270, and the target recognition model 280 further include a full convolutional neural network 220 and a feature set, respectively.
  • the generation module 230, the feature set screening model 240, and the context acquisition module 250 connected to the pooling operation module 260.
  • the camera 210 directly captures an image corresponding to the scene.
  • the image capturing device may be used instead of the camera 210 to collect images of the corresponding scene from the corresponding regions.
  • the full convolutional neural network 220 performs feature extraction on the image acquired by the image acquisition module 210 to generate a convolutional feature block (feature image).
  • the feature set generating module 230 is configured to extract features from regions of the convolution feature block where the target object may exist to form a feature set.
  • the feature set screening model 240 is configured to filter out feature sets capable of better describing the target object from the extracted feature sets.
  • the context obtaining module 250 is configured to extract, according to the selected region of each feature set, the context feature of each selected feature set from the convolution feature block.
  • the pooling operation module 260 is configured to respectively perform a feature set describing the target object and perform a pooling operation to reduce the feature quantity, improve the calculation amount of the target recognition process, and thereby improve the accuracy of the image recognition.
  • the pixel adjustment module 270 is configured to adjust the feature set and the context feature after the pooling operation to the matched pixels, respectively.
  • the target recognition model 280 is configured to identify the category of the target object based on the pixel-adjusted feature, and in some examples, may further be used to locate the location of the target object within the image.
  • the designer of the present application applies image recognition to the smart refrigerator in advance, and the sizes are 8*16, 8*8, 16*16, 16*32, 16*8, 32*64, 32*32, respectively. , 32*16 and 64*32 rectangular boxes (sliding windows), the matching pixels are set to 5*3, 12*3, 5*10, 10*5 and 7*7, and the predetermined vector is filtered. The criterion is to choose the largest vector element.
  • the camera 210 installed in the smart refrigerator takes a picture in the smart refrigerator to generate an image to be recognized (step S201), and transmits the image to the full convolutional neural network 220 (step S202), and the full convolutional neural network 220 characterizes the image. Extracting a generated convolution feature block (S203), and transmitting the convolution feature block to the feature set generation module 230 and the context acquisition module 250 (S204, S205), and the feature set generation module 230 adopts a rectangular frame of various sizes from the convolution feature block.
  • the feature set capable of describing the target object is selected by calculating the convolution of each feature set (S208), and the selected feature set is sent to the context obtaining module 250 and the pooling operation module 260 (S209, S210), and the context obtaining module 250
  • the feature set generation module 230 requests an area identifier that can describe an area to which the feature set of the target object belongs (S211), and the feature set generation module 230 sends the corresponding area identifier in response to the request.
  • the context acquisition module 250 is obtained (S212), and then the context acquisition module 250 determines an area identifier of the area to which the local context feature of the feature set of the target object belongs in the convolution feature block based on the received area identifiers (S213), and determines When the area identifier of the area to which the local context feature belongs is described, the center point of the area to which the feature set of the target object belongs may be described as centering, and the side length thereof is expanded by 0.5 times.
  • step S214 the context acquisition module 250 extracts local context features from corresponding regions of the convolutional feature block based on the determined region identification.
  • the convolution feature block can also be determined as a global top and bottom feature.
  • step S215 the context acquisition module 250 sends the extracted context features to the pooling operation module 260.
  • step S216 the pooling operation module 260 performs a pooling operation on the received feature set and context features, respectively.
  • step S217 the pooling operation module 260 delivers the feature set and context features after the pooling operation to the pixel adjustment module 270.
  • step S218 the pixel adjustment module 270 respectively adjusts the received feature set context features to the matched pixels corresponding to the various matched pixels.
  • the process of pooling operations and adjusting pixels can be referred to FIG. 2c.
  • the product of w and h in FIG. 2c represents the specific value of the matched pixel, and FIG. 2c only shows a feature set describing the target object.
  • the pooling operation and the adjustment pixel process the feature set consisting of a first set of features 510, a second set of features 520, and a third set of features 530, the first set of features 510 being the first feature extraction channel of the full convolutional neural network 220
  • the extracted and outputted features, the second set of features 520 are features extracted and output by the second feature extraction channel of the full convolutional neural network 220, and the third set of features 530 are extracted for the third feature extraction channel of the full convolutional neural network 220 and
  • the features of the output are divided into three parts according to the region before the pooling operation of the feature set, respectively, in the three regions defined by two broken lines in the figure, the top layer is In the first part, the middle part of the two broken lines is the second part, and the middle part is the third part.
  • the first part of the first set of features 510 is pooled separately, and the second part of the second set of features 520 is pooled separately.
  • the third portion of the group feature 530 performs a pooling operation. Then, the features generated by the individual pooling operation are respectively adjusted to the matched pixels, and the pixel-adjusted feature set is generated, which is composed of the fourth group feature 540, the fifth group feature 550, and the sixth group feature 560 shown in FIG. 2c.
  • a set of features 510 are pooled and adjusted to become a fourth set of features 540
  • the second set of features 520 are pooled and adjusted to become a fifth set of features 550
  • the third set of features 530 are pooled
  • the pooling operation and the pixel adjustment process of other feature sets and context features are similar to those shown in FIG. 2c, and details are not described herein again.
  • step S219 the pixel adjustment module 270 delivers the feature set and context features after the adjustment of the pixels to the target recognition model 280.
  • the target recognition model 280 identifies the target object based on the input feature set and the context feature, and outputs the target class vector and the target position offset vector of the target object.
  • each dimension vector element may represent a probability p j , j ⁇ ⁇ 0,..., C ⁇ , 0 of the target object belonging to a certain category.
  • the background class is represented; the position offset vector 614, 615, 616 may be a 4-dimensional vector, and each dimension element of this vector represents a position offset vector [ ⁇ x , ⁇ y , ⁇ w , ⁇ h ] of the region to which the feature set belongs.
  • the target recognition model 280 filters the category vectors 611, 612, 613 and other category vectors not shown based on predetermined vector screening criteria to determine a final target category vector 621 and a target position offset vector 622.
  • the image recognition method of the present application selects a feature capable of describing the target object from the feature image, multiple feature sets are selected from different regions in the feature image, which can effectively represent similar targets in different positions in the image.
  • the object therefore, can more accurately identify the target object.
  • the image recognition method of the present application may select a plurality of feature sets from a plurality of regions having different sizes in the feature image when extracting the features describing the target object for the difference in shape and size of each target object, using different pixels.
  • the feature set respectively describes the target object with different sizes and shapes, and can also identify the target object by combining the local context feature and the global context feature describing the feature set of the target object, so that the target object can be more accurately obtained, and / or locate the target object.
  • the calculation amount of the image recognition process can be further reduced, and the recognition efficiency is improved.
  • the image recognition method of the embodiment of the present application is applied to each scenario, it is likely to face large-scale data similar to Internet data, and the real-time requirement of the application is high.
  • C/C++ or C/C++ can be used.
  • the assembly language implements the program instructions corresponding to the image recognition method of the present application.
  • the present application also provides an embodiment of the image recognition apparatus.
  • FIG. 3 is a logic block diagram of an image recognition apparatus according to an exemplary embodiment of the present application.
  • the apparatus may include an image acquisition module 310, a feature extraction module 320, a feature selection module 330, and a target recognition module 340.
  • the image obtaining module 310 is configured to acquire an image to be identified.
  • the feature extraction module 320 is configured to obtain a feature image of the image to be identified, and the feature image is used to describe a feature of the image to be identified.
  • the feature selection module 330 is configured to select at least two feature sets describing the target object from the obtained feature images.
  • the target identification module 340 is configured to identify the target object based on the selected feature set.
  • the selected feature set has different sizes of regions to which the feature image belongs.
  • the size of the region to which the feature set belongs in the feature image may include:
  • the image recognition apparatus of the present application may further include:
  • a context selection module configured to select a context feature of the feature set from the feature image.
  • the target recognition module 340 is further configured to identify the target object according to the selected feature set and the context feature.
  • the contextual features include local context features and/or global context features.
  • the side length of the region to which the local context feature of the feature set belongs is 1.5 times the side length of the region to which the feature set belongs.
  • the image recognition apparatus of the present application may further include:
  • the pooling operation module is configured to perform a pooling operation on the selected feature set and the context feature of the feature set respectively.
  • the target recognition module 340 is further configured to identify the target object according to the feature set and the context feature obtained by the pooling operation.
  • the image recognition apparatus of the present application may further include:
  • a pixel adjustment module configured to adjust the feature set and the context feature obtained by the pooling operation to matched pixels.
  • the target recognition module 340 is further configured to identify the target object according to the adjusted feature set and the context feature.
  • the pooling operation module is further configured to perform a pooling operation on the specified features extracted by each feature extraction channel when performing the pooling operation on the selected feature set and the context feature of the feature set respectively;
  • the specified features extracted by the different feature extraction channels have different coordinates in the feature image.
  • the target recognition module 340 can also be used to:
  • the target object is identified based on the adjusted feature set.
  • matching pixels include at least two of the following:
  • the device embodiment since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment.
  • the device embodiments described above are merely illustrative, wherein the units or modules described as separate components may or may not be physically separate, and the components displayed as units or modules may or may not be physical units. Or modules, which can be located in one place, or distributed to multiple network units or modules. Some or all of the modules may be selected according to actual needs to achieve the objectives of the present application. Those of ordinary skill in the art can understand and implement without any creative effort.
  • Embodiments of the image recognition apparatus of the present application can be applied to an electronic device.
  • This can be implemented by a computer chip or an entity, or by a product having a certain function.
  • the electronic device is a computer, and the specific form of the computer may be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email transceiver device. , game consoles, tablets, wearables, Internet TVs, smart locomotives, driverless cars, smart refrigerators, other smart home devices, or a combination of any of these devices.
  • the device embodiment may be implemented by software, or may be implemented by hardware or a combination of hardware and software.
  • a processor of the electronic device in which it is located reads a corresponding computer program instruction in a readable medium such as a non-volatile memory into a memory.
  • a hardware level as shown in FIG. 4, a hardware structure diagram of an electronic device in which the image recognition apparatus of the present application is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in FIG.
  • the electronic device in which the device is located in the embodiment may also include other hardware according to the actual function of the electronic device, and details are not described herein again.
  • the storage processor of the electronic device may be a memory of executable instructions; the processor may be coupled to the memory for reading the program instructions stored by the memory and, in response, performing the operations of: acquiring an image to be identified; obtaining a to-be-identified a feature image of the image, the feature image is used to describe a feature of the image to be identified; at least two feature sets describing the target object are selected from the obtained feature image; and the target object is identified based on the selected feature set.
  • the embodiment of the present application further provides a computer storage medium, where the storage medium stores program instructions, where the program instructions include:
  • the target object is identified based on the selected feature set.
  • Embodiments of the present application may take the form of a computer program product embodied on one or more storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) in which program code is embodied.
  • Computer-usable storage media include both permanent and non-persistent, removable and non-removable media, and information storage can be implemented by any method or technology.
  • the information can be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read only memory
  • EEPROM electrically erasable programmable read only memory
  • flash memory or other memory technology
  • compact disk read only memory CD-ROM
  • DVD digital versatile disk
  • Magnetic tape cartridges magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.
  • the operations performed by the processor may be referred to the related description in the foregoing method embodiments, and details are not described herein.

Abstract

The present application provides an image recognition method, apparatus and device, and a readable medium. The method comprises: obtaining an image to be recognized; obtaining a feature image of the image to be recognized, the feature image being used for describing features of the image to be recognized; selecting at least two feature sets describing target objects from the obtained feature image; and recognizing the target objects on the basis of the selected feature sets. By implementing the present application, when features capable of describing a target object are selected from a feature image, multiple feature sets are selected from different regions in the feature image and can effectively represent the same type of target objects at different positions in the image, and therefore, the target objects can be more accurately recognized.

Description

图像识别方法、装置及设备、可读介质Image recognition method, device and device, readable medium
本申请要求2017年07月06日递交的申请号为201710546203.4、发明名称为“图像识别方法、装置及设备、可读介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application No. PCT Application No. No. No. No. No. No. No. No. No. .
技术领域Technical field
本申请涉及图像处理技术领域,尤其涉及图像识别方法、装置及设备、可读介质。The present application relates to the field of image processing technologies, and in particular, to an image recognition method, apparatus and device, and a readable medium.
背景技术Background technique
随着计算机技术的发展和计算机视觉原理的广泛应用,利用图像识别技术对目标对象进行分类,在无人驾驶汽车、智能冰箱等产品上具有广泛的应用价值。目前的图像识别技术对目标对象进行识别时,一般先采用特征提取模型对含有目标对象的整张图像进行特征提取,生成整张图像的特征图像,该特征图像由所提取的特征组成,所提取的特征包括颜色特征、纹理特征、形状特征、空间关系特征等图像特征中的至少一项,再采用单个固定大小的矩形框将特征图像中描述目标对象(如车,食材等)的特征框出来,进而选取被框出的特征为目标特征,然后将目标特征输入分类模型进行分类。With the development of computer technology and the wide application of computer vision principles, image recognition technology is used to classify target objects, which has wide application value in products such as driverless cars and smart refrigerators. When the current image recognition technology recognizes the target object, the feature extraction model is generally used to extract the feature of the entire image containing the target object, and a feature image of the entire image is generated, and the feature image is composed of the extracted features and extracted. The feature includes at least one of image features such as a color feature, a texture feature, a shape feature, and a spatial relationship feature, and then a single fixed-size rectangular frame is used to frame the feature object (such as a car, food, etc.) in the feature image. Then, the selected feature is selected as the target feature, and then the target feature is input into the classification model for classification.
但是,在某些产品上应用图像识别技术时,拍摄所得的含有目标对象的图像的不同区域可能出现同一目标对象,例如,智能冰箱内的食材大多会被用户随机摆放在冰箱内,拍摄智能冰箱内的食材所得的图像内。目前的图像识别技术在识别这种图像时,易导致错误的识别结果。However, when image recognition technology is applied to some products, the same target object may appear in different areas of the image containing the target object. For example, most of the ingredients in the smart refrigerator are randomly placed in the refrigerator by the user. The food inside the refrigerator is obtained from the image. Current image recognition techniques are prone to erroneous recognition results when identifying such images.
发明内容Summary of the invention
有鉴于此,本申请提供一种图像识别方法、装置及设备、可读介质。In view of this, the present application provides an image recognition method, apparatus and device, and a readable medium.
根据本申请实施例的第一方面,提供一种图像识别方法,包括步骤:According to a first aspect of the embodiments of the present application, an image recognition method is provided, including the steps of:
获取待识别的图像;Obtain an image to be identified;
获得待识别的图像的特征图像,所述特征图像用于描述待识别的图像的特征;Obtaining a feature image of the image to be identified, the feature image being used to describe features of the image to be identified;
从所获得的特征图像中选取至少两个描述目标对象的特征集;Selecting at least two feature sets describing the target object from the obtained feature images;
基于所选取的特征集对目标对象进行识别。The target object is identified based on the selected feature set.
根据本申请实施例的第二方面,提供一种电子设备,包括:According to a second aspect of the embodiments of the present application, an electronic device is provided, including:
处理器;processor;
存储处理器可执行指令的存储器;a memory that stores processor executable instructions;
其中,所述处理器耦合于所述存储器,用于读取所述存储器存储的程序指令,并作为响应,执行如下操作:The processor is coupled to the memory for reading program instructions stored in the memory, and in response, performing the following operations:
获取待识别的图像;Obtain an image to be identified;
获得待识别的图像的特征图像,所述特征图像用于描述待识别的图像的特征;Obtaining a feature image of the image to be identified, the feature image being used to describe features of the image to be identified;
从所获得的特征图像中选取至少两个描述目标对象的特征集;Selecting at least two feature sets describing the target object from the obtained feature images;
基于所选取的特征集对目标对象进行识别。The target object is identified based on the selected feature set.
根据本申请实施例的第三方面,提供一种图像识别装置,包括:According to a third aspect of the embodiments of the present application, an image recognition apparatus is provided, including:
图像获取模块,用于获取待识别的图像;An image acquisition module, configured to acquire an image to be identified;
特征提取模块,用于获得待识别的图像的特征图像,所述特征图像用于描述待识别的图像的特征;a feature extraction module, configured to obtain a feature image of the image to be identified, wherein the feature image is used to describe a feature of the image to be identified;
特征选取模块,用于从所获得的特征图像中选取至少两个描述目标对象的特征集;a feature selection module, configured to select at least two feature sets describing the target object from the obtained feature images;
目标识别模块,用于基于所选取的特征集对目标对象进行识别。A target recognition module is configured to identify the target object based on the selected feature set.
根据本申请实施例的第四方面,提供一个或多个机器可读介质,其上存储有指令,当由一个或多个处理器执行时,使得终端设备执行以上所述的方法。According to a fourth aspect of embodiments of the present application, there is provided one or more machine readable medium having stored thereon instructions that, when executed by one or more processors, cause the terminal device to perform the method described above.
实施本申请提供的实施例,从特征图像中选取能够描述目标对象的特征时,从所述特征图像内不同区域选取多个特征集,能有效代表图像中不同位置的同类目标对象,因此能更准确地识别出目标对象。When implementing the embodiment provided by the present application, when selecting a feature capable of describing the target object from the feature image, selecting a plurality of feature sets from different regions in the feature image can effectively represent similar target objects in different positions in the image, thereby enabling Accurately identify the target object.
附图说明DRAWINGS
图1是本申请一示例性实施例示出的图像识别方法的流程图;1 is a flowchart of an image recognition method according to an exemplary embodiment of the present application;
图2a是本申请一示例性实施例示出的用于图像识别的系统的框图;2a is a block diagram of a system for image recognition shown in an exemplary embodiment of the present application;
图2b是本申请另一示例性实施例示出的图像识别方法的交互示意图;FIG. 2b is an interaction diagram of an image recognition method according to another exemplary embodiment of the present application; FIG.
图2c是本申请示例性实施例示出的图像识别方法中的池化操作和调整像素的实现过程的示意图;2c is a schematic diagram of a pooling operation and an implementation process of adjusting pixels in an image recognition method according to an exemplary embodiment of the present application;
图2d从是本申请示例性实施例示出的图像识别方法中的目标识别过程的示意图;2d is a schematic diagram of a target recognition process in an image recognition method illustrated by an exemplary embodiment of the present application;
图3是本申请一示例性实施例示出的图像识别装置的逻辑框图;FIG. 3 is a logic block diagram of an image recognition apparatus according to an exemplary embodiment of the present application; FIG.
图4是本申请一示例性实施例示出的图像识别装置的硬件结构图。FIG. 4 is a hardware configuration diagram of an image recognition apparatus according to an exemplary embodiment of the present application.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. The following description refers to the same or similar elements in the different figures unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Instead, they are merely examples of devices and methods consistent with aspects of the present application as detailed in the appended claims.
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in the present application is for the purpose of describing particular embodiments, and is not intended to be limiting. The singular forms "a", "the" and "the" It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
应当理解,尽管在本申请可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used to describe various information in this application, such information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information without departing from the scope of the present application. Similarly, the second information may also be referred to as the first information. Depending on the context, the word "if" as used herein may be interpreted as "when" or "when" or "in response to determination."
请参阅图1,图1是本申请一示例性实施例示出的图像识别方法的流程图,该实施例能应用于具有图像处理功能的各种电子设备,可以包括以下步骤S101-S104:Please refer to FIG. 1. FIG. 1 is a flowchart of an image recognition method according to an exemplary embodiment of the present application. The embodiment can be applied to various electronic devices having image processing functions, and may include the following steps S101-S104:
步骤S101、获取待识别的图像。Step S101: Acquire an image to be identified.
步骤S102、获得待识别的图像的特征图像,所述特征图像用于描述待识别的图像的特征。Step S102: Obtain a feature image of the image to be identified, and the feature image is used to describe a feature of the image to be identified.
步骤S103、从所获得的特征图像中选取至少两个描述目标对象的特征集。Step S103: Select at least two feature sets describing the target object from the obtained feature images.
步骤S104、基于所选取的特征集对目标对象进行识别。Step S104: Identify the target object based on the selected feature set.
本申请实施例,所获取的图像可以是图像采集模块(如摄像头)直接采集的图像,也可以是经过图像预处理后的图像数据,这里提到的图像预处理可以包括对提高图像识别准确度有益的图像处理,例如:对场景文本图像进行的颜色空间变换,对文本词图像中词图像的位置校正,对字符图像进行的去噪处理等。In the embodiment of the present application, the acquired image may be an image directly collected by an image acquisition module (such as a camera), or may be image data after image preprocessing. The image preprocessing mentioned herein may include improving image recognition accuracy. Useful image processing, for example, color space transformation on scene text images, position correction of word images in text word images, denoising processing on character images, and the like.
对于获得的图像,可以采用卷积神经网络模型、分类器、多级网络结构等特征提取 算法,对其进行特征提取,生成特征图像,该特征图像的各区域含有所提取的各种特征。For the obtained image, a feature extraction algorithm such as a convolutional neural network model, a classifier, and a multi-level network structure may be used to extract features to generate a feature image, and each region of the feature image contains various extracted features.
在某些例子中,对目标对象进行识别时需要定位目标对象的具体位置,为了准确定位到目标对象,可以采用能够有效保留目标对象的位置信息的全卷积神经网络对所获取的图像进行特征提取,该全卷积神经网络可以包括AlexNet、GoogleNet、VGGNet、ResNet或其他卷积神经网络模型的全卷积层。In some examples, when the target object is identified, the specific location of the target object needs to be located. In order to accurately locate the target object, a full convolutional neural network capable of effectively retaining the position information of the target object may be used to characterize the acquired image. Extraction, the full convolutional neural network may include a full convolutional layer of AlexNet, GoogleNet, VGGNet, ResNet, or other convolutional neural network models.
获得特征图像后,考虑到图像中可能存在大小、形状各异的目标对象,在提取描述目标对象的特征集时可以基于区域,从所获得的特征图像中选取描述目标对象的特征,从特征图像的同一区域内选取的特征构成描述目标对象的一个特征集,区域尺寸不同时,特征集所含的特征量不同,选取的特征集在所述特征图像中所属的区域的尺寸不同。After obtaining the feature image, considering that there may be target objects of different sizes and shapes in the image, when extracting the feature set describing the target object, the feature describing the target object may be selected from the obtained feature image based on the region, and the feature image is selected from the feature image. The selected features in the same area constitute a feature set describing the target object. When the area size is different, the feature set contains different feature quantities, and the selected feature set has different sizes in the feature image.
一般情况下含有目标对象的区域的卷积运算结果大于预定阈值,该预定阈值可以由设计人员在训练分类器和特征提取模型时,通过验证集上正负样本的卷积运算结果的分布来设定,其数值一般大于或等于0且小于1,例如0.3,0.5等,在某些例子中,可以通过以下滑动窗口技术将卷积运算结果大于预定阈值的特征集作为描述目标对象的特征集:In general, the convolution operation result of the region containing the target object is greater than a predetermined threshold, which can be set by the designer when verifying the distribution of the convolution operation result of the positive and negative samples on the set when training the classifier and the feature extraction model. The value is generally greater than or equal to 0 and less than 1, such as 0.3, 0.5, etc. In some examples, the feature set whose convolution operation result is greater than a predetermined threshold may be used as a feature set describing the target object by the following sliding window technique:
采用多种尺寸的滑动窗口从所获得的特征图像中选出多个候选特征集,滑动窗口的尺寸可以包括8*16,8*8,16*16,16*32,16*8,32*64,32*32,32*16、以及64*32。A plurality of candidate feature sets are selected from the obtained feature images by using sliding windows of various sizes, and the size of the sliding window may include 8*16, 8*8, 16*16, 16*32, 16*8, 32* 64, 32*32, 32*16, and 64*32.
对选出的候选特征集进行卷积运算。A convolution operation is performed on the selected candidate feature set.
将卷积运算结果大于预定阈值的候选特征集选取为描述目标对象的特征集。A candidate feature set having a convolution operation result greater than a predetermined threshold is selected as a feature set describing the target object.
实际应用中,卷特征图像可以是全卷积神经网络对所获取的图像进行特征提取后输出的卷积特征块,采用滑动窗口选取描述目标对象的特征集时,可以在卷积特征块上寻找可能存在目标对象的区域。对于卷积特征块上的每一个位置,分别采用尺寸为8*16、8*8、16*16、16*32、16*8、32*64、32*32、32*16和64*32的矩形框(滑动窗口)标记出各矩形区域,其中,矩形框可以用4维向量[c x,c y,w 0,h 0]等区域标识标记各矩形区域,(c x,c y)表示矩形框的中心点坐标,w 0,h 0分别表示矩形框的宽度和高度,与矩形框的尺寸对应。 In practical applications, the volume feature image may be a convolution feature block obtained by extracting the acquired image by the full convolutional neural network, and when the feature set describing the target object is selected by using a sliding window, the convolution feature block may be searched for. There may be areas of the target object. For each position on the convolution feature block, the sizes are 8*16, 8*8, 16*16, 16*32, 16*8, 32*64, 32*32, 32*16, and 64*32, respectively. The rectangular frame (sliding window) marks each rectangular area, wherein the rectangular frame can mark each rectangular area with a region identification such as a 4-dimensional vector [c x , c y , w 0 , h 0 ], (c x , c y ) Indicates the coordinates of the center point of the rectangle. w 0 and h 0 represent the width and height of the rectangle, respectively, corresponding to the size of the rectangle.
然后提取标记出的每个矩形区域内的特征,将一个矩形区域内的特征组成一个候选特征集,该矩形区域即为该特征集在特征图像(卷积特征块)内所属的区域,然后对每个候选特征集进行卷积运算,确定卷积运算结果大于预定阈值的候选特征集所属的矩形区域为可能存在目标对象的区域,并将卷积运算结果大于预定阈值的候选特征集选取为 描述目标对象的特征集,这些最终被选取的特征集还可以基于其所属的矩形区域的尺寸的不同,被划分为不同种类的特征集。在其他例子中,本申请的设计人员还可以根据图像识别的具体应用场景,将滑动窗口的边长、以及不同边长间的比值设置为其他数值,本申请对此不做限制。Then extracting the features in each of the marked rectangular regions, and forming a feature in a rectangular region into a candidate feature set, where the rectangular region is the region of the feature set in the feature image (convolution feature block), and then Each of the candidate feature sets is subjected to a convolution operation, and the rectangular feature region to which the candidate feature set whose convolution operation result is greater than the predetermined threshold is determined is the region where the target object may exist, and the candidate feature set whose convolution operation result is greater than the predetermined threshold is selected as the description. The feature set of the target object, these finally selected feature sets may also be divided into different kinds of feature sets based on the size of the rectangular area to which they belong. In other examples, the designer of the present application can also set the side length of the sliding window and the ratio between different side lengths to other values according to the specific application scenario of the image recognition, which is not limited in this application.
在某些例子中,卷积特征块的高度和宽度均较大,如果对于卷积特征块上的每一个位置,分别选用尺寸为8*16、8*8、16*16、16*32、16*8、32*64、32*32、32*16和64*32的矩形框(滑动窗口)标记出各矩形区域,然后提取每个矩形区域内的特征,将构成大量的特征集,进行卷积运算的计算量较大,选取到的描述目标对象的特征集的数目可能也较多,进而会增大图像识别过程的计算量,降低图像识别效率,为了解决这些问题,考虑到描述目标对象的特征集的卷积运算结果大于其他非目标对象的特征集的卷积运算结果,可以将卷积运算结果超过预定阈值的特征集确定为候选特征集,将所确定的候选特征集中,卷积运算结果为前N大的特征集选取为描述目标对象的特征集,N大于1且小于所确定的候选特征集的总数,候选特征集的数目较大时,N可以为300。In some examples, the height and width of the convolutional feature block are large, if for each position on the convolutional feature block, the size is 8*16, 8*8, 16*16, 16*32, respectively. 16*8, 32*64, 32*32, 32*16, and 64*32 rectangular frames (sliding windows) mark each rectangular area, and then extract the features in each rectangular area, which will constitute a large number of feature sets. The calculation of the convolution operation is large, and the number of feature sets describing the target object may be more, which may increase the calculation amount of the image recognition process and reduce the image recognition efficiency. In order to solve these problems, the description target is considered. The result of the convolution operation of the feature set of the object is larger than the convolution operation result of the feature set of the other non-target object, and the feature set whose convolution operation result exceeds the predetermined threshold may be determined as the candidate feature set, and the determined candidate feature set is collected. The result of the product operation is that the feature set of the top N is selected as the feature set describing the target object, and N is greater than 1 and smaller than the total number of the determined candidate feature sets. When the number of candidate feature sets is large, N may be 300.
实际应用中,可以将从各矩形区域内的特征组成的各特征集输入预定的特征集筛选模型,由该特征集筛选模型对每个特征集进行卷积运算,将卷积运算结果超过预定阈值的特征集确定为候选特征集,将所确定的候选特征集中,卷积运算结果为前N大的特征集选取为描述目标对象的特征集,N大于1且小于所确定的候选特征集的总数,本申请设计人员可以根据应用场景、以及运行本申请的图像识别方法的电子设备的计算能力确定N的具体数值,如300。这里提到的特征集筛选模型可以是深度神经网络模型、多级结构的网络模型、或者基于图像颜色、边缘、超像素特征的概率模型。In practical applications, each feature set composed of features in each rectangular region may be input into a predetermined feature set screening model, and the feature set screening model performs convolution operation on each feature set, and the convolution operation result exceeds a predetermined threshold. The feature set is determined as a candidate feature set, and the determined candidate feature set, the convolution operation result is a top N feature set selected as a feature set describing the target object, where N is greater than 1 and less than the total number of the determined candidate feature sets. The designer of the present application can determine the specific value of N, such as 300, according to the application scenario and the computing power of the electronic device running the image recognition method of the present application. The feature set screening model mentioned here may be a deep neural network model, a network model of a multi-level structure, or a probability model based on image color, edge, and super-pixel features.
在某些场景,所获得的图像内除包括目标对象外,可能还含多种多样的背景信息,这些背景信息会对目标识别造成一定程度的困扰,为了降低背景信息对识别过程的负面影响,可以在目标识别过程中增加上下文特征,所增加的上下文特征包括局部上下文特征和/或全局上下文特征。In some scenarios, the obtained image may include a variety of background information in addition to the target object. These background information may cause a certain degree of trouble to the target recognition. In order to reduce the negative influence of the background information on the recognition process, Contextual features may be added during the target recognition process, the added contextual features including local context features and/or global context features.
实际应用中,可以在从所获得的特征图像中选取描述目标对象的特征集后,从所述特征图像中选取特征集的上下文特征,然后根据所选取的特征集和上下文特征对目标对象进行识别。添加上下文特征后,目标识别过程可以处理与目标对象相关的更多特征,一方面便于识别体积比较小的目标对象,另一方面可以排除不可能的目标对象,比如:船和大海总是在一起,如果检测到船和树木在一起,就说明目标对象识别错误。In practical applications, after selecting a feature set describing the target object from the obtained feature image, the context feature of the feature set is selected from the feature image, and then the target object is identified according to the selected feature set and the context feature. . After adding the context feature, the target recognition process can process more features related to the target object. On the one hand, it is easy to identify the target object with a smaller volume, and on the other hand, the target object can be excluded. For example, the ship and the sea are always together. If the ship and the tree are detected together, the target object recognition error is indicated.
在某些例子中,从所述特征图像中选取特征集的上下文特征(特征集的上下文特征 可以指特征集中不同描述目标对象的特征分别对应的上下文特征)时,可以在选取好描述目标对象的特征集后,针对选取的每个特征集,可以该特征集所属的区域的中心点为参照点,将该特征集所属的区域的边长增大0.5倍,构成局部上下文特征所属的区域,然后提取该区域的特征为该特征集的局部上下文特征。这样提取局部上下文特征后,特征集的局部上下文特征所属的区域的边长,为所述特征集所属的区域的边长的1.5倍,可以包含与目标对象相关的更多特征,便于识别体积比较小的目标对象。In some examples, when the context feature of the feature set is selected from the feature image (the context feature of the feature set may refer to a context feature corresponding to different features of the target object in the feature set), the target object may be selected and described. After the feature set, for each feature set selected, the center point of the region to which the feature set belongs may be the reference point, and the side length of the region to which the feature set belongs is increased by 0.5 times to form the region to which the local context feature belongs, and then The feature of the region is extracted as a local context feature of the feature set. After extracting the local context feature, the edge length of the region to which the local context feature of the feature set belongs is 1.5 times the side length of the region to which the feature set belongs, and may include more features related to the target object, so as to facilitate the comparison of the volume. Small target object.
在提取到描述目标对象的特征集和上下文特征后,可以基于这些特征对目标对象进行识别,如将所提取的描述目标对象的特征集和上下文特征输入训练后的分类器进行目标分类,但是这样操作会面临巨大的计算量的挑战,而且,分类器对超过较大数量的特征进行运算时还容易出现过拟合。为了解决这个问题,本申请可以对特征集、以及所述特征集的上下文特征分别进行池化操作,然后根据所述池化操作所得的特征集和上下文特征,对目标对象进行识别。这里提到的池化,用于降低特征集的维度和出现过拟合的概率,一般是对不同位置的特征进行聚合统计,例如,进行池化操作时,可以用特征集的某个区域的各特征的平均值(或最大值)代替各特征。After extracting the feature set and the context feature describing the target object, the target object may be identified based on the features, such as extracting the extracted feature set of the target object and the context feature into the trained classifier for target classification, but The operation faces a huge computational challenge, and the classifier is prone to overfitting when computing over a large number of features. In order to solve this problem, the present application may perform a pooling operation on the feature set and the context feature of the feature set, and then identify the target object according to the feature set and the context feature obtained by the pooling operation. The pooling mentioned here is used to reduce the dimension of the feature set and the probability of over-fitting. Generally, the statistics of the features of different locations are aggregated. For example, when the pooling operation is performed, a certain area of the feature set can be used. The average (or maximum) of each feature replaces each feature.
在其他例子中,为了进一步提高目标识别效率,可以在对特征集、以及所述特征集的上下文特征分别进行池化操作时,对各特征提取通道所提取的指定特征分别进行池化操作;不同特征提取通道所提取的指定特征在所述特征图像中的坐标不同。如:特征图像为全卷积网络提取的卷积特征块,可以将每个特征集在卷积特征块内所属的区域分割为三部分,在进行池化操作时,单独对全卷积网络的第一特征提取通道提取的第一部分区域内的特征(指定特征)进行池化操作,单独对全卷积网络的第二特征提取通道提取的第二部分区域内的特征(指定特征)进行池化操作,单独对全卷积网络的第三特征提取通道提取的第三部分区域内的特征(指定特征)进行池化操作。这样操作后,一方面对目标进行识别所需的深度神经网络的层数较少,另一方面可以识别到目标对象的相对位置关系,便于准确定位目标对象。In other examples, in order to further improve the target recognition efficiency, when the feature set and the context feature of the feature set are separately pooled, the specified features extracted by each feature extraction channel may be pooled separately; The specified features extracted by the feature extraction channel have different coordinates in the feature image. For example, if the feature image is a convolutional feature block extracted by the full convolution network, the region to which each feature set belongs within the convolutional feature block can be divided into three parts, and when the pooling operation is performed, the entire convolutional network is separately The feature (designated feature) in the first partial region extracted by the first feature extraction channel is subjected to a pooling operation, and the feature (designated feature) in the second partial region extracted by the second feature extraction channel of the full convolution network is separately pooled In operation, the feature (designated feature) in the third partial region extracted by the third feature extraction channel of the full convolution network is separately pooled. After this operation, on the one hand, the number of layers of the deep neural network required for identifying the target is small, and on the other hand, the relative positional relationship of the target object can be recognized, so that the target object can be accurately located.
此外为了提高目标识别效率,还可以将池化操作所得的特征集和上下文特征调整到匹配的像素;然后根据调整后的特征集和上下文特征,对目标对象进行识别。该匹配的像素一般小于每个特征集的像素,本申请设计人员可以根据应用场景、以及运行本申请的图像识别方法的电子设备的计算能力确定匹配的像素,在某些场景,考虑到图像中可能存在大小、形状各异的目标对象,匹配的像素可以包括3*12、12*3、5*10、10*5和7*7中的至少两种。In addition, in order to improve the target recognition efficiency, the feature set and the context feature obtained by the pooling operation may also be adjusted to matched pixels; and then the target object is identified according to the adjusted feature set and the context feature. The matched pixels are generally smaller than the pixels of each feature set, and the designer of the present application can determine matching pixels according to the application scenario and the computing power of the electronic device running the image recognition method of the present application. In some scenarios, the image is considered. There may be target objects of different sizes and shapes, and the matched pixels may include at least two of 3*12, 12*3, 5*10, 10*5, and 7*7.
此外,本申请实施例也可以在基于所选取的特征集对目标对象进行识别时,将所选取的特征集调整到匹配的像素,然后根据调整后的特征集对目标对象进行识别。In addition, in the embodiment of the present application, when the target object is identified based on the selected feature set, the selected feature set is adjusted to the matched pixel, and then the target object is identified according to the adjusted feature set.
如果匹配的像素有多种数值,可以将一种匹配的像素的特征集合上下文特征作为描述目标对象的一个分支的特征,该分支的特征数量为H 0×W 0×((3×h i×w i)×(C+1)),其中,H 0、W 0、C 0分别代表特征图像(如卷积特征块)的高、宽和通道数,h i×w i∈{3×12,12×3,5×10,10×5,7×7},C表示目标对象的类别数,+1是将背景也算作一个目标类别,h i×w i上每一个位置点都是一个3×(C+1)维的向量,该向量包括3个(C+1)维的向量。 If the matched pixels have multiple values, the feature set context feature of a matched pixel may be used as a feature describing a branch of the target object, and the number of features of the branch is H 0 ×W 0 ×((3×h i × w i )×(C+1)), where H 0 , W 0 , C 0 respectively represent the height, width and channel number of the feature image (eg, convolutional feature block), h i ×w i ∈{3×12 , 12 × 3, 5 × 10, 10 × 5, 7 × 7}, C represents the number of categories of the target object, +1 is the background is also counted as a target category, each position on h i × w i is A vector of 3 × (C + 1) dimensions, which includes three (C + 1)-dimensional vectors.
在对提取的特征集和上下文特征顺次进行池化操作和像素调整后,可以将像素调整后的特征输入目标识别模型,目标识别模型在识别目标的过程中对应每一个分支的每个特征集、以及该特征集的上下文特征,都会产生一个类别向量和该特征集所属区域的位置偏移向量。这里提到的目标识别模型可以是分类器之类的分类模型。After the extracted feature set and the context feature are sequentially subjected to the pooling operation and the pixel adjustment, the pixel-adjusted feature may be input into the target recognition model, and the target recognition model corresponds to each feature set of each branch in the process of identifying the target. And the context feature of the feature set, a class vector and a position offset vector of the region to which the feature set belongs are generated. The target recognition model mentioned here may be a classification model such as a classifier.
其中,类别向量的长度可以为(C+1),每一维向量元素可以表示目标对象属于某一类别的概率p j,j∈{0,...,C},其中0表示背景类。然后目标识别模型根据预定的向量筛选准则来确定一个最终的目标类别向量和目标位置偏移向量。 The length of the category vector may be (C+1), and each dimension vector element may represent a probability p j , j ∈ {0, . . . , C} of the target object belonging to a certain category, where 0 represents a background class. The target recognition model then determines a final target class vector and target position offset vector based on predetermined vector screening criteria.
位置偏移向量可以是4维向量,这个向量的每一维元素表示特征集所属区域的位置偏移向量[Δ xywh],该位置偏移向量对应于4维向量[c x,c y,w 0,h 0],这里Δx、Δy、Δw和Δh就是cx、cy、w0和h0分别需要调整的偏移量。目标对象所在位置被调整后对应的向量即[c x+w 0·Δ x,c y+h 0·Δ y,w 0·Δ w,h 0·Δ h]。 The position offset vector may be a 4-dimensional vector, and each dimension element of the vector represents a position offset vector [Δ x , Δ y , Δ w , Δ h ] of the region to which the feature set belongs, and the position offset vector corresponds to 4 dimensions. The vector [c x , c y , w 0 , h 0 ], where Δx, Δy, Δw, and Δh are the offsets that cx, cy, w0, and h0 need to adjust, respectively. The corresponding vector after the target object is adjusted is [c x + w 0 · Δ x , c y + h 0 · Δ y , w 0 · Δ w , h 0 · Δ h ].
如果预定的向量筛选准则为选取最大的向量元素,可以从每个分支的各特征集对应的类别向量中选取向量元素最大的类别向量为最终识别的目标类别向量,然后将该目标类别向量对应的特征集的位置偏移向量为最终识别的位置偏移向量,在某些例子中,选取向量元素最大的类别向量为最终识别的目标类别向量时,可以根据如下公式选取:If the predetermined vector filtering criterion is to select the largest vector element, the category vector with the largest vector element may be selected from the category vector corresponding to each feature set of each branch as the final identified target category vector, and then the target category vector is corresponding. The position offset vector of the feature set is the finally identified position offset vector. In some examples, when the category vector with the largest vector element is selected as the final recognized target class vector, it can be selected according to the following formula:
Figure PCTCN2018093350-appb-000001
Figure PCTCN2018093350-appb-000001
其中,score表示类别向量,向量的每一维上的元素表示目标对象属于相应类别的概率,C表示类别数目,A表示分指数(预定像素的种类数)。在某个例子中,有2类目标对象,一类为狗,另一类为猫,那么C=2,类别向量的第一维表示目标对象属于猫这 一类别的可能性,第二维表示目标对象属于狗这一类别的可能性,类别向量可以用score=[0.3,0.9]来表示,通过上述公式括号里的第一次求max,可以取0.3和0.9中的最大值。经过第一次求max后,每个分支都有一个最大值,假设为score2,score3,…,通过上述公式中括号外的第二次求max,可以从不同的分支的score2,score3,…中选取最大值为终极最大大值,将选取的终极最大值所属的类别向量确定为能确定目标对象所属类别的目标类别向量。在其他例子中,还可以根据类别向量的所有维度的平均值(mean)、最小值、中位数(median)来确定目标类别向量。Where score represents a category vector, the elements on each dimension of the vector represent the probability that the target object belongs to the corresponding category, C represents the number of categories, and A represents the sub-index (the number of types of predetermined pixels). In one example, there are 2 types of target objects, one is a dog and the other is a cat, then C=2, the first dimension of the category vector indicates the possibility that the target object belongs to the cat category, and the second dimension represents The target object belongs to the category of the dog. The category vector can be represented by score=[0.3,0.9]. The maximum value of 0.3 and 0.9 can be obtained by the first max in the brackets of the above formula. After the first seek max, each branch has a maximum value, assuming score2, score3, ..., through the second time of the above formula in the above formula to find max, can be from different branches of score2, score3, ... The maximum value is selected as the maximum maximum value, and the category vector to which the selected final maximum value belongs is determined as the target category vector capable of determining the category to which the target object belongs. In other examples, the target category vector may also be determined based on the mean, minimum, and median of all dimensions of the category vector.
在确定好目标类别向量后,目标识别模型可以将目标类别向量和目标位置偏移向量作为识别结果输出,所述目标类别向量为各特征集对应的类别向量中最大的向量元素所属的类别向量,所述目标位置偏移向量为所述目标类别向量对应的特征集的位置偏移向量。After determining the target class vector, the target recognition model may output the target class vector and the target position offset vector as the recognition result, where the target class vector is the category vector to which the largest vector element of the class vector corresponding to each feature set belongs. The target position offset vector is a position offset vector of a feature set corresponding to the target category vector.
在某些例子中,目标类别向量为c i,i∈{1,...,C+1},其向量元素并非目标对象属于对应类别的概率值,在输出目标类别向量前,可以通过求softmax将目标类别向量转成概率形式目标类别向量,Softmax的公式如下: In some examples, the target class vector is c i , i ∈ {1,..., C+1}, and the vector element is not the probability value of the target object belonging to the corresponding category. Before the target class vector is output, the request may be obtained. Softmax converts the target category vector into a probability form target category vector. The formula for Softmax is as follows:
Figure PCTCN2018093350-appb-000002
Figure PCTCN2018093350-appb-000002
pi为概率形式目标类别向量。Pi is the probability form target category vector.
在目标识别模型输出识别结果后,可以得到目标对象所属的类别,进而结合目标类别向量对应的特征集所属区域的初始位置[c x,c y,w 0,h 0]和识别结果中的目标位置偏移向量,可以得到目标对象在图像中所处的位置:[c x+w 0·Δ x,c y+h 0·Δ y,w 0·Δ w,h 0·Δ h]。 After the target recognition model outputs the recognition result, the category to which the target object belongs can be obtained, and then the initial position [c x , c y , w 0 , h 0 ] of the region to which the feature set corresponding to the target category vector is associated and the target in the recognition result are combined. The position offset vector can obtain the position of the target object in the image: [c x + w 0 · Δ x , c y + h 0 · Δ y , w 0 · Δ w , h 0 · Δ h ].
综上可知,通过本申请实施例的图像识别,可以检测出一幅图像中所有目标对象的类别和位置,如果待识别的图像为拍摄智能冰箱的储物间所得的图像,目标对象为智能冰箱的储物间储存的食材,那么基于图像识别结果,在智能冰箱领域可以进一步进行相关信息的统计,如:统计同一类别的食材的数目、所有类别的食材的数目等,然后基于统计结果可以对食品进行准确有效地智能化管理,如:可以调换冰箱运行模式,始终让食品保持最佳存储状态,可以让用户通过手机或计算机等随时随地了解冰箱里食物的数量、保鲜保质信息,可以提醒用户定时补充食品等。In summary, the image recognition of the embodiment of the present application can detect the category and location of all target objects in an image. If the image to be recognized is an image obtained by the storage room of the smart refrigerator, the target object is a smart refrigerator. The food stored in the storage room, based on the image recognition results, can further carry out statistics on relevant information in the field of smart refrigerators, such as: counting the number of ingredients in the same category, the number of ingredients in all categories, etc., and then based on the statistical results can be Accurate and effective intelligent management of food, such as: can change the operating mode of the refrigerator, always keep the food in the best storage state, allowing users to know the quantity of food in the refrigerator, preservation and quality information anytime, anywhere through mobile phones or computers, and can remind users Regular food supplements, etc.
此外,本申请实施例的图像识别应用于无人驾驶汽车中时,可以准确识别汽车前面 的路况,基于路况进行相应的驾驶操作,如:无人驾驶时绕开障碍物等。In addition, when the image recognition of the embodiment of the present application is applied to an unmanned automobile, the road condition in front of the automobile can be accurately recognized, and corresponding driving operations are performed based on the road condition, such as: obstacles are bypassed when driving unmanned.
请参阅图2a,图2a是本申请一示例性实施例示出的用于实现图像识别的系统200的框图,该系统200适用于具有图像处理功能的各种电子设备,可以包括顺次连接的摄像头210、全卷积神经网络220、特征集生成模块230、特征集筛选模型240、池化操作模块260、像素调整模块270和目标识别模型280,还包括分别与全卷积神经网络220、特征集生成模块230、特征集筛选模型240和池化操作模块260连接的上下文获取模块250。Please refer to FIG. 2a. FIG. 2a is a block diagram of a system 200 for implementing image recognition according to an exemplary embodiment of the present application. The system 200 is applicable to various electronic devices having image processing functions, and may include sequentially connected cameras. 210. The full convolutional neural network 220, the feature set generation module 230, the feature set screening model 240, the pooling operation module 260, the pixel adjustment module 270, and the target recognition model 280, further include a full convolutional neural network 220 and a feature set, respectively. The generation module 230, the feature set screening model 240, and the context acquisition module 250 connected to the pooling operation module 260.
其中,摄像头210,直接拍摄对应场景的图像,在其他例子中,还可以用图像搜集设备代替摄像头210,从对应区域搜集对应场景的图像。The camera 210 directly captures an image corresponding to the scene. In other examples, the image capturing device may be used instead of the camera 210 to collect images of the corresponding scene from the corresponding regions.
全卷积神经网络220,对图像采集模块210获取的图像进行特征提取,生成卷积特征块(特征图像)。The full convolutional neural network 220 performs feature extraction on the image acquired by the image acquisition module 210 to generate a convolutional feature block (feature image).
特征集生成模块230,用于从卷积特征块中可能存在目标对象的区域提取特征,构成特征集。The feature set generating module 230 is configured to extract features from regions of the convolution feature block where the target object may exist to form a feature set.
特征集筛选模型240,用于从提取的各特征集中筛选出能够较好的描述目标对象的特征集。The feature set screening model 240 is configured to filter out feature sets capable of better describing the target object from the extracted feature sets.
上下文获取模块250,用于基于筛选出的每个特征集的所属区域,从卷积特征块中提取筛选出的每个特征集的上下文特征。The context obtaining module 250 is configured to extract, according to the selected region of each feature set, the context feature of each selected feature set from the convolution feature block.
池化操作模块260,用于分别对描述目标对象的特征集和进行池化操作,以减少特征量,提高目标识别过程的计算量,进而提高图像识别的准确度。The pooling operation module 260 is configured to respectively perform a feature set describing the target object and perform a pooling operation to reduce the feature quantity, improve the calculation amount of the target recognition process, and thereby improve the accuracy of the image recognition.
像素调整模块270,用于分别将池化操作后的特征集和上下文特征调整到匹配的像素。The pixel adjustment module 270 is configured to adjust the feature set and the context feature after the pooling operation to the matched pixels, respectively.
目标识别模型280,用于基于像素调整后的特征识别出目标对象的类别,在某些例子中,还可以进一步用于定位目标对象在图像内的位置。The target recognition model 280 is configured to identify the category of the target object based on the pixel-adjusted feature, and in some examples, may further be used to locate the location of the target object within the image.
以下结合图2a至图2d列举一个应用实例。An application example is listed below in conjunction with Figures 2a to 2d.
在该实例中,本申请的设计人员预先将图像识别应用于智能冰箱,分别选用尺寸为8*16、8*8、16*16、16*32、16*8、32*64、32*32、32*16和64*32的矩形框(滑动窗口),将匹配的像素设置为3*12、12*3、5*10、10*5和7*7这五种,将预定的向量筛选准则定为选取最大的向量元素。In this example, the designer of the present application applies image recognition to the smart refrigerator in advance, and the sizes are 8*16, 8*8, 16*16, 16*32, 16*8, 32*64, 32*32, respectively. , 32*16 and 64*32 rectangular boxes (sliding windows), the matching pixels are set to 5*3, 12*3, 5*10, 10*5 and 7*7, and the predetermined vector is filtered. The criterion is to choose the largest vector element.
安装在智能冰箱内的摄像头210拍摄智能冰箱内的画面生成待识别的图像(步骤S201),并将图像发送到全卷积神经网络220(步骤S202),全卷积神经网络220对图像进行特征提取生成卷积特征块(S203),并发送卷积特征块到特征集生成模块230和 上下文获取模块250(S204、S205),特征集生成模块230采用各种尺寸的矩形框从卷积特征块中提取特征,获得各特征集,存储各特征集在卷积特征块中所属的区域的区域标识(S206),并将获得的特征集发送到特征集筛选模型240(S207),特征筛选模型240通过计算各特征集的卷积选取出能描述目标对象的特征集(S208),并将选取出的特征集发送到上下文获取模块250和池化操作模块260(S209、S210),上下文获取模块250向特征集生成模块230请求能描述目标对象的特征集所属的区域的区域标识(S211),特征集生成模块230响应该请求将对应的区域标识发送给上下文获取模块250(S212),然后上下文获取模块250基于所接收的各区域标识确定能描述目标对象的特征集的局部上下文特征在卷积特征块中所属的区域的区域标识(S213),确定局部上下文特征所属的区域的区域标识时,可以描述目标对象的特征集所属的区域的中心点为中心,将其边长扩大0.5倍。The camera 210 installed in the smart refrigerator takes a picture in the smart refrigerator to generate an image to be recognized (step S201), and transmits the image to the full convolutional neural network 220 (step S202), and the full convolutional neural network 220 characterizes the image. Extracting a generated convolution feature block (S203), and transmitting the convolution feature block to the feature set generation module 230 and the context acquisition module 250 (S204, S205), and the feature set generation module 230 adopts a rectangular frame of various sizes from the convolution feature block. Extracting features, obtaining each feature set, storing an area identifier of a region to which each feature set belongs in the convolution feature block (S206), and transmitting the obtained feature set to the feature set screening model 240 (S207), the feature screening model 240 The feature set capable of describing the target object is selected by calculating the convolution of each feature set (S208), and the selected feature set is sent to the context obtaining module 250 and the pooling operation module 260 (S209, S210), and the context obtaining module 250 The feature set generation module 230 requests an area identifier that can describe an area to which the feature set of the target object belongs (S211), and the feature set generation module 230 sends the corresponding area identifier in response to the request. The context acquisition module 250 is obtained (S212), and then the context acquisition module 250 determines an area identifier of the area to which the local context feature of the feature set of the target object belongs in the convolution feature block based on the received area identifiers (S213), and determines When the area identifier of the area to which the local context feature belongs is described, the center point of the area to which the feature set of the target object belongs may be described as centering, and the side length thereof is expanded by 0.5 times.
在步骤S214中,上下文获取模块250基于所确定的区域标识从卷积特征块的对应区域提取局部上下文特征。在其他例子中,还可以将卷积特征块确定为全局上下特征。In step S214, the context acquisition module 250 extracts local context features from corresponding regions of the convolutional feature block based on the determined region identification. In other examples, the convolution feature block can also be determined as a global top and bottom feature.
在步骤S215中,上下文获取模块250将所提取的上下文特征发送到池化操作模块260。In step S215, the context acquisition module 250 sends the extracted context features to the pooling operation module 260.
在步骤S216中,池化操作模块260分别对所接收的特征集和上下文特征进行池化操作。In step S216, the pooling operation module 260 performs a pooling operation on the received feature set and context features, respectively.
在步骤S217中,池化操作模块260将池化操作后的特征集和上下文特征输送到像素调整模块270。In step S217, the pooling operation module 260 delivers the feature set and context features after the pooling operation to the pixel adjustment module 270.
在步骤S218中,像素调整模块270分别对应各种匹配的像素,分别将接收的特征集合上下文特征调整到匹配的像素。In step S218, the pixel adjustment module 270 respectively adjusts the received feature set context features to the matched pixels corresponding to the various matched pixels.
在某些例子中,池化操作和调整像素的过程可以参照图2c,图2c中的w和h的乘积代表匹配的像素的具体数值,另外图2c仅示出了一个描述目标对象的特征集的池化操作和调整像素过程,该特征集由第一组特征510、第二组特征520和第三组特征530构成,第一组特征510为全卷积神经网络220的第一特征提取通道提取并输出的特征,第二组特征520为全卷积神经网络220的第二特征提取通道提取并输出的特征,第三组特征530为全卷积神经网络220的第三特征提取通道提取并输出的特征,在对该特征集进行池化操作前,分别按区域将三组特征分别等分为三部分,如图中两条虚线所划分出的三个区域内的特征,位于顶层的为第一部分,两条虚线中间的为第二部分,位于底层的为第三部分。In some examples, the process of pooling operations and adjusting pixels can be referred to FIG. 2c. The product of w and h in FIG. 2c represents the specific value of the matched pixel, and FIG. 2c only shows a feature set describing the target object. The pooling operation and the adjustment pixel process, the feature set consisting of a first set of features 510, a second set of features 520, and a third set of features 530, the first set of features 510 being the first feature extraction channel of the full convolutional neural network 220 The extracted and outputted features, the second set of features 520 are features extracted and output by the second feature extraction channel of the full convolutional neural network 220, and the third set of features 530 are extracted for the third feature extraction channel of the full convolutional neural network 220 and The features of the output are divided into three parts according to the region before the pooling operation of the feature set, respectively, in the three regions defined by two broken lines in the figure, the top layer is In the first part, the middle part of the two broken lines is the second part, and the middle part is the third part.
在对该特征集进行池化操作时,单独对第一组特征510中的第一部分特征进行池化操作,单独对第二组特征520中的第二部分特征进行池化操作,单独对第三组特征530中的第三部分特征进行池化操作。然后分别将单独池化操作生成的特征调整到匹配的像素,生成像素调整后的特征集,由图2c所示的第四组特征540、第五组特征550和第六组特征560组成,第一组特征510被池化操作和调整像素后变成第四组特征540,第二组特征520被池化操作和调整像素后变成第五组特征550,第三组特征530被池化操作和调整像素后变成第六组特征540,其他特征集和上下文特征的池化操作和调整像素过程与图2c所示类似,在此不再赘述。When the feature set is pooled, the first part of the first set of features 510 is pooled separately, and the second part of the second set of features 520 is pooled separately. The third portion of the group feature 530 performs a pooling operation. Then, the features generated by the individual pooling operation are respectively adjusted to the matched pixels, and the pixel-adjusted feature set is generated, which is composed of the fourth group feature 540, the fifth group feature 550, and the sixth group feature 560 shown in FIG. 2c. A set of features 510 are pooled and adjusted to become a fourth set of features 540, the second set of features 520 are pooled and adjusted to become a fifth set of features 550, and the third set of features 530 are pooled After the pixel is adjusted and becomes the sixth group feature 540, the pooling operation and the pixel adjustment process of other feature sets and context features are similar to those shown in FIG. 2c, and details are not described herein again.
在步骤S219中,像素调整模块270将调整像素后的特征集和上下文特征输送到目标识别模型280。In step S219, the pixel adjustment module 270 delivers the feature set and context features after the adjustment of the pixels to the target recognition model 280.
在步骤S220中,目标识别模型280基于输入的特征集和上下文特征对目标对象进行识别,并输出目标对象的目标类别向量和目标位置偏移向量。In step S220, the target recognition model 280 identifies the target object based on the input feature set and the context feature, and outputs the target class vector and the target position offset vector of the target object.
具体的目标识别过程可以参阅图2d,图2d中的类别向量611、612、613、以及位置偏移向量614、615、616,分别为目标识别模型280对应三个分支(本例子仅示出三个分支)的每个特征集、以及该特征集的上下文特征,所产生的类别向量和位置偏移向量。其中,类别向量611、612、613的长度可以为(C+1),每一维向量元素可以表示目标对象属于某一类别的概率p j,j∈{0,...,C},0表示背景类;位置偏移向量614、615、616可以是4维向量,这个向量的每一维元素表示特征集所属区域的位置偏移向量[Δ xywh]。 For a specific target recognition process, reference may be made to FIG. 2d, the category vectors 611, 612, and 613 in FIG. 2d, and the position offset vectors 614, 615, and 616, respectively, for the target recognition model 280 corresponding to three branches (this example only shows three Each feature set of the branch, and the contextual features of the feature set, the resulting class vector and position offset vector. The length of the category vectors 611, 612, and 613 may be (C+1), and each dimension vector element may represent a probability p j , j ∈ {0,..., C}, 0 of the target object belonging to a certain category. The background class is represented; the position offset vector 614, 615, 616 may be a 4-dimensional vector, and each dimension element of this vector represents a position offset vector [Δ x , Δ y , Δ w , Δ h ] of the region to which the feature set belongs.
目标识别模型280基于预定的向量筛选准则分别对类别向量611、612、613以及未示出的其他类别向量进行筛选,确定一个最终的目标类别向量621和目标位置偏移向量622。The target recognition model 280 filters the category vectors 611, 612, 613 and other category vectors not shown based on predetermined vector screening criteria to determine a final target category vector 621 and a target position offset vector 622.
由上述实施例可知,本申请的图像识别方法,从特征图像中选取能够描述目标对象的特征时,从所述特征图像内不同区域选取多个特征集,能有效代表图像中不同位置的同类目标对象,因此能更准确地识别出目标对象。It can be seen from the above embodiment that, when the image recognition method of the present application selects a feature capable of describing the target object from the feature image, multiple feature sets are selected from different regions in the feature image, which can effectively represent similar targets in different positions in the image. The object, therefore, can more accurately identify the target object.
再者,本申请的图像识别方法,可以针对各目标对象的形状、大小差异,在提取描述目标对象的特征时,从特征图像中大小不同的多个区域选取多个特征集,用像素不同的特征集来分别描述大小和形状各异的目标对象,还可以结合描述目标对象的特征集的局部上下文特征和全局上下文特征对目标对象进行识别,因此,能更准确地获得目标对 象的类别、和/或定位到目标对象。而且,通过对描述目标对象的特征集、特征集的局部上下文特征和全局上下文特征进行池化操作和像素调整,可以进一步降低图像识别过程的计算量,提高识别效率。Furthermore, the image recognition method of the present application may select a plurality of feature sets from a plurality of regions having different sizes in the feature image when extracting the features describing the target object for the difference in shape and size of each target object, using different pixels. The feature set respectively describes the target object with different sizes and shapes, and can also identify the target object by combining the local context feature and the global context feature describing the feature set of the target object, so that the target object can be more accurately obtained, and / or locate the target object. Moreover, by performing a pooling operation and pixel adjustment on the feature set describing the target object, the local context feature of the feature set, and the global context feature, the calculation amount of the image recognition process can be further reduced, and the recognition efficiency is improved.
此外,将本申请实施例的图像识别方法应用到各场景中时,很可能需要面对互联网数据类似的大规模数据,且应用的实时性要求较高,为了满足这些,可以采用C/C++或汇编语言实现本申请的图像识别方法对应的程序指令。In addition, when the image recognition method of the embodiment of the present application is applied to each scenario, it is likely to face large-scale data similar to Internet data, and the real-time requirement of the application is high. To satisfy these, C/C++ or C/C++ can be used. The assembly language implements the program instructions corresponding to the image recognition method of the present application.
与前述图像识别方法的实施例相对应,本申请还提供了图像识别装置的实施例。Corresponding to the embodiment of the aforementioned image recognition method, the present application also provides an embodiment of the image recognition apparatus.
参见图3,图3是本申请一示例性实施例示出的图像识别装置的逻辑框图,该装置可以包括:图像获取模块310、特征提取模块320、特征选取模块330和目标识别模块340。Referring to FIG. 3, FIG. 3 is a logic block diagram of an image recognition apparatus according to an exemplary embodiment of the present application. The apparatus may include an image acquisition module 310, a feature extraction module 320, a feature selection module 330, and a target recognition module 340.
其中,图像获取模块310,用于获取待识别的图像。The image obtaining module 310 is configured to acquire an image to be identified.
特征提取模块320,用于获得待识别的图像的特征图像,所述特征图像用于描述待识别的图像的特征。The feature extraction module 320 is configured to obtain a feature image of the image to be identified, and the feature image is used to describe a feature of the image to be identified.
特征选取模块330,用于从所获得的特征图像中选取至少两个描述目标对象的特征集。The feature selection module 330 is configured to select at least two feature sets describing the target object from the obtained feature images.
目标识别模块340,用于基于所选取的特征集对目标对象进行识别。The target identification module 340 is configured to identify the target object based on the selected feature set.
一些例子中,选取的特征集在所述特征图像中所属的区域的尺寸不同。In some examples, the selected feature set has different sizes of regions to which the feature image belongs.
作为例子,各特征集在所述特征图像中所属的区域的尺寸可以包括:As an example, the size of the region to which the feature set belongs in the feature image may include:
8*16,8*8,16*16,16*32,16*8,32*64,32*32,32*16,64*32。8*16, 8*8, 16*16, 16*32, 16*8, 32*64, 32*32, 32*16, 64*32.
一些例子中,本申请的图像识别装置还可以包括:In some examples, the image recognition apparatus of the present application may further include:
上下文选取模块,用于从所述特征图像中选取特征集的上下文特征。a context selection module, configured to select a context feature of the feature set from the feature image.
目标识别模块340还可以用于根据所选取的特征集和上下文特征对目标对象进行识别。The target recognition module 340 is further configured to identify the target object according to the selected feature set and the context feature.
作为例子,所述上下文特征包括局部上下文特征和/或全局上下文特征。As an example, the contextual features include local context features and/or global context features.
作为例子,特征集的局部上下文特征所属的区域的边长,为该特征集所属的区域的边长的1.5倍。As an example, the side length of the region to which the local context feature of the feature set belongs is 1.5 times the side length of the region to which the feature set belongs.
一些例子中,本申请的图像识别装置还可以包括:In some examples, the image recognition apparatus of the present application may further include:
池化操作模块,用于对选取的特征集、以及所述特征集的上下文特征分别进行池化操作。The pooling operation module is configured to perform a pooling operation on the selected feature set and the context feature of the feature set respectively.
目标识别模块340还可以用于根据所述池化操作所得的特征集和上下文特征对目标 对象进行识别。The target recognition module 340 is further configured to identify the target object according to the feature set and the context feature obtained by the pooling operation.
一些例子中,本申请的图像识别装置还可以包括:In some examples, the image recognition apparatus of the present application may further include:
像素调整模块,用于将所述池化操作所得的特征集和上下文特征调整到匹配的像素。And a pixel adjustment module, configured to adjust the feature set and the context feature obtained by the pooling operation to matched pixels.
目标识别模块340还可以用于根据调整后的特征集和上下文特征对目标对象进行识别。The target recognition module 340 is further configured to identify the target object according to the adjusted feature set and the context feature.
作为例子,所述池化操作模块在对选取的特征集、以及所述特征集的上下文特征分别进行池化操作时,还用于对各特征提取通道所提取的指定特征分别进行池化操作;不同特征提取通道所提取的指定特征在所述特征图像中的坐标不同。As an example, the pooling operation module is further configured to perform a pooling operation on the specified features extracted by each feature extraction channel when performing the pooling operation on the selected feature set and the context feature of the feature set respectively; The specified features extracted by the different feature extraction channels have different coordinates in the feature image.
一些例子中,所述目标识别模块340还可以用于:In some examples, the target recognition module 340 can also be used to:
将所选取的特征集调整到匹配的像素;Adjusting the selected feature set to matching pixels;
根据调整后的特征集对目标对象进行识别。The target object is identified based on the adjusted feature set.
作为例子,匹配的像素包括以下至少两种:As an example, matching pixels include at least two of the following:
3*12,12*3,5*10,10*5,7*7。3*12, 12*3, 5*10, 10*5, 7*7.
上述装置中各个单元(或模块)的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。The implementation process of the functions and functions of each unit (or module) in the above device is specifically described in the implementation process of the corresponding steps in the foregoing method, and details are not described herein again.
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元或模块可以是或者也可以不是物理上分开的,作为单元或模块显示的部件可以是或者也可以不是物理单元或模块,即可以位于一个地方,或者也可以分布到多个网络单元或模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本申请方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。For the device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment. The device embodiments described above are merely illustrative, wherein the units or modules described as separate components may or may not be physically separate, and the components displayed as units or modules may or may not be physical units. Or modules, which can be located in one place, or distributed to multiple network units or modules. Some or all of the modules may be selected according to actual needs to achieve the objectives of the present application. Those of ordinary skill in the art can understand and implement without any creative effort.
本申请图像识别装置的实施例可以应用在电子设备上。具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现中,电子设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备、互联网电视、智能机车、无人驾驶汽车、智能冰箱、其他智能家居设备或者这些设备中的任意几种设备的组合。Embodiments of the image recognition apparatus of the present application can be applied to an electronic device. This can be implemented by a computer chip or an entity, or by a product having a certain function. In a typical implementation, the electronic device is a computer, and the specific form of the computer may be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email transceiver device. , game consoles, tablets, wearables, Internet TVs, smart locomotives, driverless cars, smart refrigerators, other smart home devices, or a combination of any of these devices.
装置实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在电子设备的处理器将非易失性存储器等可读介质中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言, 如图4所示,为本申请图像识别装置所在电子设备的一种硬件结构图,除了图4所示的处理器、内存、网络接口、以及非易失性存储器之外,实施例中装置所在的电子设备通常根据该电子设备的实际功能,还可以包括其他硬件,对此不再赘述。电子设备的存储处理器可以是可执行指令的存储器;处理器可以耦合存储器,用于读取所述存储器存储的程序指令,并作为响应,执行如下操作:获取待识别的图像;获得待识别的图像的特征图像,所述特征图像用于描述待识别的图像的特征;从所获得的特征图像中选取至少两个描述目标对象的特征集;基于所选取的特征集对目标对象进行识别。The device embodiment may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking the software implementation as an example, as a logical device, a processor of the electronic device in which it is located reads a corresponding computer program instruction in a readable medium such as a non-volatile memory into a memory. From a hardware level, as shown in FIG. 4, a hardware structure diagram of an electronic device in which the image recognition apparatus of the present application is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in FIG. The electronic device in which the device is located in the embodiment may also include other hardware according to the actual function of the electronic device, and details are not described herein again. The storage processor of the electronic device may be a memory of executable instructions; the processor may be coupled to the memory for reading the program instructions stored by the memory and, in response, performing the operations of: acquiring an image to be identified; obtaining a to-be-identified a feature image of the image, the feature image is used to describe a feature of the image to be identified; at least two feature sets describing the target object are selected from the obtained feature image; and the target object is identified based on the selected feature set.
此外,本申请实施例还提供一种计算机存储介质,所述存储介质中存储有程序指令,所述程序指令包括:In addition, the embodiment of the present application further provides a computer storage medium, where the storage medium stores program instructions, where the program instructions include:
获取待识别的图像;Obtain an image to be identified;
获得待识别的图像的特征图像,所述特征图像用于描述待识别的图像的特征;Obtaining a feature image of the image to be identified, the feature image being used to describe features of the image to be identified;
从所获得的特征图像中选取至少两个描述目标对象的特征集;Selecting at least two feature sets describing the target object from the obtained feature images;
基于所选取的特征集对目标对象进行识别。The target object is identified based on the selected feature set.
本申请实施例可采用在一个或多个其中包含有程序代码的存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。计算机可用存储介质包括永久性和非永久性、可移动和非可移动媒体,可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括但不限于:相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。Embodiments of the present application may take the form of a computer program product embodied on one or more storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) in which program code is embodied. Computer-usable storage media include both permanent and non-persistent, removable and non-removable media, and information storage can be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.
在其他实施例中,处理器所执行的操作可以参考上文方法实施例中相关的描述,在此不予赘述。In other embodiments, the operations performed by the processor may be referred to the related description in the foregoing method embodiments, and details are not described herein.
以上所述仅为本申请的较佳实施例而已,并不用以限制本申请,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请保护的范围之内。The above is only the preferred embodiment of the present application, and is not intended to limit the present application. Any modifications, equivalent substitutions, improvements, etc., which are made within the spirit and principles of the present application, should be included in the present application. Within the scope of protection.

Claims (43)

  1. 一种图像识别方法,其特征在于,包括步骤:An image recognition method, comprising the steps of:
    获取待识别的图像;Obtain an image to be identified;
    获得待识别的图像的特征图像,所述特征图像用于描述待识别的图像的特征;Obtaining a feature image of the image to be identified, the feature image being used to describe features of the image to be identified;
    从所获得的特征图像中选取至少两个描述目标对象的特征集;Selecting at least two feature sets describing the target object from the obtained feature images;
    基于所选取的特征集对目标对象进行识别。The target object is identified based on the selected feature set.
  2. 根据权利要求1所述的方法,其特征在于,在从所获得的特征图像中选取描述目标对象的特征集后,所述方法还包括:The method according to claim 1, wherein after selecting a feature set describing the target object from the obtained feature images, the method further comprises:
    从所述特征图像中选取特征集的上下文特征;Selecting a context feature of the feature set from the feature image;
    所述基于所选取的特征集对目标对象进行识别,包括:The identifying the target object based on the selected feature set includes:
    根据所选取的特征集和上下文特征对目标对象进行识别。The target object is identified according to the selected feature set and context features.
  3. 根据权利要求2所述的方法,其特征在于,所述上下文特征包括局部上下文特征和/或全局上下文特征。The method of claim 2 wherein the contextual features comprise local context features and/or global context features.
  4. 根据权利要求3所述的方法,其特征在于,特征集的局部上下文特征所属的区域的边长,为所述特征集所属的区域的边长的1.5倍。The method according to claim 3, wherein the side length of the region to which the local context feature of the feature set belongs is 1.5 times the side length of the region to which the feature set belongs.
  5. 根据权利要求2所述的方法,其特征在于,所述根据所选取的特征集和上下文特征对目标对象进行识别,包括:The method according to claim 2, wherein the identifying the target object according to the selected feature set and the context feature comprises:
    对选取的特征集、以及所述特征集的上下文特征分别进行池化操作;Performing a pooling operation on the selected feature set and the context feature of the feature set respectively;
    根据所述池化操作所得的特征集和上下文特征对目标对象进行识别。The target object is identified according to the feature set and the context feature obtained by the pooling operation.
  6. 根据权利要求5所述的方法,其特征在于,在对选取的特征集、以及所述特征集的上下文特征分别进行池化操作时,对各特征提取通道所提取的指定特征分别进行池化操作;不同特征提取通道所提取的指定特征在所述特征图像中的坐标不同。The method according to claim 5, wherein when the selected feature set and the context feature of the feature set are respectively pooled, the specified features extracted by each feature extraction channel are respectively pooled. The specified features extracted by the different feature extraction channels have different coordinates in the feature image.
  7. 根据权利要求5所述的方法,其特征在于,在对选取特征集、以及特征集的上下文特征分别进行池化操作后,所述方法还包括:The method according to claim 5, wherein after the pooling operation is performed on the selected feature set and the context feature of the feature set, the method further includes:
    将所述池化操作所得的特征集和上下文特征调整到匹配的像素;Adjusting the feature set and context features obtained by the pooling operation to matched pixels;
    所述根据所述池化操作所得的特征集和上下文特征对目标对象进行识别,包括:The identifying the target object according to the feature set and the context feature obtained by the pooling operation, including:
    根据调整后的特征集和上下文特征对目标对象进行识别。The target object is identified based on the adjusted feature set and context characteristics.
  8. 根据权利要求1所述的方法,其特征在于,所述基于所选取的特征集对目标对象进行识别,包括:The method according to claim 1, wherein the identifying the target object based on the selected feature set comprises:
    将所选取的特征集调整到匹配的像素;Adjusting the selected feature set to matching pixels;
    根据调整后的特征集对目标对象进行识别。The target object is identified based on the adjusted feature set.
  9. 根据权利要求7或8所述的方法,其特征在于,匹配的像素包括以下至少两种:The method according to claim 7 or 8, wherein the matched pixels comprise at least two of the following:
    3*12,12*3,5*10,10*5,7*7。3*12, 12*3, 5*10, 10*5, 7*7.
  10. 根据权利要求1所述的方法,其特征在于,选取的特征集在所述特征图像中所属的区域的尺寸不同。The method according to claim 1, wherein the selected feature set has different sizes of regions to which the feature image belongs.
  11. 根据权利要求10所述的方法,其特征在于,特征集在所述特征图像中所属的区域的尺寸包括:The method according to claim 10, wherein the size of the region to which the feature set belongs in the feature image comprises:
    8*16,8*8,16*16,16*32,16*8,32*64,32*32,32*16,64*32。8*16, 8*8, 16*16, 16*32, 16*8, 32*64, 32*32, 32*16, 64*32.
  12. 根据权利要求1所述的方法,其特征在于,所述特征图像通过以下任一算法获得:The method of claim 1 wherein the feature image is obtained by any of the following algorithms:
    卷积神经网络模型,分类器,多级网络结构。Convolutional neural network model, classifier, multi-level network structure.
  13. 根据权利要求1所述的方法,其特征在于,描述目标对象的特征集的卷积运算结果大于预定阈值。The method of claim 1 wherein the result of the convolution operation describing the feature set of the target object is greater than a predetermined threshold.
  14. 根据权利要求1所述的方法,其特征在于,描述目标对象的特征集的卷积运算结果大于其他非目标对象的特征集的卷积运算结果。The method according to claim 1, wherein the convolution operation result describing the feature set of the target object is larger than the convolution operation result of the feature set of the other non-target object.
  15. 根据权利要求1所述的方法,其特征在于,识别结果包括目标对象的目标类别向量和目标位置偏移向量,所述目标类别向量为各特征集对应的类别向量中最大的向量元素所属的类别向量,所述目标位置偏移向量为所述目标类别向量对应的特征集的位置偏移向量。The method according to claim 1, wherein the recognition result comprises a target class vector of the target object and a target position offset vector, wherein the target class vector is a category to which the largest vector element of the class vectors corresponding to each feature set belongs to a vector, the target position offset vector being a position offset vector of a feature set corresponding to the target category vector.
  16. 根据权利要求1所述的方法,其特征在于,待识别的图像为拍摄智能冰箱的储物间所得的图像,所述目标对象为所述智能冰箱的储物间储存的食材。The method according to claim 1, wherein the image to be recognized is an image obtained by photographing a storage compartment of a smart refrigerator, and the target object is an foodstuff stored in a storage compartment of the smart refrigerator.
  17. 一种电子设备,其特征在于,包括:An electronic device, comprising:
    处理器;processor;
    存储处理器可执行指令的存储器;a memory that stores processor executable instructions;
    其中,所述处理器耦合于所述存储器,用于读取所述存储器存储的程序指令,并作为响应,执行如下操作:The processor is coupled to the memory for reading program instructions stored in the memory, and in response, performing the following operations:
    获取待识别的图像;Obtain an image to be identified;
    获得待识别的图像的特征图像;Obtaining a feature image of the image to be identified;
    从所获得的特征图像中选取至少两个描述目标对象的特征集,选取的特征集在所述特征图像中所属的区域的尺寸不同;Selecting at least two feature sets describing the target object from the obtained feature images, where the selected feature set has different sizes in the region to which the feature image belongs;
    基于所选取的特征集对目标对象进行识别。The target object is identified based on the selected feature set.
  18. 根据权利要求17所述的电子设备,其特征在于,所述处理器还被配置为执行以下操作:The electronic device of claim 17, wherein the processor is further configured to perform the following operations:
    从所述特征图像中选取特征集的上下文特征;Selecting a context feature of the feature set from the feature image;
    根据所选取的特征集和上下文特征对目标对象进行识别。The target object is identified according to the selected feature set and context features.
  19. 根据权利要求18所述的电子设备,其特征在于,所述上下文特征包括局部上下文特征和/或全局上下文特征。The electronic device of claim 18, wherein the contextual features comprise local context features and/or global context features.
  20. 根据权利要求19所述的电子设备,其特征在于,特征集的局部上下文特征所属的区域的边长,为该特征集所属的区域的边长的1.5倍。The electronic device according to claim 19, wherein the side length of the region to which the local context feature of the feature set belongs is 1.5 times the side length of the region to which the feature set belongs.
  21. 根据权利要求18所述的电子设备,其特征在于,所述处理器还被配置为执行以下操作:The electronic device of claim 18, wherein the processor is further configured to perform the following operations:
    对选取的特征集、以及所述特征集的上下文特征分别进行池化操作;Performing a pooling operation on the selected feature set and the context feature of the feature set respectively;
    根据所述池化操作所得的特征集和上下文特征对目标对象进行识别。The target object is identified according to the feature set and the context feature obtained by the pooling operation.
  22. 根据权利要求21所述的电子设备,其特征在于,所述处理器还被配置为执行以下操作:The electronic device of claim 21, wherein the processor is further configured to perform the following operations:
    在对选取的特征集、以及所述特征集的上下文特征分别进行池化操作时,对各特征提取通道所提取的指定特征分别进行池化操作;不同特征提取通道所提取的指定特征在所述特征图像中的坐标不同。When the selected feature set and the context feature of the feature set are respectively pooled, the specified features extracted by the feature extraction channels are respectively pooled; the specified features extracted by the different feature extraction channels are in the The coordinates in the feature image are different.
  23. 根据权利要求21所述的电子设备,其特征在于,所述处理器还被配置为执行以下操作:The electronic device of claim 21, wherein the processor is further configured to perform the following operations:
    将所述池化操作所得的特征集和上下文特征调整到匹配的像素;Adjusting the feature set and context features obtained by the pooling operation to matched pixels;
    根据调整后的特征集和上下文特征对目标对象进行识别。The target object is identified based on the adjusted feature set and context characteristics.
  24. 根据权利要求17所述的电子设备,其特征在于,所述处理器还被配置为执行以下操作:The electronic device of claim 17, wherein the processor is further configured to perform the following operations:
    将所选取的特征集调整到匹配的像素;Adjusting the selected feature set to matching pixels;
    根据调整后的特征集对目标对象进行识别。The target object is identified based on the adjusted feature set.
  25. 根据权利要求23或24所述的电子设备,其特征在于,匹配的像素包括以下至少两种:The electronic device according to claim 23 or 24, wherein the matched pixels comprise at least two of the following:
    3*12,12*3,5*10,10*5,7*7。3*12, 12*3, 5*10, 10*5, 7*7.
  26. 根据权利要求17所述的电子设备,其特征在于,选取的特征集在所述特征图像中所属的区域的尺寸不同。The electronic device according to claim 17, wherein the selected feature set has a different size of a region to which the feature image belongs.
  27. 根据权利要求26所述的电子设备,其特征在于,特征集在所述特征图像中所属的区域的尺寸包括:The electronic device according to claim 26, wherein the size of the region to which the feature set belongs in the feature image comprises:
    8*16,8*8,16*16,16*32,16*8,32*64,32*32,32*16,64*32。8*16, 8*8, 16*16, 16*32, 16*8, 32*64, 32*32, 32*16, 64*32.
  28. 根据权利要求17所述的电子设备,其特征在于,所述特征图像通过以下任一算法获得:The electronic device of claim 17, wherein the feature image is obtained by any of the following algorithms:
    卷积神经网络模型,分类器,多级网络结构。Convolutional neural network model, classifier, multi-level network structure.
  29. 根据权利要求17所述的电子设备,其特征在于,描述目标对象的特征集的卷积运算结果大于预定阈值。The electronic device according to claim 17, wherein the convolution operation result describing the feature set of the target object is greater than a predetermined threshold.
  30. 根据权利要求17所述的电子设备,其特征在于,描述目标对象的特征集的卷积运算结果大于其他非目标对象的特征集的卷积运算结果。The electronic device according to claim 17, wherein the convolution operation result describing the feature set of the target object is larger than the convolution operation result of the feature set of the other non-target object.
  31. 根据权利要求17所述的电子设备,其特征在于,识别结果包括目标对象的目标类别向量和目标位置偏移向量,所述目标类别向量为各特征集对应的类别向量中最大的向量元素所属的类别向量,所述目标位置偏移向量为所述目标类别向量对应的特征集的位置偏移向量。The electronic device according to claim 17, wherein the recognition result comprises a target class vector of the target object and a target position offset vector, wherein the target class vector is the largest vector element of the class vectors corresponding to each feature set. a class vector, the target position offset vector being a position offset vector of a feature set corresponding to the target class vector.
  32. 根据权利要求17所述的电子设备,其特征在于,待识别的图像为拍摄智能冰箱的储物间所得的图像,所述目标对象为所述智能冰箱的储物间储存的食材。The electronic device according to claim 17, wherein the image to be recognized is an image obtained by photographing a storage compartment of the smart refrigerator, and the target object is an foodstuff stored in a storage compartment of the smart refrigerator.
  33. 一种图像识别装置,其特征在于,包括:An image recognition device, comprising:
    图像获取模块,用于获取待识别的图像;An image acquisition module, configured to acquire an image to be identified;
    特征提取模块,用于获得待识别的图像的特征图像,所述特征图像用于描述待识别的图像的特征;a feature extraction module, configured to obtain a feature image of the image to be identified, wherein the feature image is used to describe a feature of the image to be identified;
    特征选取模块,用于从所获得的特征图像中选取至少两个描述目标对象的特征集;a feature selection module, configured to select at least two feature sets describing the target object from the obtained feature images;
    目标识别模块,用于基于所选取的特征集对目标对象进行识别。A target recognition module is configured to identify the target object based on the selected feature set.
  34. 根据权利要求33所述的装置,其特征在于,选取的特征集在所述特征图像中所属的区域的尺寸不同。The apparatus according to claim 33, wherein the selected feature set has a different size of a region to which the feature image belongs.
  35. 根据权利要求34所述的装置,其特征在于,各特征集在所述特征图像中所属的区域的尺寸包括:The device according to claim 34, wherein the size of the region to which the feature set belongs in the feature image comprises:
    8*16,8*8,16*16,16*32,16*8,32*64,32*32,32*16,64*32。8*16, 8*8, 16*16, 16*32, 16*8, 32*64, 32*32, 32*16, 64*32.
  36. 根据权利要求33所述的装置,其特征在于,所述装置还包括:The device of claim 33, wherein the device further comprises:
    上下文选取模块,用于从所述特征图像中选取特征集的上下文特征;a context selection module, configured to select a context feature of the feature set from the feature image;
    所述目标识别模块还用于根据所选取的特征集和上下文特征对目标对象进行识别。The target recognition module is further configured to identify the target object according to the selected feature set and the context feature.
  37. 根据权利要求36所述的装置,其特征在于,所述上下文特征包括局部上下文特征和/或全局上下文特征。The apparatus of claim 36 wherein the contextual features comprise local context features and/or global context features.
  38. 根据权利要求36所述的装置,其特征在于,所述装置还包括:The device of claim 36, wherein the device further comprises:
    池化操作模块,用于对选取的特征集、以及所述特征集的上下文特征分别进行池化操作;a pooling operation module, configured to perform a pooling operation on the selected feature set and the context feature of the feature set respectively;
    所述目标识别模块还用于根据所述池化操作所得的特征集和上下文特征对目标对象进行识别。The target recognition module is further configured to identify the target object according to the feature set and the context feature obtained by the pooling operation.
  39. 根据权利要求38所述的装置,其特征在于,所述池化操作模块在对选取的特征集、以及所述特征集的上下文特征分别进行池化操作时,还用于对各特征提取通道所提取的指定特征分别进行池化操作;不同特征提取通道所提取的指定特征在所述特征图像中的坐标不同。The device according to claim 38, wherein the pooling operation module is further configured to extract a channel for each feature when performing a pooling operation on the selected feature set and the context feature of the feature set respectively The extracted specified features are respectively subjected to a pooling operation; the specified features extracted by the different feature extraction channels have different coordinates in the feature image.
  40. 根据权利要求38所述的装置,其特征在于,所述目标识别模块还用于:The device according to claim 38, wherein the target recognition module is further configured to:
    将所述池化操作所得的特征集和上下文特征调整到匹配的像素;Adjusting the feature set and context features obtained by the pooling operation to matched pixels;
    根据调整后的特征集和上下文特征对目标对象进行识别。The target object is identified based on the adjusted feature set and context characteristics.
  41. 根据权利要求33所述的装置,其特征在于,所述目标识别模块还用于:The device according to claim 33, wherein the target recognition module is further configured to:
    将所选取的特征集调整到匹配的像素;Adjusting the selected feature set to matching pixels;
    根据调整后的特征集对目标对象进行识别。The target object is identified based on the adjusted feature set.
  42. 根据权利要求40或41所述的装置,其特征在于,匹配的像素包括以下至少两种:The apparatus according to claim 40 or 41, wherein the matched pixels comprise at least two of the following:
    3*12,12*3,5*10,10*5,7*7。3*12, 12*3, 5*10, 10*5, 7*7.
  43. 一个或多个机器可读介质,其特征在于,其上存储有指令,当由一个或多个处理器执行时,使得终端设备执行如权利要求1-16中任一项所述的方法。One or more machine-readable mediums having stored thereon instructions that, when executed by one or more processors, cause the terminal device to perform the method of any of claims 1-16.
PCT/CN2018/093350 2017-07-06 2018-06-28 Image recognition method, apparatus and device, and readable medium WO2019007253A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710546203.4A CN109214403B (en) 2017-07-06 2017-07-06 Image recognition method, device and equipment and readable medium
CN201710546203.4 2017-07-06

Publications (1)

Publication Number Publication Date
WO2019007253A1 true WO2019007253A1 (en) 2019-01-10

Family

ID=64949696

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/093350 WO2019007253A1 (en) 2017-07-06 2018-06-28 Image recognition method, apparatus and device, and readable medium

Country Status (2)

Country Link
CN (1) CN109214403B (en)
WO (1) WO2019007253A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110991460A (en) * 2019-10-16 2020-04-10 北京航空航天大学 Image recognition processing method, device, equipment and storage medium
CN111325263A (en) * 2020-02-14 2020-06-23 腾讯科技(深圳)有限公司 Image processing method and device, intelligent microscope, readable storage medium and equipment
CN111798018A (en) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 Behavior prediction method, behavior prediction device, storage medium and electronic equipment
CN112270671A (en) * 2020-11-10 2021-01-26 杭州海康威视数字技术股份有限公司 Image detection method, image detection device, electronic equipment and storage medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110223279B (en) * 2019-05-31 2021-10-08 上海商汤智能科技有限公司 Image processing method and device and electronic equipment
SG10201913005YA (en) * 2019-12-23 2020-09-29 Sensetime Int Pte Ltd Method, apparatus, and system for recognizing target object
CN111860687A (en) * 2020-07-31 2020-10-30 中国铁塔股份有限公司 Image identification method and device, electronic equipment and storage medium
CN113537309B (en) * 2021-06-30 2023-07-28 北京百度网讯科技有限公司 Object identification method and device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030179931A1 (en) * 2002-03-19 2003-09-25 Hung-Ming Sun Region-based image recognition method
CN106529467A (en) * 2016-11-07 2017-03-22 南京邮电大学 Group behavior identification method based on multi-feature fusion
CN106803090A (en) * 2016-12-05 2017-06-06 中国银联股份有限公司 A kind of image-recognizing method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4569837B2 (en) * 2007-03-30 2010-10-27 アイシン・エィ・ダブリュ株式会社 Feature information collecting apparatus and feature information collecting method
CN105005794B (en) * 2015-07-21 2018-06-05 太原理工大学 Merge the image pixel semanteme marking method of more granularity contextual informations

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030179931A1 (en) * 2002-03-19 2003-09-25 Hung-Ming Sun Region-based image recognition method
CN106529467A (en) * 2016-11-07 2017-03-22 南京邮电大学 Group behavior identification method based on multi-feature fusion
CN106803090A (en) * 2016-12-05 2017-06-06 中国银联股份有限公司 A kind of image-recognizing method and device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111798018A (en) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 Behavior prediction method, behavior prediction device, storage medium and electronic equipment
CN110991460A (en) * 2019-10-16 2020-04-10 北京航空航天大学 Image recognition processing method, device, equipment and storage medium
CN110991460B (en) * 2019-10-16 2023-11-21 北京航空航天大学 Image recognition processing method, device, equipment and storage medium
CN111325263A (en) * 2020-02-14 2020-06-23 腾讯科技(深圳)有限公司 Image processing method and device, intelligent microscope, readable storage medium and equipment
CN111325263B (en) * 2020-02-14 2023-04-07 腾讯科技(深圳)有限公司 Image processing method and device, intelligent microscope, readable storage medium and equipment
CN112270671A (en) * 2020-11-10 2021-01-26 杭州海康威视数字技术股份有限公司 Image detection method, image detection device, electronic equipment and storage medium
CN112270671B (en) * 2020-11-10 2023-06-02 杭州海康威视数字技术股份有限公司 Image detection method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN109214403B (en) 2023-02-28
CN109214403A (en) 2019-01-15

Similar Documents

Publication Publication Date Title
WO2019007253A1 (en) Image recognition method, apparatus and device, and readable medium
Lynen et al. Placeless place-recognition
WO2019218824A1 (en) Method for acquiring motion track and device thereof, storage medium, and terminal
CN109035304B (en) Target tracking method, medium, computing device and apparatus
CN110008867B (en) Early warning method and device based on person abnormal behavior and storage medium
US8989442B2 (en) Robust feature fusion for multi-view object tracking
US20170213081A1 (en) Methods and systems for automatically and accurately detecting human bodies in videos and/or images
US20140254922A1 (en) Salient Object Detection in Images via Saliency
US20140064624A1 (en) Systems and methods for estimating the geographic location at which image data was captured
CN105913082B (en) Method and system for classifying targets in image
CN109086724B (en) Accelerated human face detection method and storage medium
Xia et al. Loop closure detection for visual SLAM using PCANet features
US9443137B2 (en) Apparatus and method for detecting body parts
US10007678B2 (en) Image processing apparatus, image processing method, and recording medium
CN107358189B (en) Object detection method in indoor environment based on multi-view target extraction
WO2023142602A1 (en) Image processing method and apparatus, and computer-readable storage medium
Demirkus et al. Hierarchical temporal graphical model for head pose estimation and subsequent attribute classification in real-world videos
WO2018100668A1 (en) Image processing device, image processing method, and image processing program
Zhang et al. Fast moving pedestrian detection based on motion segmentation and new motion features
CN114782499A (en) Image static area extraction method and device based on optical flow and view geometric constraint
CN111488943A (en) Face recognition method and device
CN111951297A (en) Target tracking method based on structured pixel-by-pixel target attention mechanism
WO2021051382A1 (en) White balance processing method and device, and mobile platform and camera
Liao et al. Multi-scale saliency features fusion model for person re-identification
Liu et al. Video retrieval based on object discovery

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18827536

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18827536

Country of ref document: EP

Kind code of ref document: A1