WO2019007253A1

WO2019007253A1 - Image recognition method, apparatus and device, and readable medium

Info

Publication number: WO2019007253A1
Application number: PCT/CN2018/093350
Authority: WO
Inventors: 李博; 张伦; 楚汝峰
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2017-07-06
Filing date: 2018-06-28
Publication date: 2019-01-10
Also published as: CN109214403B; CN109214403A

Abstract

The present application provides an image recognition method, apparatus and device, and a readable medium. The method comprises: obtaining an image to be recognized; obtaining a feature image of the image to be recognized, the feature image being used for describing features of the image to be recognized; selecting at least two feature sets describing target objects from the obtained feature image; and recognizing the target objects on the basis of the selected feature sets. By implementing the present application, when features capable of describing a target object are selected from a feature image, multiple feature sets are selected from different regions in the feature image and can effectively represent the same type of target objects at different positions in the image, and therefore, the target objects can be more accurately recognized.

Description

Image recognition method, device and device, readable medium

The present application claims priority to Chinese Patent Application No. PCT Application No. No. No. No. No. No. No. No. No. .

Technical field

The present application relates to the field of image processing technologies, and in particular, to an image recognition method, apparatus and device, and a readable medium.

Background technique

With the development of computer technology and the wide application of computer vision principles, image recognition technology is used to classify target objects, which has wide application value in products such as driverless cars and smart refrigerators. When the current image recognition technology recognizes the target object, the feature extraction model is generally used to extract the feature of the entire image containing the target object, and a feature image of the entire image is generated, and the feature image is composed of the extracted features and extracted. The feature includes at least one of image features such as a color feature, a texture feature, a shape feature, and a spatial relationship feature, and then a single fixed-size rectangular frame is used to frame the feature object (such as a car, food, etc.) in the feature image. Then, the selected feature is selected as the target feature, and then the target feature is input into the classification model for classification.

However, when image recognition technology is applied to some products, the same target object may appear in different areas of the image containing the target object. For example, most of the ingredients in the smart refrigerator are randomly placed in the refrigerator by the user. The food inside the refrigerator is obtained from the image. Current image recognition techniques are prone to erroneous recognition results when identifying such images.

Summary of the invention

In view of this, the present application provides an image recognition method, apparatus and device, and a readable medium.

According to a first aspect of the embodiments of the present application, an image recognition method is provided, including the steps of:

Obtain an image to be identified;

Obtaining a feature image of the image to be identified, the feature image being used to describe features of the image to be identified;

Selecting at least two feature sets describing the target object from the obtained feature images;

The target object is identified based on the selected feature set.

According to a second aspect of the embodiments of the present application, an electronic device is provided, including:

processor;

a memory that stores processor executable instructions;

The processor is coupled to the memory for reading program instructions stored in the memory, and in response, performing the following operations:

Obtain an image to be identified;

The target object is identified based on the selected feature set.

According to a third aspect of the embodiments of the present application, an image recognition apparatus is provided, including:

An image acquisition module, configured to acquire an image to be identified;

a feature extraction module, configured to obtain a feature image of the image to be identified, wherein the feature image is used to describe a feature of the image to be identified;

a feature selection module, configured to select at least two feature sets describing the target object from the obtained feature images;

A target recognition module is configured to identify the target object based on the selected feature set.

According to a fourth aspect of embodiments of the present application, there is provided one or more machine readable medium having stored thereon instructions that, when executed by one or more processors, cause the terminal device to perform the method described above.

When implementing the embodiment provided by the present application, when selecting a feature capable of describing the target object from the feature image, selecting a plurality of feature sets from different regions in the feature image can effectively represent similar target objects in different positions in the image, thereby enabling Accurately identify the target object.

DRAWINGS

1 is a flowchart of an image recognition method according to an exemplary embodiment of the present application;

2a is a block diagram of a system for image recognition shown in an exemplary embodiment of the present application;

FIG. 2b is an interaction diagram of an image recognition method according to another exemplary embodiment of the present application; FIG.

2c is a schematic diagram of a pooling operation and an implementation process of adjusting pixels in an image recognition method according to an exemplary embodiment of the present application;

2d is a schematic diagram of a target recognition process in an image recognition method illustrated by an exemplary embodiment of the present application;

FIG. 3 is a logic block diagram of an image recognition apparatus according to an exemplary embodiment of the present application; FIG.

FIG. 4 is a hardware configuration diagram of an image recognition apparatus according to an exemplary embodiment of the present application.

Detailed ways

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. The following description refers to the same or similar elements in the different figures unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Instead, they are merely examples of devices and methods consistent with aspects of the present application as detailed in the appended claims.

The terminology used in the present application is for the purpose of describing particular embodiments, and is not intended to be limiting. The singular forms "a", "the" and "the" It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used to describe various information in this application, such information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information without departing from the scope of the present application. Similarly, the second information may also be referred to as the first information. Depending on the context, the word "if" as used herein may be interpreted as "when" or "when" or "in response to determination."

Please refer to FIG. 1. FIG. 1 is a flowchart of an image recognition method according to an exemplary embodiment of the present application. The embodiment can be applied to various electronic devices having image processing functions, and may include the following steps S101-S104:

Step S101: Acquire an image to be identified.

Step S102: Obtain a feature image of the image to be identified, and the feature image is used to describe a feature of the image to be identified.

Step S103: Select at least two feature sets describing the target object from the obtained feature images.

Step S104: Identify the target object based on the selected feature set.

In the embodiment of the present application, the acquired image may be an image directly collected by an image acquisition module (such as a camera), or may be image data after image preprocessing. The image preprocessing mentioned herein may include improving image recognition accuracy. Useful image processing, for example, color space transformation on scene text images, position correction of word images in text word images, denoising processing on character images, and the like.

For the obtained image, a feature extraction algorithm such as a convolutional neural network model, a classifier, and a multi-level network structure may be used to extract features to generate a feature image, and each region of the feature image contains various extracted features.

In some examples, when the target object is identified, the specific location of the target object needs to be located. In order to accurately locate the target object, a full convolutional neural network capable of effectively retaining the position information of the target object may be used to characterize the acquired image. Extraction, the full convolutional neural network may include a full convolutional layer of AlexNet, GoogleNet, VGGNet, ResNet, or other convolutional neural network models.

After obtaining the feature image, considering that there may be target objects of different sizes and shapes in the image, when extracting the feature set describing the target object, the feature describing the target object may be selected from the obtained feature image based on the region, and the feature image is selected from the feature image. The selected features in the same area constitute a feature set describing the target object. When the area size is different, the feature set contains different feature quantities, and the selected feature set has different sizes in the feature image.

In general, the convolution operation result of the region containing the target object is greater than a predetermined threshold, which can be set by the designer when verifying the distribution of the convolution operation result of the positive and negative samples on the set when training the classifier and the feature extraction model. The value is generally greater than or equal to 0 and less than 1, such as 0.3, 0.5, etc. In some examples, the feature set whose convolution operation result is greater than a predetermined threshold may be used as a feature set describing the target object by the following sliding window technique:

A plurality of candidate feature sets are selected from the obtained feature images by using sliding windows of various sizes, and the size of the sliding window may include 8*16, 8*8, 16*16, 16*32, 16*8, 32* 64, 32*32, 32*16, and 64*32.

A convolution operation is performed on the selected candidate feature set.

A candidate feature set having a convolution operation result greater than a predetermined threshold is selected as a feature set describing the target object.

In practical applications, the volume feature image may be a convolution feature block obtained by extracting the acquired image by the full convolutional neural network, and when the feature set describing the target object is selected by using a sliding window, the convolution feature block may be searched for. There may be areas of the target object. For each position on the convolution feature block, the sizes are 8*16, 8*8, 16*16, 16*32, 16*8, 32*64, 32*32, 32*16, and 64*32, respectively. The rectangular frame (sliding window) marks each rectangular area, wherein the rectangular frame can mark each rectangular area with a region identification such as a 4-dimensional vector [c _x , c _y , w ₀ , h ₀ ], (c _x , c _y ) Indicates the coordinates of the center point of the rectangle. w ₀ and h ₀ represent the width and height of the rectangle, respectively, corresponding to the size of the rectangle.

Then extracting the features in each of the marked rectangular regions, and forming a feature in a rectangular region into a candidate feature set, where the rectangular region is the region of the feature set in the feature image (convolution feature block), and then Each of the candidate feature sets is subjected to a convolution operation, and the rectangular feature region to which the candidate feature set whose convolution operation result is greater than the predetermined threshold is determined is the region where the target object may exist, and the candidate feature set whose convolution operation result is greater than the predetermined threshold is selected as the description. The feature set of the target object, these finally selected feature sets may also be divided into different kinds of feature sets based on the size of the rectangular area to which they belong. In other examples, the designer of the present application can also set the side length of the sliding window and the ratio between different side lengths to other values according to the specific application scenario of the image recognition, which is not limited in this application.

In some examples, the height and width of the convolutional feature block are large, if for each position on the convolutional feature block, the size is 8*16, 8*8, 16*16, 16*32, respectively. 16*8, 32*64, 32*32, 32*16, and 64*32 rectangular frames (sliding windows) mark each rectangular area, and then extract the features in each rectangular area, which will constitute a large number of feature sets. The calculation of the convolution operation is large, and the number of feature sets describing the target object may be more, which may increase the calculation amount of the image recognition process and reduce the image recognition efficiency. In order to solve these problems, the description target is considered. The result of the convolution operation of the feature set of the object is larger than the convolution operation result of the feature set of the other non-target object, and the feature set whose convolution operation result exceeds the predetermined threshold may be determined as the candidate feature set, and the determined candidate feature set is collected. The result of the product operation is that the feature set of the top N is selected as the feature set describing the target object, and N is greater than 1 and smaller than the total number of the determined candidate feature sets. When the number of candidate feature sets is large, N may be 300.

In practical applications, each feature set composed of features in each rectangular region may be input into a predetermined feature set screening model, and the feature set screening model performs convolution operation on each feature set, and the convolution operation result exceeds a predetermined threshold. The feature set is determined as a candidate feature set, and the determined candidate feature set, the convolution operation result is a top N feature set selected as a feature set describing the target object, where N is greater than 1 and less than the total number of the determined candidate feature sets. The designer of the present application can determine the specific value of N, such as 300, according to the application scenario and the computing power of the electronic device running the image recognition method of the present application. The feature set screening model mentioned here may be a deep neural network model, a network model of a multi-level structure, or a probability model based on image color, edge, and super-pixel features.

In some scenarios, the obtained image may include a variety of background information in addition to the target object. These background information may cause a certain degree of trouble to the target recognition. In order to reduce the negative influence of the background information on the recognition process, Contextual features may be added during the target recognition process, the added contextual features including local context features and/or global context features.

In practical applications, after selecting a feature set describing the target object from the obtained feature image, the context feature of the feature set is selected from the feature image, and then the target object is identified according to the selected feature set and the context feature. . After adding the context feature, the target recognition process can process more features related to the target object. On the one hand, it is easy to identify the target object with a smaller volume, and on the other hand, the target object can be excluded. For example, the ship and the sea are always together. If the ship and the tree are detected together, the target object recognition error is indicated.

In some examples, when the context feature of the feature set is selected from the feature image (the context feature of the feature set may refer to a context feature corresponding to different features of the target object in the feature set), the target object may be selected and described. After the feature set, for each feature set selected, the center point of the region to which the feature set belongs may be the reference point, and the side length of the region to which the feature set belongs is increased by 0.5 times to form the region to which the local context feature belongs, and then The feature of the region is extracted as a local context feature of the feature set. After extracting the local context feature, the edge length of the region to which the local context feature of the feature set belongs is 1.5 times the side length of the region to which the feature set belongs, and may include more features related to the target object, so as to facilitate the comparison of the volume. Small target object.

After extracting the feature set and the context feature describing the target object, the target object may be identified based on the features, such as extracting the extracted feature set of the target object and the context feature into the trained classifier for target classification, but The operation faces a huge computational challenge, and the classifier is prone to overfitting when computing over a large number of features. In order to solve this problem, the present application may perform a pooling operation on the feature set and the context feature of the feature set, and then identify the target object according to the feature set and the context feature obtained by the pooling operation. The pooling mentioned here is used to reduce the dimension of the feature set and the probability of over-fitting. Generally, the statistics of the features of different locations are aggregated. For example, when the pooling operation is performed, a certain area of the feature set can be used. The average (or maximum) of each feature replaces each feature.

In other examples, in order to further improve the target recognition efficiency, when the feature set and the context feature of the feature set are separately pooled, the specified features extracted by each feature extraction channel may be pooled separately; The specified features extracted by the feature extraction channel have different coordinates in the feature image. For example, if the feature image is a convolutional feature block extracted by the full convolution network, the region to which each feature set belongs within the convolutional feature block can be divided into three parts, and when the pooling operation is performed, the entire convolutional network is separately The feature (designated feature) in the first partial region extracted by the first feature extraction channel is subjected to a pooling operation, and the feature (designated feature) in the second partial region extracted by the second feature extraction channel of the full convolution network is separately pooled In operation, the feature (designated feature) in the third partial region extracted by the third feature extraction channel of the full convolution network is separately pooled. After this operation, on the one hand, the number of layers of the deep neural network required for identifying the target is small, and on the other hand, the relative positional relationship of the target object can be recognized, so that the target object can be accurately located.

In addition, in order to improve the target recognition efficiency, the feature set and the context feature obtained by the pooling operation may also be adjusted to matched pixels; and then the target object is identified according to the adjusted feature set and the context feature. The matched pixels are generally smaller than the pixels of each feature set, and the designer of the present application can determine matching pixels according to the application scenario and the computing power of the electronic device running the image recognition method of the present application. In some scenarios, the image is considered. There may be target objects of different sizes and shapes, and the matched pixels may include at least two of 3*12, 12*3, 5*10, 10*5, and 7*7.

In addition, in the embodiment of the present application, when the target object is identified based on the selected feature set, the selected feature set is adjusted to the matched pixel, and then the target object is identified according to the adjusted feature set.

If the matched pixels have multiple values, the feature set context feature of a matched pixel may be used as a feature describing a branch of the target object, and the number of features of the branch is H ₀ ×W ₀ ×((3×h _i × w _i )×(C+1)), where H ₀ , W ₀ , C ₀ respectively represent the height, width and channel number of the feature image (eg, convolutional feature block), h _i ×w _i ∈{3×12 , 12 × 3, 5 × 10, 10 × 5, 7 × 7}, C represents the number of categories of the target object, +1 is the background is also counted as a target category, each position on h _i × w _i is A vector of 3 × (C + 1) dimensions, which includes three (C + 1)-dimensional vectors.

After the extracted feature set and the context feature are sequentially subjected to the pooling operation and the pixel adjustment, the pixel-adjusted feature may be input into the target recognition model, and the target recognition model corresponds to each feature set of each branch in the process of identifying the target. And the context feature of the feature set, a class vector and a position offset vector of the region to which the feature set belongs are generated. The target recognition model mentioned here may be a classification model such as a classifier.

The length of the category vector may be (C+1), and each dimension vector element may represent a probability p _j , j ∈ {0, . . . , C} of the target object belonging to a certain category, where 0 represents a background class. The target recognition model then determines a final target class vector and target position offset vector based on predetermined vector screening criteria.

The position offset vector may be a 4-dimensional vector, and each dimension element of the vector represents a position offset vector [Δ _x , Δ _y , Δ _w , Δ _h ] of the region to which the feature set belongs, and the position offset vector corresponds to 4 dimensions. The vector [c _x , c _y , w ₀ , h ₀ ], where Δx, Δy, Δw, and Δh are the offsets that cx, cy, w0, and h0 need to adjust, respectively. The corresponding vector after the target object is adjusted is [c _x + w ₀ · Δ _x , c _y + h ₀ · Δ _y , w ₀ · Δ _w , h ₀ · Δ _h ].

If the predetermined vector filtering criterion is to select the largest vector element, the category vector with the largest vector element may be selected from the category vector corresponding to each feature set of each branch as the final identified target category vector, and then the target category vector is corresponding. The position offset vector of the feature set is the finally identified position offset vector. In some examples, when the category vector with the largest vector element is selected as the final recognized target class vector, it can be selected according to the following formula:

Where score represents a category vector, the elements on each dimension of the vector represent the probability that the target object belongs to the corresponding category, C represents the number of categories, and A represents the sub-index (the number of types of predetermined pixels). In one example, there are 2 types of target objects, one is a dog and the other is a cat, then C=2, the first dimension of the category vector indicates the possibility that the target object belongs to the cat category, and the second dimension represents The target object belongs to the category of the dog. The category vector can be represented by score=[0.3,0.9]. The maximum value of 0.3 and 0.9 can be obtained by the first max in the brackets of the above formula. After the first seek max, each branch has a maximum value, assuming score2, score3, ..., through the second time of the above formula in the above formula to find max, can be from different branches of score2, score3, ... The maximum value is selected as the maximum maximum value, and the category vector to which the selected final maximum value belongs is determined as the target category vector capable of determining the category to which the target object belongs. In other examples, the target category vector may also be determined based on the mean, minimum, and median of all dimensions of the category vector.

After determining the target class vector, the target recognition model may output the target class vector and the target position offset vector as the recognition result, where the target class vector is the category vector to which the largest vector element of the class vector corresponding to each feature set belongs. The target position offset vector is a position offset vector of a feature set corresponding to the target category vector.

In some examples, the target class vector is c _i , i ∈ {1,..., C+1}, and the vector element is not the probability value of the target object belonging to the corresponding category. Before the target class vector is output, the request may be obtained. Softmax converts the target category vector into a probability form target category vector. The formula for Softmax is as follows:

Pi is the probability form target category vector.

After the target recognition model outputs the recognition result, the category to which the target object belongs can be obtained, and then the initial position [c _x , c _y , w ₀ , h ₀ ] of the region to which the feature set corresponding to the target category vector is associated and the target in the recognition result are combined. The position offset vector can obtain the position of the target object in the image: [c _x + w ₀ · Δ _x , c _y + h ₀ · Δ _y , w ₀ · Δ _w , h ₀ · Δ _h ].

In summary, the image recognition of the embodiment of the present application can detect the category and location of all target objects in an image. If the image to be recognized is an image obtained by the storage room of the smart refrigerator, the target object is a smart refrigerator. The food stored in the storage room, based on the image recognition results, can further carry out statistics on relevant information in the field of smart refrigerators, such as: counting the number of ingredients in the same category, the number of ingredients in all categories, etc., and then based on the statistical results can be Accurate and effective intelligent management of food, such as: can change the operating mode of the refrigerator, always keep the food in the best storage state, allowing users to know the quantity of food in the refrigerator, preservation and quality information anytime, anywhere through mobile phones or computers, and can remind users Regular food supplements, etc.

In addition, when the image recognition of the embodiment of the present application is applied to an unmanned automobile, the road condition in front of the automobile can be accurately recognized, and corresponding driving operations are performed based on the road condition, such as: obstacles are bypassed when driving unmanned.

Please refer to FIG. 2a. FIG. 2a is a block diagram of a system 200 for implementing image recognition according to an exemplary embodiment of the present application. The system 200 is applicable to various electronic devices having image processing functions, and may include sequentially connected cameras. 210. The full convolutional neural network 220, the feature set generation module 230, the feature set screening model 240, the pooling operation module 260, the pixel adjustment module 270, and the target recognition model 280, further include a full convolutional neural network 220 and a feature set, respectively. The generation module 230, the feature set screening model 240, and the context acquisition module 250 connected to the pooling operation module 260.

The camera 210 directly captures an image corresponding to the scene. In other examples, the image capturing device may be used instead of the camera 210 to collect images of the corresponding scene from the corresponding regions.

The full convolutional neural network 220 performs feature extraction on the image acquired by the image acquisition module 210 to generate a convolutional feature block (feature image).

The feature set generating module 230 is configured to extract features from regions of the convolution feature block where the target object may exist to form a feature set.

The feature set screening model 240 is configured to filter out feature sets capable of better describing the target object from the extracted feature sets.

The context obtaining module 250 is configured to extract, according to the selected region of each feature set, the context feature of each selected feature set from the convolution feature block.

The pooling operation module 260 is configured to respectively perform a feature set describing the target object and perform a pooling operation to reduce the feature quantity, improve the calculation amount of the target recognition process, and thereby improve the accuracy of the image recognition.

The pixel adjustment module 270 is configured to adjust the feature set and the context feature after the pooling operation to the matched pixels, respectively.

The target recognition model 280 is configured to identify the category of the target object based on the pixel-adjusted feature, and in some examples, may further be used to locate the location of the target object within the image.

An application example is listed below in conjunction with Figures 2a to 2d.

In this example, the designer of the present application applies image recognition to the smart refrigerator in advance, and the sizes are 8*16, 8*8, 16*16, 16*32, 16*8, 32*64, 32*32, respectively. , 32*16 and 64*32 rectangular boxes (sliding windows), the matching pixels are set to 5*3, 12*3, 5*10, 10*5 and 7*7, and the predetermined vector is filtered. The criterion is to choose the largest vector element.

The camera 210 installed in the smart refrigerator takes a picture in the smart refrigerator to generate an image to be recognized (step S201), and transmits the image to the full convolutional neural network 220 (step S202), and the full convolutional neural network 220 characterizes the image. Extracting a generated convolution feature block (S203), and transmitting the convolution feature block to the feature set generation module 230 and the context acquisition module 250 (S204, S205), and the feature set generation module 230 adopts a rectangular frame of various sizes from the convolution feature block. Extracting features, obtaining each feature set, storing an area identifier of a region to which each feature set belongs in the convolution feature block (S206), and transmitting the obtained feature set to the feature set screening model 240 (S207), the feature screening model 240 The feature set capable of describing the target object is selected by calculating the convolution of each feature set (S208), and the selected feature set is sent to the context obtaining module 250 and the pooling operation module 260 (S209, S210), and the context obtaining module 250 The feature set generation module 230 requests an area identifier that can describe an area to which the feature set of the target object belongs (S211), and the feature set generation module 230 sends the corresponding area identifier in response to the request. The context acquisition module 250 is obtained (S212), and then the context acquisition module 250 determines an area identifier of the area to which the local context feature of the feature set of the target object belongs in the convolution feature block based on the received area identifiers (S213), and determines When the area identifier of the area to which the local context feature belongs is described, the center point of the area to which the feature set of the target object belongs may be described as centering, and the side length thereof is expanded by 0.5 times.

In step S214, the context acquisition module 250 extracts local context features from corresponding regions of the convolutional feature block based on the determined region identification. In other examples, the convolution feature block can also be determined as a global top and bottom feature.

In step S215, the context acquisition module 250 sends the extracted context features to the pooling operation module 260.

In step S216, the pooling operation module 260 performs a pooling operation on the received feature set and context features, respectively.

In step S217, the pooling operation module 260 delivers the feature set and context features after the pooling operation to the pixel adjustment module 270.

In step S218, the pixel adjustment module 270 respectively adjusts the received feature set context features to the matched pixels corresponding to the various matched pixels.

In some examples, the process of pooling operations and adjusting pixels can be referred to FIG. 2c. The product of w and h in FIG. 2c represents the specific value of the matched pixel, and FIG. 2c only shows a feature set describing the target object. The pooling operation and the adjustment pixel process, the feature set consisting of a first set of features 510, a second set of features 520, and a third set of features 530, the first set of features 510 being the first feature extraction channel of the full convolutional neural network 220 The extracted and outputted features, the second set of features 520 are features extracted and output by the second feature extraction channel of the full convolutional neural network 220, and the third set of features 530 are extracted for the third feature extraction channel of the full convolutional neural network 220 and The features of the output are divided into three parts according to the region before the pooling operation of the feature set, respectively, in the three regions defined by two broken lines in the figure, the top layer is In the first part, the middle part of the two broken lines is the second part, and the middle part is the third part.

When the feature set is pooled, the first part of the first set of features 510 is pooled separately, and the second part of the second set of features 520 is pooled separately. The third portion of the group feature 530 performs a pooling operation. Then, the features generated by the individual pooling operation are respectively adjusted to the matched pixels, and the pixel-adjusted feature set is generated, which is composed of the fourth group feature 540, the fifth group feature 550, and the sixth group feature 560 shown in FIG. 2c. A set of features 510 are pooled and adjusted to become a fourth set of features 540, the second set of features 520 are pooled and adjusted to become a fifth set of features 550, and the third set of features 530 are pooled After the pixel is adjusted and becomes the sixth group feature 540, the pooling operation and the pixel adjustment process of other feature sets and context features are similar to those shown in FIG. 2c, and details are not described herein again.

In step S219, the pixel adjustment module 270 delivers the feature set and context features after the adjustment of the pixels to the target recognition model 280.

In step S220, the target recognition model 280 identifies the target object based on the input feature set and the context feature, and outputs the target class vector and the target position offset vector of the target object.

For a specific target recognition process, reference may be made to FIG. 2d, the

category vectors

611, 612, and 613 in FIG. 2d, and the position offset

vectors

614, 615, and 616, respectively, for the target recognition model 280 corresponding to three branches (this example only shows three Each feature set of the branch, and the contextual features of the feature set, the resulting class vector and position offset vector. The length of the

category vectors

611, 612, and 613 may be (C+1), and each dimension vector element may represent a probability p _j , j ∈ {0,..., C}, 0 of the target object belonging to a certain category. The background class is represented; the position offset

vector

614, 615, 616 may be a 4-dimensional vector, and each dimension element of this vector represents a position offset vector [Δ _x , Δ _y , Δ _w , Δ _h ] of the region to which the feature set belongs.

The target recognition model 280 filters the

category vectors

611, 612, 613 and other category vectors not shown based on predetermined vector screening criteria to determine a final target category vector 621 and a target position offset vector 622.

It can be seen from the above embodiment that, when the image recognition method of the present application selects a feature capable of describing the target object from the feature image, multiple feature sets are selected from different regions in the feature image, which can effectively represent similar targets in different positions in the image. The object, therefore, can more accurately identify the target object.

Furthermore, the image recognition method of the present application may select a plurality of feature sets from a plurality of regions having different sizes in the feature image when extracting the features describing the target object for the difference in shape and size of each target object, using different pixels. The feature set respectively describes the target object with different sizes and shapes, and can also identify the target object by combining the local context feature and the global context feature describing the feature set of the target object, so that the target object can be more accurately obtained, and / or locate the target object. Moreover, by performing a pooling operation and pixel adjustment on the feature set describing the target object, the local context feature of the feature set, and the global context feature, the calculation amount of the image recognition process can be further reduced, and the recognition efficiency is improved.

In addition, when the image recognition method of the embodiment of the present application is applied to each scenario, it is likely to face large-scale data similar to Internet data, and the real-time requirement of the application is high. To satisfy these, C/C++ or C/C++ can be used. The assembly language implements the program instructions corresponding to the image recognition method of the present application.

Corresponding to the embodiment of the aforementioned image recognition method, the present application also provides an embodiment of the image recognition apparatus.

Referring to FIG. 3, FIG. 3 is a logic block diagram of an image recognition apparatus according to an exemplary embodiment of the present application. The apparatus may include an image acquisition module 310, a feature extraction module 320, a feature selection module 330, and a target recognition module 340.

The image obtaining module 310 is configured to acquire an image to be identified.

The feature extraction module 320 is configured to obtain a feature image of the image to be identified, and the feature image is used to describe a feature of the image to be identified.

The feature selection module 330 is configured to select at least two feature sets describing the target object from the obtained feature images.

The target identification module 340 is configured to identify the target object based on the selected feature set.

In some examples, the selected feature set has different sizes of regions to which the feature image belongs.

As an example, the size of the region to which the feature set belongs in the feature image may include:

8*16, 8*8, 16*16, 16*32, 16*8, 32*64, 32*32, 32*16, 64*32.

In some examples, the image recognition apparatus of the present application may further include:

a context selection module, configured to select a context feature of the feature set from the feature image.

The target recognition module 340 is further configured to identify the target object according to the selected feature set and the context feature.

As an example, the contextual features include local context features and/or global context features.

As an example, the side length of the region to which the local context feature of the feature set belongs is 1.5 times the side length of the region to which the feature set belongs.

The pooling operation module is configured to perform a pooling operation on the selected feature set and the context feature of the feature set respectively.

The target recognition module 340 is further configured to identify the target object according to the feature set and the context feature obtained by the pooling operation.

And a pixel adjustment module, configured to adjust the feature set and the context feature obtained by the pooling operation to matched pixels.

The target recognition module 340 is further configured to identify the target object according to the adjusted feature set and the context feature.

As an example, the pooling operation module is further configured to perform a pooling operation on the specified features extracted by each feature extraction channel when performing the pooling operation on the selected feature set and the context feature of the feature set respectively; The specified features extracted by the different feature extraction channels have different coordinates in the feature image.

In some examples, the target recognition module 340 can also be used to:

Adjusting the selected feature set to matching pixels;

The target object is identified based on the adjusted feature set.

As an example, matching pixels include at least two of the following:

3*12, 12*3, 5*10, 10*5, 7*7.

The implementation process of the functions and functions of each unit (or module) in the above device is specifically described in the implementation process of the corresponding steps in the foregoing method, and details are not described herein again.

For the device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment. The device embodiments described above are merely illustrative, wherein the units or modules described as separate components may or may not be physically separate, and the components displayed as units or modules may or may not be physical units. Or modules, which can be located in one place, or distributed to multiple network units or modules. Some or all of the modules may be selected according to actual needs to achieve the objectives of the present application. Those of ordinary skill in the art can understand and implement without any creative effort.

Embodiments of the image recognition apparatus of the present application can be applied to an electronic device. This can be implemented by a computer chip or an entity, or by a product having a certain function. In a typical implementation, the electronic device is a computer, and the specific form of the computer may be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email transceiver device. , game consoles, tablets, wearables, Internet TVs, smart locomotives, driverless cars, smart refrigerators, other smart home devices, or a combination of any of these devices.

The device embodiment may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking the software implementation as an example, as a logical device, a processor of the electronic device in which it is located reads a corresponding computer program instruction in a readable medium such as a non-volatile memory into a memory. From a hardware level, as shown in FIG. 4, a hardware structure diagram of an electronic device in which the image recognition apparatus of the present application is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in FIG. The electronic device in which the device is located in the embodiment may also include other hardware according to the actual function of the electronic device, and details are not described herein again. The storage processor of the electronic device may be a memory of executable instructions; the processor may be coupled to the memory for reading the program instructions stored by the memory and, in response, performing the operations of: acquiring an image to be identified; obtaining a to-be-identified a feature image of the image, the feature image is used to describe a feature of the image to be identified; at least two feature sets describing the target object are selected from the obtained feature image; and the target object is identified based on the selected feature set.

In addition, the embodiment of the present application further provides a computer storage medium, where the storage medium stores program instructions, where the program instructions include:

Obtain an image to be identified;

The target object is identified based on the selected feature set.

Embodiments of the present application may take the form of a computer program product embodied on one or more storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) in which program code is embodied. Computer-usable storage media include both permanent and non-persistent, removable and non-removable media, and information storage can be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.

In other embodiments, the operations performed by the processor may be referred to the related description in the foregoing method embodiments, and details are not described herein.

The above is only the preferred embodiment of the present application, and is not intended to limit the present application. Any modifications, equivalent substitutions, improvements, etc., which are made within the spirit and principles of the present application, should be included in the present application. Within the scope of protection.

Claims

An image recognition method, comprising the steps of:

Obtain an image to be identified;

Obtaining a feature image of the image to be identified, the feature image being used to describe features of the image to be identified;

Selecting at least two feature sets describing the target object from the obtained feature images;

The target object is identified based on the selected feature set.
The method according to claim 1, wherein after selecting a feature set describing the target object from the obtained feature images, the method further comprises:

Selecting a context feature of the feature set from the feature image;

The identifying the target object based on the selected feature set includes:

The target object is identified according to the selected feature set and context features.
The method of claim 2 wherein the contextual features comprise local context features and/or global context features.
The method according to claim 3, wherein the side length of the region to which the local context feature of the feature set belongs is 1.5 times the side length of the region to which the feature set belongs.
The method according to claim 2, wherein the identifying the target object according to the selected feature set and the context feature comprises:

Performing a pooling operation on the selected feature set and the context feature of the feature set respectively;

The target object is identified according to the feature set and the context feature obtained by the pooling operation.
The method according to claim 5, wherein when the selected feature set and the context feature of the feature set are respectively pooled, the specified features extracted by each feature extraction channel are respectively pooled. The specified features extracted by the different feature extraction channels have different coordinates in the feature image.
The method according to claim 5, wherein after the pooling operation is performed on the selected feature set and the context feature of the feature set, the method further includes:

Adjusting the feature set and context features obtained by the pooling operation to matched pixels;

The identifying the target object according to the feature set and the context feature obtained by the pooling operation, including:

The target object is identified based on the adjusted feature set and context characteristics.
The method according to claim 1, wherein the identifying the target object based on the selected feature set comprises:

Adjusting the selected feature set to matching pixels;

The target object is identified based on the adjusted feature set.
The method according to claim 7 or 8, wherein the matched pixels comprise at least two of the following:

3*12, 12*3, 5*10, 10*5, 7*7.
The method according to claim 1, wherein the selected feature set has different sizes of regions to which the feature image belongs.
The method according to claim 10, wherein the size of the region to which the feature set belongs in the feature image comprises:

8*16, 8*8, 16*16, 16*32, 16*8, 32*64, 32*32, 32*16, 64*32.
The method of claim 1 wherein the feature image is obtained by any of the following algorithms:

Convolutional neural network model, classifier, multi-level network structure.
The method of claim 1 wherein the result of the convolution operation describing the feature set of the target object is greater than a predetermined threshold.
The method according to claim 1, wherein the convolution operation result describing the feature set of the target object is larger than the convolution operation result of the feature set of the other non-target object.
The method according to claim 1, wherein the recognition result comprises a target class vector of the target object and a target position offset vector, wherein the target class vector is a category to which the largest vector element of the class vectors corresponding to each feature set belongs to a vector, the target position offset vector being a position offset vector of a feature set corresponding to the target category vector.
The method according to claim 1, wherein the image to be recognized is an image obtained by photographing a storage compartment of a smart refrigerator, and the target object is an foodstuff stored in a storage compartment of the smart refrigerator.
An electronic device, comprising:

processor;

a memory that stores processor executable instructions;

The processor is coupled to the memory for reading program instructions stored in the memory, and in response, performing the following operations:

Obtain an image to be identified;

Obtaining a feature image of the image to be identified;

Selecting at least two feature sets describing the target object from the obtained feature images, where the selected feature set has different sizes in the region to which the feature image belongs;

The target object is identified based on the selected feature set.
The electronic device of claim 17, wherein the processor is further configured to perform the following operations:

Selecting a context feature of the feature set from the feature image;

The target object is identified according to the selected feature set and context features.
The electronic device of claim 18, wherein the contextual features comprise local context features and/or global context features.
The electronic device according to claim 19, wherein the side length of the region to which the local context feature of the feature set belongs is 1.5 times the side length of the region to which the feature set belongs.
The electronic device of claim 18, wherein the processor is further configured to perform the following operations:

Performing a pooling operation on the selected feature set and the context feature of the feature set respectively;

The target object is identified according to the feature set and the context feature obtained by the pooling operation.
The electronic device of claim 21, wherein the processor is further configured to perform the following operations:

When the selected feature set and the context feature of the feature set are respectively pooled, the specified features extracted by the feature extraction channels are respectively pooled; the specified features extracted by the different feature extraction channels are in the The coordinates in the feature image are different.
The electronic device of claim 21, wherein the processor is further configured to perform the following operations:

Adjusting the feature set and context features obtained by the pooling operation to matched pixels;

The target object is identified based on the adjusted feature set and context characteristics.
The electronic device of claim 17, wherein the processor is further configured to perform the following operations:

Adjusting the selected feature set to matching pixels;

The target object is identified based on the adjusted feature set.
The electronic device according to claim 23 or 24, wherein the matched pixels comprise at least two of the following:

3*12, 12*3, 5*10, 10*5, 7*7.
The electronic device according to claim 17, wherein the selected feature set has a different size of a region to which the feature image belongs.
The electronic device according to claim 26, wherein the size of the region to which the feature set belongs in the feature image comprises:

8*16, 8*8, 16*16, 16*32, 16*8, 32*64, 32*32, 32*16, 64*32.
The electronic device of claim 17, wherein the feature image is obtained by any of the following algorithms:

Convolutional neural network model, classifier, multi-level network structure.
The electronic device according to claim 17, wherein the convolution operation result describing the feature set of the target object is greater than a predetermined threshold.
The electronic device according to claim 17, wherein the convolution operation result describing the feature set of the target object is larger than the convolution operation result of the feature set of the other non-target object.
The electronic device according to claim 17, wherein the recognition result comprises a target class vector of the target object and a target position offset vector, wherein the target class vector is the largest vector element of the class vectors corresponding to each feature set. a class vector, the target position offset vector being a position offset vector of a feature set corresponding to the target class vector.
The electronic device according to claim 17, wherein the image to be recognized is an image obtained by photographing a storage compartment of the smart refrigerator, and the target object is an foodstuff stored in a storage compartment of the smart refrigerator.
An image recognition device, comprising:

An image acquisition module, configured to acquire an image to be identified;

a feature extraction module, configured to obtain a feature image of the image to be identified, wherein the feature image is used to describe a feature of the image to be identified;

a feature selection module, configured to select at least two feature sets describing the target object from the obtained feature images;

A target recognition module is configured to identify the target object based on the selected feature set.
The apparatus according to claim 33, wherein the selected feature set has a different size of a region to which the feature image belongs.
The device according to claim 34, wherein the size of the region to which the feature set belongs in the feature image comprises:

8*16, 8*8, 16*16, 16*32, 16*8, 32*64, 32*32, 32*16, 64*32.
The device of claim 33, wherein the device further comprises:

a context selection module, configured to select a context feature of the feature set from the feature image;

The target recognition module is further configured to identify the target object according to the selected feature set and the context feature.
The apparatus of claim 36 wherein the contextual features comprise local context features and/or global context features.
The device of claim 36, wherein the device further comprises:

a pooling operation module, configured to perform a pooling operation on the selected feature set and the context feature of the feature set respectively;

The target recognition module is further configured to identify the target object according to the feature set and the context feature obtained by the pooling operation.
The device according to claim 38, wherein the pooling operation module is further configured to extract a channel for each feature when performing a pooling operation on the selected feature set and the context feature of the feature set respectively The extracted specified features are respectively subjected to a pooling operation; the specified features extracted by the different feature extraction channels have different coordinates in the feature image.
The device according to claim 38, wherein the target recognition module is further configured to:

Adjusting the feature set and context features obtained by the pooling operation to matched pixels;

The target object is identified based on the adjusted feature set and context characteristics.
The device according to claim 33, wherein the target recognition module is further configured to:

Adjusting the selected feature set to matching pixels;

The target object is identified based on the adjusted feature set.
The apparatus according to claim 40 or 41, wherein the matched pixels comprise at least two of the following:

3*12, 12*3, 5*10, 10*5, 7*7.
One or more machine-readable mediums having stored thereon instructions that, when executed by one or more processors, cause the terminal device to perform the method of any of claims 1-16.