CN115471715A

CN115471715A - Image recognition method

Info

Publication number: CN115471715A
Application number: CN202110584522.0A
Authority: CN
Inventors: 康学弘; 刘一帆; 陈奎廷
Original assignee: Coretronic Corp
Current assignee: Coretronic Corp
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2022-12-13

Abstract

The invention relates to an image identification method, which comprises the following steps: inputting the image to a detection model to obtain a heat map tensor, a reference depth tensor, a weight tensor and a sub-target tensor; obtaining K position index values from the heat map tensor; obtaining a fusion tensor based on the weight tensor and the sub-target tensor; obtaining a predicted depth tensor based on the fusion tensor and the reference depth tensor; taking out K vectors from the predicted depth tensor by referring to the K position index values; and performing transformation of a projection matrix on the K vectors to obtain K coordinate vectors in a real space. The method can simultaneously complete two tasks, namely the detection target object and the sub-targets included in the detection target object through one-time reasoning, and does not need to establish a model based on individual tasks, so that the delay on a consumer terminal can be reduced.

Description

Image recognition method

Technical Field

The present invention relates to an object tracking algorithm, and more particularly, to an image recognition method.

Background

Research and application related to gestures and hand gestures is a way to communicate with computer systems. With the development of computer vision technologies such as Augmented Reality (AR), virtual Reality (VR), and large-screen display systems, hand applications are gradually developing from conventional gesture recognition (hand gesture recognition) to hand gesture estimation and tracking (hand position estimation and tracking). Compared with the simple gesture recognition, if the state of the whole hand, such as the position of each knuckle (joint) point, can be known, the operation can be more natural and smoother by using two hands, and the application range is further improved.

In general, a conventional hand gesture tracking system needs at least two stages of model processing, namely, a hand detection model and a knuckle detection model. The hand detection model is used for detecting the hand position in each image, then the finger node detection model is used for calculating the actual position of the finger node of each hand in a two-dimensional or three-dimensional space, and the result is transmitted to the system for subsequent identification or operation.

However, as the demand of computer vision technology is higher and higher, both real-time performance and high frame rate (FPS) analysis and identification are required. Therefore, the conventional two-stage hand gesture tracking system may cause high latency and degrade user Experience (QoE), and the process also involves some complicated pre-processing or post-processing, which is difficult to be applied to consumer terminals such as mobile phones or VR/AR glasses.

The background section is only provided to aid in understanding the present disclosure, and therefore the disclosure in the background section may include some prior art that does not constitute a part of the common general knowledge of a person skilled in the art. The statements in the background section do not necessarily represent any statements of the invention or any statements of the scope of the invention which, as a matter of language, might be said to fall therebetween, but which might suggest themselves to those skilled in the art.

Disclosure of Invention

The present invention provides an image recognition method, which can find the position of sub-targets included in the target object in the image through one stage.

The image identification method of the embodiment of the invention comprises the following steps: inputting the image to a detection model to obtain a heat map tensor, a reference depth tensor, a weight tensor and a sub-target tensor; obtaining K position index values from the heat map tensor; obtaining a fusion tensor based on the weight tensor and the sub-target tensor; obtaining a predicted depth tensor based on the fusion tensor and the reference depth tensor; taking out K vectors from the predicted depth tensor by referring to the K position index values; and performing transformation of a projection matrix on the K vectors to obtain K coordinate vectors in a real space. The heat map tensor includes a plurality of probability values for predicting the appearance of the target object in a plurality of blocks corresponding to a plurality of position index values of the image, and the target object includes a plurality of sub-targets. The reference depth tensor comprises a first depth value corresponding to each block, and the first depth value is used for predicting the distance between an image capturing device for capturing the image and each block. The weight tensor comprises a plurality of weights used to optimize the sub-goals. The sub-target tensor comprises a plurality of coordinate positions for predicting the sub-target in the image and a second depth value of the sub-target. The fused tensor comprises a plurality of fused depth values obtained based on the weight and the second depth value. The predicted depth tensor comprises a plurality of predicted depth values obtained based on the fused depth value and the first depth value.

The method of the above embodiment can simultaneously complete two tasks, namely the detection target object and the sub-targets included in the detection target object, through one-time reasoning, without establishing a model based on individual tasks.

Drawings

Fig. 1 is a block diagram of an electronic device according to an embodiment of the invention.

Fig. 2 is a flowchart illustrating an image recognition method according to an embodiment of the invention.

FIG. 3 is a block diagram of an image recognition model according to an embodiment of the invention.

FIG. 4 is a schematic diagram of a finger node of a hand in accordance with one embodiment of the present invention.

Fig. 5A and 5B are schematic diagrams illustrating detection results according to an embodiment of the invention.

Description of the reference numerals

100 electronic device

110 processor

120 memory

300 image

310 detection model

320 heat map tensor

330 reference depth tensor

340 weight tensor

350: tensor of sub-target

360 position index List

370 fusion tensor

380 predicted depth tensor

390 target List

J01-J21 finger nodes

S205-S230 image identification method

Detailed Description

The foregoing and other features and advantages of the invention will be apparent from the following, more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings. Directional terms as referred to in the following examples, for example: up, down, left, right, front or rear, etc., are simply directions with reference to the drawings. Accordingly, the directional terminology used is intended to be illustrative and is not intended to be limiting.

The invention provides an image recognition method, which can be realized by an electronic device. In order to make the content of the present invention more clear, the following specific examples are given as examples according to which the present invention can be actually implemented.

Fig. 1 is a block diagram of an electronic device according to an embodiment of the invention. Referring to fig. 1, an electronic device 100 includes a processor 110 and a memory 120. The processor 110 is coupled to a memory 120.

The processor 110 may be hardware (e.g., chipset, processor, etc.), software component (e.g., operating system, application, etc.), or a combination of hardware and software components with computational processing capabilities. The Processor 110 is, for example, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or other Programmable Microprocessor (Microprocessor), digital Signal Processor (DSP), programmable controller, application Specific Integrated Circuit (ASIC), programmable Logic Device (PLD), or the like.

The memory 120 is, for example, any type of fixed or removable random access memory, read only memory, flash memory, secure digital card, hard disk, or other similar device or combination of devices. The memory 120 stores a plurality of program code segments, and the program code segments are installed and executed by the processor 110, thereby implementing the display image recognition method.

Fig. 2 is a flowchart illustrating an image recognition method according to an embodiment of the invention. FIG. 3 is a block diagram of an image recognition model according to an embodiment of the invention. The image recognition model of the present embodiment is a one-stage Neural Network (NN) model. The input to the image recognition model is a two-dimensional, arbitrary type of image 300, and the output target list 390 includes a plurality of sub-target combinations ranked according to probability values.

Referring to fig. 2 and 3, in step S205, an image 300 is input to the detection model 310 to obtain a Heat Map (Heat-Map) tensor 320, a reference depth tensor 330, a weight tensor 340 and a sub-target tensor 350. Here, the tensor dimension of the image 300 is [ H, L, C ], for example. Where H is the Height (Height) of the image, L is the width (Length) of the image, and C is the Channel number (Channel) of the image. For example, if the input source is a color Image (RGB-based Image), C =3. If the input source is depth-based Image, C =1.

The heat map tensor 320 includes probability values for predicting the appearance of the target object in the blocks corresponding to the position index values of the image 300. The object also includes a plurality of sub-objects. The reference depth tensor 330 includes a first depth value (as a reference depth) corresponding to each block of the image 300. The first depth value is a distance between the image capturing device of the predicted captured image 300 and each block. The weight tensor 340 includes a plurality of weights used to optimize a plurality of sub-goals. The sub-target tensor 350 includes a coordinate position for predicting each sub-target in the image 300 and a second depth value corresponding to each sub-target.

The detection model 310 is a Convolutional Neural Network (CNN) based feature extractor. The architecture portion of the detection model 310 is similar to the YOLO fourth edition (YOLOv 4) algorithm. The detection model 310 is a model architecture with a single input and multiple outputs, and the tensors of the multiple outputs are reduced by an integer S. For example, for the resolution of the image 300 to be H × L, the resolution of the obtained heat map tensor 320, the reference depth tensor 330, the weight tensor 340, and the sub-target tensor 350 is H/sxl/S.

If the device source of the input (image 300) is a color imager (color camera), the detection model 310 is trained using the color image data set. If the device source of the input (image 300) is a depth imaging device, the detection model 310 is trained using the depth image data set. Each data set includes three-dimensional positions of a plurality of objects and a Projection Matrix (Projection Matrix) of the image capturing device.

Here, the detected target object is a hand, and the sub-targets are finger nodes of the hand. FIG. 4 is a schematic diagram of finger nodes defining a hand, in accordance with one embodiment of the present invention. The finger nodes of the hand may be defined as 21 finger nodes J01-J21 as shown in fig. 4. The image recognition model of the present embodiment can detect K hands and their respective 21 corresponding finger nodes in the image 300.

The heat map tensor 320 includes probability values for predicting the appearance of a hand, the reference depth tensor 330 includes a distance (a first depth value) from a hand of an image capturing device for capturing the image 300, the weight tensor 340 includes weight values for optimizing the nodes, and the sub-target tensor 350 includes a second depth value for predicting the coordinate position of each node in the image 300 and corresponding to each node. The second depth value corresponding to each finger node refers to the distance of each finger node from the wrist.

The tensor dimension of the heat map tensor 320 is [ H/S, L/S,2], where the 1 st and 2 nd dimensions represent the position index value (i, j) of the block, i = {1, 2., H/S }, j = {1, 2., L/S }, and the 3 rd dimension "2" represents the probability value that each position index value (i, j) corresponds to the occurrence of two targets (i.e., "left hand" and "right hand"). That is, the image 300 is input to the detection model 310 and divided into blocks of equal size H/S × L/S, and two probability values, i.e., a probability value of a left hand occurring and a probability value of a right hand occurring, are estimated for each block. Thus, the heat map tensor 320 includes H/S × L/S × 2 blocks of data. The probability value is between 0 and 1.

The tensor dimension of the reference depth tensor 330 is [ H/S, L/S,1], where the 1 st and 2 nd dimensions represent the position index (i, j) of the block, and the 3 rd dimension "1" represents that each block represented by the position index (i, j) corresponds to 1 first depth value. The reference depth tensor 330 includes H/sxl/sx 1 first depth values.

The tensor dimension of the weight tensor 340 is [ H/S, L/S, N ], where the 1 st and 2 nd dimensions represent the position index values (i, j) of the block, and the 3 rd dimension "N" represents the optimized weights corresponding to the N finger nodes included in the block represented by each position index value (i, j). The weight tensor 340 includes H/S L/S N weights.

The tensor dimensions of the sub-target tensor 350 are [ H/S, L/S, N,3], where the 1 st and 2 nd dimensions represent the position index values (i, j) of the blocks, the 3 rd dimension "N" represents that each of the blocks represented by the position index values (i, j) corresponds to N finger nodes, and the 4 th dimension "3" represents the coordinate positions for predicting the finger nodes in x, y, and z. The sub-target tensor 350 includes H/sxl/sxn sets of coordinate locations (x, y, z), where x, y represent the location of the finger node in the image and z represents the depth value (i.e., the second depth value) of the finger node.

Next, in step S210, K position index values are obtained from the autogram tensor 320. For example, according to the H/sxl/sx 2 block data included in the heat map tensor 320, K position index values corresponding to the K block data are extracted from the block data with the highest probability value and recorded in the position index list 360. Wherein K is the number of objects (e.g., hands). For example, the location index list 360 records: the position index values (gx _1, gy _1), (gx _2, gy _2), \30, (gx _ K, gy _ K).

In step S215, a fusion tensor 370 is obtained based on the weight tensor 340 and the sub-goal tensor 350. Here, the fusion tensor 370 is obtained by convolving the weight tensor 340 and the sub-target tensor 350 by the following formula. The fused tensor 370 includes a plurality of fused depth values obtained based on the weight values and the second depth value.

Where ks is the Kernel Size (Kernel Size), W is the weight tensor 340, v is the sub-target tensor 350, a = {1, 2., H/S }, b = {1, 2., L/S }, c = {1, 2., N }, N is the sub-target number (i.e., the number of nodes), and d = {1,2,3} (representing the three axes x, y, z). O (a, b, c, d) is the fusion tensor 370. The fusion tensor 370 has tensor dimensions [ H/S, L/S, N,3]. The 4 th dimension "3" represents the coordinate position of each finger node in the x, y, z three axes, and the depth value corresponding to z is the merged depth value after convolution.

Thereafter, in step S220, a predicted depth tensor 380 is obtained based on the fusion tensor 370 and the reference depth tensor 330. The predicted depth tensor 380 includes a plurality of predicted depth values obtained based on the fused depth value and the first depth value. Specifically, the predicted depth tensor 380 is obtained by adding the fused depth value corresponding to each position index value in the fused tensor 370 (i.e., the z value in the 4 th dimension of the fused tensor 370) to the first depth value corresponding to each position index value in the reference depth tensor 330 (i.e., the value in the 3 rd dimension of the reference depth tensor 330). This is because the predicted depth value from the image capturing device to the knuckle is the addition of the distance between the image capturing device and the hand (first depth value) and the distance from each knuckle to the wrist (fused depth value).

Finally, in step S225, K vectors are extracted from the predicted depth tensor 380 with reference to the position index value. Based on the position index value recorded in the position index list 360 obtained from the hot map tensor 320, the corresponding K vectors are obtained from the predicted depth tensor 380, and the target list 390 is obtained. The positions of the N finger nodes are recorded in each vector. For example, the target manifest 390 includes a vector (J _1_1, J _1 _2.. J _1 _N), a vector (J _2_1, J _2.. J _2 _N), \8230, a vector (J _ K _1, J _K _2.. J _ K _ N).

For the 1 st position index value (gx _1, gy _1) of the position index list 360, the corresponding vectors are (J _1_1, J _1 _2.. J _1 _N), "J _1_1", "J _1_2", _8230, and "J _1_N" respectively represent the positions of N finger nodes of the position index values (gx _1, gy _1). For the 2 nd position index value (gx _2, gy _2) of the position index list 360, the corresponding vectors are (J _2_1, J _2.,. J _2 _N), "J _2_1", "J _2_2", _8230and "J _2_N" respectively representing the positions of N designated nodes of the position index values (gx _2, gy _2). For the Kth position index value (gx _ K, gy _ K) of the position index list 360, the corresponding vector is (J _ K _1, J _K \2.. J _ K _ N), "J _ K _1", "J _ K _2", "8230", "J _ K _ N" respectively denote the positions of N finger nodes of the position index values (gx _ K, gy _ K).

Fig. 5A and 5B are schematic diagrams illustrating detection results according to an embodiment of the invention. Fig. 5A shows the test results for one hand. Fig. 5B shows the results of the test with both hands. By the method, one or more finger nodes of the hand can be detected in the image.

Thereafter, in step S230, conversion of a Projection Matrix (Projection Matrix) is performed on the K vectors to obtain K coordinate vectors in real space. Through the above steps, the hand gesture appearing on the input image 300 can be tracked.

In summary, the embodiments of the present invention can simultaneously complete two tasks, namely, the detection target and the sub-targets included in the detection target, through one-time reasoning, without establishing a model based on individual tasks. Therefore, by applying the embodiment of the invention to multi-hand posture tracking, a plurality of finger joint combinations ranked according to the probability value on the image can be output by inputting any type of image.

In addition, as long as the embodiment of the invention knows that the input source is one type of the color image and the depth image, the data sets of the same type can be selected according to the type of the input source to retrain the model, and the framework used by the embodiment of the invention can still complete hand detection and finger joint regression at one time without changing the CNN model framework.

In the intermediate process of the embodiment of the invention, the sub-image captured by the boundary frame of the object detection is not needed, so that the problem of the accuracy reduction of finger section estimation caused by capturing poor sub-image is avoided. Under the condition that a K-hand appears on an image, the conventional multi-hand posture tracking system needs to perform model operation for K +1 times, and in contrast to the embodiment of the present invention, the positions of the K-hand and the finger joints thereof can be obtained simultaneously after 1-time operation. Therefore, the embodiment of the invention can reduce the delay on the consumer terminal and improve the user experience quality.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Furthermore, it is not necessary for any embodiment or claim of the invention to achieve all of the objects, advantages, or features of the invention. In addition, the abstract and the title are provided to assist the patent document searching and are not intended to limit the scope of the invention. Furthermore, the terms "first," "second," and the like, as used herein or in the appended claims, are used merely to name elements (elements) or to distinguish one embodiment or range from another, and are not intended to limit the upper or lower limit on the number of elements.

Claims

1. An image recognition method, comprising:

inputting an image to a detection model to obtain a heat map tensor, a reference depth tensor, a weight tensor and a sub-target tensor, wherein the heat map tensor comprises a plurality of probability values for predicting occurrence of a target object in a plurality of blocks corresponding to a plurality of position index values of the image, the target object comprises a plurality of sub-targets, the reference depth tensor comprises a first depth value corresponding to each block in the plurality of blocks, the first depth value is a distance between an image capturing device for capturing the image and each block, the weight tensor comprises a plurality of weight values for optimizing the plurality of sub-targets, and the sub-target tensor comprises a plurality of second depth values for predicting a plurality of coordinate positions of the sub-targets in the image and the plurality of sub-targets;

obtaining K position index values from the heat map tensor;

obtaining a fusion tensor based on the weight tensor and the sub-target tensor, wherein the fusion tensor comprises a plurality of fusion depth values obtained based on the plurality of weight values and the plurality of second depth values;

obtaining a predicted depth tensor based on the fusion tensor and the reference depth tensor, wherein the predicted depth tensor comprises a plurality of predicted depth values obtained based on the plurality of fusion depth values and the plurality of first depth values;

taking out K vectors from the predicted depth tensor by referring to K position index values; and

and performing transformation of a projection matrix on the K vectors to obtain K coordinate vectors in a real space.

2. The image recognition method of claim 1, wherein the heat map tensor comprises a plurality of block data corresponding to the plurality of blocks, each block data of the plurality of block data comprises a corresponding index value for each position and two probability values representing the probability value of each corresponding block including the left hand and the probability value of each corresponding block including the right hand,

wherein obtaining K position index values from the heat map tensor comprises:

according to the two probability values, the K position index values corresponding to the K block data are taken out from the block data with the highest probability value in the plurality of block data.

3. The image recognition method of claim 1, wherein the resolution of the image is H x L, and the heat map tensor, the reference depth tensor, the weight tensor and the sub-target tensor with resolution reduced by S times are obtained after the image is inputted to the detection model,

based on the weight tensor and the sub-target tensor, obtaining the fusion tensor comprises:

convolving the weight tensor and the sub-target tensor with the following formula:

wherein ks is a kernel size, W is the weight tensor, V is the sub-goal tensor, a = {1, 2., H/S }, b = {1, 2., L/S }, c = {1, 2., N }, N is the sub-goal number, d = {1,2,3}.

4. The image recognition method of claim 1, wherein the step of obtaining the predicted depth tensor based on the fusion tensor and the reference depth tensor comprises:

adding the plurality of fused depth values corresponding to each position index value in the fused tensor to the first depth value corresponding to each position index value in the reference depth tensor to obtain the plurality of predicted depth values corresponding to each position index value.

5. The image recognition method of claim 1, wherein the detection model is a convolutional neural network-based feature extractor.

6. The image recognition method of claim 1, wherein the object is a hand and the sub-objects are nodes.