CN115471715A - Image recognition method - Google Patents

Image recognition method Download PDF

Info

Publication number
CN115471715A
CN115471715A CN202110584522.0A CN202110584522A CN115471715A CN 115471715 A CN115471715 A CN 115471715A CN 202110584522 A CN202110584522 A CN 202110584522A CN 115471715 A CN115471715 A CN 115471715A
Authority
CN
China
Prior art keywords
tensor
depth
sub
image
values
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110584522.0A
Other languages
Chinese (zh)
Inventor
康学弘
刘一帆
陈奎廷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Coretronic Corp
Original Assignee
Coretronic Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Coretronic Corp filed Critical Coretronic Corp
Priority to CN202110584522.0A priority Critical patent/CN115471715A/en
Publication of CN115471715A publication Critical patent/CN115471715A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention relates to an image identification method, which comprises the following steps: inputting the image to a detection model to obtain a heat map tensor, a reference depth tensor, a weight tensor and a sub-target tensor; obtaining K position index values from the heat map tensor; obtaining a fusion tensor based on the weight tensor and the sub-target tensor; obtaining a predicted depth tensor based on the fusion tensor and the reference depth tensor; taking out K vectors from the predicted depth tensor by referring to the K position index values; and performing transformation of a projection matrix on the K vectors to obtain K coordinate vectors in a real space. The method can simultaneously complete two tasks, namely the detection target object and the sub-targets included in the detection target object through one-time reasoning, and does not need to establish a model based on individual tasks, so that the delay on a consumer terminal can be reduced.

Description

Image recognition method
Technical Field
The present invention relates to an object tracking algorithm, and more particularly, to an image recognition method.
Background
Research and application related to gestures and hand gestures is a way to communicate with computer systems. With the development of computer vision technologies such as Augmented Reality (AR), virtual Reality (VR), and large-screen display systems, hand applications are gradually developing from conventional gesture recognition (hand gesture recognition) to hand gesture estimation and tracking (hand position estimation and tracking). Compared with the simple gesture recognition, if the state of the whole hand, such as the position of each knuckle (joint) point, can be known, the operation can be more natural and smoother by using two hands, and the application range is further improved.
In general, a conventional hand gesture tracking system needs at least two stages of model processing, namely, a hand detection model and a knuckle detection model. The hand detection model is used for detecting the hand position in each image, then the finger node detection model is used for calculating the actual position of the finger node of each hand in a two-dimensional or three-dimensional space, and the result is transmitted to the system for subsequent identification or operation.
However, as the demand of computer vision technology is higher and higher, both real-time performance and high frame rate (FPS) analysis and identification are required. Therefore, the conventional two-stage hand gesture tracking system may cause high latency and degrade user Experience (QoE), and the process also involves some complicated pre-processing or post-processing, which is difficult to be applied to consumer terminals such as mobile phones or VR/AR glasses.
The background section is only provided to aid in understanding the present disclosure, and therefore the disclosure in the background section may include some prior art that does not constitute a part of the common general knowledge of a person skilled in the art. The statements in the background section do not necessarily represent any statements of the invention or any statements of the scope of the invention which, as a matter of language, might be said to fall therebetween, but which might suggest themselves to those skilled in the art.
Disclosure of Invention
The present invention provides an image recognition method, which can find the position of sub-targets included in the target object in the image through one stage.
The image identification method of the embodiment of the invention comprises the following steps: inputting the image to a detection model to obtain a heat map tensor, a reference depth tensor, a weight tensor and a sub-target tensor; obtaining K position index values from the heat map tensor; obtaining a fusion tensor based on the weight tensor and the sub-target tensor; obtaining a predicted depth tensor based on the fusion tensor and the reference depth tensor; taking out K vectors from the predicted depth tensor by referring to the K position index values; and performing transformation of a projection matrix on the K vectors to obtain K coordinate vectors in a real space. The heat map tensor includes a plurality of probability values for predicting the appearance of the target object in a plurality of blocks corresponding to a plurality of position index values of the image, and the target object includes a plurality of sub-targets. The reference depth tensor comprises a first depth value corresponding to each block, and the first depth value is used for predicting the distance between an image capturing device for capturing the image and each block. The weight tensor comprises a plurality of weights used to optimize the sub-goals. The sub-target tensor comprises a plurality of coordinate positions for predicting the sub-target in the image and a second depth value of the sub-target. The fused tensor comprises a plurality of fused depth values obtained based on the weight and the second depth value. The predicted depth tensor comprises a plurality of predicted depth values obtained based on the fused depth value and the first depth value.
The method of the above embodiment can simultaneously complete two tasks, namely the detection target object and the sub-targets included in the detection target object, through one-time reasoning, without establishing a model based on individual tasks.
Drawings
Fig. 1 is a block diagram of an electronic device according to an embodiment of the invention.
Fig. 2 is a flowchart illustrating an image recognition method according to an embodiment of the invention.
FIG. 3 is a block diagram of an image recognition model according to an embodiment of the invention.
FIG. 4 is a schematic diagram of a finger node of a hand in accordance with one embodiment of the present invention.
Fig. 5A and 5B are schematic diagrams illustrating detection results according to an embodiment of the invention.
Description of the reference numerals
100 electronic device
110 processor
120 memory
300 image
310 detection model
320 heat map tensor
330 reference depth tensor
340 weight tensor
350: tensor of sub-target
360 position index List
370 fusion tensor
380 predicted depth tensor
390 target List
J01-J21 finger nodes
S205-S230 image identification method
Detailed Description
The foregoing and other features and advantages of the invention will be apparent from the following, more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings. Directional terms as referred to in the following examples, for example: up, down, left, right, front or rear, etc., are simply directions with reference to the drawings. Accordingly, the directional terminology used is intended to be illustrative and is not intended to be limiting.
The invention provides an image recognition method, which can be realized by an electronic device. In order to make the content of the present invention more clear, the following specific examples are given as examples according to which the present invention can be actually implemented.
Fig. 1 is a block diagram of an electronic device according to an embodiment of the invention. Referring to fig. 1, an electronic device 100 includes a processor 110 and a memory 120. The processor 110 is coupled to a memory 120.
The processor 110 may be hardware (e.g., chipset, processor, etc.), software component (e.g., operating system, application, etc.), or a combination of hardware and software components with computational processing capabilities. The Processor 110 is, for example, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or other Programmable Microprocessor (Microprocessor), digital Signal Processor (DSP), programmable controller, application Specific Integrated Circuit (ASIC), programmable Logic Device (PLD), or the like.
The memory 120 is, for example, any type of fixed or removable random access memory, read only memory, flash memory, secure digital card, hard disk, or other similar device or combination of devices. The memory 120 stores a plurality of program code segments, and the program code segments are installed and executed by the processor 110, thereby implementing the display image recognition method.
Fig. 2 is a flowchart illustrating an image recognition method according to an embodiment of the invention. FIG. 3 is a block diagram of an image recognition model according to an embodiment of the invention. The image recognition model of the present embodiment is a one-stage Neural Network (NN) model. The input to the image recognition model is a two-dimensional, arbitrary type of image 300, and the output target list 390 includes a plurality of sub-target combinations ranked according to probability values.
Referring to fig. 2 and 3, in step S205, an image 300 is input to the detection model 310 to obtain a Heat Map (Heat-Map) tensor 320, a reference depth tensor 330, a weight tensor 340 and a sub-target tensor 350. Here, the tensor dimension of the image 300 is [ H, L, C ], for example. Where H is the Height (Height) of the image, L is the width (Length) of the image, and C is the Channel number (Channel) of the image. For example, if the input source is a color Image (RGB-based Image), C =3. If the input source is depth-based Image, C =1.
The heat map tensor 320 includes probability values for predicting the appearance of the target object in the blocks corresponding to the position index values of the image 300. The object also includes a plurality of sub-objects. The reference depth tensor 330 includes a first depth value (as a reference depth) corresponding to each block of the image 300. The first depth value is a distance between the image capturing device of the predicted captured image 300 and each block. The weight tensor 340 includes a plurality of weights used to optimize a plurality of sub-goals. The sub-target tensor 350 includes a coordinate position for predicting each sub-target in the image 300 and a second depth value corresponding to each sub-target.
The detection model 310 is a Convolutional Neural Network (CNN) based feature extractor. The architecture portion of the detection model 310 is similar to the YOLO fourth edition (YOLOv 4) algorithm. The detection model 310 is a model architecture with a single input and multiple outputs, and the tensors of the multiple outputs are reduced by an integer S. For example, for the resolution of the image 300 to be H × L, the resolution of the obtained heat map tensor 320, the reference depth tensor 330, the weight tensor 340, and the sub-target tensor 350 is H/sxl/S.
If the device source of the input (image 300) is a color imager (color camera), the detection model 310 is trained using the color image data set. If the device source of the input (image 300) is a depth imaging device, the detection model 310 is trained using the depth image data set. Each data set includes three-dimensional positions of a plurality of objects and a Projection Matrix (Projection Matrix) of the image capturing device.
Here, the detected target object is a hand, and the sub-targets are finger nodes of the hand. FIG. 4 is a schematic diagram of finger nodes defining a hand, in accordance with one embodiment of the present invention. The finger nodes of the hand may be defined as 21 finger nodes J01-J21 as shown in fig. 4. The image recognition model of the present embodiment can detect K hands and their respective 21 corresponding finger nodes in the image 300.
The heat map tensor 320 includes probability values for predicting the appearance of a hand, the reference depth tensor 330 includes a distance (a first depth value) from a hand of an image capturing device for capturing the image 300, the weight tensor 340 includes weight values for optimizing the nodes, and the sub-target tensor 350 includes a second depth value for predicting the coordinate position of each node in the image 300 and corresponding to each node. The second depth value corresponding to each finger node refers to the distance of each finger node from the wrist.
The tensor dimension of the heat map tensor 320 is [ H/S, L/S,2], where the 1 st and 2 nd dimensions represent the position index value (i, j) of the block, i = {1, 2., H/S }, j = {1, 2., L/S }, and the 3 rd dimension "2" represents the probability value that each position index value (i, j) corresponds to the occurrence of two targets (i.e., "left hand" and "right hand"). That is, the image 300 is input to the detection model 310 and divided into blocks of equal size H/S × L/S, and two probability values, i.e., a probability value of a left hand occurring and a probability value of a right hand occurring, are estimated for each block. Thus, the heat map tensor 320 includes H/S × L/S × 2 blocks of data. The probability value is between 0 and 1.
The tensor dimension of the reference depth tensor 330 is [ H/S, L/S,1], where the 1 st and 2 nd dimensions represent the position index (i, j) of the block, and the 3 rd dimension "1" represents that each block represented by the position index (i, j) corresponds to 1 first depth value. The reference depth tensor 330 includes H/sxl/sx 1 first depth values.
The tensor dimension of the weight tensor 340 is [ H/S, L/S, N ], where the 1 st and 2 nd dimensions represent the position index values (i, j) of the block, and the 3 rd dimension "N" represents the optimized weights corresponding to the N finger nodes included in the block represented by each position index value (i, j). The weight tensor 340 includes H/S L/S N weights.
The tensor dimensions of the sub-target tensor 350 are [ H/S, L/S, N,3], where the 1 st and 2 nd dimensions represent the position index values (i, j) of the blocks, the 3 rd dimension "N" represents that each of the blocks represented by the position index values (i, j) corresponds to N finger nodes, and the 4 th dimension "3" represents the coordinate positions for predicting the finger nodes in x, y, and z. The sub-target tensor 350 includes H/sxl/sxn sets of coordinate locations (x, y, z), where x, y represent the location of the finger node in the image and z represents the depth value (i.e., the second depth value) of the finger node.
Next, in step S210, K position index values are obtained from the autogram tensor 320. For example, according to the H/sxl/sx 2 block data included in the heat map tensor 320, K position index values corresponding to the K block data are extracted from the block data with the highest probability value and recorded in the position index list 360. Wherein K is the number of objects (e.g., hands). For example, the location index list 360 records: the position index values (gx _1, gy _1), (gx _2, gy _2), \30, (gx _ K, gy _ K).
In step S215, a fusion tensor 370 is obtained based on the weight tensor 340 and the sub-goal tensor 350. Here, the fusion tensor 370 is obtained by convolving the weight tensor 340 and the sub-target tensor 350 by the following formula. The fused tensor 370 includes a plurality of fused depth values obtained based on the weight values and the second depth value.
Figure BDA0003086628370000071
Where ks is the Kernel Size (Kernel Size), W is the weight tensor 340, v is the sub-target tensor 350, a = {1, 2., H/S }, b = {1, 2., L/S }, c = {1, 2., N }, N is the sub-target number (i.e., the number of nodes), and d = {1,2,3} (representing the three axes x, y, z). O (a, b, c, d) is the fusion tensor 370. The fusion tensor 370 has tensor dimensions [ H/S, L/S, N,3]. The 4 th dimension "3" represents the coordinate position of each finger node in the x, y, z three axes, and the depth value corresponding to z is the merged depth value after convolution.
Thereafter, in step S220, a predicted depth tensor 380 is obtained based on the fusion tensor 370 and the reference depth tensor 330. The predicted depth tensor 380 includes a plurality of predicted depth values obtained based on the fused depth value and the first depth value. Specifically, the predicted depth tensor 380 is obtained by adding the fused depth value corresponding to each position index value in the fused tensor 370 (i.e., the z value in the 4 th dimension of the fused tensor 370) to the first depth value corresponding to each position index value in the reference depth tensor 330 (i.e., the value in the 3 rd dimension of the reference depth tensor 330). This is because the predicted depth value from the image capturing device to the knuckle is the addition of the distance between the image capturing device and the hand (first depth value) and the distance from each knuckle to the wrist (fused depth value).
Finally, in step S225, K vectors are extracted from the predicted depth tensor 380 with reference to the position index value. Based on the position index value recorded in the position index list 360 obtained from the hot map tensor 320, the corresponding K vectors are obtained from the predicted depth tensor 380, and the target list 390 is obtained. The positions of the N finger nodes are recorded in each vector. For example, the target manifest 390 includes a vector (J _1_1, J _1 _2.. J _1 _N), a vector (J _2_1, J _2.. J _2 _N), \8230, a vector (J _ K _1, J _K _2.. J _ K _ N).
For the 1 st position index value (gx _1, gy _1) of the position index list 360, the corresponding vectors are (J _1_1, J _1 _2.. J _1 _N), "J _1_1", "J _1_2", _8230, and "J _1_N" respectively represent the positions of N finger nodes of the position index values (gx _1, gy _1). For the 2 nd position index value (gx _2, gy _2) of the position index list 360, the corresponding vectors are (J _2_1, J _2.,. J _2 _N), "J _2_1", "J _2_2", _8230and "J _2_N" respectively representing the positions of N designated nodes of the position index values (gx _2, gy _2). For the Kth position index value (gx _ K, gy _ K) of the position index list 360, the corresponding vector is (J _ K _1, J _K \2.. J _ K _ N), "J _ K _1", "J _ K _2", "8230", "J _ K _ N" respectively denote the positions of N finger nodes of the position index values (gx _ K, gy _ K).
Fig. 5A and 5B are schematic diagrams illustrating detection results according to an embodiment of the invention. Fig. 5A shows the test results for one hand. Fig. 5B shows the results of the test with both hands. By the method, one or more finger nodes of the hand can be detected in the image.
Thereafter, in step S230, conversion of a Projection Matrix (Projection Matrix) is performed on the K vectors to obtain K coordinate vectors in real space. Through the above steps, the hand gesture appearing on the input image 300 can be tracked.
In summary, the embodiments of the present invention can simultaneously complete two tasks, namely, the detection target and the sub-targets included in the detection target, through one-time reasoning, without establishing a model based on individual tasks. Therefore, by applying the embodiment of the invention to multi-hand posture tracking, a plurality of finger joint combinations ranked according to the probability value on the image can be output by inputting any type of image.
In addition, as long as the embodiment of the invention knows that the input source is one type of the color image and the depth image, the data sets of the same type can be selected according to the type of the input source to retrain the model, and the framework used by the embodiment of the invention can still complete hand detection and finger joint regression at one time without changing the CNN model framework.
In the intermediate process of the embodiment of the invention, the sub-image captured by the boundary frame of the object detection is not needed, so that the problem of the accuracy reduction of finger section estimation caused by capturing poor sub-image is avoided. Under the condition that a K-hand appears on an image, the conventional multi-hand posture tracking system needs to perform model operation for K +1 times, and in contrast to the embodiment of the present invention, the positions of the K-hand and the finger joints thereof can be obtained simultaneously after 1-time operation. Therefore, the embodiment of the invention can reduce the delay on the consumer terminal and improve the user experience quality.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Furthermore, it is not necessary for any embodiment or claim of the invention to achieve all of the objects, advantages, or features of the invention. In addition, the abstract and the title are provided to assist the patent document searching and are not intended to limit the scope of the invention. Furthermore, the terms "first," "second," and the like, as used herein or in the appended claims, are used merely to name elements (elements) or to distinguish one embodiment or range from another, and are not intended to limit the upper or lower limit on the number of elements.

Claims (6)

1. An image recognition method, comprising:
inputting an image to a detection model to obtain a heat map tensor, a reference depth tensor, a weight tensor and a sub-target tensor, wherein the heat map tensor comprises a plurality of probability values for predicting occurrence of a target object in a plurality of blocks corresponding to a plurality of position index values of the image, the target object comprises a plurality of sub-targets, the reference depth tensor comprises a first depth value corresponding to each block in the plurality of blocks, the first depth value is a distance between an image capturing device for capturing the image and each block, the weight tensor comprises a plurality of weight values for optimizing the plurality of sub-targets, and the sub-target tensor comprises a plurality of second depth values for predicting a plurality of coordinate positions of the sub-targets in the image and the plurality of sub-targets;
obtaining K position index values from the heat map tensor;
obtaining a fusion tensor based on the weight tensor and the sub-target tensor, wherein the fusion tensor comprises a plurality of fusion depth values obtained based on the plurality of weight values and the plurality of second depth values;
obtaining a predicted depth tensor based on the fusion tensor and the reference depth tensor, wherein the predicted depth tensor comprises a plurality of predicted depth values obtained based on the plurality of fusion depth values and the plurality of first depth values;
taking out K vectors from the predicted depth tensor by referring to K position index values; and
and performing transformation of a projection matrix on the K vectors to obtain K coordinate vectors in a real space.
2. The image recognition method of claim 1, wherein the heat map tensor comprises a plurality of block data corresponding to the plurality of blocks, each block data of the plurality of block data comprises a corresponding index value for each position and two probability values representing the probability value of each corresponding block including the left hand and the probability value of each corresponding block including the right hand,
wherein obtaining K position index values from the heat map tensor comprises:
according to the two probability values, the K position index values corresponding to the K block data are taken out from the block data with the highest probability value in the plurality of block data.
3. The image recognition method of claim 1, wherein the resolution of the image is H x L, and the heat map tensor, the reference depth tensor, the weight tensor and the sub-target tensor with resolution reduced by S times are obtained after the image is inputted to the detection model,
based on the weight tensor and the sub-target tensor, obtaining the fusion tensor comprises:
convolving the weight tensor and the sub-target tensor with the following formula:
Figure FDA0003086628360000021
wherein ks is a kernel size, W is the weight tensor, V is the sub-goal tensor, a = {1, 2., H/S }, b = {1, 2., L/S }, c = {1, 2., N }, N is the sub-goal number, d = {1,2,3}.
4. The image recognition method of claim 1, wherein the step of obtaining the predicted depth tensor based on the fusion tensor and the reference depth tensor comprises:
adding the plurality of fused depth values corresponding to each position index value in the fused tensor to the first depth value corresponding to each position index value in the reference depth tensor to obtain the plurality of predicted depth values corresponding to each position index value.
5. The image recognition method of claim 1, wherein the detection model is a convolutional neural network-based feature extractor.
6. The image recognition method of claim 1, wherein the object is a hand and the sub-objects are nodes.
CN202110584522.0A 2021-05-27 2021-05-27 Image recognition method Pending CN115471715A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110584522.0A CN115471715A (en) 2021-05-27 2021-05-27 Image recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110584522.0A CN115471715A (en) 2021-05-27 2021-05-27 Image recognition method

Publications (1)

Publication Number Publication Date
CN115471715A true CN115471715A (en) 2022-12-13

Family

ID=84365241

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110584522.0A Pending CN115471715A (en) 2021-05-27 2021-05-27 Image recognition method

Country Status (1)

Country Link
CN (1) CN115471715A (en)

Similar Documents

Publication Publication Date Title
CN108205655B (en) Key point prediction method and device, electronic equipment and storage medium
CN108520247B (en) Method, device, terminal and readable medium for identifying object node in image
US11222239B2 (en) Information processing apparatus, information processing method, and non-transitory computer-readable storage medium
CN109584276B (en) Key point detection method, device, equipment and readable medium
CN110378264B (en) Target tracking method and device
CN110188239B (en) Double-current video classification method and device based on cross-mode attention mechanism
CN109902548B (en) Object attribute identification method and device, computing equipment and system
WO2020228446A1 (en) Model training method and apparatus, and terminal and storage medium
CN109035304B (en) Target tracking method, medium, computing device and apparatus
CN111160375B (en) Three-dimensional key point prediction and deep learning model training method, device and equipment
CN111402130B (en) Data processing method and data processing device
CN111179419B (en) Three-dimensional key point prediction and deep learning model training method, device and equipment
CN111062263B (en) Method, apparatus, computer apparatus and storage medium for hand gesture estimation
CN109977832B (en) Image processing method, device and storage medium
WO2021098802A1 (en) Object detection device, method, and systerm
CN113095106A (en) Human body posture estimation method and device
WO2024099068A1 (en) Image-based speed determination method and apparatus, and device and storage medium
CN110633630B (en) Behavior identification method and device and terminal equipment
CN111368668A (en) Three-dimensional hand recognition method and device, electronic equipment and storage medium
CN111291611A (en) Pedestrian re-identification method and device based on Bayesian query expansion
CN115471715A (en) Image recognition method
CN114998172A (en) Image processing method and related system
TWI787841B (en) Image recognition method
CN114120423A (en) Face image detection method and device, electronic equipment and computer readable medium
JP2019125128A (en) Information processing device, control method and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination