CN113220125A

CN113220125A - Finger interaction method and device, electronic equipment and computer storage medium

Info

Publication number: CN113220125A
Application number: CN202110548068.3A
Authority: CN
Inventors: 任子辉; 林辉; 段亦涛
Original assignee: Netease Youdao Information Technology Beijing Co Ltd
Current assignee: Netease Youdao Information Technology Beijing Co Ltd
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2021-08-06

Abstract

The embodiment of the invention provides a finger interaction method, a finger interaction device, electronic equipment and a computer storage medium, wherein the method comprises the following steps: acquiring one or more images containing a hand; detecting the category and the rough position of the hand part in the image by using a first detection model; detecting a fingertip fine position included in the rough position of the finger of the upper part of the hand by using a second detection model; and determining a target object pointed by the finger according to the rough position of the finger and the fine position of the fingertip. According to the method provided by the embodiment of the invention, the content pointed by the fingertip can be more accurately matched, the pointing direction in any direction of 360 degrees can be adapted, and the robustness is improved; in addition, the mode that the first detection model and the second detection model are adopted to realize finger coarse positioning and finger fine positioning respectively can improve the detection accuracy and reduce the response time.

Description

Finger interaction method and device, electronic equipment and computer storage medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to the technical field of artificial intelligence such as image recognition, and more particularly relates to a finger interaction method and device, and a related electronic device and a computer storage medium.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Thus, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.

The word searching is a common scene in the reading and learning process, and the operation of the traditional tools (including paper dictionaries, electronic notes, mobile phone word searching apps and the like) is time-consuming and has low efficiency. With the development of artificial intelligence and big data technology, more intelligent learning tools (dictionary pens, point reading machines and the like) appear, and the learning efficiency is greatly improved. However, the dictionary pen has certain limitations in the learning scene, and the user needs to put down the writing pen in the hand and pick up the dictionary pen to scan and search words, so that the whole process is not simple enough.

In recent years, some vision-based finger-point word searching technologies are presented to liberate the hands of users and reduce interference in the reading and learning process in the word searching and generating process. These techniques are typically performed by locating the fingertip position and then outputting the content closest to the fingertip position coordinates. However, these techniques only locate one fingertip coordinate, and cannot determine the finger direction, nor which finger is pointed, and therefore, only the content above the fingertip can be output. If the finger is inclined, the content actually matched by the technologies and the content actually wanted by the user can have large deviation.

Disclosure of Invention

Therefore, an improved finger interaction method and related product are needed, which can accurately match the content pointed by the user.

In this context, embodiments of the present invention are intended to provide a method of finger interaction and related products.

In a first aspect of embodiments of the present invention, there is provided a finger interaction method, comprising: acquiring one or more images containing a hand; detecting the category and the rough position of the hand part in the image by using a first detection model; detecting a fingertip fine position included in the rough position of the finger of the upper part of the hand by using a second detection model; and determining a target object pointed by the finger according to the rough position of the finger and the fine position of the fingertip.

In an embodiment of the present invention, before using the second detection model, the method further includes: judging the intention of hand interaction according to the time sequence position characteristics of the hand parts of all types in the plurality of images; and in response to the intent of the hand interaction being an intent to fetch words, performing detection using the second detection model.

In another embodiment of the present invention, the method further comprises: selecting a corresponding target area from the image according to the rough position of the finger in the hand part; and providing the target region to a second detection model to detect fingertip fine positions.

In a further embodiment of the present invention, the rough position of the finger is represented by a bounding box, and the target area is a cropping map including part or all of the bounding box and a portion of the neighborhood of the bounding box, preferably, the frame map of the preset size is cropped to be the cropping map centered on the fingertip in the bounding box.

In one embodiment of the present invention, determining the target object pointed at by the finger based on the coarse position of the finger and the fine position of the fingertip comprises: forming a fingertip vector by taking the central point of the surrounding frame as a starting point and the coordinate of the fine fingertip position as an end point; and matching the recognition object in the cropping graph according to the fingertip vector as a target object pointed by the finger, preferably, acquiring the recognition object closest to the coordinate of the fine position of the fingertip in the direction along the fingertip vector as the target object pointed by the finger.

In still another embodiment of the present invention, further comprising: performing optical character recognition on the target area before, after, or simultaneously with the performing of the detection using the second detection model to obtain a recognition result; and determining the target object pointed to by the digit includes determining the corresponding character in the recognition result.

In a further embodiment of the present invention, the method further comprises performing any one of the following operations on the target object: a read-on operation or a translate operation.

In one embodiment of the invention, the category of the on-hand part comprises at least one of: palm, thumb, index finger, middle finger, ring finger and little finger.

In another embodiment of the present invention, the first detection model and the second detection model are each trained neural network models.

In a second aspect of embodiments of the present invention, there is provided a finger interaction device comprising: an acquisition module configured to acquire one or more images including a hand; a first detection module configured to detect a category and a rough location of a site on a hand in an image using a first detection model; a second detection module configured to detect a fingertip fine position included in the rough position of the finger of the on-hand part using a second detection model; and a determining module configured to determine a target object pointed by the finger according to the rough position of the finger and the fine position of the fingertip of the finger.

In one embodiment of the present invention, further comprising: a determining module configured to determine an intention of hand interaction based on time-series position characteristics of the hand parts of the respective categories in the plurality of images before using the second detection model; and the second detection module is configured to perform detection using the second detection model in response to the intent of the hand interaction being an intent to fetch words.

In another embodiment of the present invention, the method further comprises: a selection module configured to select a corresponding target region from the image according to a rough position of a finger in the hand; the target area is provided to a second detection model to detect fingertip fine positions.

In one embodiment of the invention, the determination module is further configured to: forming a fingertip vector by taking the central point of the surrounding frame as a starting point and the coordinate of the fine fingertip position as an end point; matching the recognition object in the cropping map according to the fingertip vector as the target object pointed by the finger, preferably, acquiring the recognition object closest to the coordinate of the fingertip fine position in the direction along the fingertip vector as the target object pointed by the finger.

In still another embodiment of the present invention, further comprising: the identification module is configured to execute optical character identification on the target area before, after or simultaneously with the execution of the second detection module so as to obtain an identification result; and the determination module is further configured to: and determining corresponding characters in the recognition result according to the rough position of the finger and the fine position of the fingertip of the finger.

In still another embodiment of the present invention, further comprising: an execution module configured to perform any one of the following operations on the target object: a read-on operation or a translate operation.

In one embodiment of the invention, the on-hand position comprises at least one of: palm, thumb, index finger, middle finger, ring finger and little finger.

In a third aspect of embodiments of the present invention, there is provided an electronic apparatus, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to perform a method according to any one of the first aspect of embodiments of the present invention.

In a fourth aspect of embodiments of the present invention, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, performs a method according to any one of the first aspect of embodiments of the present invention.

According to the finger interaction method and the related product, the target object pointed by the finger is determined by respectively detecting the rough position of the finger and the fine position of the finger tip on the hand, so that the content pointed by the finger tip can be more accurately matched, the finger interaction method and the related product can adapt to the pointing in any direction of 360 degrees, and the robustness is improved; in addition, the mode that the first detection model and the second detection model are adopted to realize finger coarse positioning and finger fine positioning respectively can improve the detection accuracy and reduce the response time. In some embodiments, different gestures can be recognized by using time sequence position characteristics of different parts on the hand, so that the current finger interaction intention is accurately judged, and false triggering is avoided. In other embodiments, an area small image with a proper size can be cut out from the original image according to the rough position of the finger, the area small image is input into the second detection model to obtain the fine position of the fingertip, and the original image is cut out based on the rough position of the finger, so that the data processing amount of the second detection model can be greatly reduced on the premise of ensuring the precision, the response speed is further improved, and the user experience is enhanced.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 schematically illustrates an application scenario according to an embodiment of the present invention;

FIG. 2 schematically shows a flow diagram of a finger interaction method according to an embodiment of the invention;

FIG. 3 schematically shows an application scenario for detecting a category and a rough location of a part on a hand using a first detection model according to an embodiment of the invention;

FIG. 4 is a diagram schematically illustrating an application scenario for detecting a fine position of a fingertip using a second detection model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating an application scenario of object recognition based on fingertip vector matching according to an embodiment of the present invention;

FIG. 6 schematically illustrates a flow chart of another embodiment of a finger interaction method according to an embodiment of the present invention;

FIG. 7 schematically illustrates a functional block diagram of a finger interaction device according to an embodiment of the present invention; and

FIG. 8 schematically illustrates a block diagram of an exemplary computing system suitable for implementing embodiments of the present invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

According to the embodiment of the invention, a finger interaction method, a finger interaction device, finger interaction equipment and a computer-readable storage medium are provided. Through the following description, it can be understood that the method of the embodiment of the present invention can determine the target object pointed by the finger by respectively detecting the rough position of the finger and the fine position of the finger tip at the upper part of the hand, so as to more accurately match the content pointed by the finger tip, and can adapt to the pointing in any direction of 360 degrees, thereby improving the robustness; in addition, the mode that the first detection model and the second detection model are adopted to realize finger coarse positioning and finger fine positioning respectively can improve the detection accuracy and reduce the response time.

Moreover, any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.

Summary of The Invention

The inventor finds that in the vision-based finger point reading word searching technology, a single detection model is generally adopted to locate the fingertip coordinate. With this scheme, it is impossible to determine the finger direction and which finger is, and only the content above (usually directly above) the fingertip coordinate can be matched in the case of only one fingertip coordinate. If the finger is tilted, the content actually matched by the scheme and the content actually wanted by the user have large deviation. The patent with application publication number CN109325464A discloses a finger reading character recognition method and a translation method based on artificial intelligence, which adds a text angle recognition function, but does not solve the problem of direction of the finger, and still cannot judge the pointing direction of the finger.

The inventor also finds that the scheme of detecting the original image by using a single detection model to match the content pointed by the user has the disadvantages of low accuracy and long time consumption, and the disadvantages become more obvious as the resolution or definition of the original image increases.

Based on the above findings, the inventor proposes a concept of determining the finger pointing direction in two stages, so as to match the characters pointed by the finger tip more accurately and quickly. The specific concept is as follows: firstly, a detection model is adopted to detect the rough position of the hand part in an original image, which is a 'rough positioning' stage; and then, based on the result of the rough positioning, adopting another detection model to carry out secondary detection on the rough position of the finger head so as to obtain the fine position of the finger head, wherein the fine positioning is the stage of 'fine positioning'. The target object pointed by the finger head is determined by respectively detecting the rough position of the finger head and the fine position of the finger tip of the hand, so that the content pointed by the finger tip can be more accurately matched, the pointing direction in any direction of 360 degrees can be adapted, and the robustness is improved; in addition, the mode of respectively realizing finger coarse positioning and finger fine positioning by adopting the two detection models can improve the detection accuracy and reduce the response time consumption.

Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.

Application scene overview

First, referring to fig. 1, a finger interaction method and an application scenario of a related product according to an embodiment of the present invention are described in detail.

Fig. 1 schematically shows an application scenario according to an embodiment of the present invention. It should be noted that fig. 1 is only an example of an application scenario in which the embodiment of the present invention may be applied to help those skilled in the art understand the technical content of the present invention, and does not mean that the embodiment of the present invention may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, in a finger point-and-search application scenario, a system architecture to which the finger interaction method according to the embodiment of the present invention is applied may include an image pickup apparatus 101, a cloud server 102, and a terminal device 103. The camera 101 is placed on top of the reading material to capture images in real time from a top view. The terminal device 103 may process the pictures acquired by the camera device 101 in real time and output the result, for example, the terminal device 103 may use the first detection model and/or the second detection model to perform reasoning, information storage, and the like on the images acquired in real time. Terminal device 103 may be a variety of electronic devices including, but not limited to, smart table lamps, smart phones, tablets, laptop and desktop computers, and the like.

The cloud server 102 may perform inference, information storage, and the like of the first detection model and/or the second detection model. The terminal device 103 may interact with the cloud server 102 through a network to receive or send messages and the like. For example, the cloud server 102 may receive a first inference result of the first detection model sent by the terminal device 103, and perform second inference on the first inference result by using the second detection model.

It should be noted that, the first detection model and/or the second detection model in the finger interaction method according to the embodiment of the present invention may be deployed on the terminal device 103, or may be deployed on the cloud server 102, which is not limited in the embodiment of the present invention.

In the process of learning or reading, if a user needs to search words, the user can use a finger to point at the position of the word to be searched, and the camera device 101 can transmit the image including the hand to the terminal device 103 for image processing after acquiring the image. Then, the terminal device 103 may execute the finger interaction method according to the embodiment of the present invention; alternatively, the terminal device 103 and the cloud server 102 may execute the finger interaction method according to the embodiment of the present invention. Finally, the pronunciation, paraphrase, etc. of the word may be obtained.

Exemplary method

The finger interaction method according to an exemplary embodiment of the present invention is described below with reference to fig. 2 in conjunction with the application scenario illustrated in fig. 1. It should be noted that the above application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present invention, and the embodiments of the present invention are not limited in this respect. Rather, embodiments of the present invention may be applied to any scenario where applicable.

Referring first to FIG. 2, a flow diagram of a finger interaction method according to an embodiment of the invention is schematically shown. As shown in fig. 2, method 200 may include: in step 210, one or more images containing a hand may be acquired. In some embodiments, the image of a specific area may be captured in real time by the camera 101 in fig. 1, and the camera 101 may be a module integrated in the terminal device 103, such as a camera. In some embodiments, the one or more images containing the hand may be pre-processed images in a manner including, but not limited to, smooth noise reduction, median filtering, edge detection, and the like.

Next, in step 220, the category and the rough location of the on-hand portion in the image may be detected using the first detection model. In some embodiments, the first detection model may be a trained neural network model. The initial model of the first detection model may be various types of untrained or untrained completed artificial neural networks, such as a deep learning model. Specifically, the skilled person can construct the initial model of the first detection model according to the actual application requirements (including which layers of the convolutional layer, the pooling layer, the activation layer, the batch normalization layer, etc., the number of layers per layer, the size and step size of the convolutional kernel, etc., as needed). Each layer of the initial model of the first detection model may be provided with initial parameters, which may be continuously adjusted during the training of the initial model of the first detection model.

Specifically, the training process of the initial model of the first detection model includes: firstly, acquiring a training sample set, wherein each image sample in the training sample set comprises category information and rough position information of a part on a hand; and then training to obtain a first detection model by using a machine learning method based on the training sample set and a preset loss function. The preset loss function may include one or more of a confidence loss function, a classification loss function, and a regression loss function.

In other embodiments, the category of the on-hand part includes at least one of: palm, thumb, index finger, middle finger, ring finger and little finger. Wherein the finger refers to the head area of the finger. Preferably, the categories of the upper part of the hand include six categories of palm, thumb, index finger, middle finger, ring finger and little finger. In some embodiments, the rough location of the site on the hand may be represented by a bounding box (bounding box). The bounding box may be represented, among other things, by one or more position parameters, such as an offset in x-coordinate (bx), an offset in y-coordinate (by), an offset in width (bw), and an offset in height (bh).

For example, the first detection model may be set to detect three categories of palm, thumb, and index finger. Wherein the first detection model may divide the image containing the hand into N × N meshes, one mesh being detected at a time. The first detection model may output a vector y ═ Pc, bx, by, bh, bw, c1, c2, c3, Pc, bx, by, bh, bw, c1, c2, c2, Pc indicates whether or not a target is detected in the mesh (if not, Pc ═ 0, bx, by, bh, and bw are all 0), (bx, by, bh, bw) is a position parameter of a bounding box of the target, and (c1, c2, c3) indicates which of a palm, a thumb, and an index finger the target is.

For better understanding of step 220 of the embodiment of the present invention, fig. 3 is a schematic diagram of an application scenario for detecting the category and the rough location of the part on the hand by using the first detection model according to the embodiment of the present invention. As shown in fig. 3, an input image including a hand is input to a trained first detection model, and the first detection model detects various types of fingers and palms on the hand and frames the positions of the various types of fingers with a bounding box (bounding box). After detection, the input image contains two finger categories of a thumb and an index finger and a palm, and the first detection model can respectively show the rough position of the palm (the largest dotted frame), the rough position of the thumb (the left small dotted frame), and the rough position of the index finger (the upper small dotted frame) with bounding boxes.

In yet another embodiment of the present invention, when the detection result of the first detection model is transferred to the second detection model, the method 200 may further include: selecting a corresponding target area from the image according to the rough position of the finger in the hand part; and providing the target region to a second detection model to detect fingertip fine positions. Among them, the cloud server 102 or the terminal device 103 in fig. 1 may perform the steps of selecting a corresponding target region from the image according to the rough position of the finger in the on-hand part and providing the target region to the second detection model. The target area may be a preset frame area around the rough position of the finger, for example, the target area may be a preset size frame diagram with the rough position of the finger as the center point. Illustratively, the preset size here may be a preset scale size (e.g., 1/16) of the first detection model input image, or may be a preset size (e.g., 10mm × 10 mm). After the cloud server 102 or the terminal device 103 in fig. 1 selects the target area, the target area may be sent to the second detection model to detect the fingertip fine position.

It should be noted that the embodiment of the present invention does not limit the size of the target area and the position relationship between the target area and the rough position of the finger (for example, the rough position of the finger may be located at the center point of the target area, may be located at an area above the center point of the target area, and may also be located at an area below the center point of the target area), as long as the target area includes the content pointed by the finger and the finger, which all fall within the protection scope of the embodiment of the present invention.

In other embodiments, the target area is a cropping map that includes a portion or all of the bounding box and a portion of the area adjacent to the bounding box. The cloud server 102 or the terminal device 103 in fig. 1 may cut out the target area from the input image of the first detection model as a cut-out image. The cutting map may include a bounding box of a part of the hand, or may include a bounding box of all the hand. The clip map may also include content surrounding the area adjacent the box, such as content of the index finger surrounding the area adjacent the box. The embodiment of the present invention does not limit the specific range of the neighboring area, and the embodiment of the present invention falls within the protection scope of the embodiment of the present invention as long as the neighboring area includes the content pointed by the finger.

In other embodiments, selecting the corresponding target region from the image based on the rough location of the digit in the on-hand region includes: and taking the fingertips in the surrounding frame as the center, and taking a frame diagram of cutting preset sizes as a cutting diagram. Illustratively, the frame diagram in which the preset size is cut out may be a cut-out diagram centering on the fingertip in the index finger head surrounding frame. The preset size here may be a preset scale size (e.g., 1/16) of the input image of the first detection model, or may be a preset size (e.g., 10mm × 10 mm).

Returning to fig. 2, the flow may proceed to step 230, where a fingertip fine position included in the coarse position of the digit of the on-hand part may be detected using the second detection model. In some embodiments, the second detection model may be a trained neural network model. The initial model of the second detection model may be various types of untrained or untrained completed artificial neural networks, such as a deep learning model. Specifically, the technician can construct the initial model of the second detection model according to the actual application requirements (including which layers of the convolutional layer, the pooling layer, the activation layer, the batch normalization layer, etc., the number of layers per layer, the size and step size of the convolutional kernel, etc., as needed). Each layer of the initial model of the second detection model may be provided with initial parameters, which may be continuously adjusted during the training of the initial model of the second detection model.

Specifically, the training process of the initial model of the second detection model includes: firstly, acquiring a training sample set, wherein each image sample in the training sample set is an image including a finger, for example, a target region including a rough position of the finger output by a first detection model; and then training to obtain a second detection model by using a machine learning method based on the training sample set and a preset loss function. The preset loss function may include one or more of a confidence loss function, a classification loss function, and a regression loss function.

For better understanding of step 230 of the embodiment of the present invention, fig. 4 is a schematic diagram of an application scenario for detecting a fingertip fine position by using a second detection model according to the embodiment of the present invention. As shown in fig. 4, the target region including the index finger selected by the first detection model is input to the trained second detection model, and the second detection model detects a fine fingertip position (indicated by a star symbol in the drawing) from the rough position of the index finger.

Returning to FIG. 2, the method proceeds to step 240, where the target object pointed to by the digit may be determined based on the coarse position of the digit and the fine position of the fingertip. In some embodiments, the fine position of the fingertip can be represented as a point, and then a point is taken from the rough position of the finger, and the two points are combined into a vector with a direction to represent the pointing direction of the finger. Then, the target object is determined according to the orientation of the vector. In another embodiment, the fingertip vector may be composed with the center point of the bounding box of the finger as the starting point and the coordinates of the fine position of the fingertip as the ending point.

In still other embodiments, step 240 may include: forming a fingertip vector by taking the central point of the surrounding frame as a starting point and the coordinate of the fine fingertip position as an end point; and matching the recognition object in the cropping picture according to the fingertip vector as a target object pointed by the finger. The identification object in the cutting picture refers to an object obtained by performing machine identification on the content in the cutting picture. The recognition objects have corresponding position characteristics in the cropping graph, and the recognition objects pointed by the fingertip vectors can be matched through the position relation between the fingertip vectors and the recognition objects, and are used as target objects pointed by the fingertips.

In still other embodiments, matching the recognition objects in the cropped picture according to the fingertip vector includes: and acquiring the recognition object which is closest to the coordinate distance of the fine position of the fingertip in the direction of the fingertip vector as a target object pointed by the fingertip.

To facilitate understanding of the process of identifying an object according to fingertip vector matching according to the embodiment of the present invention, fig. 5 is a schematic diagram illustrating an application scenario of identifying an object according to fingertip vector matching according to the embodiment of the present invention. As shown in fig. 5, the starting point of the fingertip vector is the center point of the fingertip surrounding frame (dashed line frame), the end point is the coordinate of the fingertip fine position (the other end of the solid line in the dashed line frame), the recognition object in the cropping map includes A, B, C, D, the distances between A, B, C, D and the coordinate of the fingertip fine position in the direction of the fingertip vector are respectively calculated, and finally, the recognition object B with the closest distance is selected as the target object pointed by the fingertip.

While the finger interaction method according to the embodiment of the present invention is described above with reference to fig. 2, it will be understood by those skilled in the art that the above description is exemplary and not restrictive.

In one embodiment of the present invention, prior to utilizing the second detection model, the method 200 may further comprise: judging the intention of hand interaction according to the time sequence position characteristics of the hand parts of all types in the plurality of images; and in response to the intent of the hand interaction being an intent to fetch words, performing detection using the second detection model. After the category and the rough position of the hand part are obtained according to step 220, the interaction intention of the hand part can be determined according to the time sequence position characteristics of different hand parts. The time-series position feature here may be obtained by inputting a time-series image to the first detection model. The category and the rough position of the hand part in the frame image can be obtained after the frame image is input into the first detection model, and the time sequence position characteristics of different hand parts can be obtained after the multi-frame time sequence image is input into the first detection model.

Preferably, the current interaction intention of the hand can be comprehensively judged according to the time sequence position characteristics of six categories of the palm, the thumb, the index finger, the middle finger, the ring finger and the little finger. The current hand interaction intention can be determined through gestures, for example, if the gesture shown in fig. 3 is detected, it can be determined that the current hand interaction intention is a word extraction intention. According to the embodiment of the invention, different gesture categories can be identified by identifying the categories of different fingers and utilizing the time sequence information of the front frame and the rear frame, the intention of the user can be accurately judged, and the false triggering can be avoided.

In some optional implementations of this embodiment, the method 200 may further include: performing optical character recognition on the target area before, after, or simultaneously with the performing of the detection using the second detection model to obtain a recognition result; and determining the target object pointed to by the digit includes determining the corresponding character in the recognition result. Optical Character Recognition (OCR) refers to a process in which an electronic device examines an image or characters on paper, determines their shape by detecting dark and light patterns, and then translates the shape into computer text using Character Recognition methods. In this regard, the cloud server 102 or the terminal device 103 in fig. 1 may perform optical character recognition on the target area to obtain a recognition result. The step of performing optical character recognition on the target area may be performed before the detection using the second detection model, may be performed after the fingertip accurate position is detected using the second detection model, or may be performed while the detection using the second detection model is performed. Preferably, the step of performing optical character recognition on the target area may be performed before the detection using the second detection model, so as to improve the response speed and reduce the time consumption.

In some optional implementations of this embodiment, the method 200 may further include: performing any one of the following operations on the target object: a read-on operation or a translate operation.

FIG. 6 shows a flow diagram of another embodiment of a finger interaction method according to an embodiment of the invention. As shown in fig. 6, step 601 is first executed: the camera collects images. Thereafter, each frame image is input to the palm and finger detection model to perform step 602: palm and finger detection. At step 602, it is detected whether there are palm and fingers in the current frame image, and finger categories (including thumb, index finger, middle finger, ring finger, little finger) are identified. If the palm and the fingers are detected by the palm and finger detection model, the categories of the palm and the fingers and the surrounding box are output, and the central point of the surrounding box is taken as an S point. After obtaining the rough finger positions of the various fingers, step 603 is executed: and judging the point-reading intention. At step 603, the current finger interaction intention may be determined according to the time sequence position characteristics of different fingers. If the current finger interaction intention is a click-to-read intention, then go to step 604: and cutting a small picture with proper size by taking the finger as a central point, wherein the small picture contains all or part of the finger and characters.

Next, the cut out thumbnail is input to the fingertip fine position regression model to execute step 605: the fingertip fine positions are regressed to obtain a terminal point along the finger direction, and the terminal point is marked as an E point. The cut out thumbnail passes through the optical character recognition module at the same time, and the step 607 is executed: and (5) carrying out optical character recognition to obtain an optical character recognition result. At step 607, the optical character recognition result includes the text content and text position around the finger. After step 605 is executed, step 606 is executed: and forming a fingertip vector by taking the central point of the finger head surrounding frame as a starting point and the coordinate of the fine position of the fingertip as an end point. At step 606, a fingertip vector V is formed starting at S and ending at E. Then, step 608 is executed: the fingertip vector matches the currently pointed character. At step 608, the character closest to the fingertip along the finger direction may be found from the fingertip vector and the candidate text position box. Finally, step 609 is executed: and outputting the reading or translation result. And the like, processing the image of each subsequent frame.

Exemplary devices

Having described the method of the exemplary embodiment of the present invention, the finger interaction device of the exemplary embodiment of the present invention is next described with reference to FIG. 7.

FIG. 7 schematically shows a functional block diagram of a finger interaction device according to an embodiment of the present invention. As shown in fig. 7, the apparatus 700 may include: an acquisition module 710 configured to acquire one or more images including a hand; a first detection module 720 configured to detect a category and a rough location of a site on a hand in an image using a first detection model; a second detection module 730 configured to detect a fingertip fine position included in the rough position of the digit of the on-hand part using a second detection model; and a determining module 740 configured to determine a target object pointed to by the finger according to the rough position of the finger and the fine position of the tip of the finger.

In one embodiment of the present invention, the method may further include: a determining module configured to determine an intention of hand interaction based on time-series position characteristics of the hand parts of the respective categories in the plurality of images before using the second detection model; and the second detection module is configured to perform detection using the second detection model in response to the intent of the hand interaction being an intent to fetch words.

In an embodiment of the present invention, the method may further include selecting a corresponding target region from the image according to a rough position of a finger in the hand; the target area is provided to a second detection model to detect fingertip fine positions.

In another embodiment of the present invention, the rough position of the finger is represented by a bounding box, and the target area is a cropping map including part or all of the bounding box and a portion of the neighborhood of the bounding box, and preferably, the frame map of the preset size is cropped centering on the fingertip in the bounding box to be the cropping map.

In yet another embodiment of the present invention, the determining module 740 is further configured to: forming a fingertip vector by taking the central point of the surrounding frame as a starting point and the coordinate of the fine fingertip position as an end point; matching the recognition object in the cropping map according to the fingertip vector as the target object pointed by the finger, preferably, acquiring the recognition object closest to the coordinate of the fingertip fine position in the direction along the fingertip vector as the target object pointed by the finger.

In another embodiment of the present invention, the method may further include: the identification module is configured to execute optical character identification on the target area before, after or simultaneously with the execution of the second detection module so as to obtain an identification result; and the determination module 740 is further configured to: and determining corresponding characters in the recognition result according to the rough position of the finger and the fine position of the fingertip of the finger.

In another embodiment of the present invention, the method further comprises: an execution module configured to perform any one of the following operations on the target object: a read-on operation or a translate operation.

The apparatus of the embodiments of the present invention has been described and explained in detail above in connection with the method, and will not be described again here.

Exemplary computing System

Having described the method and apparatus of exemplary embodiments of the present invention, a finger interaction system of exemplary embodiments of the present invention is described next with reference to FIG. 8.

In a third aspect of embodiments of the present invention, there is provided an electronic device comprising, at least one processor; a memory storing program instructions that, when executed by the at least one processor, cause the apparatus to perform the method according to any one of the first aspect of embodiments of the present invention.

FIG. 8 schematically illustrates a block diagram of an exemplary computing system 800 suitable for implementing embodiments of the present invention. As shown in fig. 8, computing system 800 may include device 810 (shown in phantom) and its peripherals according to embodiments of the present invention, where device 810 performs finger interaction methods and the like to implement the methods described above in connection with the embodiments of the present invention of fig. 1-6.

As shown in fig. 8, device 810 may include a Central Processing Unit (CPU)801, which may be a general purpose CPU, a special purpose CPU, or other execution unit on which information processing and programs run. Further, the device 810 may further include a Random Access Memory (RAM)802 and a Read Only Memory (ROM)803, wherein the RAM 802 may be configured to store various types of data including character sequences of equations, mark sequences, and the like, and various programs required for finger interaction, and the ROM 803 may be configured to store data required for initialization, basic input/output drivers, booting an operating system, and the like for each functional module in the device 810.

Further, the device 810 may also include other hardware or components, such as a hard disk controller 805, a keyboard controller 806, a serial interface controller 807, a parallel interface controller 808, a display controller 809, etc., as shown. It is understood that although various hardware or components are shown in the device 810, this is merely exemplary and not limiting, and one skilled in the art can add or remove corresponding hardware as needed.

The above-described CPU 801, access memory 802, read only memory 803, hard disk controller 805, keyboard controller 806, serial interface controller 807, parallel interface controller 808, and display controller 809 of the device 810 of the embodiment of the present invention may be connected to each other by a bus system 804. In one embodiment, data interaction with peripheral devices may be accomplished via the bus system 804. In another embodiment, the CPU 801 may control other hardware components and their peripherals in the device 810 through the bus system 804.

Peripheral devices to device 810 may include, for example, hard disk 810, keyboard 811, serial peripheral device 812, parallel peripheral device 813, and display 814 as shown. The hard disk 810 may be coupled with a hard disk controller 805, the keyboard 811 may be coupled with a keyboard controller 806, the serial external device 812 may be coupled with a serial interface controller 807, the parallel external device 813 may be coupled with a parallel interface controller 808, and the display 814 may be coupled with a display controller 809. It should be understood that the block diagram of FIG. 8 is for exemplary purposes only and is not intended to limit the scope of the present invention. In some cases, certain devices may be added or subtracted as the case may be.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, method or computer program product. Thus, embodiments of the invention may be embodied in the form of: the term "computer readable medium" as used herein refers to any tangible medium that can contain, store, communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Furthermore, in some embodiments, the present invention may also be embodied in the form of a computer program product in one or more computer-readable storage media, in which a program (or program code) of the finger interaction method may be stored, which when executed by a processor may perform the method according to any one of the first aspect of the present invention embodiments.

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive example) of the computer readable storage medium may include, for example: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing. Computer program code for carrying out operations for embodiments of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

Through the above description of the technical scheme of finger interaction and a plurality of embodiments thereof according to the embodiments of the present invention, it can be understood that a target object pointed by a finger is determined by respectively detecting a rough position of the finger and a fine position of the finger tip at a position on a hand, so as to more accurately match the content pointed by the finger tip, and can adapt to the pointing in any direction of 360 degrees, thereby improving robustness; in addition, the mode that the first detection model and the second detection model are adopted to realize finger coarse positioning and finger fine positioning respectively can improve the detection accuracy and reduce the response time. In some embodiments, different gestures can be recognized by using time sequence position characteristics of different parts on the hand, so that the current finger interaction intention is accurately judged, and false triggering is avoided. In other embodiments, an area small image with a proper size can be cut out from the original image according to the rough position of the finger, the area small image is input into the second detection model to obtain the fine position of the fingertip, and the original image is cut out based on the rough position of the finger, so that the data processing amount of the second detection model can be greatly reduced on the premise of ensuring the precision, the response speed is further improved, and the user experience is enhanced.

It should be noted that although in the above detailed description several modules or means of the device are mentioned, this division is only not mandatory. Indeed, the features and functions of two or more of the devices described above may be embodied in one device, according to embodiments of the invention. Conversely, the features and functions of one apparatus described above may be further divided into embodiments by a plurality of apparatuses.

Moreover, although the operations of the methods of embodiments of the invention are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

Use of the verbs "comprise", "comprise" and their conjugations in this application does not exclude the presence of elements or steps other than those stated in this application. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements.

While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.

Claims

1. A finger interaction method, comprising:

acquiring one or more images containing a hand;

detecting the category and the rough position of the hand part in the image by using a first detection model;

detecting a fingertip fine position included in the rough position of the finger of the upper part of the hand by using a second detection model; and

and determining a target object pointed by the finger according to the rough position of the finger and the fine position of the fingertip.

2. The method of claim 1, wherein prior to utilizing the second detection model, further comprising:

judging the intention of hand interaction according to the time sequence position characteristics of the hand parts of all the types in the plurality of images; and

performing detection using the second detection model in response to the intent of the hand interaction being a word-fetching intent.

3. The method of claim 1 or 2, further comprising:

selecting a corresponding target area from the image according to the rough position of the finger in the hand part; and

providing the target region to the second detection model to detect the fingertip fine position.

4. A method according to claim 3, wherein the rough position of the finger is represented by a bounding box, the target area being a cropping map comprising part or all of the bounding box and a portion of the neighborhood of the bounding box, preferably the cropping map being a block of a preset size, centred on a fingertip in the bounding box.

5. The method of claim 4, wherein determining a target object at which the digit is pointing from the coarse position of the digit and the fingertip fine position comprises:

forming a fingertip vector by taking the central point of the surrounding frame as a starting point and the coordinate of the fine fingertip position as an end point; and

matching the recognition object in the cropping map according to the fingertip vector as the target object pointed by the finger, preferably, acquiring the recognition object closest to the coordinate of the fingertip fine position in the direction along the fingertip vector as the target object pointed by the finger.

6. The method of any of claims 1-5, further comprising:

performing optical character recognition on the target area before, after, or simultaneously with the performing of the detection using the second detection model to obtain a recognition result; and is

Determining the target object pointed to by the digit includes determining a corresponding character in the recognition result.

7. The method of any of claims 1-6, further comprising performing any of the following operations on the target object: a read-on operation or a translate operation.

8. The method of any of claims 1-7, wherein the category of the on-hand part comprises at least one of:

palm, thumb, index finger, middle finger, ring finger and little finger.

9. The method of any of claims 1-8, wherein the first detection model and the second detection model are each trained neural network models.

10. A finger interaction device, comprising:

an acquisition module configured to acquire one or more images including a hand;

a first detection module configured to detect a category and a rough location of a site on a hand in the image using a first detection model;

a second detection module configured to detect a fingertip fine position included in the rough position of the digit of the on-hand part using a second detection model; and

a determining module configured to determine a target object pointed by the digit according to the rough position of the digit and the fine position of the tip of the digit.

11. The apparatus of claim 10, further comprising:

a determination module configured to determine an intent of a hand interaction based on temporal location features of the on-hand location in the plurality of images for each category prior to utilizing the second detection model; and is

The second detection module is configured to perform detection using the second detection model in response to the intent of the hand interaction being a word-fetching intent.

12. The apparatus of claim 10 or 11, further comprising:

a selection module configured to select a corresponding target region from the image according to a rough position of a digit in the on-hand portion;

13. The device according to claim 12, wherein the rough position of the finger is represented by a bounding box, the target area is a cropping map comprising part or all of the bounding box and a portion of the neighborhood of the bounding box, preferably the cropping map is a block diagram of a preset size cropped centered on a fingertip in the bounding box.

14. The apparatus of claim 13, wherein the determination module is further configured to:

forming a fingertip vector by taking the central point of the surrounding frame as a starting point and the coordinate of the fine fingertip position as an end point;

15. The apparatus of any of claims 10-14, further comprising:

a recognition module configured to perform optical character recognition on the target area before, after or simultaneously with the second detection module to obtain a recognition result; and is

The determination module is further configured to:

and determining corresponding characters in the recognition result according to the rough position of the finger and the fine fingertip position of the finger.

16. The apparatus of any of claims 10-15, further comprising:

an execution module configured to perform any one of the following operations on the target object: a read-on operation or a translate operation.

17. The apparatus of any of claims 10-16, wherein the on-hand position comprises at least one of:

palm, thumb, index finger, middle finger, ring finger and little finger.

18. The apparatus of any of claims 10-17, wherein the first detection model and the second detection model are each trained neural network models.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to perform the method of any one of claims 1-9.

20. A computer storage medium storing a computer program which, when executed by a processor, implements the method according to any one of claims 1-9.