CN114627494A

CN114627494A - Method for positioning hand area and display device

Info

Publication number: CN114627494A
Application number: CN202210177270.4A
Authority: CN
Inventors: 任攀; 郑贵桢; 袁毅
Original assignee: Hisense Electronic Technology Shenzhen Co ltd
Current assignee: Hisense Electronic Technology Shenzhen Co ltd
Priority date: 2022-02-25
Filing date: 2022-02-25
Publication date: 2022-06-14

Abstract

The preset convolutional neural network model used in the method is obtained by utilizing loss function optimization determined by hand joint points, joint point information is added to assist in positioning the hand area, and therefore the accuracy of outputting the hand area by the convolutional neural network model can be improved. The method comprises the following steps: when an instruction for positioning a hand region is received, the image shot by the image acquisition device is beneficial to a preset convolutional neural network model to determine the hand region, wherein the hand region is a rectangular region formed by the outer edge of a hand, and the preset convolutional neural network model is obtained by optimizing a loss function determined by hand joint points.

Description

Method for positioning hand area and display device

Technical Field

The present application relates to the field of hand region positioning technology, and in particular, to a method for positioning a hand region and a display device.

Background

At present, 2D (two-dimensional) and 3D (three-dimensional) gesture technologies are rapidly developed, and functions of display devices are widened. For example, the display device may implement special effects presentation based on human hands, and may also interact with the user through user gestures.

In the process of realizing the functions, the positioning of the hand area of the human body is involved. In the related art, because the degree of freedom of the fingers in the hand is too high, namely the fingers can move in multiple directions and at multiple angles, the detection of the hand area is often inaccurate, and the effect of realizing the functions is influenced. For example, the user interface shown in fig. 5 is an ideal effect of the special effect, that is, the tip of the index finger of the user faces upward, and a rotating virtual basketball appears on the tip of the index finger, but due to inaccurate positioning of the hand region, the range of positioning of the hand region is too large or too small, so that the actual display effect of the special effect appears as shown in fig. 6 and fig. 7. Specifically, there is a distance between the virtual basketball in fig. 6 and the index finger, and in fig. 7, the virtual basketball covers part of the index finger, so that the special effect cannot be well shown in both cases, and the use feeling of the user is affected. In addition, when a user interacts with the display device through a gesture, hand region detection is a precondition for hand posture estimation involved in the interaction, and if the hand region cannot be accurately positioned, the interaction effect of the gesture and the display device is affected.

Therefore, how to improve the accuracy of positioning the hand region becomes a problem to be solved urgently by those skilled in the art.

Disclosure of Invention

Some embodiments of the present application provide a method and a display device for locating a hand region, in which a preset convolutional neural network model used in the method is obtained by using a loss function optimization determined by hand joint points, so that the accuracy of outputting the hand region by the convolutional neural network model can be improved.

In a first aspect, there is provided a display device comprising:

a display for displaying a user interface;

a user interface for receiving an input signal;

a controller respectively coupled to the display and the user interface, for performing:

when an instruction for positioning a hand region is received, the image shot by the image acquisition device is beneficial to a preset convolutional neural network model to determine the hand region, wherein the hand region is a rectangular region formed by the outer edge of a hand, and the preset convolutional neural network model is obtained by optimizing a loss function determined by hand joint points.

In some embodiments, the controller is configured to perform a loss function optimization using hand joint point determination resulting in a preset convolutional neural network model according to the following steps:

acquiring a training set, wherein the training set comprises a plurality of sample images and real information corresponding to the sample images, and the real information comprises a real hand classification, a real hand joint point position and a real hand region frame;

preprocessing each sample image; sequentially inputting all the preprocessed sample images into an initial convolutional neural network model, and outputting corresponding prediction information, wherein the prediction information comprises hand prediction classification, hand joint point prediction positions and hand prediction region frames;

inputting the real information and the prediction information into a loss function to obtain a loss value, wherein the loss function comprises a detection frame regression loss function, a detection classification loss function, a hand joint point detection regression loss function and a hand joint point detection penalty item, the joint point detection penalty item is determined according to a hand prediction region frame and a joint point external rectangle, and the joint point external rectangle is a maximum rectangle surrounded by the real positions of the hand joint points;

and adjusting the initial convolutional neural network model by using the loss value, and determining the adjusted initial convolutional neural network model as a preset convolutional neural network model.

In some embodiments, the controller is configured to determine a joint point detection penalty term (1-IOLBR) from the hand prediction region box and the joint point bounding rectangle according to the following formula:

wherein, Area_lbrIs the Area where the joint point is circumscribed by the rectangle_prePredicting the area of the area box for the hand; interaction (Area)_pre，Area_lbr) The overlapping area of the hand area detection frame and the area of the joint point circumscribed rectangle is shown.

In some embodiments, the controller is configured to perform the pre-processing of each sample image by:

subtracting the revision value from the first RGB values of all the pixel points in the sample image to obtain a second RGB value;

and taking the sample image of the second RGB value as a preprocessed sample image.

In some embodiments, the loss function is formulated as:

loss_total＝loss_box+loss_class+alpha*loss_landmark+beta*(1-ILOBR)；

therein, loss_totalAs a losS function, los S_boxFor the detection of the frame regression loss function, loss_classTo detect the classification losS function, los_landmarkThe method comprises the steps of detecting a regression loss function of hand joint points, wherein 1-ILOBR is a hand joint point detection penalty term, alpha is the weight of the hand joint point detection regression loss function, and beta is the weight of the hand joint point detection penalty term.

In a second aspect, there is provided a method of locating a hand region, comprising:

In some embodiments, the step of obtaining a preset convolutional neural network model using a loss function optimization determined by hand joint points comprises:

In some embodiments, the joint point detection penalty term (1-IOLBR) is determined from the hand prediction region box and the joint point bounding rectangle by the following calculation formula:

wherein, Area_lbrIs the Area where the joint point is circumscribed with a rectangle_prePredicting the area of the area box for the hand; interaction (Area)_pre，Area_lbr) The coincidence of the region of the hand region detection frame and the region of the joint point circumscribed rectangleArea.

In some embodiments, the step of pre-processing the sample image comprises:

In some embodiments, the loss function is formulated as:

loss_total＝loss_box+loss_class+alpha*loss_landmark+beta*(1-ILOBR)；

therein, loss_totalAs a losS function, los S_boxTo detect the frame regression loss function, loss_classFor detecting the loss of classification function, loss_landmarkThe method comprises the steps of detecting a regression loss function of hand joint points, wherein 1-ILOBR is a hand joint point detection penalty term, alpha is the weight of the regression loss function of the hand joint point detection, and beta is the weight of the hand joint point detection penalty term.

In a third aspect, a storage medium is provided having stored thereon computer instructions that, when executed by a processor, cause a computer device to perform:

when an instruction for positioning a hand region is received, determining the hand region by using a preset convolutional neural network model in favor of the image shot by the image acquisition device, wherein the hand region is a rectangular region formed by the outer edge of a hand, and the preset convolutional neural network model is obtained by optimizing a loss function determined by using a hand joint point.

In the above embodiment, a method for positioning a hand region and a display device are provided, in which a preset convolutional neural network model used in the method is obtained by using a loss function optimization determined by hand joint points, and joint point information is added to assist in positioning the hand region, so that accuracy of outputting the hand region by the convolutional neural network model can be improved. The method comprises the following steps: when an instruction for positioning a hand region is received, the image shot by the image acquisition device is beneficial to a preset convolutional neural network model to determine the hand region, wherein the hand region is a rectangular region formed by the outer edge of a hand, and the preset convolutional neural network model is obtained by optimizing a loss function determined by hand joint points.

Drawings

FIG. 1 illustrates an operational scenario between a display device and a control apparatus according to some embodiments;

fig. 2 illustrates a hardware configuration block diagram of the control apparatus 100 according to some embodiments;

fig. 3 illustrates a hardware configuration block diagram of the display apparatus 200 according to some embodiments;

FIG. 4 illustrates a software configuration diagram in the display device 200 according to some embodiments;

FIG. 5 illustrates a user interface including special effects of a desired effect, according to some embodiments;

FIG. 6 illustrates a user interface including special effects when the hand region is over-positioned, according to some embodiments;

FIG. 7 illustrates a user interface including special effects when the hand region is positioned too small, according to some embodiments;

FIG. 8 illustrates a flow chart of a method of locating a hand region according to some embodiments;

FIG. 9 illustrates a user interaction diagram in accordance with some embodiments;

FIG. 10 illustrates a user interface diagram according to some embodiments;

FIG. 11 illustrates a hand area schematic according to some embodiments;

FIG. 12 illustrates yet another hand area schematic according to some embodiments;

FIG. 13 illustrates a schematic diagram of a coordinate system according to some embodiments;

FIG. 14 illustrates a schematic diagram of another coordinate system according to some embodiments;

figure 15 illustrates a schematic diagram of total hand joint point locations of a hand, according to some embodiments;

FIG. 16 illustrates a schematic diagram of a joint point bounding rectangle, in accordance with some embodiments;

FIG. 17 illustrates a schematic diagram of another joint point circumscribing rectangle, in accordance with some embodiments;

FIG. 18 illustrates a schematic diagram of a coincidence region according to some embodiments;

FIG. 19 illustrates a hand area schematic according to some embodiments;

FIG. 20 illustrates another hand area schematic diagram in accordance with some embodiments;

fig. 21 illustrates a schematic view of the hand regions of fig. 19 and 20 superimposed according to some embodiments.

Detailed Description

To make the purpose and embodiments of the present application clearer, the following will clearly and completely describe the exemplary embodiments of the present application with reference to the attached drawings in the exemplary embodiments of the present application, and it is obvious that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

It should be noted that the brief descriptions of the terms in the present application are only for convenience of understanding of the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.

The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements expressly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The display device provided by the embodiment of the present application may have various implementation forms, and for example, the display device may be a television, a smart television, a laser projection device, a display (monitor), an electronic whiteboard (electronic whiteboard), an electronic desktop (electronic table), and the like. Fig. 1 and 2 are specific embodiments of a display device of the present application.

Fig. 1 is a schematic diagram of an operation scenario between a display device and a control apparatus according to an embodiment. As shown in fig. 1, a user may operate the display apparatus 200 through the smart device 300 or the control device 100.

In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes an infrared protocol communication or a bluetooth protocol communication, and other short-distance communication methods, and controls the display device 200 in a wireless or wired manner. The user may input a user instruction through a key on a remote controller, voice input, control panel input, etc., to control the display apparatus 200.

In some embodiments, the smart device 300 (e.g., mobile terminal, tablet, computer, laptop, etc.) may also be used to control the display device 200. For example, the display device 200 is controlled using an application program running on the smart device.

In some embodiments, the display device may not receive instructions using the smart device or control device described above, but rather receive user control through touch or gestures, or the like.

In some embodiments, the display device 200 may also be controlled in a manner other than the control apparatus 100 and the smart device 300, for example, the voice command control of the user may be directly received through a module configured inside the display device 200 to obtain a voice command, or may be received through a voice control device provided outside the display device 200.

In some embodiments, the display device 200 is also in data communication with a server 400. The display device 200 may be allowed to be communicatively connected through a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display apparatus 200. The server 400 may be a cluster or a plurality of clusters, and may include one or more types of servers.

Fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 according to an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an input operation instruction from a user and convert the operation instruction into an instruction recognizable and responsive by the display device 200, serving as an interaction intermediary between the user and the display device 200.

As shown in fig. 3, the display apparatus 200 includes at least one of a tuner demodulator 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface.

In some embodiments the controller comprises a processor, a video processor, an audio processor, a graphics processor, a RAM, a ROM, a first interface to an nth interface for input/output.

The display 260 includes a display screen component for presenting a picture, and a driving component for driving image display, a component for receiving an image signal from the controller output, performing display of video content, image content, and a menu manipulation interface, and a user manipulation UI interface.

The display 260 may be a liquid crystal display, an OLED display, and a projection display, and may also be a projection device and a projection screen.

The communicator 220 is a component for communicating with an external device or a server according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, and other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The display apparatus 200 may establish transmission and reception of control signals and data signals with the external control apparatus 100 or the server 400 through the communicator 220.

A user interface for receiving control signals for controlling the apparatus 100 (e.g., an infrared remote control, etc.).

The detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for collecting ambient light intensity; alternatively, the detector 230 includes an image collector, such as a camera, which may be used to collect external environment scenes, attributes of the user, or user interaction gestures, or the detector 230 includes a sound collector, such as a microphone, which is used to receive external sounds.

The external device interface 240 may include, but is not limited to, the following: high Definition Multimedia Interface (HDMI), analog or data high definition component input interface (component), composite video input interface (CVBS), USB input interface (USB), RGB port, and the like. Or may be a composite input/output interface formed by the plurality of interfaces.

The tuner demodulator 210 receives a broadcast television signal through a wired or wireless reception manner, and demodulates an audio/video signal, such as an EPG data signal, from a plurality of wireless or wired broadcast television signals.

In some embodiments, the controller 250 and the modem 210 may be located in different separate devices, that is, the modem 210 may also be located in an external device of the main device where the controller 250 is located, such as an external set-top box.

The controller 250 controls the operation of the display device and responds to the user's operation through various software control programs stored in the memory. The controller 250 controls the overall operation of the display apparatus 200. For example: in response to receiving a user command for selecting a UI object to be displayed on the display 260, the controller 250 may perform an operation related to the object selected by the user command.

In some embodiments the controller comprises at least one of a Central Processing Unit (CPU), a video processor, an audio processor, a Graphics Processing Unit (GPU), a RAM Random Access Memory (RAM), a ROM (Read-Only Memory), a first to nth interface for input/output, a communication Bus (Bus), and the like.

A user may input a user command on a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface receives the user input command by recognizing the sound or gesture through the sensor.

A "user interface" is a media interface for interaction and information exchange between an application or operating system and a user that enables the conversion of the internal form of information to a form acceptable to the user. A commonly used presentation form of the User Interface is a Graphical User Interface (GUI), which refers to a User Interface related to computer operations and displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in the display screen of the electronic device, where the control may include a visual interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc.

Referring to fig. 4, in some embodiments, the system is divided into four layers, which are, from top to bottom, an Application (Applications) layer (referred to as an "Application layer"), an Application Framework (Application Framework) layer (referred to as a "Framework layer"), an Android runtime (Android runtime) layer and a system library layer (referred to as a "system runtime library layer"), and a kernel layer.

In some embodiments, at least one application program runs in the application program layer, and the application programs may be windows (windows) programs carried by an operating system, system setting programs, clock programs or the like; or an application developed by a third party developer. In particular implementations, the application packages in the application layer are not limited to the above examples.

The framework layer provides an Application Programming Interface (API) and a programming framework for the application. The application framework layer includes a number of predefined functions. The application framework layer acts as a processing center that decides to let the applications in the application layer act. The application program can access the resources in the system and obtain the services of the system in execution through the API interface.

As shown in fig. 4, in the embodiment of the present application, the application framework layer includes a manager (Managers), a Content Provider (Content Provider), and the like, where the manager includes at least one of the following modules: an Activity Manager (Activity Manager) is used for interacting with all activities running in the system; the Location Manager (Location Manager) is used for providing the system service or application with the access of the system Location service; a Package Manager (Package Manager) for retrieving various information related to an application Package currently installed on the device; a Notification Manager (Notification Manager) for controlling display and clearing of Notification messages; a Window Manager (Window Manager) is used to manage the icons, windows, toolbars, wallpapers, and desktop components on a user interface.

In some embodiments, the activity manager is used to manage the lifecycle of the various applications as well as general navigational fallback functions, such as controlling exit, opening, fallback, etc. of the applications. The window manager is used for managing all window programs, such as obtaining the size of a display screen, judging whether a status bar exists, locking the screen, intercepting the screen, controlling the change of the display window (for example, reducing the display window, displaying a shake, displaying a distortion deformation, and the like), and the like.

In some embodiments, the system runtime layer provides support for the upper layer, i.e., the framework layer, and when the framework layer is used, the android operating system runs the C/C + + library included in the system runtime layer to implement the functions to be implemented by the framework layer.

In some embodiments, the kernel layer is a layer between hardware and software. As shown in fig. 4, the core layer includes at least one of the following drivers: audio drive, display driver, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (like fingerprint sensor, temperature sensor, pressure sensor etc.) and power drive etc..

In some embodiments, the display device may extend the functionality of the display device through its installed applications. Illustratively, special effect showing functions and functions for interacting with the user can be provided for the user.

The display equipment is used for implementing special effect display based on human hands, and positioning human hand regions in the process of realizing functions of user gestures and user interaction and the like. In the correlation technique, because the degree of freedom that the hand pointed is too high, the removal that the finger can diversified multi-angle promptly, hand home range is great, just so leads to the regional detection of hand inaccurate, influences the realization effect that realizes above-mentioned function.

For example, the ideal effect of the special effect is that when the tip of the user's index finger is pointing upwards, a rotating virtual basketball appears on the tip of the index finger, simulating a scene in which the user is rotating the ball. A specific implementation may be to locate a hand region based on the identified image, the lower end of the virtual basketball being attached to the upper end of the hand region, and the index finger tip pointing to the center of the virtual basketball. As shown in FIG. 5, FIG. 5 illustrates a user interface including special effects of a desired effect according to some embodiments when the hand region is accurately positioned, i.e., drawn against the edge of the hand, with the top of the tip of the index finger against the top of the hand region 500 and the tip of the index finger against a virtual basketball. As can be seen from fig. 5, the currently displayed special effect may reflect the scene of the actual user turning the ball.

However, in the related art, the hand region is not accurately located, as shown in fig. 6, fig. 6 exemplarily shows a schematic diagram of a user interface including a special effect when the location range of the hand region is too large, the location range of the hand region 500 is too large, the hand region 500 does not fit the edge of a human hand, the uppermost end of the hand region 500 has a certain distance from the upper end of the fingertip of the index finger, and further a certain distance also exists between the virtual basketball and the fingertip of the index finger, and the whole virtual basketball floats on the fingertip, which is different from the scene of the actual user in which the user rotates the ball, and the display effect of the special effect is affected.

As shown in fig. 7, fig. 7 is a schematic diagram of a user interface including a special effect when a hand area 500 is positioned too small, the hand area 500 covers a part of a human hand, and a virtual basketball covers a part of an index finger, so that the special effect is also influenced when the hand area 500 is positioned too small according to some embodiments.

Under the two conditions, the special effect cannot be well displayed, and the use feeling of a user is influenced.

In addition, when the interaction between the user and the display device is realized through the gesture, the hand region positioning is a precondition for estimating the hand posture involved in the interaction process, and if the hand region cannot be accurately positioned, the interaction effect between the gesture and the display device is affected.

Therefore, in some scenarios, how to improve the accuracy of positioning the hand region becomes an urgent problem to be solved by those skilled in the art.

In order to solve the above technical problem, embodiments of the present application provide a method for locating a hand region, where a preset convolutional neural network model used in the method is obtained by using a loss function optimization determined by hand joint points, and joint point information is added to assist in locating the hand region, so that accuracy of outputting the hand region by the convolutional neural network model can be improved.

It should be noted that the method in the embodiment of the present application may be applied to a display device, and may also be applied to a computer, VR glasses, AR glasses, and other terminals capable of carrying a camera.

As shown in fig. 8, fig. 8 illustrates a flow chart of a method of locating a hand region, the method including:

s100, receiving an instruction of positioning the hand area. In the embodiment of the application, the hand region needs to be positioned in various scenes, and the hand region of the user hand in the acquired image is mainly positioned.

In some embodiments, launching the first preset application may generate an instruction to locate the hand region. For example, the first preset application may be a game application in which a controlled object moves according to the recognized hand region. When the position of the hand area changes, the controlled object changes along with the change. In the embodiment of the application, when the game application is started, the instruction for positioning the hand area is directly generated.

As shown in fig. 9, fig. 9 illustrates a user interaction diagram according to some embodiments, and (a) of fig. 9 illustrates a user interface containing a game application control with the game application control displayed thereon. In some embodiments, when the game application control is displayed on the display device, other controls that can be implemented may also be displayed, which is not limited herein. The user can move the focus to the game application control through the control device and press the confirmation key on the control device, the user interface jumps to the user interface shown in fig. 9 (b), wherein (b) in fig. 9 shows a game screen, and the user can move the hand to control the movement of the controlled object (airplane) in the user interface. It should be noted that, in the user interface including the game screen in fig. 9 (b), in general, two controlled objects do not appear in the actual game process, and only the moved controlled object, that is, the controlled object pointed by the dotted arrow in the user interface, exists. The purpose of marking the controlled object before and after movement in fig. 9 (b) is to clearly display the movement process of the controlled object. Of course, in some embodiments, the controlled object before moving, that is, the controlled object at the start point of the dotted arrow may also be retained according to the requirements of the game setting.

In some embodiments, a second preset application is launched, the second preset application comprising an item. Starting the project, an instruction to locate the hand region may be generated. For example, the item may be a special effects item, a special effect may be added based on a hand, and when the special effects item is launched, an instruction to locate a hand region is generated.

For example, as shown in fig. 10, fig. 10 illustrates a user interface diagram according to some embodiments, in which a special effect control corresponding to a special effect item is displayed, and a user uses a control device to move a focus on the special effect control and presses a confirmation key on the control device, so as to generate an instruction for positioning a hand region. After the hand area is positioned, the special effect is added to the hand of the user by utilizing the hand area. In some embodiments, when the display device displays the special effect control, a control that can implement other functions may also be displayed, which is not limited herein.

In some embodiments, the special effect control may be displayed on a floating layer of the user interface, and the special effect control is not located at a lower portion of the user interface, so as to prevent the special effect control from blocking other display contents on the user interface.

In some embodiments, if it is desired to cancel adding an effect, the focus may again be moved to the effect control and a confirmation key on the control device is pressed, at which time the effect display is canceled.

In some embodiments, the special effects control may be invoked for display by pressing a particular key on the control device. Specifically, after entering the second preset application, the special effect control is not directly displayed on the user interface, because if the special effect control is always displayed, a part of the user interface may be blocked, and the integrity of content display in the user interface is affected. In the embodiment of the application, when a user presses a specific key on the control device, the display interface is controlled to display the special effect control, so that the special effect control can be displayed when needed, and the situation that the special effect control always shields the user interface is avoided.

In some embodiments, after the special effect control is selected, the special effect item is started and the special effect control is not displayed. In other embodiments, the user may press the particular key again, controlling the display device not to display the special effects control.

It should be noted that, in the embodiment of the present application, when the focus is moved to the control, a frame of the control may be thickened, and in addition, other forms may also be adopted to indicate that the control is selected, for example, when the focus is moved to the control, the shape of the control may be changed from a square shape to a circular shape, and the like, and also when the control is selected, the control is enlarged according to a preset ratio, for example, display areas of video controls on the user interface are the same, and when a control is selected, the display area of the control is increased by 1.2 times as compared with the display area of the original control. Because the focus is not positioned on the control, and the form of the control is limited, other forms which can be conveniently distinguished by a user and are used for selecting the video control can be accepted.

S200, when an instruction for positioning the hand area is received, the image shot by the image acquisition device is beneficial to a preset convolution neural network model to determine the hand area.

In this embodiment of the application, the image capturing device may be a device built in the display device, or may also be an external device connected to the display device, for example, as shown in fig. 1, the image capturing device 231 is installed on the display device 200. In some embodiments, the image capturing Device may be an RGB camera, which typically uses three independent CCD sensors (Charge-coupled devices) to acquire three color signals.

In the embodiment of the application, in order to determine the optimal hand region, namely the rectangular region formed by the outer edge of the hand, the preset convolutional neural network model is adopted to process the image. Illustratively, as shown in fig. 11-12, fig. 11 and 12 each illustrate a hand area diagram according to some embodiments, and fig. 11 and 12 illustrate optimal hand areas 500 for different poses of the hand.

In the embodiment of the application, the preset convolutional neural network model is obtained by optimizing a loss function determined by hand joint points. The preset convolutional neural network model is optimized by using the hand joint point related information, so that the preset convolutional neural network parameter is set more accurately, and the accuracy of the output data of the preset convolutional neural network model is improved.

acquiring a training set, wherein the training set comprises a plurality of sample images and real information corresponding to the sample images, and the real information comprises a real hand classification, a real hand joint point position and a real hand region frame.

In the embodiment of the application, the sample images in the training set are shot by adopting an image acquisition device. In some embodiments, in order to obtain a more accurate preset convolutional neural network model, when a sample image is captured, the same image capturing device as mentioned above, that is, the image capturing device for capturing the image input into the preset convolutional neural network model, is used, so that the difference of the images of different image capturing devices in the same scene can be avoided. Interference is caused to the process of learning to obtain the preset convolutional neural network model.

In an embodiment of the present application, the hand classification includes a left hand and a right hand. In some embodiments, a "1" designation may be used when the hand classification is left hand and a "0" designation may be used when the hand classification is right hand. It is understood that other ways of identifying the left hand and the right hand can be adopted, and the embodiment of the present application can accept the way of distinguishing the left hand from the right hand.

In some embodiments, the hand joint point locations are shown in the form of coordinates of hand joint points.

In some embodiments, the hand true area box is represented in coordinates of four vertices.

In some embodiments, the coordinates are arranged in accordance with a coordinate system a, as shown in fig. 13, fig. 13 schematically illustrating a schematic view of a coordinate system according to some embodiments. The coordinate system A takes the upper left corner of the image as an origin, is arranged to be in the positive direction of the X axis along the transverse direction to the right, and is arranged to be in the positive direction of the Y axis along the longitudinal direction to the down. In some embodiments, the coordinates are arranged in accordance with a coordinate system B, as shown in fig. 14, fig. 14 exemplarily showing a schematic view of another coordinate system according to some embodiments, the coordinate system B having the center of the image as an origin, being arranged in the positive x-axis direction to the right in the lateral direction, and being arranged in the positive y-axis direction to the down in the longitudinal direction.

In some embodiments, when the user's hand is fully extended, there are 21 degrees of freedom in the human hand, which in this context is understood to mean that the number of hand joint locations totals 21, including the finger pulp, joints, and palm root of each finger, regardless of the six degrees of freedom in relative space of the entire hand. As shown in fig. 15, fig. 15 illustrates a schematic diagram of overall hand joint positions of a hand according to some embodiments.

In some embodiments, the method for determining the real information corresponding to the sample image includes: hand classification, hand joint point position and hand region frames are determined in an automatic manner. The automatic mode can directly output hand classification, hand joint point positions and hand region frames corresponding to the sample images. However, since the information in the training set needs to be determined accurately, the preset convolutional neural network model obtained by training can be more accurate, and after the information is determined in an automatic manner, the determined information needs to be corrected manually. Finally, the corrected hand classification, hand joint point position and hand area frame are determined, and the hand classification, the hand joint point real position and the hand real area frame are determined. And the real classification of the hand, the real position of the hand joint point and the real area frame of the hand are real information corresponding to the sample image under the real condition. In the embodiment of the application, the real information corresponding to the sample image is finally determined by combining an automatic mode and a manual mode, so that the accuracy of the real information can be ensured.

In the embodiment of the application, because the quantity of the sample images is more, if the real information corresponding to the sample images is determined in a manual mode directly, the labor capacity is huge, so that the hand classification is determined in an automatic mode in the first round, the hand joint point position and the hand region frame are modified in a manual mode, the workload of manual operation is reduced, and the accuracy of the obtained real information can be ensured.

In some embodiments, the database stores pre-stored images, and hand classifications, hand joint point locations, and hand region boxes corresponding to the pre-stored images. The automatic mode can be that the sample image is matched with a prestored image in a database, the similarity between the sample image and the prestored image is calculated, the prestored image with the maximum similarity with the sample image is screened out from the database according to the similarity, and a hand classification, a hand joint point position and a hand area frame corresponding to the prestored image are used as the hand classification, the hand joint point position and the hand area frame corresponding to the sample image obtained in the automatic mode. Of course, the embodiment of the present application is not limited to the specific automatic mode, and other modes may be adopted to determine the hand classification, the hand joint point position, and the hand region frame.

The method for calculating the similarity between the images is more, and for example, the method may be a histogram matching method, in which histograms of the sample image and a pre-stored image are calculated, and then normalized correlation coefficients of the two histograms, such as a babbitt distance and a histogram intersection distance, are calculated. The present application does not limit the specific method for calculating the similarity between images, and is not repeated here.

In the embodiment of the application, in order to reduce the interference of the environment on the hand image in the sample image, each sample image is preprocessed. Through the preprocessing process, the interference of the non-hand region in each sample image in the training set can be removed, and the method is beneficial to rapidly determining the hand classification, the hand joint point position and the hand region frame.

In some embodiments, the step of pre-processing each sample image comprises: and subtracting the revision value from the first RGB values of all the pixel points in the sample image to obtain a second RGB value. And taking the sample image of the second RGB value as a preprocessed sample image. In the embodiment of the application, an RGB camera is adopted to shoot a sample image, and each pixel in the sample image has a corresponding RGB value, namely a first RGB value.

In the embodiment of the present application, the revision value (means) is obtained through a large amount of calculation and research, and the revision value may be different in different scenarios.

Illustratively, the revision value may be (104.117, 123). If the first RGB value is (255, 123, 204), the second RGB value is (255-.

In some embodiments, the step of determining the revision value may include: the preset revision values are pre-calculated in different environments, which may be understood as other areas than the hand in the sample image. And storing the environment image and the corresponding preset revision value into a database. And searching the stored environment image which is the same as the environment image in the sample image in the database, and extracting the corresponding preset revision value as the revision value in the preprocessing step.

If the exact same environmental image is not found in the database, the preset revision value corresponding to the most similar environmental image in the database may be used as the revision value in the preprocessing step according to the ranking with the similarity.

And sequentially inputting all the preprocessed sample images into an initial convolutional neural network model, and outputting corresponding prediction information, wherein the prediction information comprises hand prediction classification, hand joint point prediction positions and hand prediction region frames.

In the embodiment of the application, the input data of the initial convolutional neural network model is an image, the output data of the initial convolutional neural network model is a hand classification, a hand joint point position and a hand region frame, and the initial convolutional neural network model is designed according to the hand classification and the hand region frame. To distinguish from the above real data, the output data of the initial convolutional neural network model is referred to as hand prediction classification, hand joint prediction position, and hand prediction region box.

In some embodiments, the initial convolutional neural network model refers to a model that is not adjusted by the loss function, and the parameters in the initial convolutional neural network model are random values.

In the embodiment of the application, parameters in the initial convolutional neural network model are continuously adjusted through loss values corresponding to loss functions, the smaller the loss value is, the closer the data output by the preset convolutional neural network model is to the real data, and finally the initial convolutional neural network model after the parameters are adjusted for multiple times is used as the preset convolutional neural network model.

In some embodiments, the real information and the prediction information are input into a loss function to obtain a loss value, wherein the loss function includes a detection frame regression loss function, a detection classification loss function, a hand joint point detection regression loss function, and a hand joint point detection penalty term, the hand joint point detection penalty term is determined according to a hand prediction region frame and a joint point bounding rectangle, and the joint point bounding rectangle is a maximum rectangle surrounded by the real positions of the hand joint points. In the embodiment of the application, the detection frame regression loss function and the detection classification loss function are both used in training convolutional neural network models in different fields, and on the basis that the loss function comprises the detection frame regression loss function and the detection classification loss function, the embodiment of the application further comprises a hand joint point detection regression loss function and a hand joint point detection penalty item, and information of hand joint points is added into the loss function, so that a loss value obtained through calculation of the loss function can be more accurate when parameters of an initial convolutional neural network model are adjusted, and the preset convolutional neural network model obtained after adjustment is more accurate.

And adjusting the initial convolutional neural network model by using the loss value, and determining the adjusted initial convolutional neural network model as a preset convolutional neural network model. In the embodiment of the application, the parameters set by the initial convolutional neural network model are adjusted through the loss values, and the output effect of the preset convolutional neural network model obtained by the initial convolutional neural network model is more accurate after the parameters are adjusted through the loss values corresponding to a large amount of sample data.

In some embodiments, a joint detection penalty term (1-IOLBR) is determined from the hand prediction region box and the joint bounding rectangle according to the following formula:

wherein, Area_lbrIs the Area where the joint point is circumscribed with a rectangle_prePredicting the area of the area box for the hand; interaction (Area)_pre，Area_lbr) The overlapping area of the hand area detection frame and the area of the joint point circumscribed rectangle is shown.

It should be noted that the joint point circumscribing rectangle is a maximum rectangle surrounded by the actual positions of the hand joint points, for example, as shown in fig. 16 and 17, fig. 16 and 17 respectively illustrate schematic diagrams of a joint point circumscribing rectangle according to some embodiments, the joint point circumscribing rectangle 600 may be determined according to the actual positions of the four most marginal hand joint points in all the actual positions of the hand joint points, and the four sides of the joint point circumscribing rectangle 600 respectively pass through the four most marginal hand joint points. In fig. 16 and 17, only the joint points of the hand other than the four most peripheral joint points are labeled, and the other joint points are not labeled.

In some embodiments, the joint point bounding rectangle is represented in coordinates of four vertices.

In some embodiments, the method for screening the real positions of the four most marginal hand joint points may be: and determining the position of the hand joint point with the maximum vertical coordinate, the minimum vertical coordinate, the maximum horizontal coordinate and the minimum horizontal coordinate in all the real hand joint points.

In one example, the position of the hand joint point with the largest ordinate is (2, 16), the position of the hand joint point with the smallest ordinate is (1, 2), the position of the hand joint point with the largest abscissa is (6, 13), and the position of the hand joint point with the smallest abscissa is (-1, 9), then the real positions of the hand joint points at the four most edges are determined to be (2, 16), (1, 2), (6, 13), and (-1, 9), respectively, and the four vertexes of the circumscribed rectangle of the determined joint points are (-1, 2), (-1, 16), (6, 2), and (6, 16), respectively.

In another example, the position of the hand joint point with the largest ordinate is (2, 16), the position of the hand joint point with the smallest ordinate is (-1, 1), the position of the hand joint point with the largest abscissa is (6, 13), the position of the hand joint point with the smallest abscissa is (-1, 1), the real positions of the hand joint points with the four most edges are determined to be (2, 16), (-1, 1), (6, 13) and (-1, 1), the vertex of (-1, 1) is taken as one vertex, and the four vertices of the circumscribed rectangle of the finally determined joint points are (-1, 1), (-1, 16), (6, 16), (6, 1), respectively.

Illustratively, as shown in fig. 18, fig. 18 illustrates a schematic diagram of an overlapping area according to some embodiments, which shows an overlapping area 800 of an area where the hand area detection box 700 is located and an area where the joint point circumscribing rectangle 600 is located, and an overlap (area) is an overlapping area of the overlapping area 800.

In the embodiment of the present application, the joint point bounding rectangle is necessarily within the optimal hand area. When the area where the relevant node is circumscribed by the rectangle is larger than the overlapping area of the area where the hand prediction area frame is located, the more accurate the data output in the preset convolution neural network model at the moment is, and further the loss value corresponding to the loss function is smaller, so that the joint detection penalty term (1-IOLBR) is used as a part of the loss function. When the area of coincidence of the two regions is larger, the IOLBR value is larger, and the joint detection penalty term (1-IOLBR) value is smaller.

In some embodiments, the loss function is formulated as:

loss_total＝loss_box+loss_class+alpha*loss_landmark+beta*(1-ILOBR)；

therein, loss_totalAs a loss function, loss_boxTo detect the frame regression loss function, loss_classFor detecting the loss of classification function, loss_landmarkThe method comprises the steps of detecting a regression loss function of hand joint points, wherein 1-ILOBR is a hand joint point detection penalty term, alpha is the weight of the hand joint point detection regression loss function, and beta is the weight of the hand joint point detection penalty term.

In the embodiment of the application, in order to adapt to different scenes, corresponding weights are set for both a hand joint point detection regression loss function and a hand joint point detection penalty term in a loss function, that is, alpha is the weight of the hand joint point detection regression loss function, and beta is the weight of the hand joint point detection penalty term. When the convolution neural network model needing to be preset can output more accurate hand joint point prediction positions, the weight of the hand joint point detection regression loss function can be increased; when a more accurate hand prediction region box can be output by a convolution neural network model needing presetting, the weight of a hand joint point detection penalty term can be increased.

In some embodiments, the hand region is more accurately located after using the method for locating a hand region as described above, as compared with the related art, as shown in fig. 19, fig. 19 schematically shows a hand region diagram according to some embodiments, and the hand region 900 in fig. 19 is determined by using the method for locating a hand region in the embodiment of the present application. As shown in fig. 20, fig. 20 illustrates another hand region schematic according to some embodiments, and a hand region 1000 in fig. 20 is determined without using the positioning hand region method in the embodiments of the present application. As shown in fig. 21, it can be seen from comparison that the hand region obtained by using the method for positioning the hand region in the embodiment of the present application fits the palm better and the positioning effect is more accurate, as shown by overlapping fig. 19 and fig. 20.

An embodiment of the present application further provides a display device, including:

a display for displaying a user interface;

a user interface for receiving an input signal;

a controller respectively coupled to the display and the user interface for performing:

In an embodiment of the present application, there is provided a storage medium having stored thereon computer instructions that, when executed by a processor, cause a computer device to perform:

In the above embodiment, a method for positioning a hand region and a display device are provided, where a preset convolutional neural network model used in the method is obtained by using a loss function determined by hand joint points and is optimized, and joint point information is added to assist in positioning the hand region, so that accuracy of outputting the hand region by the convolutional neural network model can be improved. The method comprises the following steps: when an instruction for positioning a hand region is received, the image shot by the image acquisition device is beneficial to a preset convolutional neural network model to determine the hand region, wherein the hand region is a rectangular region formed by the outer edge of a hand, and the preset convolutional neural network model is obtained by optimizing a loss function determined by hand joint points.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A display device, comprising:

a display for displaying a user interface;

a user interface for receiving an input signal;

2. The display device of claim 1, wherein the controller is configured to perform a loss function optimization using hand joint point determination resulting in a preset convolutional neural network model according to the following steps:

3. The display device of claim 2, wherein the controller is configured to determine a joint point detection penalty term (1-IOLBR) from the hand prediction region box and the joint point bounding rectangle according to the following formula:

4. The display device of claim 2, wherein the controller is configured to perform the pre-processing of each sample image according to the following steps:

5. The display device of claim 2, wherein the loss function is formulated as:

loss_total＝loss_box+loss_class+alpha*loss_landmark+beta*(1-ILOBR)；

therein, loss_totalAs a loss function, loss_boxFor the detection of the frame regression loss function, loss_classFor detecting the loss of classification function, loss_landmarkThe method comprises the steps of detecting a regression loss function of hand joint points, wherein 1-ILOBR is a hand joint point detection penalty term, alpha is the weight of the hand joint point detection regression loss function, and beta is the weight of the hand joint point detection penalty term.

6. A method of locating a hand region, comprising:

7. The method of claim 6, wherein the step of obtaining a preset convolutional neural network model using a loss function optimization determined by hand joint points comprises:

acquiring a training set, wherein the training set comprises a plurality of sample images and real information corresponding to the sample images, and the real information comprises a real hand classification, a real hand joint point position and a real hand area frame;

inputting the real information and the prediction information into a loss function to obtain a loss value, wherein the loss function comprises a detection frame regression loss function, a detection classification loss function, a hand joint point detection regression loss function and a hand joint point detection penalty term, the joint point detection penalty term is determined according to a hand prediction area frame and a joint point external rectangle, and the joint point external rectangle is a maximum rectangle defined by the real positions of the hand joint points;

8. The method of claim 7, wherein the joint detection penalty term (1-IOLBR) is determined according to the hand prediction region box and the joint bounding rectangle by the following formula:

wherein, Area_lbrIs the Area where the joint point is circumscribed with a rectangle_prePredicting the area of the area box for the hand; interaction (Area)_pre，Area_lbr) Is a hand region detection frameThe overlapping area of the joint point and the area where the joint point circumscribed rectangle is located.

9. The method of claim 7, wherein the step of preprocessing the sample image comprises:

10. The method of claim 7, wherein the loss function is formulated as:

loss_total＝loss_box+loss_class+alpha*loss_landmark+beta*(1-ILOBR)；