CN116935431A

CN116935431A - Determination method of centroid of human hand region and display device

Info

Publication number: CN116935431A
Application number: CN202210319913.4A
Authority: CN
Inventors: 袁毅; 黄志明
Original assignee: Hisense Electronic Technology Shenzhen Co ltd
Current assignee: Hisense Electronic Technology Shenzhen Co ltd
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2023-10-24

Abstract

The application discloses a method for determining the centroid of a human hand area and display equipment. Coordinates of a target point corresponding to the center point of the human hand detection frame in the depth image are calculated based on coordinates of the center point of the human hand detection frame in the RGB image. And calculating the coordinates of the mass center of the human hand area in the depth image based on the coordinates of the target point and the depth value of the human hand area where the target point is located. According to the method and the display device, based on the non-aligned RGBD image, complex calculation is not needed to realize the alignment function of the RGB image and the depth image, but a polar constraint mode is used to greatly simplify the calculation mode of the center of mass of the human hand region of the depth image, the search range from the RGB image to the corresponding feature point of the depth image is reduced, the center of mass of the human hand region can be timely determined, real-time interaction of a 3D gesture algorithm at an embedded terminal is realized, and immersive interaction experience of a user is improved.

Description

Determination method of centroid of human hand region and display device

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method for determining a centroid of a human hand region and a display device.

Background

With the rapid development of artificial intelligence technology, 3D (three-dimensional) gesture recognition becomes an important branch of man-machine interaction, and is an important technical field in realizing virtual technology. When the 3D gesture interaction is realized, the coordinate position of the mass center of the human hand area is usually determined, and the human-computer interaction based on the 3D gesture is realized based on the coordinate position of the mass center of the human hand area.

At present, when determining the coordinate position of the centroid of the human hand region, an RGB image and a depth image are generally collected based on an RGBD camera, the RGB image and the depth image are firstly aligned to generate an RGBD image, then the detection result of a human hand detection frame in the aligned RGB image is mapped to the depth image, the acquisition of the human hand region in the depth image is realized, and the position of the centroid of the human hand region in the depth image is further calculated.

However, when the alignment processing of the RGB image and the depth image is performed, a large amount of calculation processing is required, and the operation of a large number of algorithms is required. If the computing power of the 3D gesture at the device deployment end is limited, the RGBD image alignment real-time requirement may not be met, and thus the centroid of the human hand region may not be determined in real time, and the interaction experience of the 3D gesture interaction is affected.

Disclosure of Invention

The application provides a method for determining the centroid of a human hand area and display equipment, which are used for solving the problem that the centroid of the human hand area cannot be determined in real time.

In a first aspect, the present application provides a display apparatus comprising: a display configured to display a user interface; an image collector configured to collect an image; a controller coupled to the display, the image collector, the controller configured to:

when a user interacts based on gestures, RGB images and depth images which are acquired by an image acquisition unit and are based on the gestures are acquired;

extracting a human hand detection frame center point in the RGB image, and determining coordinates of the human hand detection frame center point in the RGB image, wherein the human hand detection frame center point is a center point of an acquisition frame where the gesture is acquired;

calculating coordinates of a target point in a depth image based on coordinates of the human hand detection frame center point, the target point in the depth image corresponding to the human hand detection frame center point in an RGB image, the target point being used to characterize a human hand region in the depth image;

the method comprises the steps of obtaining depth information of the depth image, determining a depth value of a human hand area where a target point is located in the depth image based on the depth information, and calculating coordinates of a mass center of the human hand area in the depth image based on the coordinates of the target point and the depth value of the human hand area where the target point is located.

In some embodiments of the present application, the controller, when executing the calculation of coordinates of the target point in the depth image based on coordinates of the center point of the human hand detection frame, is further configured to:

acquiring pixel coordinates of a center point of the human hand detection frame in a two-dimensional coordinate system where the RGB image is located;

determining polar lines generated when the RGB image is projected to the depth image in the depth image based on pixel coordinates of the central point of the human hand detection frame, wherein the polar lines in the depth image comprise a plurality of pixel points corresponding to the central point of the human hand detection frame in the RGB image;

and selecting a target pixel point with the shortest distance from the center point of the human hand detection frame from a plurality of pixel points included on the polar line as a target point in the depth image, and taking the pixel coordinates of the target pixel point as the coordinates of the target point in the depth image.

In some embodiments of the application, the controller, when executing the pixel coordinates based on the center point of the human hand detection frame, determines, in the depth image, epipolar lines generated by the RGB image when projected to the depth image, is further configured to:

acquiring matrix parameters of the acquired image collector when acquiring images based on the gestures, and calculating polar equation coefficients based on pixel coordinates of a center point of the human hand detection frame and the matrix parameters of the image collector when acquiring the images based on the gestures;

And establishing a epipolar line equation in a depth image based on the epipolar line equation coefficient, and solving the epipolar line equation, wherein epipolar lines generated by the RGB image when projected to the depth image are determined in the depth image.

In some embodiments of the present application, the controller, when executing the determining, based on the depth information, a depth value of an area of a human hand where the target point is located in the depth image, is further configured to:

obtaining a detection frame where the target point is located, and carrying out plane amplification on the detection frame where the target point is located according to a preset multiple to obtain a human hand area where the target point is located in a depth image, wherein the human hand area comprises a plurality of pixel points;

determining a depth value of each pixel point in the human hand region based on the depth information of the depth image;

and calculating a depth average value based on the depth value of each pixel point in the human hand region, and taking the depth average value as the depth value of the human hand region where the target point is located in the depth image.

In some embodiments of the present application, the controller, when executing the calculation of the coordinates of the centroid of the human hand region in the depth image based on the coordinates of the target point and the depth value of the human hand region in which the target point is located, is further configured to:

Calculating centroid image coordinates in the depth image based on the coordinates of the target point in the depth image and the depth value of the human hand region, wherein the centroid image coordinates refer to pixel coordinates of the centroid of the human hand region;

and converting the barycenter image coordinates into position coordinates of the barycenter of the human hand region under the three-dimensional coordinate system where the depth image is located.

In a second aspect, the present application also provides a method of determining the centroid of a human hand region, the method comprising:

In some embodiments of the present application, the calculating coordinates of the target point in the depth image based on the coordinates of the center point of the human hand detection frame includes:

In some embodiments of the present application, the determining, in the depth image, epipolar lines generated when the RGB image is projected onto the depth image based on pixel coordinates of a center point of the human hand detection frame includes:

In some embodiments of the present application, the determining, based on the depth information, a depth value of a human hand region where the target point is located in the depth image includes:

In some embodiments of the present application, the calculating coordinates of a centroid of a human hand region in a depth image based on coordinates of the target point and a depth value of the human hand region where the target point is located includes:

In a third aspect, the present application also provides a computer readable storage medium, which may store a program that, when executed, may implement some or all of the steps in embodiments of a method for determining a centroid of a human hand region provided by the present application.

According to the method and the display device for determining the centroid of the human hand region, when a user performs 3D gesture interaction, an image collector collects RGB images and depth images. Coordinates of a target point in the depth image corresponding to the human hand detection frame center point in the RGB image are calculated based on coordinates of the human hand detection frame center point in the RGB image. And determining the depth value of the human hand region where the target point is located according to the depth information of the depth image, and further calculating the coordinates of the mass center of the human hand region in the depth image based on the coordinates of the target point and the depth value of the human hand region where the target point is located. According to the method and the display device, based on the non-aligned RGBD image, complex calculation is not needed to realize the alignment function of the RGB image and the depth image, but a polar constraint mode is used to greatly simplify the calculation mode of the center of mass of the human hand region of the depth image, the search range from the RGB image to the corresponding feature point of the depth image is reduced, the center of mass of the human hand region can be timely determined, real-time interaction of a 3D gesture algorithm at an embedded terminal is realized, and immersive interaction experience of a user is improved.

Drawings

In order to more clearly illustrate the technical solution of the present application, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 illustrates a schematic diagram of an operational scenario between a display device and a control apparatus according to some embodiments;

fig. 2 shows a hardware configuration block diagram of the control apparatus 100 according to some embodiments;

fig. 3 illustrates a hardware configuration block diagram of a display device 200 according to some embodiments;

FIG. 4 illustrates a software configuration diagram in a display device 200 according to some embodiments;

FIG. 5 shows a schematic diagram of the original scheme applied to human hand region bias in a non-aligned RGBD image in accordance with some embodiments;

FIG. 6 illustrates a schematic diagram of abnormal gesture results due to a deviation of a human hand region, according to some embodiments;

FIG. 7 illustrates a flow chart of a method of determining a centroid of a human hand region according to some embodiments;

FIG. 8 illustrates a data flow diagram of a method of determining a centroid of a human hand region in accordance with some embodiments;

FIG. 9 illustrates a schematic diagram of epipolar geometry constraints according to some embodiments;

FIG. 10 illustrates a schematic diagram of a human hand detection frame center point in an RGB image according to some embodiments;

FIG. 11 illustrates a schematic view of epipolar lines in a depth image according to some embodiments;

FIG. 12 illustrates a schematic diagram of determining a target point in a epipolar line of a depth image according to some embodiments;

FIG. 13 illustrates a schematic view of a human hand region in a depth image according to some embodiments;

FIG. 14 illustrates a schematic diagram of centroids in a depth image, according to some embodiments;

fig. 15 illustrates a schematic diagram of a hand gesture according to some embodiments.

Detailed Description

For the purposes of making the objects and embodiments of the present application more apparent, an exemplary embodiment of the present application will be described in detail below with reference to the accompanying drawings in which exemplary embodiments of the present application are illustrated, it being apparent that the exemplary embodiments described are only some, but not all, of the embodiments of the present application.

It should be noted that the brief description of the terminology in the present application is for the purpose of facilitating understanding of the embodiments described below only and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms first, second, third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus. The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code that is capable of performing the function associated with that element.

The display device provided by the embodiment of the application can have various implementation forms, for example, a television, an intelligent television, a computer, a laser projection device, a display (monitor), an electronic whiteboard (electronic bulletin board), an electronic desktop (electronic table) and the like. Fig. 1 and 2 are specific embodiments of a display device of the present application.

Fig. 1 illustrates a schematic diagram of an operational scenario between a display device and a control apparatus according to some embodiments. As shown in fig. 1, a user may operate the display device 200 through the smart device 300 or the control apparatus 100.

In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes infrared protocol communication or bluetooth protocol communication, and other short-range communication modes, and the display device 200 is controlled by a wireless or wired mode. The user may control the display device 200 by inputting user instructions through keys on a remote control, voice input, control panel input, etc.

In some embodiments, a smart device 300 (e.g., mobile terminal, tablet, computer, notebook, etc.) may also be used to control the display device 200. For example, the display device 200 is controlled using an application running on a smart device.

In some embodiments, the display device may receive instructions not using the smart device or control device described above, but rather receive control of the user by touch or gesture, or the like.

In some embodiments, the display device 200 may also perform control in a manner other than the control apparatus 100 and the smart device 300, for example, the voice command control of the user may be directly received through a module configured inside the display device 200 device for acquiring voice commands, or the voice command control of the user may be received through a voice control device configured outside the display device 200 device.

In some embodiments, the display device 200 is also in data communication with a server 400. The display device 200 may be permitted to make communication connections via a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display device 200. The server 400 may be a cluster, or may be multiple clusters, and may include one or more types of servers.

Fig. 2 shows a hardware configuration block diagram of the control apparatus 100 according to some embodiments. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an input operation instruction of a user, and convert the operation instruction into an instruction recognizable and responsive to the display device 200, and may perform an interaction between the user and the display device 200.

Fig. 3 illustrates a hardware configuration block diagram of a display device 200 according to some embodiments. As shown in fig. 3, the display apparatus 200 includes at least one of a modem 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface.

In some embodiments the controller includes a processor, a video processor, an audio processor, a graphics processor, RAM, ROM, a first interface for input/output to an nth interface.

The display 260 includes a display screen component for presenting a picture, and a driving component for driving an image display, a component for receiving an image signal from the controller output, displaying video content, image content, and a menu manipulation interface, and a user manipulation UI interface.

The display 260 may be a liquid crystal display, an OLED display, a projection device, or a projection screen.

The communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver. The display device 200 may establish transmission and reception of control signals and data signals with the external control device 100 or the server 400 through the communicator 220.

A user interface, which may be used to receive control signals from the control device 100 (e.g., an infrared remote control, etc.).

The detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for capturing the intensity of ambient light; alternatively, the detector 230 includes an image collector such as a camera, which may be used to collect external environmental scenes, user attributes, or user interaction gestures, or alternatively, the detector 230 includes a sound collector such as a microphone, or the like, which is used to receive external sounds.

The external device interface 240 may include, but is not limited to, the following: high Definition Multimedia Interface (HDMI), analog or data high definition component input interface (component), composite video input interface (CVBS), USB input interface (USB), RGB port, or the like. The input/output interface may be a composite input/output interface formed by a plurality of interfaces.

The modem 210 receives broadcast television signals through a wired or wireless reception manner, and demodulates audio and video signals, such as EPG data signals, from a plurality of wireless or wired broadcast television signals.

In some embodiments, the controller 250 and the modem 210 may be located in separate devices, i.e., the modem 210 may also be located in an external device to the main device in which the controller 250 is located, such as an external set-top box or the like.

The controller 250 controls the operation of the display device and responds to the user's operations through various software control programs stored on the memory. The controller 250 controls the overall operation of the display apparatus 200. For example: in response to receiving a user command to select a UI object to be displayed on the display 260, the controller 250 may perform an operation related to the object selected by the user command.

In some embodiments the controller includes at least one of a central processing unit (Central Processing Unit, CPU), video processor, audio processor, graphics processor (Graphics Processing Unit, GPU), RAM (Random Access Memory, RAM), ROM (Read-Only Memory, ROM), first to nth interfaces for input/output, a communication Bus (Bus), etc.

The user may input a user command through a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface recognizes the sound or gesture through the sensor to receive the user input command.

A "user interface" is a media interface for interaction and exchange of information between an application or operating system and a user, which enables conversion between an internal form of information and a user-acceptable form. A commonly used presentation form of the user interface is a graphical user interface (Graphic User Interface, GUI), which refers to a user interface related to computer operations that is displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in a display screen of the electronic device, where the control may include a visual interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc.

Fig. 4 illustrates a software configuration diagram in a display device 200 according to some embodiments. Referring to FIG. 4, in some embodiments, the system is divided into four layers, from top to bottom, an application layer (referred to as an "application layer"), an application framework layer (Application Framework layer) (referred to as a "framework layer"), a An Zhuoyun row (Android run) and a system library layer (referred to as a "system runtime layer"), and a kernel layer, respectively.

In some embodiments, at least one application program is running in the application program layer, and these application programs may be a Window (Window) program of an operating system, a system setting program, a clock program, or the like; or may be an application developed by a third party developer. In particular implementations, the application packages in the application layer are not limited to the above examples.

The framework layer provides an application programming interface (application programming interface, API) and programming framework for the application. The application framework layer includes a number of predefined functions. The application framework layer corresponds to a processing center that decides to let the applications in the application layer act. Through the API interface, the application program can access the resources in the system and acquire the services of the system in the execution.

As shown in fig. 4, the application framework layer in some embodiments of the present application includes a manager (manager), a Content Provider (Content Provider), and the like, wherein the manager includes at least one of the following modules: an Activity Manager (Activity Manager) is used to interact with all activities that are running in the system; a Location Manager (Location Manager) for providing system services or applications with access to system Location services; a Package Manager (Package Manager) for retrieving various information about an application Package currently installed on the device; a notification manager (Notification Manager) for controlling the display and clearing of notification messages; a Window Manager (Window Manager) is used to manage icons, windows, toolbars, wallpaper, desktop components, etc. on the user interface.

In some embodiments, the activity manager is used to manage the lifecycle of the individual applications as well as the usual navigation rollback functions, such as controlling the exit, opening, fallback, etc. of the applications. The window manager is used for managing all window programs, such as obtaining the size of the display screen, judging whether a status bar exists or not, locking the screen, intercepting the screen, controlling the change of the display window (for example, reducing the display window to display, dithering display, distorting display, etc.), etc.

In some embodiments, the system runtime layer provides support for the upper layer, the framework layer, and when the framework layer is in use, the android operating system runs the C/C++ libraries contained in the system runtime layer to implement the functions to be implemented by the framework layer.

In some embodiments, the kernel layer is a layer between hardware and software. As shown in fig. 4, the kernel layer contains at least one of the following drivers: audio drive, display drive, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (e.g., fingerprint sensor, temperature sensor, pressure sensor, etc.), and power supply drive, etc.

In some embodiments, the 3D gesture recognition functionality may be configured at the display device while the user is utilizing the display device to implement a 3D gesture interaction experience. The display device may include a VRAR device or other device that may implement 3D gesture recognition functionality. The 3D gesture interaction technology is one of the most important interaction modes of the VRAR equipment, can realize a virtual interaction function independent of a handle, and is a key technology for directly influencing the user experience of the VRAR equipment. As with many terminal technology algorithms, the technology must also be real-time or it will be difficult to achieve immersive interactions.

The 3D gesture recognition technology can be applied to games, plays an important role in the emerging technical fields of intelligent control and the like, and controls intelligent home through gestures in an exemplary manner. The 3D gesture recognition technology includes gesture recognition using a 3D gesture technology such as a motion sensor and a data glove, estimating 3D gesture posture information based on a single color image (RGB image) participating in a depth convolution neural network, estimating 3D gesture posture information based on depth data participating in the depth convolution neural network, and the like. RGB images are understood to mean images with colors, the images taken by a conventional camera (e.g. an RGB camera) being color images. The Depth data may be determined from a Depth image captured by a Depth camera (e.g., a Depth camera). The depth data (hereinafter referred to as a depth value) refers to the distance of a subject photographed by the camera from the camera.

In some embodiments, the mainstream scheme for implementing 3D gesture interaction includes an RGBD camera scheme and a multi-camera scheme, and RGBD cameras are widely studied due to their higher precision. RGBD camera solutions typically require alignment of the RGB image with a Depth image (Depth image) for mapping to the Depth image by the detection result of the RGB image, to achieve acquisition of the Depth image human hand region, and thus calculate the centroid of the human hand region. The center of mass of the human hand region refers to the center point position of the human hand region in the depth image.

The RGBD camera implementing 3D gesture recognition may include an RGB camera for capturing RGB images and a Depth camera for capturing Depth images (i.e., depth images), among others. The two cameras may have different visual fields, and thus, in two images acquired by the two cameras, the coordinates of the same object acquired may deviate, that is, the corresponding object in the RGB image and the corresponding object in the depth image may not be the same at the same coordinate. For example, the coordinates of the object a in the RGB image are (100 ), and then the coordinates of the object a in the depth image are (200 ), and the coordinates of the object B in the depth image are (100 ). Therefore, in order to form the RGBD image, the deviation between the RGB camera and the depth camera needs to be eliminated, and the RGB image and the depth image are aligned, so that the same coordinate in the two images represents the same object, and further the position of the centroid of the human hand area is determined, so that accurate 3D gesture interaction is realized based on the position of the centroid of the human hand area.

It can be seen that, in the 3D gesture interaction scheme implemented by the RGBD camera, the coordinates calculated by the centroid of the human hand region are input information essential for 3D hand gesture estimation. The existing calculation mode of the centroid of the human hand region depends on the aligned RGB image and Depth image, and the centroid is calculated according to Depth data of the Depth image region corresponding to the result of the human hand detection frame of the RGB image. Typically, each manufacturer will use a separate DSP chip to implement the alignment function of the RGB image and the Depth image, but this will undoubtedly increase the power consumption and cost of the VRAR terminal device. And RGBD image alignment function has great demand on deployment end computing force, because VRAR equipment terminal computing force is limited, operation of a plurality of algorithms is required to be considered, limited resources cannot meet RGBD image alignment instantaneity requirement, and instantaneity of overall system operation is influenced.

Therefore, in order to ensure real-time performance of the determination of the centroid of the human hand area without increasing the power consumption and the cost of the device, the embodiment of the invention provides the display device, and the coordinate position of the centroid of the human hand area can be timely calculated based on the non-aligned RGBD image instead of being calculated based on the aligned RGBD image.

If the original alignment scheme is still adopted when the coordinates of the centroid of the human hand region are calculated rapidly based on the non-aligned RGBD image, that is, the scheme of calculating the centroid of the human hand region based on the aligned RGBD image is applied to the non-aligned RGBD image, there is a deviation in determining the human hand region in the Depth image corresponding to the human hand detection frame region of the RGB image.

Fig. 5 shows a schematic diagram of the original scheme applied to human hand region bias in a non-aligned RGBD image according to some embodiments. Referring to fig. 5 (a), in the RGB image, the upper left corner coordinates of the human hand detection frame are (250, 20), and the lower right corner coordinates are (410, 222). In calculating the centroid of the human hand region based on the non-aligned RGBD image based on the alignment scheme, see (b) in fig. 5, in the depth image, the position corresponding to the same coordinates (250, 20) is different from that in the RGB image, and the position corresponding to the same coordinates (410, 222) is different from that in the RGB image, that is, the human hand region is deviated.

FIG. 6 illustrates a schematic diagram of abnormal gesture results due to a deviation of a human hand region, according to some embodiments. Referring to fig. 6, a deviated Depth image (Depth image) hand region will affect the calculation of the centroid and thus the accuracy of hand pose estimation. Exemplary, the test result shows that the deviation of the hand area seriously affects the position estimation of 21 joints of the hand, and the deviation of a large number of nodes from the actual positions of the joints of the hand, so that the subsequent 3D gesture interaction logic design is seriously affected, and the interaction experience of the user through the 3D gestures is further affected.

Therefore, in order to realize a method for accurately solving the centroid of the human hand region in the Depth image based on the non-aligned RGBD image in a mode with little calculation amount, the display device provided by the embodiment of the invention can solve the polar equation according to the epipolar constraint theory, so that the searching range from the RGB image to the corresponding characteristic point of the Depth image is greatly reduced; and the preferred result of the preferred corresponding feature point may be performed by pixel errors of the Depth camera re-projection to the RGB camera. According to the method for calculating the centroid of the human hand region in the Depth image based on the non-aligned RGBD image, the alignment function of the RGB image and the Depth image is achieved without complex calculation, the polar constraint mode is used, the calculation mode of the centroid of the human hand region of the Depth image is greatly simplified, real-time interaction of a 3D gesture algorithm in an embedded terminal is achieved, and immersive interaction experience of a user is improved.

Therefore, the epipolar constraint theory and epipolar search method adopted by the display equipment provided by the embodiment of the invention does not need data after image alignment, only contains little calculated amount to realize centroid calculation of a human hand region in a Depth image, and is very suitable for deployment at an embedded end.

In some embodiments, the method for determining the hand area executed by the display device provided by the embodiment of the invention can be applied to a 3D gesture virtual interaction scene, namely, the interaction between the user and the ARVR device application. For example, 3D gesture games, virtual office meeting gesture interactions, virtual industrial device assembly gesture controls, and the like. Related application scenes all need users to interact with virtual objects, so that playability and operability are enriched, and immersive experience effects are provided. In addition, the method can be applied to any terminal platform with 3D interaction requirements, for example: VR glasses, AR glasses, MR glasses, XR glasses, and the like.

In some embodiments, the RGBD camera based 3D gesture algorithm input data needs to meet the preconditions of image alignment, but RGB image alignment involves a large amount of computation, typically using additional DSP processing, increasing device hardware costs. Moreover, the VRAR equipment terminal has limited calculation force, needs to consider the calculation force requirements of various algorithms and rendering, and reduces the dependence of the 3D gesture interaction algorithm on the calculation force as much as possible. Therefore, the method adopted by the display device provided by the embodiment of the invention can ensure that the 3D gesture technical scheme based on the RGBD camera does not need a special DSP to process the alignment operation of the RGB image and the Depth image, and therefore, the special DSP is not required to realize the alignment, so that the hardware cost and the power consumption of the VRAR device are reduced on the premise of keeping the effect, and the deployment of the 3D gesture technology on more low-power devices is possible. In addition, the embodiment of the invention rapidly realizes that the center point of the RGB image human hand detection frame is projected to the Depth image based on the epipolar constraint theory and the epipolar search method, and realizes the real-time calculation of the centroid of the human hand region of the Depth image.

FIG. 7 illustrates a flow chart of a method of determining a centroid of a human hand region according to some embodiments; fig. 8 illustrates a data flow diagram of a method of determining a centroid of a human hand region in accordance with some embodiments. The display device provided by the embodiment of the invention comprises: a display configured to display a user interface; an image collector configured to collect an image; a controller coupled to the display and the image collector, the controller being configured to, in performing the method of determining the centroid of the human hand region shown in fig. 7 and 8, perform the steps of:

s1, when a user interacts based on gestures, RGB images and depth images which are acquired by an image acquisition unit and are based on the gestures are acquired.

In order to realize 3D gesture recognition, an image collector is configured in the display device, and the image collector is a camera module. Thus, the image collector includes an RGB collector (i.e., an RGB camera) and a depth collector (i.e., a depth camera). The RGB camera is used to capture RGB images and the Depth camera is used to capture Depth images (i.e., depth images).

When a user interacts with a display device configured with 3D gesture recognition functionality, such as when the user interacts with a VRAR device, the user makes control gestures with the VRAR device. At this time, the image collectors (RGB collectors and depth collectors) configured by the VRAR device collect RGB images and depth images including gestures.

S2, extracting a center point of a human hand detection frame in the RGB image, and determining coordinates of the center point of the human hand detection frame in the RGB image, wherein the center point of the human hand detection frame is the center point of an acquisition frame where the acquisition gesture is located.

In some embodiments, the human hand region centroid refers to the center point location of the human hand region in the depth image. Therefore, when calculating the centroid of the human hand region based on the non-aligned RGBD image, the position of the centroid of the human hand region in the corresponding depth image is searched based on the center point of the human hand detection frame in the RGB image.

After the VRAR equipment acquires the RGB image comprising the gesture acquired by the image acquirer, carrying out image recognition on the RGB image so as to recognize a human hand detection frame. The hand detection frame is a collection frame surrounded around the gesture when the image collector collects the gesture. Further extracting a center point from the human hand detection frame, and determining the center point of the human hand detection frame in the RGB imageIs defined by the coordinates of (a).

In some embodiments, a two-dimensional coordinate system is established in the RGB image with the origin of coordinates being the upper left corner of the RGB image, the X-axis forward direction from left to right, and the Y-axis forward direction from top to bottom. The two-dimensional coordinate system is a pixel coordinate system, and the center point of the human hand detection frame is a pixel point in the two-dimensional pixel coordinate system. Therefore, in the pixel coordinate system of the RGB image, the coordinates of the center point of the human hand detection frame determined are pixel coordinates and are recorded as 。

And S3, calculating coordinates of a target point in the depth image based on coordinates of a center point of the human hand detection frame, wherein the target point in the depth image corresponds to the center point of the human hand detection frame in the RGB image, and the target point is used for representing a human hand region in the depth image.

To determine the coordinates of the center of mass of the human hand region, the coordinates of the center point of the human hand detection frame need to be determined in the depth imageCorresponding target point。

Because the embodiment of the invention calculates the coordinates of the centroid of the human hand region based on the unaligned RGBD image, the RGB image and the depth image are not aligned before calculation. In order to avoid the deviation between the misaligned RGB image and the depth image, the center point of the human hand region corresponding to the center point, namely the target point, is found in the depth image according to the position of the center point of the human hand detection frame in the RGB image.

The center point, namely the centroid point, of the human hand area in the depth image can be determined by the target point, so that the target point in the depth image corresponds to the center point of the human hand detection frame in the RGB image, the same object is represented by the same coordinate in the RGB image and the depth image, and the accuracy of the centroid of the human hand area is further ensured.

For this purpose, the epipolar constraint theory can be used to solve for the midpoint in the RGB imageIn the polar line equation corresponding to the depth image, a plurality of characteristic points meeting the polar line equation are all central points of a human hand detection frame in the depth image and the RGB imageCorresponding points. Therefore, the center point of the human hand detection frame in the RGB image is also required to be searched in the point set contained in the polar line of the depth imageUniquely corresponding target point。

In some embodiments, when uniquely determining the target point, the controller is further configured to perform the following steps in performing step S3, i.e. performing a calculation of coordinates of the target point in the depth image based on coordinates of a center point of the human hand detection frame:

and step 31, acquiring pixel coordinates of a center point of the human hand detection frame in a two-dimensional coordinate system where the RGB image is located.

Step 32, determining the polar lines generated when the RGB image is projected to the depth image in the depth image based on the pixel coordinates of the center point of the human hand detection frame, wherein the polar lines in the depth image comprise a plurality of pixel points corresponding to the center point of the human hand detection frame in the RGB image.

And 33, selecting a target pixel point with the shortest distance from the center point of the human hand detection frame from a plurality of pixel points included on the polar line as a target point in the depth image, and taking the pixel coordinates of the target pixel point as the coordinates of the target point in the depth image.

The RGB image is a planar image, so the coordinates of the center point of the human hand detection frame in the RGB image are the pixel coordinates in the two-dimensional pixel coordinate system, namely the points。

In step 32, according to the pixel coordinates of the center point of the human hand detection frame, the epipolar constraint theory is used to calculate the epipolar equation, so that the searching range from the RGB image to the corresponding feature point of the Depth image is greatly reduced, and the epipolar constraint mode is used to greatly simplify the calculation mode of the centroid of the human hand region of the Depth image.

FIG. 9 illustrates a schematic diagram of epipolar geometry constraints according to some embodiments. In some embodiments, epipolar constraint theory is used to describe the intrinsic mapping relationship between two views, independent of external scenes, relying only on two camera intrinsic and relative poses. The epipolar geometry constraint relationship according to the epipolar constraint theory is shown in FIG. 9, whereinAndthe RGB camera center and the Depth camera center are represented, respectively. Points for RGB imagesThere is a corresponding point in the Depth imageWire connectionAnd connecting wireWill intersect at a point in three-dimensional space. From the point，，The plane formed by the three points is a polar plane. Polar plane and two image planes，Intersecting lines between，Is polar, wherein polarIs taken as a pointAndis connected with the polar lineIs taken as a point Andis connected with the connecting line of the (a); point(s)、Is a connecting lineRespectively with two image planes，An intersection point between. Due to RGB imageWithout Depth information, projected to Depth cameraThe position of the point is not unique but can be on the polar lineAnd (3) upper part.

Therefore, based on epipolar constraint theory, epipolar lines in the depth image are created by using mapping relations between RGB images including user gestures and the depth images acquired by the RGB and depth collectors in the image acquisition deviceTo determine a set of feature points in the depth image corresponding to a center point of a human hand detection frame in the RGB image.

In some embodiments of the present invention, in some embodiments,after the RGB image and the depth image are acquired by the image acquisition device, the barycenter of the human hand area is directly calculated without alignment processing. Therefore, there may be a deviation in the projection of the RGB image and the depth image such that the RGB image and the depth image are not in parallel relationship, but are angled, i.e., two image planes，Are not parallel but may intersect at a point.

Whereas if there is no deviation in the projection of the RGB image and the depth image, then there are two image planes，The two are in parallel relation, and can not intersect at one point.

In some embodiments, in determining epipolar lines in the depth image, the controller, when executing step 32, i.e., executing the epipolar lines in the depth image generated by the RGB image when projected onto the depth image based on the pixel coordinates of the center point of the human hand detection frame, is further configured to execute the steps of:

Step 321, acquiring matrix parameters of the image collector when the image is collected based on the gesture, and calculating a polar equation coefficient based on pixel coordinates of a center point of the human hand detection frame and the matrix parameters of the image collector when the image is collected based on the gesture.

Step 322, establishing a epipolar equation in the depth image based on the epipolar equation coefficients, and solving the epipolar equation to determine epipolar lines in the depth image generated by the RGB image when projected to the depth image.

Since there is a set of corresponding points in the RGB image and the depth image，Thus, according to the epipolar constraint theory, the following equation can be derived:。

in the method, in the process of the invention,a rotation matrix and a translation matrix of the RGB camera to the Depth camera respectively,the reference matrices of the RGB camera and the Depth camera respectively,is a transposed function. In addition, in the case of the optical fiber,an antisymmetric matrix representing a translation matrix, the following formula:

。

thus, based on the principle of epipolar constraint theory, matrix parameters of the image collectors (RGB camera and depth camera) are obtained when capturing images based on gestures, including those in the above equations. Pixel coordinates based on center point of human hand detection frame in RGB imageAnd the image collector calculates the epipolar equation coefficient A, B, C based on the matrix parameters of the gesture-based image collection.

The polar equation coefficients can be derived from the following equation:。

based on the above parameters, polar equations in depth images are established. Solving the epipolar line equation can determine epipolar lines generated by the RGB image when projected to the depth image in the depth image.

FIG. 10 illustrates a schematic diagram of a human hand detection frame center point in an RGB image according to some embodiments; fig. 11 illustrates a schematic diagram of epipolar lines in a depth image according to some embodiments. Referring to fig. 10, an image collector collects RGB images of a gesture made by a user, and displays a human hand detection frame in an area where the gesture is located, a center point in the human hand detection frameNamely, the target feature points in the RGB image. Referring to fig. 11, the target feature points in the RGB-based image are based onWhen searching the corresponding target point in the depth image, solving the epipolar constraint theory to obtain epipolar equations, wherein epipolar equations are formed by all points corresponding to the target feature points in the depth image. The epipolar line can be understood as a projection line generated when the center point of the human hand detection frame in the RGB image is projected to the depth image.

In step 33, the polar line in the depth image includes a plurality of pixel points corresponding to the center point of the human hand detection frame in the RGB image, i.e. all points on the polar line are in a corresponding relationship with the center point of the human hand detection frame in the RGB image.

Then, to select a point uniquely corresponding to the center point of the human hand detection frame in the RGB image, the center point of the human hand detection frame in the RGB image can be searched in the point set contained in the polar line of the depth imageCorresponding point. The searching rule is to traverse the pixel points where the polar lines are located in the depth image and project the pixel points to the RGB camera, so as to generate the point with the shortest distance.

According to the formula:respectively calculating each pixel point and point on the polar lineIs a distance of (3). Generating the nearest target pixel point as the target point in the depth imageAnd taking the pixel coordinates of the target pixel point as the coordinates of the target point in the depth image. In the method, in the process of the invention,for the current pixel traversed, set a is the coordinate set of the pixels contained in the epipolar line of the depth image.

Fig. 12 illustrates a schematic diagram of determining a target point in a epipolar line of a depth image, according to some embodiments. Referring to fig. 12, a target point can be determined on epipolar lines of a depth image based on the foregoing. Target point in depth imageTo detect the center point of a frame for human hands in RGB imagesCorresponding points, therefore, the human hand region in the depth image may be determined based on the target point.

S4, acquiring depth information of the depth image, determining a depth value of a human hand area where the target point is located in the depth image based on the depth information, and calculating coordinates of a centroid of the human hand area in the depth image based on the coordinates of the target point and the depth value of the human hand area where the target point is located.

When a human hand region in a depth image is determined based on a target point, since the coordinates of the target point are pixel coordinates in a two-dimensional pixel coordinate system, the coordinates of the two-dimensional pixel coordinate system need to be converted into coordinates in a three-dimensional coordinate system. At this time, the depth information of the depth image is obtained, and the depth information includes the depth value of each pixel point, so that the depth value of the hand area where the target point is located in the depth image can be determined.

Because the position of the hand region comprises a plurality of pixel points, the depth value of the hand region is accurately represented, and the depth value of the plurality of pixel points contained in the hand region can be calculated.

In some embodiments, the controller is further configured to perform the following step in performing step S4, i.e. performing the determination of the depth value of the region of the human hand in the depth image where the target point is located based on the depth information:

step 411, obtaining a detection frame where the target point is located, and performing plane amplification on the detection frame where the target point is located according to a preset multiple to obtain a human hand area where the target point is located in the depth image, wherein the human hand area comprises a plurality of pixel points.

Step 412, determining a depth value of each pixel point in the human hand region based on the depth information of the depth image.

And 413, calculating a depth average value based on the depth value of each pixel point in the human hand region, and taking the depth average value as the depth value of the human hand region where the target point is located in the depth image.

Determining a target point corresponding to a center point of a human hand detection frame in an RGB image in a depth image based on epipolar constraint theoryThe target point is a pixel on the polar lineAnd (5) a dot. The occupied position of the target point is smaller than the occupied position of the gesture of the user in the depth image, that is, the occupied position of the target point cannot enclose the gesture, so that the occupied position of the target point cannot completely represent the position of the hand region.

Therefore, a human hand area capable of surrounding the gesture in the depth image needs to be determined based on the position of the target point. And obtaining a detection frame of the target point, and amplifying the plane of the detection frame by a preset multiple to obtain a human hand area in the depth image. The preset multiple can be 1-1.5, such as 1.2 times of plane amplification.

Fig. 13 illustrates a schematic diagram of a human hand region in a depth image according to some embodiments. Referring to fig. 13, an area obtained by enlarging a detection frame of a target point in a depth image by a preset multiple may enclose a user gesture, and thus, the enlarged area is taken as a human hand area.

When the depth value of the human hand region is calculated, the human hand region can be composed of a plurality of pixel points. Accordingly, a depth value of each pixel point in the human hand region is determined based on the depth information of the depth image. Calculating depth average value of all pixel points in human hand regionAnd taking the depth average value as a depth value of a human hand area where the target point is located in the depth image.

Wherein, according to the formula:a depth average is calculated. Wherein d _n And N is the total number of the pixels contained in the human hand area, wherein N is the depth value of the nth pixel.

Thus, coordinates of the target point in the depth image are determinedAnd the depth value of the region of the hand where the target point is locatedThen, the coordinates of the mass center of the human hand area in the depth image can be calculated。

In some embodiments, the coordinates of the target point are pixel coordinates and the depth value is also a value represented by a pixel point. Thus, the coordinates of the centroid of the determined human hand region are also pixel coordinates. In order to facilitate the ARVR device to respond to the control command of the gesture of the user, the pixel coordinates are converted into coordinates in the three-dimensional space, so that the corresponding function is completed.

Accordingly, the controller, upon performing step S4, i.e. based on the coordinates of the target point and the depth value of the human hand region in which the target point is located, calculates coordinates of a centroid of the human hand region in the depth image, is further configured to perform the steps of:

Step 421, calculating the coordinates of the centroid image in the depth image based on the coordinates of the target point in the depth image and the depth value of the human hand region, wherein the coordinates of the centroid image refer to the coordinates of the pixels of the centroid of the human hand region.

And step 422, converting the barycenter image coordinates into position coordinates of the barycenter of the human hand region under the three-dimensional coordinate system where the depth image is located.

Plane coordinates in a two-dimensional pixel coordinate system for determining a target point in a depth imageAnd depth value of human hand regionI.e. the centroid image coordinates in the depth image can be determined. The centroid image coordinates are pixel coordinates of the centroid of the human hand region under the three-dimensional pixel coordinate system.

In order to facilitate the image collector to respond to the gesture of the user, the pixel coordinates are also required to be converted into position coordinates under a three-dimensional space coordinate system. Which is a kind ofIn which the conversion mode is to convert the coordinates of the mass center image into the following formConverting to determine the coordinates of the mass center of the hand area as follows under the three-dimensional space coordinate system of the depth camera。

Fig. 14 illustrates a schematic diagram of centroids in a depth image according to some embodiments. After converting the three-dimensional pixel coordinates to spatial coordinates, a schematic diagram of the three-dimensional spatial location of the centroid of the human hand region is shown in fig. 14. After the coordinate conversion, the coordinates of the centroid of the human hand region are no longer represented by pixel points, but by spatial coordinates (x, y, z) of the human hand relative to the actual position of the camera.

The origin of coordinates of the three-dimensional space coordinate system may be an image collector, such as a lens in a depth camera, or a lens in an RGB camera; the x-axis forward direction is the left-to-right direction, the y-axis forward direction is the top-to-bottom direction, and the z-axis forward direction is the direction from the camera to the human hand.

In some embodiments, after computing the centroid of the human hand region based on the non-aligned RGBD image, a subsequent 3D gesture interaction application may be performed based on the centroid of the human hand region. For example, the calculated centroid coordinates of the hand region are input to the hand gesture module for performing other related functions. The hand gesture module is a mode configured in the image collector for executing collection gestures and responding gesture operations.

Under the condition that the RGB image and the Depth image are not aligned, the method creatively combines the epipolar geometry principle and the searching method to find the centroid of the human hand region of the Depth image, so that the calculated amount is reduced, and meanwhile, the hand gesture effect is not influenced.

Fig. 15 illustrates a schematic diagram of a hand gesture according to some embodiments. With reference to fig. 6 and fig. 15, according to the method provided by the embodiment of the invention, the centroid of the hand region is calculated based on the non-aligned RGBD image, so that poor hand gesture recognition effect caused by deviation of the RGB image and the depth image can be avoided, and the optimized hand gesture effect is closer to the actual hand gesture of the user.

Therefore, according to the display device provided by the embodiment of the application, when a user performs 3D gesture interaction, the RGB image and the depth image are acquired by the image acquisition device. Coordinates of a target point in the depth image corresponding to the human hand detection frame center point in the RGB image are calculated based on coordinates of the human hand detection frame center point in the RGB image. And determining the depth value of the human hand region where the target point is located according to the depth information of the depth image, and further calculating the coordinates of the mass center of the human hand region in the depth image based on the coordinates of the target point and the depth value of the human hand region where the target point is located. According to the display device, based on the non-aligned RGBD image, the alignment function of the RGB image and the Depth image is achieved without complex calculation, a polar constraint mode is used for greatly simplifying the manual area centroid calculation mode of the Depth image, the searching range from the RGB image to the corresponding feature point of the Depth image is reduced, the manual area centroid can be timely determined, real-time interaction of a 3D gesture algorithm at an embedded terminal is achieved, and user immersive interaction experience is improved.

Fig. 7 illustrates a flow chart of a method of determining a centroid of a human hand region according to some embodiments. Referring to fig. 7, the present application further provides a method for determining a centroid of a human hand region, which is performed by the display device provided in the foregoing embodiment, and includes:

S1, when a user interacts based on gestures, acquiring RGB images and depth images which are acquired by an image acquisition unit and are based on the gestures;

s2, extracting a central point of a human hand detection frame in the RGB image, and determining coordinates of the central point of the human hand detection frame in the RGB image, wherein the central point of the human hand detection frame is a central point of an acquisition frame where the gesture is acquired;

s3, calculating coordinates of a target point in a depth image based on the coordinates of the center point of the human hand detection frame, wherein the target point in the depth image corresponds to the center point of the human hand detection frame in an RGB image, and the target point is used for representing a human hand region in the depth image;

s4, acquiring depth information of the depth image, determining a depth value of a human hand area where a target point is located in the depth image based on the depth information, and calculating coordinates of a mass center of the human hand area in the depth image based on the coordinates of the target point and the depth value of the human hand area where the target point is located.

According to the method for determining the centroid of the hand region and the display device, when a user performs 3D gesture interaction, an image collector collects RGB images and depth images. Coordinates of a target point in the depth image corresponding to the human hand detection frame center point in the RGB image are calculated based on coordinates of the human hand detection frame center point in the RGB image. And determining the depth value of the human hand region where the target point is located according to the depth information of the depth image, and further calculating the coordinates of the mass center of the human hand region in the depth image based on the coordinates of the target point and the depth value of the human hand region where the target point is located. According to the method and the display device, based on the non-aligned RGBD image, the alignment function of the RGB image and the Depth image is achieved without complex calculation, a polar constraint mode is used for greatly simplifying the manual area centroid calculation mode of the Depth image, the searching range from the RGB image to the corresponding feature point of the Depth image is reduced, the manual area centroid can be timely determined, real-time interaction of a 3D gesture algorithm at an embedded terminal is achieved, and user immersive interaction experience is improved.

In a specific implementation, the present invention further provides a computer readable storage medium, where the computer readable storage medium may store a program, where the program may include some or all of the steps in each embodiment of the method for determining a centroid of a human hand region provided by the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (random access memory, RAM), or the like.

It will be apparent to those skilled in the art that the techniques of embodiments of the present application may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present application may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, in essence or contributing to the prior art.

The same or similar parts between the various embodiments in this specification are referred to each other. In particular, for the method embodiment of determining the centroid of the human hand region, since it is substantially similar to the display device embodiment, the description is relatively simple, and reference is made to the description in the display device embodiment for the relevant points.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the present disclosure and to enable others skilled in the art to best utilize the embodiments.

Claims

1. A display device, characterized by comprising:

a display configured to display a user interface;

an image collector configured to collect an image;

a controller coupled to the display, the image collector, the controller configured to:

2. The display device of claim 1, wherein the controller, when executing the calculating coordinates of the target point in the depth image based on coordinates of the center point of the human hand detection frame, is further configured to:

3. The display device of claim 2, wherein the controller, when executing the pixel coordinates based on the human hand detection frame center point, to determine in the depth image a epipolar line generated by the RGB image when projected to the depth image, is further configured to:

4. The display device of claim 1, wherein the controller, when executing the determining the depth value for the region of the human hand in the depth image where the target point is located based on the depth information, is further configured to:

5. The display device of claim 1, wherein the controller, when executing the computing the coordinates of the centroid of the human hand region in the depth image based on the coordinates of the target point and the depth value of the human hand region in which the target point is located, is further configured to:

6. A method of determining a centroid of a human hand region, the method comprising:

7. The method of claim 6, wherein calculating coordinates of the target point in the depth image based on coordinates of the center point of the human hand detection frame comprises:

8. The method of claim 7, wherein the determining the epipolar line in the depth image generated by the RGB image when projected to the depth image based on the pixel coordinates of the center point of the human hand detection frame comprises:

9. The method of claim 6, wherein determining the depth value of the region of the human hand in which the target point is located in the depth image based on the depth information comprises:

10. The method of claim 6, wherein calculating coordinates of a centroid of a human hand region in the depth image based on the coordinates of the target point and a depth value of the human hand region in which the target point is located comprises: