CN112657176A - Binocular projection man-machine interaction method combined with portrait behavior information - Google Patents

Binocular projection man-machine interaction method combined with portrait behavior information Download PDF

Info

Publication number
CN112657176A
CN112657176A CN202011642041.2A CN202011642041A CN112657176A CN 112657176 A CN112657176 A CN 112657176A CN 202011642041 A CN202011642041 A CN 202011642041A CN 112657176 A CN112657176 A CN 112657176A
Authority
CN
China
Prior art keywords
image
coordinate system
binocular
computer interaction
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011642041.2A
Other languages
Chinese (zh)
Inventor
谢巍
许练濠
吴伟林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202011642041.2A priority Critical patent/CN112657176A/en
Publication of CN112657176A publication Critical patent/CN112657176A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

本发明公开了一种结合人像行为信息的双目投影人机交互方法,该方法包括:对摄像头捕获区域进行边缘检测;对边缘检测后的图像进行直线检测;通过单应性变换矩阵求解视角变换下的映射关系;利用YOLOv3目标检测算法对交互对象进行识别,获取从投影区域坐标映射到Unity3D开发场景中的游戏坐标;通过双目摄像头测距获得交互对象的深度信息;通过Kinect摄像头获得交互者的骨骼关键点位置,再在Uinty3D软件中虚拟化人物对象以及相应的交互动作;本发明能够利用深度学习的方法进行人机交互,同时结合双目摄像头测得的交互对象深度信息,以及交互者的人像行为信息,可以很大程度地提高投影人机交互系统的交互性。

Figure 202011642041

The invention discloses a human-computer interaction method for binocular projection combined with portrait behavior information. The method includes: performing edge detection on an area captured by a camera; performing line detection on an image after edge detection; and solving perspective transformation through a homography transformation matrix The mapping relationship below; use the YOLOv3 target detection algorithm to identify the interactive objects, and obtain the game coordinates mapped from the projection area coordinates to the Unity3D development scene; obtain the depth information of the interactive objects through the binocular camera ranging; obtain the interactors through the Kinect camera The key point position of the skeleton is then virtualized in the Uinty3D software, and the human-computer interaction can be virtualized in the Uinty3D software. It can greatly improve the interactivity of the projection human-computer interaction system.

Figure 202011642041

Description

Binocular projection man-machine interaction method combined with portrait behavior information
Technical Field
The invention relates to the technical fields of image processing, feature extraction, feature analysis, computer vision, convolutional neural networks, target detection, human-computer interaction and the like, in particular to a binocular projection human-computer interaction method combining portrait behavior information.
Background
With the development of scientific technology, man-machine interaction technology becomes diversified, people no longer need to simply present virtual scenes, and begin to explore interaction methods with virtual worlds, so that more and more novel man-machine interaction technologies come into play. Human-computer interaction techniques fall into several categories: the traditional interactive technology taking a keyboard and a mouse as input; interaction technologies based on touch screen devices, such as smart phones, tablet computers; non-contact interaction technologies based on machine vision and image processing technologies, such as virtual keyboards, gesture interaction systems and the like.
The key-press keyboard and mouse are the most mature interactive devices at present, and the man-machine interaction technology based on the key-press keyboard and mouse is also applied to the computer operation at the earliest. The interactive mode has the characteristics of stable performance and quick response, and is widely applied to office work in daily life. However, the disadvantages of this technique are: a complete human-computer interaction process requires devices such as a keyboard, a mouse, a screen for displaying a graphical interface and the like, and the devices are numerous and heavy; with the development of touch screen technology and mobile equipment, display and interaction functions of novel mobile products such as smart phones are integrated into a screen, and the characteristics of portability and easy operation enable the mobile equipment to be rapidly and widely used, so that the life style of people is changed; the diversified demands of people on human-computer interaction prompt people to find a more natural and less-limited human-computer interaction mode.
In the literature (Goto H, Takemura D, Kawasaki Y, et al. development of an Information Projection Using a Projector-Camera System [ J ]. Electronics and Communications in Japan,2013,96(11): 70-81), Hiroki Goto et al studied a Camera Projection interaction System based on the frame difference method and the hand flesh color extraction method: firstly, separating hands from a scene based on clustering characteristics of hand skin colors in HSV and YCbCr spaces, and then detecting fingertip positions on a separated foreground image by using a template matching method, thereby realizing projection interaction between a user and a computer or a family television. In the literature (Fitriani, Goh W.B. Interactive with projected media on deformable surfaces [ C ]. Rio de Janeiro, Brazil: IEEE International Conference on Computer Vision,2007:1-6.), Fitriani et al propose a human-Computer interaction system based on a deformable projection surface, which projects a virtual scene onto the surface of a deformable object, then detects the deformation generated when a user touches the projection screen, and analyzes the interaction information through an image processing algorithm and a deformation model of the object. The above schemes based on machine vision technology and image processing algorithm all have the following problems: the diversity of the projection scene cannot be guaranteed; the dependence on peripherals is large. As in an interactive system based on hand skin color, when the projected scene is similar to the hand skin color, the hand foreground separation algorithm is ineffective.
Disclosure of Invention
The invention discloses a binocular projection human-computer interaction method combined with portrait behavior information, and aims to utilize a depth learning method to carry out human-computer interaction, and simultaneously combine depth information of an interaction object measured by a binocular camera and portrait behavior information of an interactor to improve the interactivity of a projection human-computer interaction system to a great extent.
The invention is realized by at least one of the following technical schemes.
A binocular projection human-computer interaction method combined with portrait behavior information comprises the following steps:
acquiring image data by using a camera, and carrying out edge detection on a camera capturing area;
carrying out straight line detection on the image after edge detection through Hough straight line detection to realize area positioning of a projection area;
solving the mapping relation under the view angle transformation through the homography transformation matrix;
identifying the interactive objects by using a YOLOv3 target detection algorithm, and obtaining game coordinates mapped to the Unity3D development scene from the projection area coordinates according to the solved homography transformation matrix;
obtaining depth information of an interactive object through binocular camera ranging;
and virtualizing the character object by obtaining the positions of the skeleton key points of the interactive person, and generating corresponding interactive actions according to the distribution of the character joint points.
Preferably, the camera image is grayed before edge detection, the edge is a set of points with obvious brightness change in the image, and a large-scale canny filter operator is adopted to detect the edge of the image.
Preferably, a gaussian filter is used to smooth the image and filter out noise when edge detection is performed;
a gaussian kernel of size (2k +1) × (2k +1) is set by the following formula:
Figure BDA0002880998400000021
wherein k is a positive integer, i, j is in the middle of [1,2k +1 ]],σ2Let σ be 1.4 and k be 1 for the variance of the gaussian function, resulting in a gaussian convolution kernel:
Figure BDA0002880998400000022
convolving the Gaussian kernel with a gray image to obtain a smooth image;
calculating the gradient strength and direction of each pixel point in the image, and utilizing Sobel operators in the horizontal direction and the vertical direction:
Figure BDA0002880998400000023
wherein SxIs the Sobel operator in the horizontal direction, SyIs a vertical Sobel operator which is respectively convolved with the smooth image to obtain the first derivative G of the pixel point in two directionsx,GyFrom this, the gradient of the pixel point is calculated:
Figure BDA0002880998400000031
preferably, the non-maxima suppression comprises the steps of:
3) comparing the gradient strength of the current pixel with two pixels along the positive and negative gradient directions;
4) if the gradient intensity of the current pixel is maximum compared with the other two pixels, the pixel point is reserved as an edge point, otherwise, the pixel point is restrained.
Preferably, the performing line detection on the image after the edge detection through hough line detection to realize the area positioning process of the projection area specifically includes:
for a straight line y on a cartesian coordinate system, kx + b, where (x, y) represents a coordinate point under the coordinate system, k represents a slope of the straight line, and b represents an intercept of the straight line, the straight line is transformed into: b is y-xk, and the abscissa in the hough space is defined as k, and the ordinate is b, then b is y-xk, which is a straight line with slope-x and intercept of y in the hough space; several points (x) on the same straight line on the Cartesian coordinate system1,y1),(x2,y2),…,(xn,yn) Corresponding to a plurality of straight lines on the Hough space, wherein the common intersection point (k, b) of the straight lines is the slope and the intercept of the same straight line in a Cartesian coordinate system;
performing hough transform in a polar coordinate mode, specifically, expressing a straight line by using a polar coordinate equation rho ═ xcos theta + ysin theta, wherein rho is a polar distance, namely a distance from an original point to the straight line in a polar coordinate space; theta is a polar angle, namely an included angle between a line segment which passes through an origin and is perpendicular to a straight line and an x axis, defining the horizontal coordinate in Hough space as theta and the vertical coordinate as rho, and then coordinates (x) of a plurality of points on the same straight line on the polar coordinate system1,y1),(x2,y2),…,(xn,yn) Corresponding to a plurality of curves in Hough space, wherein the common intersection point (theta, rho) of the curves is the polar angle and the polar distance of the same straight line in a polar coordinate system;
calculating the intersection point of the four boundary straight lines of the longest projection region to obtain four vertex coordinates (x)lt,ylt)、(xlb,ylb)、(xrb,yrb)、(xrt,yrt)。
Preferably, the solving of the mapping relationship under the view angle transformation through the homography transformation matrix specifically includes:
setting the X '-Y' plane to be vertical to the Z axis of the X-Y-Z space coordinate system and to be intersected with the Z axis to be a point (0,0,1), namely, the point (X ', Y') under the X '-Y' plane coordinate system is a point (X ', Y', 1) under the X-Y-Z space coordinate system; and describing the mapping relation between the X-Y plane coordinate system and the X-Y-Z space coordinate system by using a homography matrix H:
Figure BDA0002880998400000032
Figure BDA0002880998400000033
in the formula, h1~h99 transformation parameters for the homography matrix; further obtaining the mapping relation from the x-y plane coordinate system to the x '-y' plane coordinate system as follows:
Figure BDA0002880998400000041
the H matrix has 9 transformation parameters, but actually only 8 degrees of freedom, and multiplying the H matrix by a scaling factor k:
Figure BDA0002880998400000042
i.e. k x H andh actually represents the same mapping relation, so that H only has 8 degrees of freedom, and the solution of the homography matrix H is realized by adding constraint to the homography matrix H or adding H to the homography matrix H 91.
Preferably, h is9The solution equation is set to 1 as follows:
Figure BDA0002880998400000043
the homography matrix H is constrained modulo 1 as follows:
Figure BDA0002880998400000044
the equation to be solved is then:
Figure BDA0002880998400000045
defining target coordinate points of the four vertexes of the projection area under the obtained pixel coordinate system under the projection scene coordinate system, namely solving an H matrix:
Figure BDA0002880998400000046
preferably, the identifying the interactive object by using the YOLOv3 target detection algorithm, and obtaining the game coordinate mapped from the projection area coordinate to the Unity3D development scene according to the solved homography transformation matrix specifically includes:
the loss function of YOLOv3 is as follows:
Figure BDA0002880998400000051
where the first term is the coordinate error loss, λcoordIs a coordinate loss function coefficient; s denotes dividing an input image intoS × S grids; b represents the number of frames included in one mesh;
Figure BDA0002880998400000052
whether the jth frame of the ith grid contains an object or not is represented, the containing time value is 1, and the non-containing time value is 0; x and y respectively represent the center coordinates of the frame; w and h respectively represent the length and width of the frame; r isij
Figure BDA0002880998400000053
X, y, w, h representing the prediction box and the real box, respectively; the second term and the third term are confidence loss,
Figure BDA0002880998400000054
whether the jth frame of the ith grid does not contain an object or not is represented, the value of the non-containing time is 1, and the value of the containing time is 0; lambda [ alpha ]noobjTo balance the loss weights of object-bearing and object-free meshes, the goal is to reduce the confidence loss of the mesh borders without objects; cijAnd
Figure BDA0002880998400000055
respectively representing the predicted and real confidence coefficients of the jth frame of the ith grid; classes represents the number of categories; p is a radical ofij(c),
Figure BDA0002880998400000056
And the prediction probability and the real probability of the jth frame of the ith grid belonging to the class c object are shown.
Preferably, the obtaining of the depth information of the interactive object through binocular camera ranging specifically includes:
correcting the original image according to the calibration result, wherein the two corrected images are positioned on the same plane and are parallel to each other;
matching pixel points of the two corrected images to obtain a disparity map;
and calculating the depth of each pixel according to the matching result, thereby obtaining a depth map.
Preferably, the obtaining of the skeletal key point position of the interactor through the Kinect camera and the virtualization of the character object and the corresponding interaction action in the Uinty3D software specifically include:
identifying the skeleton structure of a target person by utilizing the skeleton tracking function of the Kinect somatosensory instrument, acquiring depth data of the target person and related information of a color image, and displaying the obtained skeleton structure information of the person in real time to store joint point data of the person; the interaction of the man-machine interaction system is improved by virtualizing character objects in the Unity3D software to simulate the interaction of an interactor.
Compared with the prior art, the invention has the beneficial effects that: the target detection is carried out by a deep learning method, so that the interference of environmental factors can be avoided, the diversity of interactive scenes is ensured, and meanwhile, the interactivity of a human-computer interaction system is improved by combining the depth information of an interactive object acquired by a binocular camera and the human behavior information of an interactive person acquired by a Kinect camera.
Drawings
Fig. 1 is a hardware schematic diagram of a binocular projection human-computer interaction method combining with portrait behavior information according to the embodiment;
fig. 2 is a flowchart of a binocular projection human-computer interaction method combining with portrait behavior information according to the present embodiment;
FIG. 3 is a schematic diagram of detection of a Cartesian coordinate Hough line according to the embodiment;
fig. 4 is a schematic diagram of a hough line detection algorithm of the polar coordinate system in this embodiment;
FIG. 5 is a schematic diagram of homography transformation of the present embodiment;
fig. 6 is a frame diagram of an actual binocular system according to the present embodiment;
fig. 7 is a frame diagram of an ideal binocular vision system of the present embodiment.
Detailed Description
The invention is further described with reference to the following figures and specific embodiments.
As shown in fig. 2, a binocular projection human-computer interaction method combining portrait behavior information includes the following steps:
s1, performing edge detection on the camera capturing area through a large-scale canny filtering operator, and performing non-maximum suppression;
carrying out graying processing on the camera image;
the edge is a set of points with obvious brightness change in the image, the gradient can reflect the change speed in numerical value, and the edge of the image is detected by adopting a large-scale canny filter operator based on the principle of no leakage of the boundary of a projection area;
using a Gaussian filter to smooth the image and filter out noise;
a gaussian kernel of size (2k +1) × (2k +1) is set by the following formula:
Figure BDA0002880998400000061
wherein k is a positive integer, i, j is in the middle of [1,2k +1 ]],σ2Let σ be 1.4 and k be 1 for the variance of the gaussian function, resulting in a gaussian convolution kernel:
Figure BDA0002880998400000062
convolving the Gaussian kernel with a gray image to obtain a smooth image;
calculating the gradient strength and direction of each pixel point in the image, and utilizing Sobel operators in the horizontal direction and the vertical direction:
Figure BDA0002880998400000063
wherein SxIs the Sobel operator in the horizontal direction, SxIs a vertical Sobel operator which is respectively convolved with the smooth image to obtain the first derivative G of the pixel point in two directionsx,GyFrom this, the gradient of the pixel point is calculated:
Figure BDA0002880998400000071
non-maxima suppression is applied to eliminate spurious responses due to edge detection:
for each pixel point on the obtained gradient image, the reservation or elimination of the point cannot be determined only by a single threshold, and for the finally obtained edge image, the accurate description of the source image contour is expected, so that non-maximum suppression is required:
5) comparing the gradient strength of the current pixel with two pixels along the positive and negative gradient directions;
6) if the gradient intensity of the current pixel is maximum compared with the other two pixels, the pixel point is reserved as an edge point, otherwise, the pixel point is inhibited;
as the convolution kernel scale increases, the more pronounced the detected edge is; based on the principle that the boundary of the projection area cannot be detected, a large-scale canny operator is adopted to detect the edge of the image.
S2, carrying out straight line detection on the image after edge detection through Hough straight line detection to realize area positioning of a projection area;
hough line detection maps each point on a cartesian coordinate system to a straight line in Hough space by using the principle of point-line duality of the cartesian coordinate system and Hough space, so that an intersection passing through a plurality of straight lines in Hough space corresponds to a straight line passing through a plurality of points in the cartesian coordinate system, as shown in fig. 3.
Specifically, for a straight line y on a cartesian coordinate system, kx + b, where (x, y) represents a coordinate point under the coordinate system, k represents a slope of the straight line, and b represents an intercept of the straight line. The straight line is transformed into: and b is y-xk, and the abscissa in the hough space is k, and the ordinate in the hough space is b, then b is y-xk, which is a straight line with slope-x and intercept y in the hough space. Several points (x) on the same straight line on the Cartesian coordinate system1,y1),(x2,y2),…,(xn,yn) The hough space corresponds to a plurality of straight lines, and the common intersection point (k, b) of the straight lines is the slope and the intercept of the same straight line in a Cartesian coordinate system.
The slope of the vertical line in the image cannot be calculated becauseThis is typically done in polar form with a hough transform. Specifically, a straight line is expressed by a polar coordinate equation ρ ═ xcos θ + ysin θ, where ρ is a polar distance, i.e., a distance from an origin to the straight line in a polar coordinate space; θ is the polar angle, i.e. the angle between the line segment passing through the origin and perpendicular to the straight line and the x-axis. Defining the horizontal coordinate in Hough space as theta and the vertical coordinate as rho, and then defining the coordinates (x) of a plurality of points on the same straight line on the polar coordinate system1,y1),(x2,y2),…,(xn,yn) The hough space corresponds to a plurality of curves, and a common intersection point (θ, ρ) of the curves is a polar angle and a polar distance of the same straight line in the polar coordinate system, and a schematic diagram is shown in fig. 4 of the accompanying drawings.
Calculating the intersection point of the four boundary straight lines of the longest projection region in pairs to obtain coordinates (x) of four top points of the projection region, namely, the top left point, the bottom right point and the top right pointlt,ylt)、(xlb,ylb)、(xrb,yrb)、(xrt,yrt)。
S3, solving the mapping relation under the view angle transformation through the homography transformation matrix; the homography transformation diagram is shown in figure 5.
Homography transformations reflect the process of mapping from one two-dimensional plane to three-dimensional space, and then from three-dimensional space to another two-dimensional plane. The homography transformation describes nonlinear transformation between two coordinate systems, so that the homography transformation has wide application in the fields of image splicing, image correction, augmented reality and the like.
X-Y-Z is a three-dimensional space coordinate system and can be understood as a world coordinate system; x-y is a pixel plane space coordinate system; and x '-y' is a plane coordinate system of the elevator key. The homography transform can be described as: a point (X, Y) on the X-Y coordinate system, corresponding to a straight line l passing through the origin and the point on the X-Y-Z coordinate system:
Figure BDA0002880998400000081
the straight line intersects the x '-y' coordinate system plane at point (x ', y'), and the process from point (x, y) to point (x ', y') is referred to as a homography transformation.
The solving process of the homography transformation is as follows:
let the X '-Y' plane be perpendicular to the Z-axis of the X-Y-Z space coordinate system and intersect the Z-axis at point (0,0,1), i.e., point (X ', Y') in the X '-Y' plane coordinate system is point (X ', Y', 1) in the X-Y-Z space coordinate system. And describing the mapping relation between the X-Y plane coordinate system and the X-Y-Z space coordinate system by using a homography matrix H:
Figure BDA0002880998400000082
Figure BDA0002880998400000083
in the formula, h1~h99 transformation parameters for the homography matrix; further obtaining the mapping relation from the x-y plane coordinate system to the x '-y' plane coordinate system as follows:
Figure BDA0002880998400000084
the H matrix has 9 transformation parameters, but actually only 8 degrees of freedom, and multiplying the H matrix by a scaling factor k:
Figure BDA0002880998400000085
that is, k x H and H actually represent the same mapping relationship, so H has only 8 degrees of freedom, and one way is to solve H9Setting to 1, the equation to be solved is:
Figure BDA0002880998400000086
another approach is to add a constraint to the homography matrix H, modulo 1, as follows:
Figure BDA0002880998400000087
the equation to be solved is then:
Figure BDA0002880998400000091
defining target coordinate points of the four vertexes of the projection area under the obtained pixel coordinate system under the projection scene coordinate system, namely solving an H matrix:
Figure BDA0002880998400000092
s4, identifying the interactive objects by researching a convolutional neural network technology and utilizing a YOLOv3 target detection algorithm, and obtaining game coordinates mapped to a Unity3D development scene from the projection area coordinates according to the solved homography transformation matrix;
s5, obtaining depth information of the interactive object by ranging the binocular camera through a convolutional neural network;
further, interactive objects are identified by using a YOLOv3 target detection algorithm, and a game coordinate process of mapping the projection area coordinates to the Unity3D development scene is obtained according to the solved homography transformation matrix.
Coordinate positioning of the arrow position in the image is a typical object detection problem. The current research on target detection in the field of convolutional neural networks is mainly divided into two algorithms, namely a two-stage algorithm and a one-stage algorithm: the two-stage target detection algorithm calculates an input image to obtain a candidate region, and then classifies and corrects the candidate region through a convolution neural network. Representative of such algorithms are the R-CNN series and SPP-Net, etc.; the one stage target detection algorithm converts the detection and positioning of the target position into a regression problem, and the target detection algorithm from the input end to the output end is directly realized through a single convolutional neural network. one stage algorithm is represented by the YOLO series, SSD, Retina-Net, etc. The two types of target detection algorithms have respective advantages and disadvantages, the two types of target detection algorithms are superior to the one type of target detection algorithms in accuracy and precision, and the one type of target detection algorithms have great advantages in speed.
The YOLOv3 loss function is designed as follows:
Figure BDA0002880998400000093
where the first term is the coordinate error loss, λcoordIs a coordinate loss function coefficient; s denotes dividing the input image into S × S meshes; b represents the number of frames included in one mesh;
Figure BDA0002880998400000101
whether the jth frame of the ith grid contains an object or not is represented, the containing time value is 1, and the non-containing time value is 0; x and y respectively represent the center coordinates of the frame; w and h respectively represent the length and width of the frame; r isij
Figure BDA0002880998400000102
X, y, w, h representing the prediction box and the real box, respectively; the second term and the third term are confidence loss,
Figure BDA0002880998400000103
whether the jth frame of the ith grid does not contain an object or not is represented, the value of the non-containing time is 1, and the value of the containing time is 0; lambda [ alpha ]noobjTo balance the loss weights of object-bearing and object-free meshes, the goal is to reduce the confidence loss of the mesh borders without objects; cxjAnd
Figure BDA0002880998400000104
respectively representing the predicted and real confidence coefficients of the jth frame of the ith grid; classes represents the number of categories; p is a radical ofij(c),
Figure BDA0002880998400000105
And the prediction probability and the real probability of the jth frame of the ith grid belonging to the class c object are shown.
S6, obtaining the positions of the skeleton key points of the interactors through a Kinect camera, virtualizing the character objects in the Uinty3D software, and generating corresponding interactive actions according to the distribution of the character joint points.
Calibrating the binocular cameras to obtain internal and external parameters and distortion coefficients of the two cameras;
the purpose of camera calibration is as follows: first, to restore the real world position of the object imaged by the camera, it is necessary to know how the world object is transformed into the computer image plane, i.e. to solve the internal and external parameter matrix.
Second, the perspective projection of the camera has a significant problem — distortion. Another purpose of camera calibration is to solve distortion coefficients and then use them for image rectification.
Correcting the original image according to the calibration result, wherein the two corrected images are positioned on the same plane and are parallel to each other;
the main task of the binocular camera system is distance measurement, and the parallax distance measurement formula is derived under the ideal condition of the binocular system, but in the real binocular stereo vision system, two camera image planes which are completely aligned in a coplanar line do not exist, as shown in the attached figure 6, wherein p is a certain point on an object to be measured, and O is a certain point on the object to be measured1And O2The optical centers of the two cameras are respectively, so that the three-dimensional correction is carried out, namely, two images which are not in coplanar line alignment in practice are corrected into coplanar line alignment (the coplanar line alignment is that when two camera image planes are on the same plane and the same point is projected to the two camera image planes, the same line of two pixel coordinate systems is needed), the actual binocular system is corrected into an ideal binocular system, as shown in figure 7, wherein p is a certain point on an object to be measured, and O is a certain point on the object to be measuredRAnd OTThe optical centers of the two cameras are respectively, the imaging points of the point P on the photoreceptors of the two cameras are P and P', f is the focal length of the cameras, B is the center distance of the two cameras, and X isRAnd XTThe distances from the imaging points on the image planes of the left camera and the right camera to the left edge of the image plane respectively, and z is the required depth information.
Matching pixel points of the two corrected images, wherein the pixel points are used for matching corresponding image points of the same scene on left and right views to obtain a parallax image;
calculating the depth of each pixel according to the matching result, thereby obtaining a depth map;
further, the obtaining of the skeletal key point position of the interactor through the Kinect camera, and then the virtualization of the character object and the corresponding interactive action process in the Uinty3D software specifically include:
and identifying the skeleton structure of the target person by utilizing the skeleton tracking function of the Kinect somatosensory instrument. The depth data of the target person and the related information of the color image are directly acquired from the sensor of the Kinect, and the obtained skeletal structure information of the person is displayed in real time and 20 joint point data of the person are saved.
The interactivity of the man-machine interaction system is improved by virtualizing character objects in the Unity3D software and designing a plurality of corresponding actions, such as squatting, standing, shooting postures and the like, in the Unity3D software according to the character joint distribution information so as to simulate the interactive actions of an interactor.
The invention uses the projector 2 to project a mobile phone interface through the mobile phone end 1, the interactive object starts to perform interactive action, the binocular camera 3 and the Kinect camera 4 transmit images to the cloud server 5 through the network for data processing and analysis, and the result is transmitted back to the mobile phone end 1 for displaying the game scene.
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Claims (10)

1.一种结合人像行为信息的双目投影人机交互方法,其特征在于,所述双目投影人机交互方法,包括以下步骤:1. a binocular projection human-computer interaction method in conjunction with portrait behavior information, is characterized in that, described binocular projection human-computer interaction method, comprises the following steps: 利用摄像头获取图像数据,对摄像头捕获区域进行边缘检测;Use the camera to obtain image data, and perform edge detection on the area captured by the camera; 通过霍夫直线检测对边缘检测后的图像进行直线检测,实现投影区域的区域定位;Line detection is performed on the image after edge detection by Hough line detection to realize the regional positioning of the projection area; 通过单应性变换矩阵求解视角变换下的映射关系;Solve the mapping relationship under the perspective transformation through the homography transformation matrix; 利用YOLOv3目标检测算法对交互对象进行识别,并根据求解的单应性变换矩阵得到从投影区域坐标映射到Unity3D开发场景中的游戏坐标;Use the YOLOv3 target detection algorithm to identify the interactive objects, and obtain the game coordinates mapped from the projection area coordinates to the Unity3D development scene according to the solved homography transformation matrix; 通过双目摄像头测距获得交互对象的深度信息;Obtain depth information of interactive objects through binocular camera ranging; 通过获得交互者的骨骼关键点位置,虚拟化人物对象,根据人物关节点的分布生成相应的交互动作。By obtaining the position of the key points of the skeleton of the interactor, the character objects are virtualized, and corresponding interactive actions are generated according to the distribution of the character joint points. 2.根据权利要求1所述的一种结合人像行为信息的双目投影人机交互方法,其特征在于,在边缘检测之前对摄像头图像进行灰度化处理,边缘是图像中亮度变化明显的点的集合,采用大尺度的canny滤波算子检测图像边缘。2. a kind of binocular projection human-computer interaction method combined with portrait behavior information according to claim 1, is characterized in that, before edge detection, the camera image is subjected to grayscale processing, and the edge is the point where the brightness changes obvious in the image A collection of large-scale canny filtering operators to detect image edges. 3.根据权利要求2所述的一种结合人像行为信息的双目投影人机交互方法,其特征在于,在进行边缘检测时先使用高斯滤波器,以平滑图像,滤除噪声;3. a kind of binocular projection human-computer interaction method in combination with portrait behavior information according to claim 2, is characterized in that, first use Gaussian filter when carrying out edge detection, to smooth image, filter out noise; 由以下公式设置大小为(2k+1)×(2k+1)的高斯核:A Gaussian kernel of size (2k+1)×(2k+1) is set by the following formula:
Figure FDA0002880998390000011
Figure FDA0002880998390000011
其中k为正整数,i,j∈[1,2k+1],σ2为高斯函数的方差,设σ=1.4,k=1,得高斯卷积核:where k is a positive integer, i,j∈[1,2k+1], σ 2 is the variance of the Gaussian function, set σ=1.4, k=1, the Gaussian convolution kernel is obtained:
Figure FDA0002880998390000012
Figure FDA0002880998390000012
使用该高斯核与灰度化图像卷积,得平滑图像;Use the Gaussian kernel to convolve the grayscale image to obtain a smooth image; 计算图像中每个像素点的梯度强度和方向,利用水平、垂直方向的Sobel算子:Calculate the gradient intensity and direction of each pixel in the image, using the Sobel operator in the horizontal and vertical directions:
Figure FDA0002880998390000013
Figure FDA0002880998390000013
其中Sx是水平方向的Sobel算子,Sy是垂直方向的Sobel算子,分别与平滑图像卷积,得到像素点在两个方向的一阶导数Gx,Gy,由此计算像素点的梯度:Among them, S x is the Sobel operator in the horizontal direction, and S y is the Sobel operator in the vertical direction, which are convolved with the smooth image respectively to obtain the first-order derivatives G x and G y of the pixel points in the two directions, and then calculate the pixel points. The gradient of :
Figure FDA0002880998390000014
Figure FDA0002880998390000014
4.根据权利要求3所述的一种结合人像行为信息的双目投影人机交互方法,其特征在于,非极大值抑制包括以下步骤:4. a kind of binocular projection human-computer interaction method combined with portrait behavior information according to claim 3, is characterized in that, non-maximum value suppression comprises the following steps: 1)将当前像素的梯度强度与沿正负梯度方向上的两个像素进行比较;1) Compare the gradient strength of the current pixel with two pixels along the positive and negative gradient directions; 2)如果当前像素的梯度强度与另外两个像素相比最大,则该像素点保留为边缘点,否则该像素点将被抑制。2) If the gradient strength of the current pixel is the largest compared with the other two pixels, the pixel point is reserved as an edge point, otherwise the pixel point will be suppressed. 5.根据权利要求4所述的一种结合人像行为信息的双目投影人机交互方法,其特征在于,所述通过霍夫直线检测对边缘检测后的图像进行直线检测,实现投影区域的区域定位过程,具体包括:5. a kind of binocular projection human-computer interaction method combined with portrait behavior information according to claim 4, is characterized in that, the described image after edge detection is carried out line detection by Hough line detection, realizes the area of projection area The positioning process includes: 对于笛卡尔坐标系上的一条直线y=kx+b,其中(x,y)表示坐标系下坐标点,k表示直线斜率,b表示直线截距,将该直线变换为:b=y-xk,并定义霍夫空间中横坐标为k,纵坐标为b,则b=y-xk为霍夫空间下斜率为-x,截距为y的一条直线;笛卡尔坐标系上同一直线上的若干点(x1,y1),(x2,y2),…,(xn,yn),在霍夫空间上对应若干条直线,这些直线的共同交点(k,b)即是笛卡尔坐标系中相同直线的斜率和截距;For a straight line y=kx+b on the Cartesian coordinate system, where (x, y) represents the coordinate point in the coordinate system, k represents the slope of the line, b represents the intercept of the line, and the line is transformed into: b=y-xk , and define that the abscissa in the Hough space is k and the ordinate is b, then b=y-xk is a straight line with a slope of -x and an intercept of y in the Hough space; Several points (x 1 , y 1 ), (x 2 , y 2 ),…,(x n , y n ) correspond to several straight lines in Hough space, and the common intersection (k,b) of these straight lines is the slope and intercept of the same line in Cartesian coordinates; 采用极坐标的形式进行霍夫变换,具体地,将直线用极坐标方程ρ=xcosθ+ysinθ表示,其中ρ为极距,即极坐标空间下原点到直线的距离;θ为极角,即经过原点且垂直于直线的线段与x轴的夹角,定义霍夫空间中横坐标为θ,纵坐标为ρ,则极坐标系上同一直线上的若干点坐标(x1,y1),(x2,y2),…,(xn,yn),在霍夫空间上对应若干条曲线,这些曲线的共同交点(θ,ρ)即是极坐标系中相同直线的极角与极距;The Hough transform is carried out in the form of polar coordinates. Specifically, the straight line is represented by the polar coordinate equation ρ=xcosθ+ysinθ, where ρ is the polar distance, that is, the distance from the origin to the straight line in the polar coordinate space; The angle between the line segment at the origin and perpendicular to the line and the x-axis, define the abscissa as θ and the ordinate as ρ in the Hough space, then the coordinates of several points on the same line in the polar coordinate system (x 1 , y 1 ), ( x 2 , y 2 ),…,(x n , y n ) correspond to several curves in Hough space, and the common intersection (θ, ρ) of these curves is the polar angle and polar angle of the same straight line in the polar coordinate system distance; 对求得的四条最长投影区域边界直线,两两求其交点得投影区域得四个顶点坐标(xlt,ylt)、(xlb,ylb)、(xrb,yrb)、(xrt,yrt)。For the obtained four longest projection area boundary lines, find the intersection points of the four vertex coordinates (x lt , y lt ), (x lb , y lb ), (x rb , y rb ), ( x rt , y rt ). 6.根据权利要求5所述的一种结合人像行为信息的双目投影人机交互方法,其特征在于,所述通过单应性变换矩阵求解视角变换下的映射关系,具体包括:6. a kind of binocular projection human-computer interaction method in combination with portrait behavior information according to claim 5, is characterized in that, described solving the mapping relation under the perspective transformation by homography transformation matrix, specifically comprises: 设x′-y′平面与X-Y-Z空间坐标系的Z轴垂直,且与Z轴相交于点(0,0,1),即x′-y′平面坐标下的点(x′,y′)为X-Y-Z空间坐标系下的点(x′,y′,1);利用单应性矩阵H描述x-y平面坐标系与X-Y-Z空间坐标系映射关系:Suppose the x'-y' plane is perpendicular to the Z axis of the X-Y-Z space coordinate system, and intersects the Z axis at the point (0,0,1), that is, the point (x', y') under the x'-y' plane coordinates is the point (x', y', 1) in the X-Y-Z space coordinate system; use the homography matrix H to describe the mapping relationship between the x-y plane coordinate system and the X-Y-Z space coordinate system:
Figure FDA0002880998390000021
Figure FDA0002880998390000021
Figure FDA0002880998390000022
Figure FDA0002880998390000022
式中,h1~h9为单应性矩阵的9个变换参数;进而得x-y平面坐标系到x′-y′平面坐标系的映射关系为:In the formula, h 1 ~ h 9 are the nine transformation parameters of the homography matrix; and then the mapping relationship from the xy plane coordinate system to the x'-y' plane coordinate system is:
Figure FDA0002880998390000031
Figure FDA0002880998390000031
H矩阵拥有9个变换参数,但实际上只拥有8个自由度,对H矩阵乘以一个缩放因子k,则:The H matrix has 9 transformation parameters, but actually only has 8 degrees of freedom. Multiply the H matrix by a scaling factor k, then:
Figure FDA0002880998390000032
Figure FDA0002880998390000032
即k*H与H实际上表示相同的映射关系,因此H只有8个自由度,对单应性矩阵H求解采用给单应性矩阵H添加约束或者将h9置为1的方式。That is, k*H and H actually represent the same mapping relationship, so H has only 8 degrees of freedom. To solve the homography matrix H, add constraints to the homography matrix H or set h 9 to 1.
7.根据权利要求6所述的一种结合人像行为信息的双目投影人机交互方法,其特征在于,将h9置为1的求解方式,求解方程为:7. a kind of binocular projection human-computer interaction method in conjunction with portrait behavior information according to claim 6, is characterized in that, the solution mode that h 9 is set to 1, the solution equation is:
Figure FDA0002880998390000033
Figure FDA0002880998390000033
给单应性矩阵H添加约束,令其模等于1,如下:Add constraints to the homography matrix H so that its modulus is equal to 1, as follows:
Figure FDA0002880998390000034
Figure FDA0002880998390000034
则待求解的方程为:Then the equation to be solved is:
Figure FDA0002880998390000035
Figure FDA0002880998390000035
由上述所得到的像素坐标系下投影区域的四个顶点,定义各自在投影场景坐标系下的目标坐标点,即可求解H矩阵:From the four vertices of the projection area in the pixel coordinate system obtained above, define their respective target coordinate points in the projected scene coordinate system, and then the H matrix can be solved:
Figure FDA0002880998390000036
Figure FDA0002880998390000036
8.根据权利要求7所述的一种结合人像行为信息的双目投影人机交互方法,其特征在于,所述通利用YOLOv3目标检测算法对交互对象进行识别,并根据求解的单应性变换矩阵得到从投影区域坐标映射到Unity3D开发场景中的游戏坐标,具体包括:8. a kind of binocular projection human-computer interaction method in conjunction with portrait behavior information according to claim 7, is characterized in that, described general uses YOLOv3 target detection algorithm to identify the interactive object, and transforms according to the homography of solving The matrix is mapped from the projection area coordinates to the game coordinates in the Unity3D development scene, including: 所述YOLOv3的损失函数如下:The loss function of the YOLOv3 is as follows:
Figure FDA0002880998390000041
Figure FDA0002880998390000041
式中,第一项为坐标误差损失,λcoord为坐标损失函数系数;S表示将输入图像分为S×S个网格;B表示一个网格内包含的边框数;
Figure FDA0002880998390000042
表示第i个网格的第j个边框是否包含物体,包含时值为1,不包含时值为0;x,y分别表示边框的中心坐标;w,h分别表示边框的长和宽;rij
Figure FDA0002880998390000044
分别表示预测框和真实框的x、y、w、h;第二项以及第三项为置信度损失,
Figure FDA0002880998390000043
表示第i个网格的第j个边框是否不包含物体,不包含时值为1,包含时值为0;λnoobj为平衡有物体和无物体网格的损失权重,目的是降低不含物体的网格边框的置信度损失;Cij
Figure FDA0002880998390000045
分别表示第i个网格的第j个边框预测的和真实的置信度;classes表示类别的数量;pij(c),
Figure FDA0002880998390000046
表示第i个网格的第j个边框属于第c类物体的预测概率和真实概率。
In the formula, the first term is the coordinate error loss, and λ coord is the coordinate loss function coefficient; S represents dividing the input image into S×S grids; B represents the number of frames contained in a grid;
Figure FDA0002880998390000042
Indicates whether the j-th frame of the i-th grid contains an object, the value is 1 when it is included, and 0 when it is not included; x, y respectively represent the center coordinates of the frame; w, h respectively represent the length and width of the frame; r ij ,
Figure FDA0002880998390000044
represent the x, y, w, and h of the predicted box and the real box, respectively; the second and third terms are the confidence loss,
Figure FDA0002880998390000043
Indicates whether the j-th border of the i-th grid does not contain objects, the value is 1 when not included, and 0 when it is included; λ noobj is the loss weight of the grid with objects and without objects, the purpose is to reduce the loss weight of grids without objects The confidence loss of the grid border of ; C ij is the same as
Figure FDA0002880998390000045
Represents the predicted and true confidence of the j-th frame of the i-th grid; classes represents the number of categories; p ij (c),
Figure FDA0002880998390000046
Represents the predicted probability and true probability that the jth box of the ith grid belongs to the cth object.
9.根据权利要求8所述的一种结合人像行为信息的双目投影人机交互方法,其特征在于,所述通过双目摄像头测距获得交互对象的深度信息,具体包括:9. A binocular projection human-computer interaction method combined with portrait behavior information according to claim 8, wherein the obtaining depth information of the interactive object through binocular camera ranging, specifically comprises: 根据标定结果对原始图像进行校正,校正后的两张图像位于同一平面且互相平行;Correct the original image according to the calibration result, and the two corrected images are located on the same plane and parallel to each other; 对校正后的两张图像进行像素点匹配,得到视差图;Perform pixel point matching on the two corrected images to obtain a disparity map; 根据匹配结果计算每个像素的深度,从而获得深度图。The depth map is obtained by calculating the depth of each pixel according to the matching result. 10.根据权利要求9所述的一种结合人像行为信息的双目投影人机交互方法,其特征在于,所述通过Kinect摄像头获得交互者的骨骼关键点位置,再在Uinty3D软件中虚拟化人物对象以及相应的交互动作,具体包括:10. A binocular projection human-computer interaction method combined with portrait behavior information according to claim 9, characterized in that, obtaining the key point position of the skeleton of the interactor through the Kinect camera, and then virtualizing the character in the Uinty3D software Objects and corresponding interactive actions, including: 利用Kinect体感仪的骨骼追踪功能识别目标人物的骨骼结构,获取目标人物的深度数据以及彩色图像的相关信息,并且实时显示得到的人物骨骼结构信息保存人物的关节点数据;通过在Unity3D软件中虚拟化人物对象,模仿交互者的交互动作。Use the skeletal tracking function of the Kinect somatosensory instrument to identify the skeletal structure of the target person, obtain the depth data of the target person and related information of the color image, and display the obtained skeleton structure information in real time to save the joint point data of the person; Characterized objects, imitating the interaction of the interactor.
CN202011642041.2A 2020-12-31 2020-12-31 Binocular projection man-machine interaction method combined with portrait behavior information Pending CN112657176A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011642041.2A CN112657176A (en) 2020-12-31 2020-12-31 Binocular projection man-machine interaction method combined with portrait behavior information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011642041.2A CN112657176A (en) 2020-12-31 2020-12-31 Binocular projection man-machine interaction method combined with portrait behavior information

Publications (1)

Publication Number Publication Date
CN112657176A true CN112657176A (en) 2021-04-16

Family

ID=75412210

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011642041.2A Pending CN112657176A (en) 2020-12-31 2020-12-31 Binocular projection man-machine interaction method combined with portrait behavior information

Country Status (1)

Country Link
CN (1) CN112657176A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113487668A (en) * 2021-05-25 2021-10-08 北京工业大学 Radius-unlimited learnable cylindrical surface back projection method
CN113506210A (en) * 2021-08-10 2021-10-15 深圳市前海动竞体育科技有限公司 A method and video shooting device for automatically generating athlete bitmaps in basketball games
CN115061577A (en) * 2022-08-11 2022-09-16 北京深光科技有限公司 Hand projection interaction method, system and storage medium
WO2022252239A1 (en) * 2021-05-31 2022-12-08 浙江大学 Computer vision-based mobile terminal application control identification method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015024407A1 (en) * 2013-08-19 2015-02-26 国家电网公司 Power robot based binocular vision navigation system and method based on
CN107481267A (en) * 2017-08-14 2017-12-15 华南理工大学 A kind of shooting projection interactive system and method based on binocular vision
CN111354007A (en) * 2020-02-29 2020-06-30 华南理工大学 Projection interaction method based on pure machine vision positioning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015024407A1 (en) * 2013-08-19 2015-02-26 国家电网公司 Power robot based binocular vision navigation system and method based on
CN107481267A (en) * 2017-08-14 2017-12-15 华南理工大学 A kind of shooting projection interactive system and method based on binocular vision
CN111354007A (en) * 2020-02-29 2020-06-30 华南理工大学 Projection interaction method based on pure machine vision positioning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈鹏艳等: "Unity3D中的Kinect主角位置检测与体感交互", 《沈阳理工大学学报》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113487668A (en) * 2021-05-25 2021-10-08 北京工业大学 Radius-unlimited learnable cylindrical surface back projection method
WO2022252239A1 (en) * 2021-05-31 2022-12-08 浙江大学 Computer vision-based mobile terminal application control identification method
CN113506210A (en) * 2021-08-10 2021-10-15 深圳市前海动竞体育科技有限公司 A method and video shooting device for automatically generating athlete bitmaps in basketball games
CN115061577A (en) * 2022-08-11 2022-09-16 北京深光科技有限公司 Hand projection interaction method, system and storage medium
CN115061577B (en) * 2022-08-11 2022-11-11 北京深光科技有限公司 Hand projection interaction method, system and storage medium

Similar Documents

Publication Publication Date Title
Memo et al. Head-mounted gesture controlled interface for human-computer interaction
US10732725B2 (en) Method and apparatus of interactive display based on gesture recognition
Rekimoto Matrix: A realtime object identification and registration method for augmented reality
CN112657176A (en) Binocular projection man-machine interaction method combined with portrait behavior information
CN104317391B (en) A kind of three-dimensional palm gesture recognition exchange method and system based on stereoscopic vision
CN107679537B (en) A Pose Estimation Algorithm for Objects in Untextured Space Based on Contour Point ORB Feature Matching
CN102508574B (en) Projection-screen-based multi-touch detection method and multi-touch system
US11308655B2 (en) Image synthesis method and apparatus
CN111401266B (en) Method, equipment, computer equipment and readable storage medium for positioning picture corner points
CN111354007B (en) A projection interaction method based on pure machine vision positioning
CN109359514B (en) A joint strategy method for gesture tracking and recognition for deskVR
CN108604379A (en) System and method for determining the region in image
CN104081307A (en) Image processing apparatus, image processing method, and program
Caputo et al. 3D Hand Gesture Recognition Based on Sensor Fusion of Commodity Hardware.
CN108734194A (en) A kind of human joint points recognition methods based on single depth map of Virtual reality
CN112912936B (en) Mixed reality system, program, mobile terminal device and method
CN113220114B (en) An embeddable non-contact elevator button interaction method integrated with face recognition
US20230351724A1 (en) Systems and Methods for Object Detection Including Pose and Size Estimation
JP2021520577A (en) Image processing methods and devices, electronic devices and storage media
CN108305321B (en) Three-dimensional human hand 3D skeleton model real-time reconstruction method and device based on binocular color imaging system
CN110147162A (en) An enhanced assembly teaching system based on fingertip features and its control method
CN118261985B (en) Three-coordinate machine intelligent positioning detection method and system based on stereoscopic vision
EP3309713B1 (en) Method and device for interacting with virtual objects
Kim et al. ThunderPunch: A bare-hand, gesture-based, large interactive display interface with upper-body-part detection in a top view
Shi et al. Error elimination method in moving target tracking in real-time augmented reality

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210416