Disclosure of Invention
The invention discloses a binocular projection human-computer interaction method combined with portrait behavior information, and aims to utilize a depth learning method to carry out human-computer interaction, and simultaneously combine depth information of an interaction object measured by a binocular camera and portrait behavior information of an interactor to improve the interactivity of a projection human-computer interaction system to a great extent.
The invention is realized by at least one of the following technical schemes.
A binocular projection human-computer interaction method combined with portrait behavior information comprises the following steps:
acquiring image data by using a camera, and carrying out edge detection on a camera capturing area;
carrying out straight line detection on the image after edge detection through Hough straight line detection to realize area positioning of a projection area;
solving the mapping relation under the view angle transformation through the homography transformation matrix;
identifying the interactive objects by using a YOLOv3 target detection algorithm, and obtaining game coordinates mapped to the Unity3D development scene from the projection area coordinates according to the solved homography transformation matrix;
obtaining depth information of an interactive object through binocular camera ranging;
and virtualizing the character object by obtaining the positions of the skeleton key points of the interactive person, and generating corresponding interactive actions according to the distribution of the character joint points.
Preferably, the camera image is grayed before edge detection, the edge is a set of points with obvious brightness change in the image, and a large-scale canny filter operator is adopted to detect the edge of the image.
Preferably, a gaussian filter is used to smooth the image and filter out noise when edge detection is performed;
a gaussian kernel of size (2k +1) × (2k +1) is set by the following formula:
wherein k is a positive integer, i, j is in the middle of [1,2k +1 ]],σ2Let σ be 1.4 and k be 1 for the variance of the gaussian function, resulting in a gaussian convolution kernel:
convolving the Gaussian kernel with a gray image to obtain a smooth image;
calculating the gradient strength and direction of each pixel point in the image, and utilizing Sobel operators in the horizontal direction and the vertical direction:
wherein SxIs the Sobel operator in the horizontal direction, SyIs a vertical Sobel operator which is respectively convolved with the smooth image to obtain the first derivative G of the pixel point in two directionsx,GyFrom this, the gradient of the pixel point is calculated:
preferably, the non-maxima suppression comprises the steps of:
3) comparing the gradient strength of the current pixel with two pixels along the positive and negative gradient directions;
4) if the gradient intensity of the current pixel is maximum compared with the other two pixels, the pixel point is reserved as an edge point, otherwise, the pixel point is restrained.
Preferably, the performing line detection on the image after the edge detection through hough line detection to realize the area positioning process of the projection area specifically includes:
for a straight line y on a cartesian coordinate system, kx + b, where (x, y) represents a coordinate point under the coordinate system, k represents a slope of the straight line, and b represents an intercept of the straight line, the straight line is transformed into: b is y-xk, and the abscissa in the hough space is defined as k, and the ordinate is b, then b is y-xk, which is a straight line with slope-x and intercept of y in the hough space; several points (x) on the same straight line on the Cartesian coordinate system1,y1),(x2,y2),…,(xn,yn) Corresponding to a plurality of straight lines on the Hough space, wherein the common intersection point (k, b) of the straight lines is the slope and the intercept of the same straight line in a Cartesian coordinate system;
performing hough transform in a polar coordinate mode, specifically, expressing a straight line by using a polar coordinate equation rho ═ xcos theta + ysin theta, wherein rho is a polar distance, namely a distance from an original point to the straight line in a polar coordinate space; theta is a polar angle, namely an included angle between a line segment which passes through an origin and is perpendicular to a straight line and an x axis, defining the horizontal coordinate in Hough space as theta and the vertical coordinate as rho, and then coordinates (x) of a plurality of points on the same straight line on the polar coordinate system1,y1),(x2,y2),…,(xn,yn) Corresponding to a plurality of curves in Hough space, wherein the common intersection point (theta, rho) of the curves is the polar angle and the polar distance of the same straight line in a polar coordinate system;
calculating the intersection point of the four boundary straight lines of the longest projection region to obtain four vertex coordinates (x)lt,ylt)、(xlb,ylb)、(xrb,yrb)、(xrt,yrt)。
Preferably, the solving of the mapping relationship under the view angle transformation through the homography transformation matrix specifically includes:
setting the X '-Y' plane to be vertical to the Z axis of the X-Y-Z space coordinate system and to be intersected with the Z axis to be a point (0,0,1), namely, the point (X ', Y') under the X '-Y' plane coordinate system is a point (X ', Y', 1) under the X-Y-Z space coordinate system; and describing the mapping relation between the X-Y plane coordinate system and the X-Y-Z space coordinate system by using a homography matrix H:
in the formula, h1~h99 transformation parameters for the homography matrix; further obtaining the mapping relation from the x-y plane coordinate system to the x '-y' plane coordinate system as follows:
the H matrix has 9 transformation parameters, but actually only 8 degrees of freedom, and multiplying the H matrix by a scaling factor k:
i.e. k x H andh actually represents the same mapping relation, so that H only has 8 degrees of freedom, and the solution of the homography matrix H is realized by adding constraint to the homography matrix H or adding H to the homography matrix H 91.
Preferably, h is9The solution equation is set to 1 as follows:
the homography matrix H is constrained modulo 1 as follows:
the equation to be solved is then:
defining target coordinate points of the four vertexes of the projection area under the obtained pixel coordinate system under the projection scene coordinate system, namely solving an H matrix:
preferably, the identifying the interactive object by using the YOLOv3 target detection algorithm, and obtaining the game coordinate mapped from the projection area coordinate to the Unity3D development scene according to the solved homography transformation matrix specifically includes:
the loss function of YOLOv3 is as follows:
where the first term is the coordinate error loss, λ
coordIs a coordinate loss function coefficient; s denotes dividing an input image intoS × S grids; b represents the number of frames included in one mesh;
whether the jth frame of the ith grid contains an object or not is represented, the containing time value is 1, and the non-containing time value is 0; x and y respectively represent the center coordinates of the frame; w and h respectively represent the length and width of the frame; r is
ij、
X, y, w, h representing the prediction box and the real box, respectively; the second term and the third term are confidence loss,
whether the jth frame of the ith grid does not contain an object or not is represented, the value of the non-containing time is 1, and the value of the containing time is 0; lambda [ alpha ]
noobjTo balance the loss weights of object-bearing and object-free meshes, the goal is to reduce the confidence loss of the mesh borders without objects; c
ijAnd
respectively representing the predicted and real confidence coefficients of the jth frame of the ith grid; classes represents the number of categories; p is a radical of
ij(c),
And the prediction probability and the real probability of the jth frame of the ith grid belonging to the class c object are shown.
Preferably, the obtaining of the depth information of the interactive object through binocular camera ranging specifically includes:
correcting the original image according to the calibration result, wherein the two corrected images are positioned on the same plane and are parallel to each other;
matching pixel points of the two corrected images to obtain a disparity map;
and calculating the depth of each pixel according to the matching result, thereby obtaining a depth map.
Preferably, the obtaining of the skeletal key point position of the interactor through the Kinect camera and the virtualization of the character object and the corresponding interaction action in the Uinty3D software specifically include:
identifying the skeleton structure of a target person by utilizing the skeleton tracking function of the Kinect somatosensory instrument, acquiring depth data of the target person and related information of a color image, and displaying the obtained skeleton structure information of the person in real time to store joint point data of the person; the interaction of the man-machine interaction system is improved by virtualizing character objects in the Unity3D software to simulate the interaction of an interactor.
Compared with the prior art, the invention has the beneficial effects that: the target detection is carried out by a deep learning method, so that the interference of environmental factors can be avoided, the diversity of interactive scenes is ensured, and meanwhile, the interactivity of a human-computer interaction system is improved by combining the depth information of an interactive object acquired by a binocular camera and the human behavior information of an interactive person acquired by a Kinect camera.
Detailed Description
The invention is further described with reference to the following figures and specific embodiments.
As shown in fig. 2, a binocular projection human-computer interaction method combining portrait behavior information includes the following steps:
s1, performing edge detection on the camera capturing area through a large-scale canny filtering operator, and performing non-maximum suppression;
carrying out graying processing on the camera image;
the edge is a set of points with obvious brightness change in the image, the gradient can reflect the change speed in numerical value, and the edge of the image is detected by adopting a large-scale canny filter operator based on the principle of no leakage of the boundary of a projection area;
using a Gaussian filter to smooth the image and filter out noise;
a gaussian kernel of size (2k +1) × (2k +1) is set by the following formula:
wherein k is a positive integer, i, j is in the middle of [1,2k +1 ]],σ2Let σ be 1.4 and k be 1 for the variance of the gaussian function, resulting in a gaussian convolution kernel:
convolving the Gaussian kernel with a gray image to obtain a smooth image;
calculating the gradient strength and direction of each pixel point in the image, and utilizing Sobel operators in the horizontal direction and the vertical direction:
wherein SxIs the Sobel operator in the horizontal direction, SxIs a vertical Sobel operator which is respectively convolved with the smooth image to obtain the first derivative G of the pixel point in two directionsx,GyFrom this, the gradient of the pixel point is calculated:
non-maxima suppression is applied to eliminate spurious responses due to edge detection:
for each pixel point on the obtained gradient image, the reservation or elimination of the point cannot be determined only by a single threshold, and for the finally obtained edge image, the accurate description of the source image contour is expected, so that non-maximum suppression is required:
5) comparing the gradient strength of the current pixel with two pixels along the positive and negative gradient directions;
6) if the gradient intensity of the current pixel is maximum compared with the other two pixels, the pixel point is reserved as an edge point, otherwise, the pixel point is inhibited;
as the convolution kernel scale increases, the more pronounced the detected edge is; based on the principle that the boundary of the projection area cannot be detected, a large-scale canny operator is adopted to detect the edge of the image.
S2, carrying out straight line detection on the image after edge detection through Hough straight line detection to realize area positioning of a projection area;
hough line detection maps each point on a cartesian coordinate system to a straight line in Hough space by using the principle of point-line duality of the cartesian coordinate system and Hough space, so that an intersection passing through a plurality of straight lines in Hough space corresponds to a straight line passing through a plurality of points in the cartesian coordinate system, as shown in fig. 3.
Specifically, for a straight line y on a cartesian coordinate system, kx + b, where (x, y) represents a coordinate point under the coordinate system, k represents a slope of the straight line, and b represents an intercept of the straight line. The straight line is transformed into: and b is y-xk, and the abscissa in the hough space is k, and the ordinate in the hough space is b, then b is y-xk, which is a straight line with slope-x and intercept y in the hough space. Several points (x) on the same straight line on the Cartesian coordinate system1,y1),(x2,y2),…,(xn,yn) The hough space corresponds to a plurality of straight lines, and the common intersection point (k, b) of the straight lines is the slope and the intercept of the same straight line in a Cartesian coordinate system.
The slope of the vertical line in the image cannot be calculated becauseThis is typically done in polar form with a hough transform. Specifically, a straight line is expressed by a polar coordinate equation ρ ═ xcos θ + ysin θ, where ρ is a polar distance, i.e., a distance from an origin to the straight line in a polar coordinate space; θ is the polar angle, i.e. the angle between the line segment passing through the origin and perpendicular to the straight line and the x-axis. Defining the horizontal coordinate in Hough space as theta and the vertical coordinate as rho, and then defining the coordinates (x) of a plurality of points on the same straight line on the polar coordinate system1,y1),(x2,y2),…,(xn,yn) The hough space corresponds to a plurality of curves, and a common intersection point (θ, ρ) of the curves is a polar angle and a polar distance of the same straight line in the polar coordinate system, and a schematic diagram is shown in fig. 4 of the accompanying drawings.
Calculating the intersection point of the four boundary straight lines of the longest projection region in pairs to obtain coordinates (x) of four top points of the projection region, namely, the top left point, the bottom right point and the top right pointlt,ylt)、(xlb,ylb)、(xrb,yrb)、(xrt,yrt)。
S3, solving the mapping relation under the view angle transformation through the homography transformation matrix; the homography transformation diagram is shown in figure 5.
Homography transformations reflect the process of mapping from one two-dimensional plane to three-dimensional space, and then from three-dimensional space to another two-dimensional plane. The homography transformation describes nonlinear transformation between two coordinate systems, so that the homography transformation has wide application in the fields of image splicing, image correction, augmented reality and the like.
X-Y-Z is a three-dimensional space coordinate system and can be understood as a world coordinate system; x-y is a pixel plane space coordinate system; and x '-y' is a plane coordinate system of the elevator key. The homography transform can be described as: a point (X, Y) on the X-Y coordinate system, corresponding to a straight line l passing through the origin and the point on the X-Y-Z coordinate system:
the straight line intersects the x '-y' coordinate system plane at point (x ', y'), and the process from point (x, y) to point (x ', y') is referred to as a homography transformation.
The solving process of the homography transformation is as follows:
let the X '-Y' plane be perpendicular to the Z-axis of the X-Y-Z space coordinate system and intersect the Z-axis at point (0,0,1), i.e., point (X ', Y') in the X '-Y' plane coordinate system is point (X ', Y', 1) in the X-Y-Z space coordinate system. And describing the mapping relation between the X-Y plane coordinate system and the X-Y-Z space coordinate system by using a homography matrix H:
in the formula, h1~h99 transformation parameters for the homography matrix; further obtaining the mapping relation from the x-y plane coordinate system to the x '-y' plane coordinate system as follows:
the H matrix has 9 transformation parameters, but actually only 8 degrees of freedom, and multiplying the H matrix by a scaling factor k:
that is, k x H and H actually represent the same mapping relationship, so H has only 8 degrees of freedom, and one way is to solve H9Setting to 1, the equation to be solved is:
another approach is to add a constraint to the homography matrix H, modulo 1, as follows:
the equation to be solved is then:
defining target coordinate points of the four vertexes of the projection area under the obtained pixel coordinate system under the projection scene coordinate system, namely solving an H matrix:
s4, identifying the interactive objects by researching a convolutional neural network technology and utilizing a YOLOv3 target detection algorithm, and obtaining game coordinates mapped to a Unity3D development scene from the projection area coordinates according to the solved homography transformation matrix;
s5, obtaining depth information of the interactive object by ranging the binocular camera through a convolutional neural network;
further, interactive objects are identified by using a YOLOv3 target detection algorithm, and a game coordinate process of mapping the projection area coordinates to the Unity3D development scene is obtained according to the solved homography transformation matrix.
Coordinate positioning of the arrow position in the image is a typical object detection problem. The current research on target detection in the field of convolutional neural networks is mainly divided into two algorithms, namely a two-stage algorithm and a one-stage algorithm: the two-stage target detection algorithm calculates an input image to obtain a candidate region, and then classifies and corrects the candidate region through a convolution neural network. Representative of such algorithms are the R-CNN series and SPP-Net, etc.; the one stage target detection algorithm converts the detection and positioning of the target position into a regression problem, and the target detection algorithm from the input end to the output end is directly realized through a single convolutional neural network. one stage algorithm is represented by the YOLO series, SSD, Retina-Net, etc. The two types of target detection algorithms have respective advantages and disadvantages, the two types of target detection algorithms are superior to the one type of target detection algorithms in accuracy and precision, and the one type of target detection algorithms have great advantages in speed.
The YOLOv3 loss function is designed as follows:
where the first term is the coordinate error loss, λ
coordIs a coordinate loss function coefficient; s denotes dividing the input image into S × S meshes; b represents the number of frames included in one mesh;
whether the jth frame of the ith grid contains an object or not is represented, the containing time value is 1, and the non-containing time value is 0; x and y respectively represent the center coordinates of the frame; w and h respectively represent the length and width of the frame; r is
ij、
X, y, w, h representing the prediction box and the real box, respectively; the second term and the third term are confidence loss,
whether the jth frame of the ith grid does not contain an object or not is represented, the value of the non-containing time is 1, and the value of the containing time is 0; lambda [ alpha ]
noobjTo balance the loss weights of object-bearing and object-free meshes, the goal is to reduce the confidence loss of the mesh borders without objects; cx
jAnd
respectively representing the predicted and real confidence coefficients of the jth frame of the ith grid; classes represents the number of categories; p is a radical of
ij(c),
And the prediction probability and the real probability of the jth frame of the ith grid belonging to the class c object are shown.
S6, obtaining the positions of the skeleton key points of the interactors through a Kinect camera, virtualizing the character objects in the Uinty3D software, and generating corresponding interactive actions according to the distribution of the character joint points.
Calibrating the binocular cameras to obtain internal and external parameters and distortion coefficients of the two cameras;
the purpose of camera calibration is as follows: first, to restore the real world position of the object imaged by the camera, it is necessary to know how the world object is transformed into the computer image plane, i.e. to solve the internal and external parameter matrix.
Second, the perspective projection of the camera has a significant problem — distortion. Another purpose of camera calibration is to solve distortion coefficients and then use them for image rectification.
Correcting the original image according to the calibration result, wherein the two corrected images are positioned on the same plane and are parallel to each other;
the main task of the binocular camera system is distance measurement, and the parallax distance measurement formula is derived under the ideal condition of the binocular system, but in the real binocular stereo vision system, two camera image planes which are completely aligned in a coplanar line do not exist, as shown in the attached figure 6, wherein p is a certain point on an object to be measured, and O is a certain point on the object to be measured1And O2The optical centers of the two cameras are respectively, so that the three-dimensional correction is carried out, namely, two images which are not in coplanar line alignment in practice are corrected into coplanar line alignment (the coplanar line alignment is that when two camera image planes are on the same plane and the same point is projected to the two camera image planes, the same line of two pixel coordinate systems is needed), the actual binocular system is corrected into an ideal binocular system, as shown in figure 7, wherein p is a certain point on an object to be measured, and O is a certain point on the object to be measuredRAnd OTThe optical centers of the two cameras are respectively, the imaging points of the point P on the photoreceptors of the two cameras are P and P', f is the focal length of the cameras, B is the center distance of the two cameras, and X isRAnd XTThe distances from the imaging points on the image planes of the left camera and the right camera to the left edge of the image plane respectively, and z is the required depth information.
Matching pixel points of the two corrected images, wherein the pixel points are used for matching corresponding image points of the same scene on left and right views to obtain a parallax image;
calculating the depth of each pixel according to the matching result, thereby obtaining a depth map;
further, the obtaining of the skeletal key point position of the interactor through the Kinect camera, and then the virtualization of the character object and the corresponding interactive action process in the Uinty3D software specifically include:
and identifying the skeleton structure of the target person by utilizing the skeleton tracking function of the Kinect somatosensory instrument. The depth data of the target person and the related information of the color image are directly acquired from the sensor of the Kinect, and the obtained skeletal structure information of the person is displayed in real time and 20 joint point data of the person are saved.
The interactivity of the man-machine interaction system is improved by virtualizing character objects in the Unity3D software and designing a plurality of corresponding actions, such as squatting, standing, shooting postures and the like, in the Unity3D software according to the character joint distribution information so as to simulate the interactive actions of an interactor.
The invention uses the projector 2 to project a mobile phone interface through the mobile phone end 1, the interactive object starts to perform interactive action, the binocular camera 3 and the Kinect camera 4 transmit images to the cloud server 5 through the network for data processing and analysis, and the result is transmitted back to the mobile phone end 1 for displaying the game scene.
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.