CN108521594B

CN108521594B - Free viewpoint video playing method based on motion sensing camera gesture recognition

Info

Publication number: CN108521594B
Application number: CN201810246578.3A
Authority: CN
Inventors: 朱威; 宋广振; 王义锋; 陈朋; 郑雅羽
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-03-23
Filing date: 2018-03-23
Publication date: 2020-08-11
Anticipated expiration: 2038-03-23
Also published as: CN108521594A

Abstract

The invention relates to a free viewpoint video playing method based on somatosensory camera gesture recognition, which comprises the steps of obtaining a frame of depth image and human body skeleton data, preprocessing the depth image by a bilateral filtering method, calculating the center position of a right palm and the position of a right arm, obtaining a right palm profile, extracting the profile characteristics of the right palm and the position characteristics of the right arm to judge a gesture, and converting the recognized gesture into corresponding viewpoint playing operation. The invention uses the bilateral filtering method to preprocess the depth image, so that the generated depth image better retains the edge contour information in the image. A simple and quick gesture recognition method is designed, and the blockage of the viewpoint switching by the user gesture is effectively reduced. A brand-new method for amplifying and reducing the local picture of the free viewpoint video and a method for switching viewpoints are designed, the local amplification and reduction of the viewpoint picture and the viewpoint switching are carried out by recognizing the gesture of a user, and a good blending feeling can be brought to the user when the user watches the free viewpoint video.

Description

Free viewpoint video playing method based on motion sensing camera gesture recognition

Technical Field

The invention relates to the field of free viewpoint video playing, in particular to a free viewpoint video playing method based on somatosensory camera gesture recognition.

Background

With the development of information technology, the requirements of people on video playing are more and more demanding, the original black and white video image is developed to the current color video image, the original standard definition and high definition are developed to the current ultra-definition and ultra-high definition, but the vision of human eyes has physiological limits, and the requirements on resolution and chroma tend to be saturated. In addition, videos and images are high-definition and three-dimensional in two major development directions of the twenty-first century, high-definition is basically achieved at present, and three-dimensional development is the development direction of videos and images in the future. The display modes of the stereoscopic video can be divided into two modes, namely naked eye 3D display and VR display according to the mainstream classification method. The naked eye 3D display technology can be divided into two modes, and the traditional mode uses a special 3D display to display and directly uses a common display to perform free viewpoint playing display. The display mode used by the invention is to play the free viewpoint video by adopting a common display and a somatosensory camera. At present, there are many developers of Motion sensing cameras, and the mainstream manufacturers include Microsoft, Intel, Leap Motion and the like, wherein the Intel develops a RealSense Motion camera, and compared with a Kinect camera of Microsoft, the RealSense camera has a relatively short recognition distance and focuses on application in a short distance. LeapMotion developed a Controller somatosensory camera, which is very small compared to the former two cameras and is only suitable for some gesture recognition and finger tracking at close distances. The Microsoft Kinect camera comprises Kinect v1 and Kinect v2, wherein Kinect v2 is a somatosensory camera based on the ToF (Time of Flight) principle, the generated depth image is high in quality, and has human skeleton tracking, basic gesture operation and face tracking functions, the camera provides an SDK (space division key) developed on a Windows platform, and the API can be directly called for skeleton tracking to obtain the three-dimensional coordinate information of corresponding postures and skeleton points.

Gesture recognition can be divided into vision-based gesture recognition and data glove gesture recognition based on various sensors from the technical point of view, and the recognition precision of the data glove gesture recognition technology is higher than that of the vision-based gesture recognition technology, but the special data glove equipment is too expensive, so that the data glove equipment is not suitable for popularization and use in life. The gesture recognition technology based on vision can be completed only by using a camera, and the gesture recognition technology is convenient to use and moderate in price, so that the gesture recognition is performed through the motion sensing camera. The visual gesture recognition system (see the sea soldiers, who want to be far and new, slow light, forestry, medium to be hot and medium, research based on visual gesture recognition-review [ J ] electronics, 2000(02): 118-: first, a video data stream is acquired by one or more cameras. Then, the system detects whether a gesture occurs in the data stream according to the interaction model of the gesture input. If so, the gesture is segmented from the video signal. Then, selecting a gesture model for gesture analysis, wherein the analysis process comprises feature detection and model parameter estimation; in the recognition stage, the gestures are classified according to the model parameters and gesture descriptions are generated as required. Finally, the system drives the specific application according to the generated description. The patent with the application number of 201710263069.7 mainly comprises the steps of obtaining the palm center and the arm center through a Kinect camera to extract a hand outline, then extracting the position characteristics of fingertips and finger roots, comparing the characteristics with the characteristics in a gesture template library to recognize gestures, finally putting the recognized gestures into a queue and comparing and analyzing the recognition results of the first four frames, and obtaining the final recognition result when the most gestures occur. The final recognition result of the recognition method is accurate, but the processing delay is large, so that the method is not suitable for real-time switching of free viewpoints. The patent with application number 201610711131.X obtains depth image data and skeleton data in a gesture video through a Kinect camera, performs primary processing on the obtained depth image by using median filtering, then performs gesture detection and hand position tracking through HandTracker classes in a NITE library, extracts contour information of a palm through edge detection, and finally performs matching recognition gesture through multiple gestures in a decision tree gesture library. The gesture recognition method has the advantages of complex steps and high calculation complexity, and is not beneficial to real-time switching of the free viewpoint.

Disclosure of Invention

In order to enhance the interaction and the mutual inductance of a user in the playing process of a free viewpoint video and further bring better 3D effect experience to the user, the invention provides a free viewpoint video playing method based on gesture recognition of a somatosensory camera, which can perform local amplification and reduction of a viewpoint picture and switching of viewpoints by recognizing the gesture of the user, and comprises the following steps:

(1) acquiring a frame of depth image and human skeleton data

Real-time acquisition of one-frame depth image data I through motion sensing camera₁And corresponding human skeletal data S₁In which I₁Has a width of w pixel points and a height of h pixel points, S₁Skeleton three-dimensional coordinate comprising three points of right arm joint, right wrist joint and right palmAnd (4) data. The acquired frame data is the current frame data.

(2) Depth image preprocessing using bilateral filtering

Using a bilateral filtering method to the depth image I obtained in the step (1)₁Preprocessing is carried out to obtain a depth image recorded as I 'after bilateral filtering'₁. The bilateral filter has the characteristic of better retaining the edge information of the depth image while filtering noise.

(3) Calculating right palm center position and right arm position using human skeletal data

In the invention, because the viewpoint is switched by recognizing the gesture of the user in the playing process of the free viewpoint video, the complexity of the existing calculation method of the palm center position and the arm position is too high, and a quick calculation method of the palm center position and the arm position is used. In the invention, S is directly reacted₁The three-dimensional coordinates of the right palm joint in the middle are used as the coordinates H of the center position of the right palm_c(x₁,y₁,z₁) The position of the right arm is the joint point A of the right arm_r(x₂,y₂,z₂) And right wrist joint point W_r(x₃,y₃,z₃) The two points are connected.

(4) Obtaining the outline of the right palm according to the center position of the right palm and the preprocessed depth image

H is to be_c(x₁,y₁,z₁) Mapping to depth image I'₁To the corresponding position to give H'_c(x’₁,y’₁,z’₁) In the practical application scene of gesture recognition, the hand is closest to the motion sensing camera, namely, the gray value of the hand area is smaller than that of other areas and the difference value is obvious compared with that of other areas. Taking H'_c(x’₁,y’₁,z’₁) Gray value G of₁For reference, search for matching depth image I'₁H 'of'_c(x’₁,y’₁,z’₁) The rectangular area is a rectangular area which is provided with a center, w 'pixel points in width and h' pixel points in height, wherein w 'is w/4, and h' is h/3; in thatExtracting gray value in G in rectangular area₁-10,G₁+10]The inner pixel area, this area is marked H. And finally setting the gray value of the H area as 255 and the gray values of the rest background areas as 0 to obtain the right palm profile.

(5) Extracting the outline characteristic of the right palm and the position characteristic of the right arm

(5-1) extracting the characteristics of the H region. And (5) performing convex hull detection on the H area, and if the H area is a polygonal area with five convex points and four concave points, jumping to the step (5-2) for further processing, otherwise, jumping to the step (5-3).

(5-2) the H region is further detected to discriminate fingertips. All the convex points and concave points in the H area are numbered as z [ n ] in sequence and alternately]N is 0,1.. 8, resulting in a single line from z [0]To z [8 ]]The path of (2) is to distinguish the convex point (i.e. the finger tip) and the concave point (i.e. the finger root) in the outline of the right palm by judging whether the direction of the path from the previous point to the next point is upward or downward. If the direction from the front point to the rear point is downward, the point is the fingertip, if the direction from the front point to the rear point is upward, the point is the finger root, and all the fingertip positions and the previously determined right palm center position are counted to be used as the right palm contour feature. Calculating the Euclidean distance x from the first fingertip of the right hand to the center of the right palm_nAnd the Euclidean distance z from the second fingertip of the right hand to the center of the right palm_nAnd (6) jumping to the step (6-1) to further judge the zooming-in gesture or the zooming-out gesture.

(5-3) passing through the current frame midpoint A_r(x₂,y₂,z₂) And point W_r(x₃,y₃,z₃) To extract the right arm's angle θ with the spatial X-axis representing the horizontal right direction: the Euclidean distance d of the two points is obtained by using the formula (1), namely the actual distance of the two points in the three-dimensional space,

then passes through point W_r(x₃,y₃,z₃) And A_r(x₂,y₂,z₂) Distance | X on spatial X-axis₃-x₂And | and the Euclidean distance d, solving the included angle theta between the right arm of the current frame and the space X axis representing the horizontal right direction, and jumping to the step (6-2) for further judgment.

(6) Judging the gesture by using the outline characteristic of the right palm and the position characteristic of the right arm

(6-1) dividing x of the current frame_nAnd z_nAnd m before₁Euclidean distance x from the first fingertip of the right hand to the center of the right palm in the frame_m1And the Euclidean distance z from the second fingertip of the right hand to the center of the right palm_m1Contrast-recognizing a zoom-in gesture and a zoom-out gesture, m₁Has a value range of [10,60 ]]: if x_n>x_m1And z is_n>z_m1Judging the current frame output gesture as an amplifying gesture, and jumping to the step (7); if x_n<x_m1And z is_n<z_m1Judging the current frame output gesture as a zooming-out gesture, and jumping to the step (7); if the above two cases are not satisfied or the m-th case before₁And if the first fingertip data of the right hand and the second fingertip data of the right hand do not exist in the frame, judging that the current frame has no zooming gesture, and finishing the current frame processing.

(6-2) judging whether theta in the current frame belongs to [85 degrees and 95 degrees ], if the theta belongs to [85 degrees and 95 degrees ], jumping to the step (6-3) to further judge the gesture of switching the viewpoint leftwards and the gesture of switching the viewpoint rightwards, and if not, finishing the processing of the current frame.

(6-3) m th before judgment₂The right arm in the frame makes an angle theta with the spatial X-axis representing the horizontal right direction_m2Recognizing a left-hand viewpoint-switching gesture and a right-hand viewpoint-switching gesture, m₂Has a value range of [10,60 ]]: if theta_m2Satisfies the relation of formula (2), i.e. theta_m2Belong to [20 degrees, 80 degrees ]]，

Judging the current frame output gesture as a leftward viewpoint switching gesture, and calculating a viewpoint switching span V of a corresponding unit_leftAnd jumping to the step (7);

if theta_m2Satisfies the relation of formula (3)I.e. theta_m2Belonging to [100 degrees, 160 degrees ]]，

Judging the current frame output gesture as a right viewpoint switching gesture, and calculating a viewpoint switching span V of a corresponding unit_rightAnd jumping to the step (7);

if theta_m2Not satisfying the above two cases or the m-th case before₂The right arm in the frame makes an angle theta with the spatial X-axis representing the horizontal right direction_m2If not, judging that the current frame has no viewpoint switching gesture, and ending the processing of the current frame data.

(7) Converting the recognized gesture into corresponding viewpoint playing operation

(7-1) if the recognized gesture is a local screen zooming-in gesture, realizing a playing operation in the following way: and pausing the current playing picture, mapping the space coordinate of the center of the right palm of the current gesture to a specific position in the image, then locally amplifying the picture at the position through a corresponding functional module, and finally outputting the amplified picture to a display to finish the operation.

(7-2) if the recognized gesture is a zoom-out gesture of the local screen, implementing a play operation by: and pausing the current playing picture, mapping the space coordinate of the center of the right palm of the current gesture to a specific position in the image, then locally reducing the picture of the position through a corresponding functional module, and finally outputting the reduced picture to a display to finish the operation.

And (7-3) if the gesture recognized currently is a gesture for switching the viewpoint leftwards, sending a signal for switching the viewpoint leftwards, generating a new viewpoint through a free viewpoint synthesis technology, and finally outputting the new viewpoint to a display to finish the operation.

And (7-4) if the gesture recognized currently is a gesture for switching the viewpoint rightwards, sending a signal for switching the gesture rightwards, generating a new viewpoint through a free viewpoint synthesis technology, and finally outputting the new viewpoint to a display to finish the operation.

And (7-5) ending the processing of the current frame data.

The technical conception of the invention is as follows: firstly, considering that the quality of a depth image acquired by a common somatosensory camera is poor, a bilateral filtering method is used for carrying out primary processing on the depth image after the depth image is acquired, so that a better edge contour image is acquired; then, judging the center position of a right palm and the position of a right arm according to skeleton data of the somatosensory camera, acquiring a right palm contour according to the center position of the right palm and the preprocessed depth image, and further extracting a right palm contour characteristic and a right arm position distance characteristic; then judging the magnification and reduction gestures of the self-viewpoint video through the outline characteristics of the right palm, and judging the left-right switching gesture of the viewpoint through the included angle between the position connecting line of the right arm and the horizontal right direction; and finally, sending the recognized gesture signals to a corresponding module to realize a corresponding playing function.

Compared with the prior art, the invention has the following beneficial effects:

the invention uses the bilateral filtering method to preprocess the depth image, so that the generated depth image better retains the edge contour information in the image. A simple and quick gesture recognition method is designed, and the blockage of the viewpoint switching by the user gesture is effectively reduced. A brand-new method for amplifying and reducing the local picture of the free viewpoint video and a method for switching viewpoints are designed, the local amplification and reduction of the viewpoint picture and the viewpoint switching are carried out by recognizing the gesture of a user, and a good blending feeling can be brought to the user when the user watches the free viewpoint video.

Drawings

FIG. 1 is a block flow diagram of the present invention;

FIG. 2 is a block diagram of a decision process of the present invention;

FIG. 3 is a schematic diagram of a viewpoint gesture switching to the right in the present invention, wherein a viewpoint gesture switching to the right is composed of a left diagram gesture to a right diagram gesture;

FIG. 4 is a schematic diagram of a viewpoint left switching gesture according to the present invention, wherein the viewpoint left switching gesture is composed of a left diagram gesture and a right diagram gesture.

Detailed Description

The present invention will be described in detail below with reference to examples and drawings, but the present invention is not limited thereto.

The invention adopts a second generation Kinect somatosensory camera of Microsoft corporation and a PC with a processor of Inter-Core-i5-6500@2.30GHz as a hardware platform, and the embodiment of the invention relates to a method for playing free viewpoint video based on Kinect camera gesture recognition, which specifically comprises the following steps as shown in figure 1:

(1) acquiring a frame of depth image and human skeleton data;

(2) preprocessing the depth image by adopting a bilateral filtering method;

(3) calculating the center position of the right palm and the position of the right arm by using the human skeleton data;

(4) obtaining a right palm outline according to the center position of the right palm and the preprocessed depth image;

(5) extracting the outline characteristic of the right palm and the position characteristic of the right arm;

(6) judging and recognizing the gesture by using the outline characteristic of the right palm and the position characteristic of the right arm;

(7) and converting the recognized gesture into a corresponding viewpoint playing operation.

The step (1) specifically comprises the following steps:

(1-1) calling a related header file in the Kinect camera SDK by using OpenCV to obtain video stream data in real time, and directly obtaining a frame of current Depth image data I through a Depth function interface₁In which I₁The width of the pixel is w pixel points, and the height of the pixel is h pixel points.

(1-2) obtaining the bone point data S through drawSkeleton function₁Specifically, bone three-dimensional coordinate data of an ELBOW _ RIGHT, a WRIST _ RIGHT and a HAND _ RIGHT of a human body are obtained in real time through a whichone variable.

The step (2) specifically comprises the following steps:

first, a depth image I is filtered using a bilateral filter₁The gray value of the peripheral point of a certain pixel and the Euclidean distance between the points and the central point are weighted and combined, the specific value of the point after bilateral filtering is obtained through calculation, and the specific value is used for filtering the peripheral point of the certain pixelThe depth image after bilateral filtering optimization is marked as I'₁And acquiring the right palm contour and the right arm contour for the subsequent steps.

The step (3) specifically comprises the following steps:

the bone data S₁The three-dimensional coordinates of the HAND _ RIGHT skeleton point in (1) are taken as the coordinates H of the center position of the RIGHT palm_cPassing through ELBOW _ RIGHT point A_r(x₂,y₂,z₂) And WRIST _ RIGHT point W_r(x₃,y₃,z₃) The right arm position is calculated by connecting the two points.

The step (4) specifically comprises the following steps:

taking H'_c(x’₁,y’₁,z’₁) Gray value G of₁For reference, search for matching depth image I'₁H 'of'_c(x’₁,y’₁,z’₁) The method comprises the steps of obtaining a rectangular area with the center width of w 'pixel points and the center height of h' pixel points, wherein w 'is w/4, h' is h/3, and extracting a gray value in [ G ] from the rectangular area₁-10,G₁+10]The inner pixel area, this area is marked H. And finally setting the gray value of the H area as 255 and the gray values of the rest background areas as 0 to obtain the right palm profile.

The step (5) specifically comprises the following steps:

(5-2) the H region is further detected to discriminate fingertips. All the convex points and concave points in the H area are numbered as z [ n ] in sequence and alternately]N is 0,1.. 8, resulting in a single line from z [0]To z [8 ]]The path of (2) is to distinguish the convex point (i.e. the finger tip) and the concave point (i.e. the finger root) in the outline of the right palm by judging whether the direction of the path from the previous point to the next point is upward or downward. If the direction from the front point to the rear point is downward, the point is the fingertip, if the direction from the front point to the rear point is upward, the point is the finger root, all the fingertip positions and the previously determined right palm center position are counted and used as the right palm outlineAnd (5) characterizing. Calculating the Euclidean distance x from the first fingertip of the right hand to the center of the right palm_nAnd the Euclidean distance z from the second fingertip of the right hand to the center of the right palm_nAnd (6) jumping to the step (6-1) to further judge the zooming-in gesture or the zooming-out gesture.

(5-3) passing through the current frame midpoint A_r(x₂,y₂,z₂) And point W_r(x₃,y₃,z₃) Extracting a spatial X-axis included angle theta between the right arm and the representative horizontal right direction: the Euclidean distance d of the two points is obtained by using the formula (1), namely the actual distance of the two points in the three-dimensional space,

(6-1) dividing x of the current frame_nAnd z_nAnd m before₁Euclidean distance x from the first fingertip of the right hand to the center of the right palm in the frame_m1And the Euclidean distance z from the second fingertip of the right hand to the center of the right palm_m1Contrast-recognizing a zoom-in gesture and a zoom-out gesture, m₁Has a value range of [10,60 ]]: if x_n>x_m1And z is_n>z_m1Judging the current frame output gesture as an amplifying gesture, and jumping to the step (7); if x_n<x_m1And z is_n<z_m1Judging the current frame output gesture as a zooming-out gesture, and jumping to the step (7); if the above relationship or the previous m-th relation is not satisfied₁And if the first fingertip data of the right hand and the second fingertip data of the right hand do not exist in the frame, judging that the current frame has no zooming gesture, and finishing the current frame processing.

(6-3) m th before judgment₂The right arm in the frame makes an angle theta with the spatial X-axis representing the horizontal right direction_m2Recognizing a left-hand viewpoint-switching gesture and a right-hand viewpoint-switching gesture, m₂Has a value range of [10,60 ]]: if theta_m2If the relationship of the formula (2) is satisfied, the current frame output gesture is determined as a leftward viewpoint switching gesture, and a viewpoint switching span V of a corresponding unit is calculated_leftAnd jumping to the step (7);

if theta_m2If the relation of formula (3) is satisfied, the current frame output gesture is determined as a right viewpoint switching gesture, and a viewpoint switching span V of a corresponding unit is calculated_rightAnd jumping to the step (7);

if the above relation or the previous m-th relation is not satisfied₂The right arm in the frame makes an angle theta with the spatial X-axis representing the horizontal right direction_m2If not, judging that the current frame has no viewpoint switching gesture, and ending the processing of the current frame data.

The step (7) specifically comprises:

(7-1) if the current frame is detected to have the amplification gesture, suspending the current playing picture, mapping the space coordinate of the center position of the right palm of the current gesture to the specific position in the free viewpoint image, then realizing the local amplification of the picture of the position through a resize () function of OpenCV, and finally displaying the amplified picture to a display to finish the amplification operation.

(7-2) if a zoom-out gesture is detected in the current frame, pausing the current playing picture, mapping the space coordinate of the center position of the right palm of the current gesture to the specific position in the free viewpoint image, then realizing local zoom-out of the picture in the position through a resize () function of OpenCV, and finally displaying the zoomed-out picture to a display to finish the zoom-out operation.

And (7-3) if detecting that the current frame has a gesture of switching the viewpoint to the left, triggering a corresponding signal to be sent to a free viewpoint synthesis end, generating a new viewpoint at the free viewpoint synthesis end through a free viewpoint synthesis technology, wherein the generated new viewpoint is located at a distance of a corresponding unit on the right of the current viewpoint, and if the viewpoint is switched to the left by one unit distance, namely, the viewpoint is actually switched to the left by 30 mm. Then the new viewpoint is sent to the display to complete the process of viewpoint switching.

(7-4) if detecting that the current frame has a viewpoint gesture of switching to the right, triggering a corresponding signal to be sent to a free viewpoint synthesis end, generating a new viewpoint at the free viewpoint synthesis end through a free viewpoint synthesis technology, wherein the generated new viewpoint is located at a distance of a corresponding unit on the left of the current viewpoint, and if switching to the right is performed, namely, a viewpoint of 30mm is actually switched to the right. Then the new viewpoint is sent to the display to complete the process of viewpoint switching.

And (7-5) ending the processing of the current frame data.

Claims

1. A free viewpoint video playing method based on somatosensory camera gesture recognition is characterized by comprising the following steps:

(1) acquiring a frame of depth image and human skeleton data:

real-time acquisition of one frame depth image I through motion sensing camera₁And corresponding human skeletal data S₁，I₁Has a width of w pixel points and a height of h pixel points, S₁The skeleton three-dimensional coordinate data comprises three points of a right arm joint, a right wrist joint and a right palm;

(2) the depth image is preprocessed by adopting a bilateral filtering method:

depth image I using bilateral filtering method₁Filtering to eliminate noise interference in the depth image,optimizing edge contour information in the depth image, and recording the depth image subjected to bilateral filtering as I'₁；

(3) Calculating a right palm center position and a right arm position using human skeletal data:

will S₁The three-dimensional coordinates of the right-palm joint point in (1) are taken as the coordinates H of the center position of the right palm_c(x₁,y₁,z₁) The right arm position is calculated through an arm joint point A_r(x₂,y₂,z₂) And wrist joint point W_r(x₃,y₃,z₃) A connection between two points;

(4) obtaining a right palm outline according to the center position of the right palm and the preprocessed depth image:

h is to be_c(x₁,y₁,z₁) Mapping to depth image I'₁Obtaining a corresponding point, and matching the depth image I 'by taking the gray value of the point as a reference'₁Obtaining a region H from a region adjacent to the gray level value, and mixing the region H with I'₁Binarizing gray values of other areas to obtain a palm contour map;

(5) extracting the outline characteristic of the right palm and the position characteristic of the right arm:

firstly, judging whether the H area is a polygonal area with five convex points and four concave points, if so, further judging the fingertips and the finger roots, and further obtaining the Euclidean distance x from the first fingertip of the right hand to the center of the right palm_nAnd the Euclidean distance z from the second fingertip of the right hand to the center of the right palm_n(ii) a If not, pass A_r(x₂,y₂,z₂) And W_r(x₃,y₃,z₃) Obtaining a space X-axis included angle theta between the right arm of the current frame and the representative horizontal right direction;

(6) judging and recognizing the gesture by utilizing the palm contour characteristics and the arm position characteristics:

if the Euclidean distance x from the first fingertip of the right hand to the center of the right palm of the current frame is obtained in the step (5)_nAnd the Euclidean distance z from the second fingertip of the right hand to the center of the right palm_nThen x is_nAnd z_nAnd itFront m₁Euclidean distance x from the first fingertip of the right hand to the center of the right palm in the frame_m1And the Euclidean distance z from the second fingertip of the right hand to the center of the right palm_m1Comparison, m₁Has a value range of [10,60 ]]Identifying a zooming-in gesture or a zooming-out gesture, and ending the current frame processing if the zooming-in gesture or the zooming-out gesture conditions are not met;

if the included angle theta between the right arm of the current frame and the space X axis representing the horizontal right direction is obtained in the step (5), whether the theta in the current frame belongs to [85 degrees ], 95 degrees or not is judged]If yes, the m-th step before the judgment is carried out₂The right arm in the frame makes an angle theta with the spatial X-axis representing the horizontal right direction_m2Recognizing a left-hand viewpoint-switching gesture and a right-hand viewpoint-switching gesture, m₂Has a value range of [10,60 ]]If the viewpoint switching gesture condition is not met, ending the current frame processing;

(7) converting the recognized gesture into a corresponding operation:

converting the recognized zooming-in gesture or zooming-out gesture into zooming-in or zooming-out operation of a video picture; converting the recognized gesture for switching the viewpoint leftwards or switching the viewpoint rightwards into leftward switching operation of the viewpoint or rightward switching operation of the viewpoint; the processing of the current frame data is ended.

2. The method for playing free viewpoint video based on somatosensory camera gesture recognition as claimed in claim 1, wherein in the step (4), H is selected_c(x₁,y₁,z₁) Mapping to depth image I'₁To obtain H'_c(x’₁,y’₁,z’₁) H is taken'_c(x’₁,y’₁,z’₁) Gray value G of₁For reference, search for matching depth image I'₁H 'of'_c(x’₁,y’₁,z’₁) A rectangular area with a center, a width of w 'pixel points and a height of h', wherein w 'is w/4, h' is h/3, and the gray value is extracted from the rectangular area and is [ G ]₁-10,G₁+10]Inner pixel area, marking the area as H, and finally setting the gray value of the H areaAnd setting the gray values of the rest background areas to be 255 as 0, and obtaining the right palm outline.

3. The method for playing the free viewpoint video based on the somatosensory camera gesture recognition as claimed in claim 1, wherein the step (5) comprises the following substeps:

(5-1) performing convex hull detection on the H area, and if the H area is a polygonal area with five convex points and four concave points, sequentially and alternately numbering all the convex points and the concave points of the H area as z [ n ]]N is 0,1.. 8, there is a path from z [0 · to]To z [8 ]]Judging whether the path direction from the front point to the rear point is upward or downward to distinguish that the convex points in the palm outline are finger tips and concave points which are finger roots, if the path direction from the front point to the rear point is downward, the convex points are the finger tips, if the path direction from the front point to the rear point is upward, the convex points are the finger roots, and if the path direction from the front point to the rear point is upward, the convex points are the finger roots, all the finger tip positions and the previously determined center position of the right palm are counted to be used as the outline characteristics of the right; calculating the Euclidean distance x from the first fingertip of the right hand to the center of the right palm_nAnd the Euclidean distance z from the second fingertip of the right hand to the center of the right palm_n；

(5-2) if there are no five bumps and no four pits in the H region, passing through the current frame midpoint A_r(x₂,y₂,z₂) And point W_r(x₃,y₃,z₃) To extract the right arm's angle θ with the spatial X-axis representing the horizontal right direction: the Euclidean distance d of the two points is obtained by using the formula (1), namely the actual distance of the two points in the three-dimensional space,

then passes through point W_r(x₃,y₃,z₃) And A_r(x₂,y₂,z₂) Distance | X on spatial X-axis₃-x₂And | and the Euclidean distance d, calculating a space X-axis included angle theta between the right arm of the current frame and the representative horizontal right direction.

4. The method for playing the free viewpoint video based on the somatosensory camera gesture recognition as claimed in claim 1, wherein the step (6) comprises the following sub-steps:

(6-1) if the Euclidean distance x from the first fingertip of the right hand to the center of the right palm of the current frame is obtained in the step (5)_nAnd the Euclidean distance z from the second fingertip of the right hand to the center of the right palm_nThen, whether the gesture zooming-in or gesture zooming-out condition is met is judged: if x_n>x_m1And z is_n>z_m1Judging the current frame output gesture as an amplifying gesture; if x_n<x_m1And z is_n<z_m1Judging the current frame output gesture as a zooming-out gesture; if the above two cases are not satisfied or the m-th case before₁If the first fingertip data of the right hand and the second fingertip data of the right hand do not exist in the frame, judging that the current frame has no zooming gesture, and finishing the processing of the current frame;

(6-2) if the included angle theta between the right arm of the current frame and the space X axis representing the horizontal right direction is obtained in the step (5), judging whether the theta belongs to [85 degrees and 95 degrees ], if the theta belongs to [85 degrees and 95 degrees ], further judging whether the viewpoint gesture is switched leftwards or rightwards through the step (6-3), and if not, finishing the processing of the current frame;

(6-3) m th before taking₂The right arm in the frame makes an angle theta with the spatial X-axis representing the horizontal right direction_m2: if theta_m2Belong to [20 degrees, 80 degrees ]]Judging the current frame output gesture as a viewpoint switching left gesture according to the formula (2),

calculating a viewpoint switching span V of a corresponding unit_left(ii) a If theta_m2Belonging to [100 degrees, 160 degrees ]]Judging the current frame output gesture as a right viewpoint switching gesture and according to the formula (3),

calculating a viewpoint switching span V of a corresponding unit_right(ii) a If theta_m2Not satisfying the above two cases or the m-th case before₂The right arm in the frame makes an angle theta with the spatial X-axis representing the horizontal right direction_m2If not, judging that the current frame has no viewpoint switching gesture, and ending the processing of the current frame data.

5. The method for playing the free viewpoint video based on the somatosensory camera gesture recognition as claimed in claim 1, wherein the step (7) comprises the following sub-steps:

(7-1) if the recognized gesture is a local screen zooming-in gesture, realizing a playing operation in the following way: pausing a current playing picture, mapping the space coordinate of the center of the right palm of the current gesture to a specific position in the image, then locally amplifying the picture at the position through a corresponding functional module, and finally outputting the amplified picture to a display to finish the operation;

(7-2) if the recognized gesture is a zoom-out gesture of the local screen, implementing a play operation by: pausing a current playing picture, mapping the space coordinate of the center of the right palm of the current gesture to a specific position in the image, then locally reducing the picture of the position through a corresponding functional module, and finally outputting the reduced picture to a display to finish the operation;

(7-3) if the currently recognized gesture is a gesture for switching the viewpoint leftwards, sending a gesture signal for switching leftwards, generating a new viewpoint through a free viewpoint synthesis technology, and finally outputting the new viewpoint to a display to complete the operation;

(7-4) if the currently recognized gesture is a gesture for switching the viewpoint to the right, sending a gesture signal for switching the viewpoint to the right, generating a new viewpoint through a free viewpoint synthesis technology, and finally outputting the new viewpoint to a display to complete the operation;

and (7-5) ending the processing of the current frame data.