CN117392759B

CN117392759B - Action recognition method based on AR teaching aid

Info

Publication number: CN117392759B
Application number: CN202311685269.3A
Authority: CN
Inventors: 凌艳; 陆海燕
Original assignee: Chengdu Aeronautic Polytechnic
Current assignee: Chengdu Aeronautic Polytechnic
Priority date: 2023-12-11
Filing date: 2023-12-11
Publication date: 2024-03-12
Anticipated expiration: 2043-12-11
Also published as: CN117392759A

Abstract

The invention discloses an action recognition method based on an AR teaching aid, which belongs to the technical field of action recognition and comprises the following steps: s1, acquiring a gesture global image of a user on an AR teaching aid, cutting the gesture global image, and generating a gesture local image; s2, constructing a gesture recognition model; s3, inputting the gesture local image into a gesture recognition model, and determining gesture actions of a user. The invention discloses an action recognition method based on an AR teaching aid, which is used for precisely cutting a global image, wherein a generated local image only comprises hands, so that the gesture action can be quickly and accurately extracted in the following steps; meanwhile, the invention also builds a gesture recognition model, and the gesture recognition model performs feature extraction and feature fusion on the local image, so that the gesture of the user can be accurately extracted, the user can conveniently control the AR teaching aid, the use feeling of the user is improved, and the interaction time of the AR teaching aid is reduced.

Description

Action recognition method based on AR teaching aid

Technical Field

The invention belongs to the technical field of action recognition, and particularly relates to an action recognition method based on an AR teaching aid.

Background

With the progress of technology, AR technology is increasingly developed, and its application is increasingly accepted. AR sand tables are being widely used as a new type of AR educational tool. The three-dimensional virtual reality system can combine the three-dimensional virtual reality with the actual environment, so that students experience through various senses such as vision, hearing and touch, and the like, and know various contents. In education, AR sand tables are fused with solid models by acoustic, optical, electrical, image, three-dimensional animation, and computer programming techniques. Students and teachers can operate the AR sand table through gesture motions, gesture motion graphics or image information is converted into data information to be input into head-mounted equipment equipped with the AR sand table, and the data information is matched with the three-dimensional space information sand table, so that the AR sand table is controlled. However, the existing AR sand table is not accurate enough and has a slower response speed when recognizing the action of a user, so the invention provides an action recognition method based on an AR teaching aid.

Disclosure of Invention

The invention provides an action recognition method based on an AR teaching aid in order to solve the problems.

The technical scheme of the invention is as follows: an action recognition method based on an AR teaching aid comprises the following steps:

s1, acquiring a gesture global image of a user on an AR teaching aid, cutting the gesture global image, and generating a gesture local image;

s2, constructing a gesture recognition model;

s3, inputting the gesture local image into a gesture recognition model, and determining gesture actions of a user.

Further, S1 comprises the following sub-steps:

s11, calculating color label values of all pixel points in the gesture global image;

s12, taking the pixel point with the largest color label value as a standard pixel point;

s13, calculating the difference between the color label values of the rest pixel points in the gesture global image and the color label values of the standard pixel points to obtain a color label difference set;

s14, determining invalid pixel points in the gesture global image according to the color label difference value set;

s15, removing the invalid pixel points from the gesture global image to generate a gesture local image.

The beneficial effects of the above-mentioned further scheme are: in the invention, when a student or a teacher uses the digital sand table, the operation is completed by waving the gesture above the digital sand table, so the invention collects the image when waving the gesture, but the image at the moment possibly has redundant background to influence gesture recognition, so the invention primarily cuts the global image of the gesture and determines the local image only containing the hand. The three-channel color values of the partial images are generally similar, so that the pixel points belonging to background noise are screened by calculating the color label values among the pixel points, and the pixel points are removed from the gesture global image, so that the gesture partial image only containing hands can be obtained, the gesture recognition can be conveniently and rapidly carried out in the subsequent steps, and the recognition efficiency and the recognition accuracy are improved.

Further, in S11, the color label value C of the pixel point with x-axis and y-axis in the gesture global image _x,y The calculation formula of (2) is as follows:the method comprises the steps of carrying out a first treatment on the surface of the Wherein x is ₀ Representing the abscissa, y of the pixel point where the gesture global image center is located ₀ Representing the ordinate of the pixel point where the gesture global image center is located, R _x,y Red channel value, G, representing pixel point with x on the abscissa and y on the ordinate _x,y Green channel value, B, representing pixel point with x on the abscissa and y on the ordinate _x,y Blue channel values representing pixel points with x on the abscissa and y on the ordinate, log (-) represents a logarithmic function,representing the abscissa x ₀ And the ordinate is y ₀ Is used for the red channel value of the pixel point,representing the abscissa x ₀ And the ordinate is y ₀ Is set for the green channel value of the pixel point of (c),representing the abscissa x ₀ And the ordinate is y ₀ Blue channel value of the pixel point of (c).

Further, S14 includes the sub-steps of:

s141, sorting all color label difference values of the color label difference value set from small to large, and sorting before sortingThe color label differences are used as a first color label difference subset; wherein L represents the number of color label differences of the color label difference set,representing an upward rounding function;

s142, randomly dividing the rest color label difference values except the first color label difference value subset in the color label difference value set into a second color label difference value subset and a third color label difference value subset;

s143, determining a color label threshold according to the first color label difference value subset, the second color label difference value subset and the third color label difference value subset;

s144, eliminating pixel points corresponding to the color label difference values larger than the color label threshold value from the gesture global image, and generating a gesture local image.

Further, in S143, the calculation formula of the color label threshold σ is:the method comprises the steps of carrying out a first treatment on the surface of the Wherein u is _m Representing the mth color label difference, v, in the first subset of color label differences _n Representing the nth color label difference, w, in the second subset of color label differences _k Represents the kth color label difference in the third subset of color label differences, max (·) represents the maximum function, min (·) represents the minimum function, v _ave Representing the average value, w, of all color label differences in the second subset of color label differences _ave Represents the average of all color label differences in the third subset of color label differences, e represents the index.

Further, in S2, the gesture recognition model includes an input layer, a first feature convolution layer, a second feature convolution layer, an operator, a full connection layer, and an output layer;

the input end of the input layer is used as the input end of the gesture recognition model, the first output end of the input layer is connected with the input end of the first characteristic convolution layer, and the second output layer of the input layer is connected with the first input end of the second characteristic convolution layer; the first output end of the first characteristic convolution layer is connected with the first input end of the arithmetic unit, and the second output end of the first characteristic convolution layer is connected with the second input end of the second characteristic convolution layer; the output end of the second characteristic convolution layer is connected with the second input end of the arithmetic unit; the output end of the arithmetic unit is connected with the input end of the full-connection layer; the output end of the full-connection layer is connected with the input end of the output layer; the output end of the output layer is used as the output end of the gesture recognition model.

The beneficial effects of the above-mentioned further scheme are: in the invention, an input layer is used for inputting a gesture local image into a gesture recognition model. The first characteristic convolution layer is used for extracting characteristic information of the gesture local image according to pixel values of all pixel points in the extracted gesture local image; the second characteristic convolution layer is used for fusing the characteristic information extracted by the first characteristic convolution layer with the pixel value of each pixel point in the gesture local image, so that the characteristic richness is increased. The full-connection layer fuses the characteristic information extracted by the first characteristic convolution layer and the characteristic information extracted by the second characteristic convolution layer again through addition operation, the characteristic dimension is improved, and finally the identification result is output through the output layer.

Further, the expression of the first characteristic convolution layer is:，the method comprises the steps of carrying out a first treatment on the surface of the Where G represents the output of the first feature convolution layer, σ (·) represents the activation function, Z represents the matrix of pixel values, Z _1,1 ,..,z _IJ The pixel value of each pixel point in the gesture local image is represented, I represents the number of pixel point lines of the gesture local image, J represents the number of pixel point lines of the gesture local image, and w _p Weights representing the p-th convolution kernel in the first feature convolution layer, o _p Representing the first characteristic convolution layerOffset of p convolution kernels, alpha _p Step size, b, representing the p-th convolution kernel in the first feature convolution layer _p The number of channels representing the P-th convolution kernel in the first characteristic convolution layer, and P represents the number of convolution kernels of the first characteristic convolution layer.

Further, the expression of the second characteristic convolution layer is:，the method comprises the steps of carrying out a first treatment on the surface of the Where H represents the output of the second feature convolution layer, σ (·) represents the activation function, Z represents the matrix of pixel values, Z _1,1 ,..,z _IJ The pixel value of each pixel point in the gesture local image is represented, I represents the number of pixel point rows of the gesture local image, J represents the number of pixel point rows of the gesture local image, and W _q Weight, O, representing the q-th convolution kernel in the first feature convolution layer _q Representing the bias, beta, of the qth convolution kernel in the second feature convolution layer _q Step length representing the qth convolution kernel in the second feature convolution layer, B _q The number of channels of the Q-th convolution kernel in the second characteristic convolution layer is represented, P represents the number of convolution kernels of the first characteristic convolution layer, and Q represents the number of convolution kernels of the second characteristic convolution layer.

Further, the expression of the fully connected layer is:the method comprises the steps of carrying out a first treatment on the surface of the Wherein T represents the output of the fully connected layer,the offset of the kth neuron in the full connection layer is represented, K represents the number of neurons of the full connection layer, P represents the number of convolution kernels of the first characteristic convolution layer, Q represents the number of convolution kernels of the second characteristic convolution layer, G represents the output of the first characteristic convolution layer, and H represents the output of the second characteristic convolution layer.

The beneficial effects of the invention are as follows: the invention discloses an action recognition method based on an AR teaching aid, which is characterized in that through recognizing a global image of a gesture swung by a user on the AR teaching aid (namely an AR sand table), the global image is accurately cut in consideration of background noise of the global image, and the generated local image only comprises hands, so that the gesture action can be quickly and accurately extracted in the following steps; meanwhile, the invention also builds a gesture recognition model, and the gesture recognition model performs feature extraction and feature fusion on the local image, so that the gesture of the user can be accurately extracted, the user can conveniently control the AR teaching aid, the use feeling of the user is improved, and the interaction time of the AR teaching aid is reduced.

Drawings

FIG. 1 is a flow chart of an AR teaching aid based motion recognition method;

fig. 2 is a schematic diagram of a gesture recognition model.

Detailed Description

Embodiments of the present invention are further described below with reference to the accompanying drawings.

As shown in fig. 1, the invention provides an action recognition method based on an AR teaching aid, which comprises the following steps:

s2, constructing a gesture recognition model;

In an embodiment of the present invention, S1 comprises the following sub-steps:

In the invention, when a student or a teacher uses the digital sand table, the operation is completed by waving the gesture above the digital sand table, so the invention collects the image when waving the gesture, but the image at the moment possibly has redundant background to influence gesture recognition, so the invention primarily cuts the global image of the gesture and determines the local image only containing the hand. The three-channel color values of the partial images are generally similar, so that the pixel points belonging to background noise are screened by calculating the color label values among the pixel points, and the pixel points are removed from the gesture global image, so that the gesture partial image only containing hands can be obtained, the gesture recognition can be conveniently and rapidly carried out in the subsequent steps, and the recognition efficiency and the recognition accuracy are improved.

In the embodiment of the present invention, in S11, the color label value C of the pixel point with x abscissa and y ordinate in the gesture global image _x,y The calculation formula of (2) is as follows:the method comprises the steps of carrying out a first treatment on the surface of the Wherein x is ₀ Representing the abscissa, y of the pixel point where the gesture global image center is located ₀ Representing the ordinate of the pixel point where the gesture global image center is located, R _x,y Red channel value, G, representing pixel point with x on the abscissa and y on the ordinate _x,y Green channel value, B, representing pixel point with x on the abscissa and y on the ordinate _x,y Blue channel values representing pixel points with x on the abscissa and y on the ordinate, log (-) represents a logarithmic function,representing the abscissa x ₀ And the ordinate is y ₀ Is used for the red channel value of the pixel point,representing the abscissa x ₀ And the ordinate is y ₀ Is set for the green channel value of the pixel point of (c),representing the abscissa x ₀ And the ordinate is y ₀ Blue channel value of the pixel point of (c).

In an embodiment of the present invention, S14 includes the sub-steps of:

In the embodiment of the present invention, in S143, the calculation formula of the color label threshold σ is:the method comprises the steps of carrying out a first treatment on the surface of the Wherein u is _m Representing the mth color label difference, v, in the first subset of color label differences _n Representing the nth color label difference, w, in the second subset of color label differences _k Represents the kth color label difference in the third subset of color label differences, max (·) represents the maximum function, min (·) represents the minimum function, v _ave Representing the average value, w, of all color label differences in the second subset of color label differences _ave Represents the average of all color label differences in the third subset of color label differences, e represents the index.

In the embodiment of the present invention, as shown in fig. 2, in S2, the gesture recognition model includes an input layer, a first feature convolution layer, a second feature convolution layer, an operator, a full connection layer, and an output layer;

In the invention, an input layer is used for inputting a gesture local image into a gesture recognition model. The first characteristic convolution layer is used for extracting characteristic information of the gesture local image according to pixel values of all pixel points in the extracted gesture local image; the second characteristic convolution layer is used for fusing the characteristic information extracted by the first characteristic convolution layer with the pixel value of each pixel point in the gesture local image, so that the characteristic richness is increased. The full-connection layer fuses the characteristic information extracted by the first characteristic convolution layer and the characteristic information extracted by the second characteristic convolution layer again through addition operation, the characteristic dimension is improved, and finally the identification result is output through the output layer.

In the embodiment of the present invention, the expression of the first characteristic convolution layer is:，the method comprises the steps of carrying out a first treatment on the surface of the Where G represents the output of the first feature convolution layer, σ (·) represents the activation function, Z represents the matrix of pixel values, Z _1,1 ,..,z _IJ The pixel value of each pixel point in the gesture local image is represented, I represents the number of pixel point lines of the gesture local image, J represents the number of pixel point lines of the gesture local image, and w _p Weights representing the p-th convolution kernel in the first feature convolution layer, o _p Representing a first feature convolution layerOffset of p-th convolution kernel, alpha _p Step size, b, representing the p-th convolution kernel in the first feature convolution layer _p The number of channels representing the P-th convolution kernel in the first characteristic convolution layer, and P represents the number of convolution kernels of the first characteristic convolution layer.

In the embodiment of the present invention, the expression of the second characteristic convolution layer is:，the method comprises the steps of carrying out a first treatment on the surface of the Where H represents the output of the second feature convolution layer, σ (·) represents the activation function, Z represents the matrix of pixel values, Z _1,1 ,..,z _IJ The pixel value of each pixel point in the gesture local image is represented, I represents the number of pixel point rows of the gesture local image, J represents the number of pixel point rows of the gesture local image, and W _q Weight, O, representing the q-th convolution kernel in the first feature convolution layer _q Representing the bias, beta, of the qth convolution kernel in the second feature convolution layer _q Step length representing the qth convolution kernel in the second feature convolution layer, B _q The number of channels of the Q-th convolution kernel in the second characteristic convolution layer is represented, P represents the number of convolution kernels of the first characteristic convolution layer, and Q represents the number of convolution kernels of the second characteristic convolution layer.

In the embodiment of the invention, the expression of the full connection layer is:the method comprises the steps of carrying out a first treatment on the surface of the Wherein T represents the output of the fully connected layer,the offset of the kth neuron in the full connection layer is represented, K represents the number of neurons of the full connection layer, P represents the number of convolution kernels of the first characteristic convolution layer, Q represents the number of convolution kernels of the second characteristic convolution layer, G represents the output of the first characteristic convolution layer, and H represents the output of the second characteristic convolution layer.

Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims

1. An action recognition method based on an AR teaching aid is characterized by comprising the following steps:

s2, constructing a gesture recognition model;

s3, inputting the gesture local image into a gesture recognition model, and determining gesture actions of a user;

the step S1 comprises the following substeps:

s15, removing invalid pixel points from the gesture global image to generate a gesture local image;

in S11, the color label value C of the pixel point with x abscissa and y ordinate in the gesture global image _x,y The calculation formula of (2) is as follows:the method comprises the steps of carrying out a first treatment on the surface of the Wherein x is ₀ Representing the abscissa, y of the pixel point where the gesture global image center is located ₀ Representing the ordinate of the pixel point where the gesture global image center is located, R _x,y Red channel value, G, representing pixel point with x on the abscissa and y on the ordinate _x,y Green channel value, B, representing pixel point with x on the abscissa and y on the ordinate _x,y Blue channel value representing pixel point with x on the abscissa and y on the ordinate, log (·) represents a logarithmic function, ++>Representing the abscissa x ₀ And the ordinate is y ₀ Is used for the red channel value of the pixel point,representing the abscissa x ₀ And the ordinate is y ₀ Green channel value of pixel of +.>Representing the abscissa x ₀ And the ordinate is y ₀ Blue channel values for pixels of (a);

the step S14 includes the sub-steps of:

s141, sorting all color label difference values of the color label difference value set from small to large, and sorting before sortingThe color label differences are used as a first color label difference subset; wherein L represents the number of color label differences of the color label difference set, < >>Representing an upward rounding function;

s144, removing pixel points corresponding to the color label difference values larger than the color label threshold value from the gesture global image to generate a gesture local image;

in S143, the calculation formula of the color label threshold σ is:the method comprises the steps of carrying out a first treatment on the surface of the Wherein u is _m Representing the mth color label difference, v, in the first subset of color label differences _n Representing the nth color label difference, w, in the second subset of color label differences _k Represents the kth color label difference in the third subset of color label differences, max (·) represents the maximum function, min (·) represents the minimum function, v _ave Representing the average value, w, of all color label differences in the second subset of color label differences _ave Represents the average of all color label differences in the third subset of color label differences, e represents the index.

2. The AR teaching aid-based motion recognition method according to claim 1, wherein in S2, the gesture recognition model includes an input layer, a first feature convolution layer, a second feature convolution layer, an operator, a full connection layer, and an output layer;

3. The AR teaching aid based motion recognition method according to claim 2, wherein the expression of the first feature convolution layer is:，/>the method comprises the steps of carrying out a first treatment on the surface of the Where G represents the output of the first feature convolution layer, σ (·) represents the activation function, Z represents the matrix of pixel values, Z _1,1 ,..,z _IJ The pixel value of each pixel point in the gesture local image is represented, I represents the number of pixel point lines of the gesture local image, J represents the number of pixel point lines of the gesture local image, and w _p Weights representing the p-th convolution kernel in the first feature convolution layer, o _p Representing the offset, alpha, of the p-th convolution kernel in the first characteristic convolution layer _p Step size, b, representing the p-th convolution kernel in the first feature convolution layer _p The number of channels representing the P-th convolution kernel in the first characteristic convolution layer, and P represents the number of convolution kernels of the first characteristic convolution layer.

4. The AR teaching aid based motion recognition method according to claim 2, wherein the expression of the second feature convolution layer is:，the method comprises the steps of carrying out a first treatment on the surface of the Where H represents the output of the second feature convolution layer, σ (·) represents the activation function, Z represents the matrix of pixel values, Z _1,1 ,..,z _IJ The pixel value of each pixel point in the gesture local image is represented, I represents the number of pixel point rows of the gesture local image, J represents the number of pixel point rows of the gesture local image, and W _q Weight, O, representing the q-th convolution kernel in the first feature convolution layer _q Representing the bias, beta, of the qth convolution kernel in the second feature convolution layer _q Step length representing the qth convolution kernel in the second feature convolution layer, B _q The number of channels of the Q-th convolution kernel in the second characteristic convolution layer is represented, P represents the number of convolution kernels of the first characteristic convolution layer, and Q represents the number of convolution kernels of the second characteristic convolution layer.

5. According to claim 2The action recognition method based on the AR teaching aid is characterized in that the expression of the full connection layer is as follows:the method comprises the steps of carrying out a first treatment on the surface of the Wherein T represents the output of the fully-connected layer, ">The offset of the kth neuron in the full connection layer is represented, K represents the number of neurons of the full connection layer, P represents the number of convolution kernels of the first characteristic convolution layer, Q represents the number of convolution kernels of the second characteristic convolution layer, G represents the output of the first characteristic convolution layer, and H represents the output of the second characteristic convolution layer.