CN114758268A

CN114758268A - Gesture recognition method and device and intelligent equipment

Info

Publication number: CN114758268A
Application number: CN202210262703.6A
Authority: CN
Inventors: 邵池; 焦继超; 胡淑萍; 王玥; 庞建新
Original assignee: Ubtech Robotics Corp
Current assignee: Ubtech Robotics Corp
Priority date: 2022-03-17
Filing date: 2022-03-17
Publication date: 2022-07-15

Abstract

The application is suitable for the technical field of gesture recognition, and provides a gesture recognition method, a device and intelligent equipment, wherein the gesture recognition method comprises the following steps: acquiring a video clip, wherein the video clip comprises at least two video frame images; determining the difference between a first video frame image and a second video frame image to obtain a difference result, wherein the first video frame image and the second video frame image are adjacent video frame images obtained from the video clip, and the first video frame image is a video frame image next to the second video frame image; and performing dynamic gesture recognition on the first video frame image of which the difference result indicates that the difference meets the condition. By the method, the accuracy of the obtained dynamic gesture recognition result can be improved.

Description

Gesture recognition method and device and intelligent equipment

Technical Field

The present application belongs to the field of gesture recognition technology, and in particular, to a gesture recognition method, an apparatus, an intelligent device, and a computer-readable storage medium.

Background

Gestures are a natural form of communication between humans, and gesture recognition is also one of the important research directions for human-computer interaction.

Gesture recognition can be divided into static gesture recognition and dynamic gesture recognition. The static gesture recognition means performing gesture recognition on an input picture to obtain a recognition result of a gesture category in the picture; the dynamic gesture recognition refers to performing gesture recognition on a plurality of continuous pictures within a period of time to obtain recognition results of gesture categories in the pictures. In contrast to static gesture recognition, dynamic gesture recognition requires learning the relationship of gestures in successive frames of pictures in the time dimension, which is a continuous process.

For the video-based dynamic gesture recognition technology, because a gesture recognition model cannot know a start frame and an end frame corresponding to a certain gesture in a video segment, a window-dividing mode is usually adopted to predict segment by segment, and a prediction result of a gesture category of the video segment is output. However, when the above method is used for dynamic gesture recognition, the obtained prediction result has a certain error rate, so that the user experience is poor.

Disclosure of Invention

The embodiment of the application provides a computer-readable storage medium, which can solve the problem that the recognition accuracy is low when the existing dynamic gesture recognition method is used for gesture recognition.

In a first aspect, an embodiment of the present application provides a gesture recognition method, including:

acquiring a video clip, wherein the video clip comprises at least two video frame images;

determining the difference between a first video frame image and a second video frame image to obtain a difference result, wherein the first video frame image and the second video frame image are adjacent video frame images obtained from the video clip, and the first video frame image is a video frame image next to the second video frame image;

and performing dynamic gesture recognition on the first video frame image of which the difference result indicates that the difference meets the condition, wherein the condition comprises that the difference is greater than a preset difference threshold value.

In a second aspect, an embodiment of the present application provides a gesture recognition apparatus, including:

the video clip acquisition module is used for acquiring a video clip, and the video clip comprises at least two video frame images;

a difference result determining module, configured to determine a difference between a first video frame image and a second video frame image to obtain a difference result, where the first video frame image and the second video frame image are adjacent video frame images obtained from the video clip, and the first video frame image is a video frame image subsequent to the second video frame image;

and the dynamic gesture recognition module is used for performing dynamic gesture recognition on the first video frame image of which the difference result indicates that the difference meets the condition, wherein the condition comprises that the difference is greater than a preset difference threshold value.

In a third aspect, an embodiment of the present application provides an intelligent device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the method according to the first aspect when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product, which, when run on a smart device, causes the smart device to perform the method according to the first aspect.

Compared with the prior art, the embodiment of the application has the advantages that:

in the embodiment of the present application, since the difference between the first video frame image and the previous video frame image (i.e. the second video frame image) of the first video frame image is determined before the dynamic gesture recognition is performed on the first video frame image, and performing dynamic gesture recognition on the first video frame image when the difference of the first video frame image is determined to be greater than a preset difference threshold value, and the difference of the first video frame image is greater than the preset difference threshold value, it indicates that compared with the second video frame image, the first video frame image has a dynamic picture, therefore, only the first video frame image with the difference indicating that the difference meets the condition is subjected to dynamic gesture recognition, the video frame images subjected to dynamic gesture recognition can be ensured not to have static gestures, therefore, the accuracy of the obtained dynamic gesture recognition result can be improved, and good experience of the user is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below.

FIG. 1 is a flow chart of a gesture recognition method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a gray scale map corresponding to a first video frame image according to an embodiment of the present application;

fig. 3 is a schematic diagram of a gray scale map corresponding to a second video frame image according to an embodiment of the present application;

fig. 4 is a schematic diagram of a difference diagram corresponding to fig. 2 and fig. 3 according to an embodiment of the present application;

FIG. 5 is a schematic illustration of a binary map generated according to FIG. 4 according to another embodiment of the present application;

fig. 6 is a schematic diagram of the binary image of fig. 5 after being morphologically processed according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a mask image according to another embodiment of the present application;

FIG. 8 is a schematic diagram of a skin tone region provided by an embodiment of the present application;

fig. 9 is a schematic structural diagram of a gesture recognition apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an intelligent device provided in an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," and the like are used for distinguishing between descriptions and not necessarily for describing a relative importance or importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise.

The first embodiment is as follows:

when the gesture category is predicted segment by segment in a window-dividing manner, if only a static gesture exists in a certain video (note that the static gesture is different from the static gesture recognition in the foregoing, and the static gesture means that there is no change or displacement in the hand of the user in consecutive multiple frames), for example, the user only puts the hand in the picture, but does not execute a dynamic gesture action in a preset category, the dynamic gesture recognition model also returns a prediction result of the gesture category of the certain video according to the probability, thereby resulting in poor user experience.

In order to enable the dynamic gesture recognition model to have better performance in practical use, the embodiment of the application provides the gesture recognition method, and the method can enable the dynamic gesture recognition model to have better performance without training the dynamic gesture recognition model.

The gesture recognition method provided by the embodiment of the application is described below with reference to the drawings.

Fig. 1 shows a flowchart of a gesture recognition method provided in an embodiment of the present application, which is detailed as follows:

step S11, acquiring a video clip, where the video clip includes at least two video frame images.

For example, when the gesture recognition method of the embodiment of the present application is applied to a robot capable of recognizing a gesture, the started robot acquires a plurality of video frame images in real time, and the video frame images constitute the video clip.

Step S12, determining a difference between a first video frame image and a second video frame image to obtain a difference result, where the first video frame image and the second video frame image are adjacent video frame images obtained from the video clip, and the first video frame image is a video frame image subsequent to the second video frame image.

In the embodiment of the application, when the video clip is one of the video clips in the video file acquired in advance, considering that the gesture recognition belongs to offline recognition, that is, the requirement on real-time performance is not high, at this time, the gesture recognition can be performed on the video frame images of the video clips according to the sequence from front to back, that is, the second video frame image of the video clip is firstly used as the first video frame image of the embodiment of the application to be processed, then the third video frame image of the video clip is used as the first video frame image of the embodiment of the application to be processed, and the rest of the video frame images of the video clips are analogized. When the video clip is a currently acquired video clip, the latest acquired video frame image is usually used as the first video frame image.

Since the intelligent device such as a robot usually stores the video stream acquired by the intelligent device, after the first video frame image is determined, a previous video frame image of the first video frame image can be acquired from the stored video stream (i.e., the video clip), and the acquired video frame image is taken as the second video frame image. The first video frame image is then compared with the second video frame image, e.g., pixel values at corresponding locations of the two video frame images, to determine a difference between the two video frame images. The difference here refers to a place where two video frame images are different, for example, if the pixel values at the same corresponding position are different, it indicates that the two video frame images are different at the position.

Step S13, performing dynamic gesture recognition on the first video frame image whose difference result indicates that the difference satisfies a condition, where the condition includes that the difference is greater than a preset difference threshold.

Specifically, when the absolute value difference of the pixel values of the first video frame image and the second video frame image at the corresponding positions is greater than 0, it indicates that there is a difference between the first video frame image and the second video frame image, and at this time, the difference satisfies the condition including: the absolute value difference of the pixel values at the corresponding positions is greater than a preset first threshold (the preset difference threshold includes the first threshold). Further, the difference is considered to meet the condition only when the number of the pixel points of which the absolute value difference of the pixel values is larger than the first threshold is larger than a preset second threshold, and at this time, the preset difference threshold includes the first threshold and the second threshold.

In the embodiment of the present application, since the difference between the first video frame image and the previous video frame image (i.e. the second video frame image) of the first video frame image is determined before the dynamic gesture recognition is performed on the first video frame image, and performing dynamic gesture recognition on the first video frame image when the difference of the first video frame image is determined to be greater than a preset difference threshold value, and the difference of the first video frame image is greater than the preset difference threshold value, it indicates that compared with the second video frame image, the first video frame image has dynamic pictures, therefore, only the first video frame image with the difference indicating that the difference meets the condition is subjected to dynamic gesture recognition, the video frame images subjected to dynamic gesture recognition can be ensured not to have static gestures, therefore, the accuracy of the obtained dynamic gesture recognition result can be improved, and the good experience of the user is further improved.

In some embodiments, when it is considered that the background of the video frame may be changed rather than the gesture when there is a large difference between the first video frame image and the second video frame image, and therefore, it is required to determine whether the changed video frame is a gesture, in this case, the step S13 includes:

a1, detecting whether a target area of a target video frame image includes a skin color area, wherein the target video frame image is the first video frame image whose difference result indicates that the difference satisfies a condition, and the target area is an area where the first video frame image and the second video frame image have a difference.

Specifically, whether pixel values matched with the skin color of the human body exist in pixel points of a target area or not is detected, and if the pixel values exist, the area where the pixel points corresponding to the pixel values are located is used as the skin color area.

And A2, performing dynamic gesture recognition on the target video frame image with the skin color area.

In this embodiment, it is considered that the hand of the user, such as the palm of the hand, is usually not covered by clothes, and therefore, when the gesture of the user changes, the changed pixel point usually includes a pixel value matched with the skin color. That is, in the embodiment of the present application, after it is determined that the difference of the first video frame image satisfies the condition, it is further determined whether the region in which the change exists in the first video frame image includes a skin color region, and after it is determined that the region in which the change exists includes the skin color region, the dynamic gesture recognition is performed on the first video frame image, so that the accuracy of the subsequently obtained dynamic gesture recognition can be further improved.

In some embodiments, the step S12 includes:

b1, converting the first video frame image and the second video frame image into gray scale images respectively.

In this embodiment, to facilitate subsequent matching, both the first video frame image and the second video frame image are converted into grayscale images.

B2, calculating the absolute difference between the pixel value of the gray scale image of the first video frame image and the pixel value of the gray scale image of the second video frame image to obtain a difference image.

Specifically, assuming that the first video frame image is represented by cur _ image and the second video frame image is represented by pri _ gray, the absolute difference between the pixel value of the grayscale image of the first video frame image and the pixel value of the grayscale image of the second video frame image is calculated using the following formula:

abs_gray＝abs(pri_gray-gray)。

in this embodiment, considering that the difference between the pixel values of the gray-scale images of the two video frame images may be a negative value, and for an 8-bit image, the pixel values are between 0 and 255, so to ensure that the image can be displayed accurately, and the absolute difference can reflect the difference between the two pixel values, the absolute difference between the pixel values of the gray-scale images is calculated.

B3, determining the difference result according to the pixel value of the difference image.

In this embodiment, since the pixel value of each pixel point in the difference image is an absolute difference between the pixel values of the grayscale images of the first video frame image and the second video frame image, and when the absolute difference is not 0, it indicates that there is a difference between the first video frame image and the second video frame image at the position, and therefore, the difference result can be accurately determined according to the pixel value of the difference image.

In some embodiments, the preset difference threshold includes a first threshold and a second threshold, and the B3 includes:

and B31, setting the pixel value larger than the first threshold value in the difference image as a first pixel value, and setting the pixel value not larger than the first threshold value in the difference image as a second pixel value, so as to obtain a binary image corresponding to the first video frame image.

In this embodiment, the first pixel value is usually set to 255, and the second pixel value is usually set to 0, but the first pixel value and the second pixel value may be set to other values between 0 and 255, and the present invention is not limited thereto.

In this embodiment, in consideration of the fact that the first video frame image and the second video frame image have a difference when the light changes or the dust exists, and the change caused by the light changes or the dust exists is small, that is, the obtained absolute difference between the pixel values is small, so that the first threshold value can be set to 40, which can not only avoid determining that small disturbances (which are not caused by gesture changes) exist between the first video frame image and the second video frame image, but also determine that the first video frame image and the second video frame image actually exist.

In some embodiments, the first threshold may also be determined by:

and counting the distribution rule of the pixel values of all the pixel points in the differential image, determining all the pixel values in the most concentrated distribution, and selecting one pixel value from the pixel values as a first threshold value. For example, assuming that most of the pixel values of the difference image are around 50, the first threshold is set to 50. By so setting, the accuracy of the set first threshold can be improved.

In some embodiments, the noise points of the difference map may be filtered, and then the binary image corresponding to the first video frame image may be generated according to the difference map with the noise points filtered.

Specifically, the types of noise points possibly existing in the difference image are determined, then a corresponding filtering method is selected according to the types of the noise points, and finally the difference image is subjected to noise reduction according to the selected filtering method. For example, considering that the difference map usually has salt and pepper noise, and the median filter is more suitable for removing the salt and pepper noise, the median filter method may be adopted to perform noise reduction processing on the difference map, and then generate a binary image from the difference map after the noise reduction processing.

And B32, if the number of the first pixel values in the binary image corresponding to the first video frame image is greater than the second threshold, obtaining a difference result indicating that the difference satisfies the condition, otherwise, obtaining a difference result indicating that the difference does not satisfy the condition.

Specifically, when the gesture of the user changes, a number of changed pixel points is considered, and therefore the set second threshold cannot be too small. In some embodiments, the second threshold may be set to 200.

In the above B31 and B32, since the difference of the first video frame image is considered to satisfy the condition only when the number of first pixel values in the binary image corresponding to the first video frame image is greater than the second threshold, and the first pixel values are obtained from absolute difference values greater than the first threshold, it indicates that the difference between the first video frame image and the second video frame image is greater when the number of first pixel values is greater. That is, the difference results obtained by the above manner are more accurate.

In some embodiments, prior to step B32, comprising:

and performing morphological processing on the binary image. Correspondingly, step B32 specifically includes:

and if the number of the first pixel values in the morphologically processed binary image corresponding to the first video frame image is greater than the second threshold, obtaining a difference result indicating that the difference satisfies the condition, otherwise, obtaining a difference result indicating that the difference does not satisfy the condition.

In the embodiment of the application, the binary image obtained by thresholding segmentation often has the situation that the object shape is incomplete and incomplete, and the binary image is full or redundant pixels are removed after morphological processing, so that the number of the first pixel values in the binary image after morphological processing is more accurate. The morphological processing in the embodiment of the application comprises the following steps: corrosion, expansion operation.

In some embodiments, the step a1 includes:

and A11, determining a target area of the target video frame image according to the binary image corresponding to the target video frame image.

In this embodiment, since the resolutions of the target video frame image and the binary image corresponding thereto are the same, each pixel point of the target video frame image has a corresponding pixel point in the binary image of the target video frame image. In this embodiment, if the pixel value of the pixel point in the binary image corresponding to the target video frame image is the first pixel value, the pixel value of the pixel point corresponding to the pixel point in the target video frame image remains unchanged, and if the pixel value of the pixel point in the binary image corresponding to the target video frame image is the second pixel value, the pixel value of the pixel point corresponding to the pixel point in the target video frame image becomes 0. For example, assuming that the first pixel value is 255, the second pixel value is 0, the pixel value of the target video frame image at (0, 0) is 200, and the pixel value of the binary image corresponding to the target video frame image at (0, 0) is 0, when a new image is determined from the target video frame image and the corresponding binary image, the pixel value of the new image at (0, 0) is set to 0. Of course, if the pixel value of the binary image corresponding to the target video frame image at (0, 0) is 255, when a new image is determined from the target video frame image and the corresponding binary image, the pixel value of the new image at (0, 0) is set to 200. By setting the pixel values of the pixels of the new image in this way, as long as the pixel values of the pixels of the new image are not 0, the regions corresponding to the pixels are the target regions of the target video frame image.

And A12, converting the pixel points corresponding to the target area into YUV space to obtain new pixel points.

Among them, the YUV space is also referred to as YCrCb space.

In this embodiment, because the YUV space is easier to determine the skin color area, it is necessary to convert the pixel point of the target area into the YUV space. For example, assuming that the target video frame image is a Red Green Blue (RGB) space, the target video frame image needs to be converted into a YUV space.

And A13, if the new pixel point exists in the preset elliptical area, judging that the target area comprises a skin color area.

The size of the elliptical area may be determined empirically, or may be determined based on the size of the new image (i.e., the image determined based on the target video frame image and its corresponding binary image).

After the pixel points are converted into YUV space, the pixel points with pixel values matched with skin color can be gathered in an elliptical area, so that whether the target area contains the skin color area can be quickly judged only by judging whether new pixel points exist in the elliptical area.

In some embodiments, the target new pixel point is a new pixel point in the preset elliptical region, and in the target region, a region corresponding to the target new pixel point is a skin color region, where the step a2 includes:

and detecting whether the skin color area of the target video frame image is larger than a preset area threshold value, and if so, performing dynamic gesture recognition on the target video frame image.

Specifically, the number of new pixel points in the preset elliptical region is accumulated to obtain the size of the skin color region.

In the embodiment of the application, because the gesture that the first video frame image is not changed is indicated when the skin color area is small, the accuracy of dynamic gesture recognition can be improved only by performing dynamic gesture recognition on the target video frame image with the large skin color area.

In some embodiments, the gesture recognition method provided in the embodiments of the present application further includes:

and not performing dynamic gesture recognition on the first video frame image of which the difference result indicates that the difference does not meet the condition.

Specifically, when the difference between the first video frame image and the second video frame image does not satisfy the condition, it indicates that the difference between the first video frame image and the second video frame image is small, that is, compared with the second video frame image, the first video frame image has no changed picture, and at this time, the dynamic gesture recognition is not performed on the first video frame image, so that the accuracy of a subsequently obtained dynamic gesture recognition result can be improved.

In order to more clearly describe the gesture recognition method provided in the embodiment of the present application, a specific example is described below.

(1) Calculating the gray difference between two video frame images

1) Assuming that the gray scale map of the first video frame image is gray (as shown in fig. 2) and the gray scale map of the second video frame image is pri _ gary (as shown in fig. 3), calculating a difference map between pri _ gray and gray scale maps (as shown in fig. 4), that is, obtaining absolute differences of pixel values of 2 gray scale maps:

abs_gray＝abs(pri_gray-gray) (1)

2) and performing median filtering on the difference image to filter noise points of the difference image to obtain a new difference image blu _ gray.

3) And carrying out binarization processing on the blu _ gray. Specifically, a binarization threshold value threshold is set, the pixel value (also called a pixel gray scale value) of which the pixel value is greater than the binarization threshold value is modified to be 255, and the pixel value of which the pixel value is less than the binarization threshold value is modified to be 0, so that a binarized gray scale map can be obtained, as shown in fig. 5. Since the binary image obtained by thresholding segmentation often has incomplete object morphology, it can be made full or redundant pixels can be removed by morphological processing. After morphological treatment of fig. 5, a schematic diagram as shown in fig. 6 is obtained.

After a binary image (marked as mask _ gray) after the erosion and expansion operation is obtained, the image pixels can be traversed, the sum _ p of the number of the pixel points with the gray value of 255 is counted, and when the sum _ p is larger than a second threshold (such as 200), the pixel value change of the first video frame image and the pixel value change of the second video frame image are considered to be large, namely, the image has a moving part.

(2) Skin tone detection in varying areas

Through the method in the step (1), whether the moving part exists in the picture can be judged, but only the moving part is judged to be insufficient, and whether the moving part contains skin color needs to be further judged, so that misjudgment caused by changes of other objects can be screened out.

First, the mask _ gray obtained in step (1) is combined with the first video frame image (which is assumed to be represented by cur _ image), that is, the pixel value corresponding to the region of the mask _ gray whose pixel grayscale value is 0 is also changed to 0 in cur _ image, so that the mask _ image shown in fig. 7 can be obtained.

Secondly, whether skin color pixels exist in the obtained mask _ image is calculated, the principle is that the RGB image is converted into a YCRCB space, and the skin color pixels are gathered in an elliptical area. Specifically, an ellipse area size is defined, then each RGB pixel point is converted into a YCrCb space, and then whether the converted pixel points are in the defined ellipse area is compared, if so, the pixel points are determined to be pixel points corresponding to skin. The resulting skin color region is shown in fig. 8.

And finally, setting an area threshold skin _ thre, traversing the image, calculating the number of pixel values of the skin color area, and considering that the skin color exists in the moving part when the number of the pixel values is larger than the skin _ thre.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Example two：

Fig. 9 shows a structural block diagram of a gesture recognition apparatus provided in the embodiment of the present application, which corresponds to the gesture recognition method in the foregoing embodiment, and only shows portions related to the embodiment of the present application for convenience of description.

Referring to fig. 9, the gesture recognition apparatus 9 includes: a video clip acquisition module 91, a difference result determination module 92 and a dynamic gesture recognition module 93. Wherein:

the video segment acquiring module 91 is configured to acquire a video segment, where the video segment includes at least two video frame images.

A difference result determining module 92, configured to determine a difference between a first video frame image and a second video frame image to obtain a difference result, where the first video frame image and the second video frame image are adjacent video frame images obtained from the video clip, and the first video frame image is a video frame image subsequent to the second video frame image.

A dynamic gesture recognition module 93, configured to perform dynamic gesture recognition on the first video frame image whose difference result indicates that the difference satisfies a condition, where the condition includes that the difference is greater than a preset difference threshold.

In the embodiment of the present application, since the difference between the first video frame image and the previous video frame image (i.e. the second video frame image) of the first video frame image is determined before the dynamic gesture recognition is performed on the first video frame image, and performing dynamic gesture recognition on the first video frame image when the difference of the first video frame image is determined to be greater than a preset difference threshold value, and the difference of the first video frame image is greater than the preset difference threshold value, it indicates that compared with the second video frame image, the first video frame image has dynamic pictures, therefore, only the first video frame image with the difference indicating that the difference meets the condition is subjected to dynamic gesture recognition, the video frame images subjected to dynamic gesture recognition can be ensured not to have static gestures, therefore, the accuracy of the obtained dynamic gesture recognition result can be improved, and good experience of the user is improved.

In some embodiments, the dynamic gesture recognition module 93 includes:

a skin color region detection unit, configured to detect whether a target region of a target video frame image includes a skin color region, where the target video frame image is the first video frame image whose difference is indicated by the difference result and satisfies a condition, and the target region is a region where the first video frame image and the second video frame image have a difference.

And the dynamic gesture recognition unit is used for performing dynamic gesture recognition on the target video frame image with the skin color area.

In some embodiments, the difference result determining module 92 includes:

and a grayscale image conversion unit for converting the first video frame image and the second video frame image into grayscale images, respectively.

And a difference map determining unit, configured to calculate an absolute difference between a pixel value of the grayscale map of the first video frame image and a pixel value of the grayscale map of the second video frame image, so as to obtain a difference map.

A difference result determining unit for determining the difference result according to the pixel value of the difference image.

In some embodiments, the preset difference threshold includes a first threshold and a second threshold, and the difference result determining unit includes:

and the binary image generating unit is used for setting the pixel value larger than the first threshold value in the difference image as a first pixel value, and setting the pixel value not larger than the first threshold value in the difference image as a second pixel value to obtain a binary image corresponding to the first video frame image.

And a difference result generation unit, configured to obtain a difference result indicating that the difference satisfies the condition if the number of the first pixel values in the binary image corresponding to the first video frame image is greater than the second threshold, and otherwise, obtain a difference result indicating that the difference does not satisfy the condition.

In some embodiments, the binary image generating unit corresponding to the first video frame image is specifically configured to: and after filtering the noise points of the differential image, setting the pixel values of the differential image of the filtered noise points, which are greater than the first threshold value, as first pixel values, and setting the pixel values of the differential image of the filtered noise points, which are not greater than the first threshold value, as second pixel values, so as to obtain a binary image corresponding to the first video frame image.

In some embodiments, the gesture recognition device 9 further comprises:

and the morphology processing module is used for carrying out morphology processing on the binary image.

Correspondingly, the different difference result generating unit is specifically configured to:

In some embodiments, the skin color region detecting unit includes:

and the target area determining unit is used for determining the target area of the target video frame image according to the binary image corresponding to the target video frame image.

And the new pixel point determining unit is used for converting the pixel point corresponding to the target area into a YUV space to obtain a new pixel point.

And the judging unit is used for judging whether the target area contains a skin color area if the new pixel point exists in the preset elliptical area.

In some embodiments, the target new pixel point is a new pixel point in the preset elliptical region, and an area corresponding to the target new pixel point in the target region is a skin color area, and the dynamic gesture recognition unit is specifically configured to:

In some embodiments, the gesture recognition apparatus further includes:

and the non-response module is used for not performing dynamic gesture recognition on the first video frame image of which the difference result indicates that the difference does not meet the condition.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

Example three:

fig. 10 is a schematic structural diagram of an intelligent device according to an embodiment of the present application. As shown in fig. 10, the smart device 10 of this embodiment includes: at least one processor 100 (only one processor is shown in fig. 10), a memory 101, and a computer program 102 stored in the memory 101 and executable on the at least one processor 100, wherein the steps of any of the method embodiments are implemented when the processor 100 executes the computer program 102.

The intelligent device 10 may be a computing device such as a robot, a mobile phone, a desktop computer, a notebook, a palm computer, and a cloud server. The smart device may include, but is not limited to, a processor 100, a memory 101. Those skilled in the art will appreciate that fig. 10 is merely an example of the smart device 10 and does not constitute a limitation of the smart device 10 and may include more or less components than those shown, or combine certain components, or different components, such as input output devices, network access devices, etc.

The Processor 100 may be a Central Processing Unit (CPU), and the Processor 100 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 101 may in some embodiments be an internal storage unit of the smart device 10, such as a hard disk or a memory of the smart device 10. In other embodiments, the memory 101 may also be an external storage device of the Smart device 10, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the Smart device 10. Further, the memory 101 may also include both an internal storage unit and an external storage device of the smart device 10. The memory 101 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 101 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the foregoing method embodiments.

The embodiments of the present application provide a computer program product, which when running on an intelligent device, enables the intelligent device to implement the steps in the above method embodiments when executed.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or apparatus capable of carrying computer program code to a photographing apparatus/smart device, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the description of each embodiment has its own emphasis, and reference may be made to the related description of other embodiments for parts that are not described or recited in any embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A gesture recognition method, comprising:

2. The gesture recognition method according to claim 1, wherein the performing dynamic gesture recognition on the first video frame image of which the difference result indicates that a difference satisfies a condition comprises:

detecting whether a target area of a target video frame image contains a skin color area, wherein the target video frame image is the first video frame image of which the difference result indicates that the difference meets a condition, and the target area is an area of which the first video frame image and the second video frame image have the difference;

and performing dynamic gesture recognition on the target video frame image with the skin color area.

3. The gesture recognition method of claim 2, wherein the determining the difference between the first video frame image and the second video frame image to obtain a difference result comprises:

respectively converting the first video frame image and the second video frame image into gray-scale images;

calculating an absolute difference value between the pixel value of the gray scale image of the first video frame image and the pixel value of the gray scale image of the second video frame image to obtain a difference image;

and determining the difference result according to the pixel value of the difference image.

4. The gesture recognition method according to claim 3, wherein the preset difference threshold includes a first threshold and a second threshold, and the determining the difference result according to the pixel values of the difference map includes:

setting the pixel value larger than the first threshold value in the difference image as a first pixel value, and setting the pixel value not larger than the first threshold value in the difference image as a second pixel value to obtain a binary image corresponding to the first video frame image;

if the number of the first pixel values in the binary image corresponding to the first video frame image is larger than the second threshold, obtaining a difference result indicating that the difference meets the condition, otherwise, obtaining a difference result indicating that the difference does not meet the condition.

5. The gesture recognition method of claim 4, wherein the detecting whether the target region of the target video frame image contains a skin tone region comprises:

determining a target area of the target video frame image according to the binary image corresponding to the target video frame image;

converting the pixel point corresponding to the target area into a YUV space to obtain a new pixel point;

and if the new pixel point exists in the preset elliptical area, judging that the target area comprises a skin color area.

6. The method according to claim 5, wherein the target new pixel point is a new pixel point in the preset elliptical region, and in the target region, a region corresponding to the target new pixel point is a skin color region, and the dynamic gesture recognition on the target video frame image with the skin color region comprises:

7. The gesture recognition method according to any one of claims 1 to 6, characterized in that the gesture recognition method further comprises:

not performing dynamic gesture recognition on the first video frame image of which the difference result indicates that the difference does not satisfy the condition.

8. A gesture recognition apparatus, comprising:

9. A smart device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.